LLMs (large language models) are becoming increasingly relevant in various businesses and organizations. Their ability to understand and analyze data and make sense of complex information can drive innovation, improve operational efficiency, and deliver personalized experiences across various industries. Integrating with various tools allows us to build LLM applications that can automate tasks, provide insights, and support decision-making processes.
However, building these applications can be complex and time-consuming, requiring a framework to streamline development and ensure scalability. A framework provides standardized tools and processes, making developing, deploying, and maintaining effective LLM applications easier. So, let’s learn about LangChain, the most popular framework for developing LLM applications.
LangChain has functionalities ranging from loading, splitting, embedding, and retrieving the data for the LLM to parsing the output of the LLM. It includes adding tools and agentic capabilities to the LLM and hundreds of third-party integrations. LangChain ecosystem also includes LangGraph to build stateful agents and LangSmith to productionize LLM applications. We can learn more about LangChain here at Building LLM-Powered Applications with LangChain.
In a series of articles, we will learn about different components of the Langchain. As it all starts with data, we will start by loading data from various file types and data sources with document loaders from Langchain.
Document loaders convert data from diverse data formats to standardized Document objects. The Document object consists of page_content, which has the data as a string, optionally an ID for the Document, and metadata that provides information on the data.
Let’s create a document object to learn how it works:
To get started, install the LangChain framework using ‘pip install langchain’
from langchain_core.documents import Document
data = Document(page_content='This is the article about document loaders of Langchain', id=1, metadata={'source':'AV'})
data
>>> Document(id='1', metadata={'source': 'AV'}, page_content='This is the article about document loaders of Langchain')
data.page_content
>>> This is the article about document loaders of Langchain'
data.id = 2 # this changes the id of the Document object
As we can see, we can create a Document object with page_content, id, and metadata and access and modify its contents.
There are more than two hundred document loaders in LangChain. They can be categorized as follows
Data sources can be further classified as public and private. Public data sources like YouTube or Wikipedia don’t need access tokens, while private data sources like AWS or Azure do. Let’s use a few document loaders to understand how they work. Further we will talk about the – LangChain Document Loaders convert data from various formats (e.g., CSV, PDF, HTML) into standardized Document objects for LLM applications.
CSV files can be loaded with CSVLoader
. It loads each row as a Document.
from langchain_community.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(file_path= "./iris.csv", metadata_columns=['species'], csv_args={"delimiter": ","})
data = loader.load()
len(data)
>>> 150 # for 150 rows
we can add any columns to the metadata using metadata_columns. We can also add the column to the source instead of the file name.
data[0].metadata
>>> {'source': './iris.csv', 'row': 0, 'species': 'setosa'}
# we can change the source as 'setosa' with the parameter source_column='species'
for record in data[:1]:
print(record)
>>> page_content='sepal_length: 5.1
sepal_width: 3.5
petal_length: 1.4
petal_width: 0.2' metadata={'source': './iris.csv', 'row': 0, 'species': 'setosa'}
Langchain dataloaders load the document into Document objects.
we can load an HTML page either directly from a saved HTML page or a URL
from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain_community.document_loaders import UnstructuredURLLoader
loader = UnstructuredURLLoader(urls=['https://diataxis.fr'], mode='elements')
data = loader.load()
len(data)
>>> 61
The entire HTML page is loaded as one document if the mode is single. if the mode is ‘elements, ‘ documents are made using the HTML tags.
# accessing metadata and content in a documnent
data[28].metadata
>>> {'languages': ['eng'], 'parent_id': '312017038db4f2ad1e9332fc5a40bb9d',
'filetype': 'text/html', 'url': 'https://diataxis.fr', 'category': 'NarrativeText'}
data[28].page_content
>>> "Diátaxis is a way of thinking about and doing documentation"
Markdown is a markup language for creating formatted text using a simple text editor.
from langchain_community.document_loaders import UnstructuredMarkdownLoader
# can download from here https://github.com/dsanr/best-of-ML/blob/main/README.md
loader = UnstructuredMarkdownLoader('README.md', mode='elements')
data = loader.load()
len(data)
>>> 1458
In addition to single and elements, this also has a ‘paged’ mode, which partitions the file based on the page numbers.
data[700].metadata
>>> {'source': 'README.md', 'last_modified': '2024-07-09T12:52:53', 'languages': ['eng'], 'filetype': 'text/markdown', 'filename': 'README.md', 'category': 'Title'}
data[700].page_content
>>> 'NeuralProphet (🥈28 · ⭐ 3.7K) - NeuralProphet: A simple forecasting package.'
We can copy the JSON content from here – How to load JSON?
from langchain_community.document_loaders import JSONLoader
loader = JSONLoader(file_path='chat.json', jq_schema='.', text_content=False)
data = loader.load()
len(data)
>>> 1
In JSONLoader, we need to mention the schema. If jq_schema = ‘.’ all the content is loaded. Depending on the content we need from the json, we can change the schema. For example, jq_schema=’.title’ for title, jq_schema=’.messages[].content’ to get only the content of the messages.
Let’s load an MS Word file as an example.
from langchain_community.document_loaders import UnstructuredWordDocumentLoader
loader = UnstructuredWordDocumentLoader(file_path='Polars.docx', mode='elements', chunking_strategy='by_title',
max_characters=200, new_after_n_chars=20)
data = loader.load()
len(data)
>>> 67
As we have seen, Langchain uses the Unstructured library to load files in different formats. As the libraries are frequently updated, finding documentation for all the parameters requires searching through the whole source code. we can find the parameters of this loader under the ‘add_chunking_strategy’ function in Github.
Multiple PDF parser integrations are available in Langchain. We can compare various parsers and choose a suitable one. Here is the Benchmark.
Some of the available parsers are PyMuPDF, PyPDF, PDFPlumber, etc.
Let’s try with UnstructuredPDFLoader
from langchain_community.document_loaders import UnstructuredPDFLoader
loader = UnstructuredPDFLoader('how-to-formulate-successful-business-strategy.pdf', mode='elements', strategy="auto")
data = loader.load()
len(data)
>>> 177
Here is the code explanation:
If we want to load multiple files from a directory, we can use the following
from langchain_community.document_loaders import DirectoryLoader
loader = DirectoryLoader(".", glob="**/*.json", loader_cls=JSONLoader, loader_kwargs={'jq_schema': '.', 'text_content':False},
show_progress=True, use_multithreading=True)
docs = loader.load()
len(docs)
>>> 1
As we can see, we can mention which loader to use using the loader_cls parameter and the loader’s arguments using the loader_kwargs parameter.
If you want the summary of a YouTube video or want to search through its transcript, this is the loader you need. Make sure you use the video_id not the entire URL, as shown below
from langchain_community.document_loaders import YoutubeLoader
video_url = 'https://www.youtube.com/watch?v=LKCVKw9CzFo'
loader = YoutubeLoader(video_id='LKCVKw9CzFo', add_video_info=True)
data = loader.load()
len(data)
>>> 1
We can get the transcript using data[0].page_content and video information using data[0].metadata
We get the Wikipedia article content based on a search query. The code below extracts the top five articles based on Wikipedia search results. Make sure you install the Wikipedia package with ‘pip install Wikipedia’
from langchain_community.document_loaders import WikipediaLoader
loader = WikipediaLoader(query='Generative AI', load_max_docs=5, doc_content_chars_max=5000, load_all_available_meta=True)
data = loader.load()
len(data)
>>> 5
We can control article content length with doc_content_chars_max. We can also get all the information about the article.
data[0].metadata.keys()
>>> dict_keys(['title', 'summary', 'source', 'categories', 'page_url', 'image_urls', 'related_titles', 'parent_id', 'references', 'revision_id', 'sections'])
for i in data:
print(i.metadata['title'])
>>>Generative artificial intelligence
AI boom
Generative pre-trained transformer
ChatGPT
Artificial intelligence
LangChain offers a comprehensive and versatile framework for loading data from various sources, making it an invaluable tool for developing applications powered by Large Language Models (LLMs). By integrating multiple file types and data sources, such as CSV files, MS Office documents, PDF files, YouTube videos, and Wikipedia articles, LangChain allows developers to gather and standardize diverse data into Document objects, facilitating seamless data processing and analysis.
In the next article, we will learn why we need to split the documents and how to do it. Stay tuned to Analytics Vidhya Blogs for the next update!
Ans. LangChain offers a range of functionalities, including loading, splitting, embedding, and retrieving data. It also supports parsing LLM outputs, adding tools and agentic capabilities to LLMs, and integrating with hundreds of third-party services. Additionally, it includes components like LangGraph for building stateful agents and LangSmith for productionizing LLM applications.
Ans. LangChain offers a range of functionalities, including loading, splitting, embedding, and retrieving data. It also supports parsing LLM outputs, adding tools and agentic capabilities to LLMs, and integrating with hundreds of third-party services. Additionally, it includes components like LangGraph for building stateful agents and LangSmith for productionizing LLM applications.
Ans. Document loaders in LangChain are tools that convert data from various formats (e.g., CSV, PDF, HTML) into standardized Document objects. These objects include the data’s content, an optional ID, and metadata. Document loaders facilitate the seamless integration and processing of data from diverse sources into LLM applications.
Ans. LangChain supports over two hundred document loaders categorized by file type (e.g., CSV, PDF, HTML) and data source (e.g., YouTube, Wikipedia, GitHub). Public data sources like YouTube and Wikipedia can be accessed without tokens, while private data sources like AWS or Azure require access tokens. Each loader is designed to parse and load data appropriately based on the specific format or source.