What are Langchain Document Loaders?

User User 15 Jul, 2024

7 min read

Introduction

LLMs (large language models) are becoming increasingly relevant in various businesses and organizations. Their ability to understand and analyze data and make sense of complex information can drive innovation, improve operational efficiency, and deliver personalized experiences across various industries. Integrating with various tools allows us to build LLM applications that can automate tasks, provide insights, and support decision-making processes.

However, building these applications can be complex and time-consuming, requiring a framework to streamline development and ensure scalability. A framework provides standardized tools and processes, making developing, deploying, and maintaining effective LLM applications easier. So, let’s learn about LangChain, the most popular framework for developing LLM applications.

Overview

LangChain Document Loaders convert data from various formats (e.g., CSV, PDF, HTML) into standardized Document objects for LLM applications.
They facilitate the seamless integration and processing of diverse data sources, such as YouTube, Wikipedia, and GitHub, into Document objects.
Document loaders in LangChain enable developers to manage and standardize content for large language model workflows efficiently.
They support a wide range of data formats and sources, enhancing the versatility and scalability of LLM-powered applications.
LangChain’s document loaders streamline the conversion of raw data into structured formats, which is essential for building and maintaining effective LLM applications.

LangChain Overview

LangChain has functionalities ranging from loading, splitting, embedding, and retrieving the data for the LLM to parsing the output of the LLM. It includes adding tools and agentic capabilities to the LLM and hundreds of third-party integrations. LangChain ecosystem also includes LangGraph to build stateful agents and LangSmith to productionize LLM applications. We can learn more about LangChain here at Building LLM-Powered Applications with LangChain.

In a series of articles, we will learn about different components of the Langchain. As it all starts with data, we will start by loading data from various file types and data sources with document loaders from Langchain.

What are Document Loaders?

Document loaders convert data from diverse data formats to standardized Document objects. The Document object consists of page_content, which has the data as a string, optionally an ID for the Document, and metadata that provides information on the data.

Let’s create a document object to learn how it works:

To get started, install the LangChain framework using ‘pip install langchain’

from langchain_core.documents import Document

data = Document(page_content='This is the article about document loaders of Langchain', id=1, metadata={'source':'AV'})

data
>>> Document(id='1', metadata={'source': 'AV'}, page_content='This is the article about document loaders of Langchain')

data.page_content
>>> This is the article about document loaders of Langchain'

data.id = 2 # this changes the id of the Document object

As we can see, we can create a Document object with page_content, id, and metadata and access and modify its contents.

Types of Document Loaders

There are more than two hundred document loaders in LangChain. They can be categorized as follows

Based on file type: These document loaders parse and load the documents based on the file type. Example file types include CSV, PDF, HTML, Markdown, etc.
Based on data source: They get the data from different data sources and load it into Document objects. Examples of data sources include YouTube, Wikipedia, and GitHub.

Data sources can be further classified as public and private. Public data sources like YouTube or Wikipedia don’t need access tokens, while private data sources like AWS or Azure do. Let’s use a few document loaders to understand how they work. Further we will talk about the – LangChain Document Loaders convert data from various formats (e.g., CSV, PDF, HTML) into standardized Document objects for LLM applications.

CSV(Comma-separated Values)

CSV files can be loaded with CSVLoader. It loads each row as a Document.

from langchain_community.document_loaders.csv_loader import CSVLoader


loader = CSVLoader(file_path= "./iris.csv", metadata_columns=['species'], csv_args={"delimiter": ","})
data = loader.load()
len(data)
>>> 150   # for 150 rows

we can add any columns to the metadata using metadata_columns. We can also add the column to the source instead of the file name.

data[0].metadata
>>> {'source': './iris.csv', 'row': 0, 'species': 'setosa'}

# we can change the source as 'setosa' with the parameter source_column='species'

for record in data[:1]:
    print(record)
>>> page_content='sepal_length: 5.1
    sepal_width: 3.5
    petal_length: 1.4
    petal_width: 0.2' metadata={'source': './iris.csv', 'row': 0, 'species': 'setosa'}

Langchain dataloaders load the document into Document objects.

HTML(Hyper Text Markup Language)

we can load an HTML page either directly from a saved HTML page or a URL

from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain_community.document_loaders import UnstructuredURLLoader

loader = UnstructuredURLLoader(urls=['https://diataxis.fr'], mode='elements')
data = loader.load()
len(data)
>>> 61

The entire HTML page is loaded as one document if the mode is single. if the mode is ‘elements, ‘ documents are made using the HTML tags.

# accessing metadata and content in a documnent

data[28].metadata
>>> {'languages': ['eng'], 'parent_id': '312017038db4f2ad1e9332fc5a40bb9d', 
'filetype': 'text/html', 'url': 'https://diataxis.fr', 'category': 'NarrativeText'}

data[28].page_content
>>> "Diátaxis is a way of thinking about and doing documentation"

Markdown

Markdown is a markup language for creating formatted text using a simple text editor.

from langchain_community.document_loaders import UnstructuredMarkdownLoader

# can download from here https://github.com/dsanr/best-of-ML/blob/main/README.md
loader = UnstructuredMarkdownLoader('README.md', mode='elements')
data = loader.load()
len(data)
>>> 1458

In addition to single and elements, this also has a ‘paged’ mode, which partitions the file based on the page numbers.

data[700].metadata
>>> {'source': 'README.md', 'last_modified': '2024-07-09T12:52:53', 'languages': ['eng'], 'filetype': 'text/markdown', 'filename': 'README.md', 'category': 'Title'}

data[700].page_content
>>> 'NeuralProphet (🥈28 ·  ⭐ 3.7K) - NeuralProphet: A simple forecasting package.'

JSON

We can copy the JSON content from here – How to load JSON?

from langchain_community.document_loaders import JSONLoader

loader = JSONLoader(file_path='chat.json', jq_schema='.', text_content=False)
data = loader.load()
len(data)
>>> 1

In JSONLoader, we need to mention the schema. If jq_schema = ‘.’ all the content is loaded. Depending on the content we need from the json, we can change the schema. For example, jq_schema=’.title’ for title, jq_schema=’.messages[].content’ to get only the content of the messages.

MS Office docs

Let’s load an MS Word file as an example.

from langchain_community.document_loaders import UnstructuredWordDocumentLoader

loader = UnstructuredWordDocumentLoader(file_path='Polars.docx', mode='elements', chunking_strategy='by_title', 
                                        max_characters=200, new_after_n_chars=20)
                                        
data = loader.load()
len(data)
>>> 67

As we have seen, Langchain uses the Unstructured library to load files in different formats. As the libraries are frequently updated, finding documentation for all the parameters requires searching through the whole source code. we can find the parameters of this loader under the ‘add_chunking_strategy’ function in Github.

PDF(Portable Document Format)

Multiple PDF parser integrations are available in Langchain. We can compare various parsers and choose a suitable one. Here is the Benchmark.

Some of the available parsers are PyMuPDF, PyPDF, PDFPlumber, etc.

Let’s try with UnstructuredPDFLoader

from langchain_community.document_loaders import UnstructuredPDFLoader

loader = UnstructuredPDFLoader('how-to-formulate-successful-business-strategy.pdf', mode='elements', strategy="auto")

data = loader.load()
len(data)
>>> 177

Here is the code explanation:

The ‘strategy’ parameter defines how to process the pdf.
The ‘hi_res’ strategy uses the Detectron2 model to identify the document’s layout.
The ‘ocr_only’ strategy uses Tesseract to extract the text even from the images.
The ‘fast’ strategy uses pdfminer to extract the text.
‘The default ‘auto’ strategy uses any of the above strategies based on the documents and parameter arguments.

Multiple Files

If we want to load multiple files from a directory, we can use the following

from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(".", glob="**/*.json", loader_cls=JSONLoader, loader_kwargs={'jq_schema': '.', 'text_content':False},
                         show_progress=True, use_multithreading=True)
                         
docs = loader.load()
len(docs)
>>> 1

As we can see, we can mention which loader to use using the loader_cls parameter and the loader’s arguments using the loader_kwargs parameter.

YouTube

If you want the summary of a YouTube video or want to search through its transcript, this is the loader you need. Make sure you use the video_id not the entire URL, as shown below

from langchain_community.document_loaders import YoutubeLoader

video_url = 'https://www.youtube.com/watch?v=LKCVKw9CzFo'
loader = YoutubeLoader(video_id='LKCVKw9CzFo', add_video_info=True)
data = loader.load()
len(data)
>>> 1

We can get the transcript using data[0].page_content and video information using data[0].metadata

Wikipedia

We get the Wikipedia article content based on a search query. The code below extracts the top five articles based on Wikipedia search results. Make sure you install the Wikipedia package with ‘pip install Wikipedia’

from langchain_community.document_loaders import WikipediaLoader

loader = WikipediaLoader(query='Generative AI', load_max_docs=5, doc_content_chars_max=5000, load_all_available_meta=True)
data = loader.load()
len(data)
>>> 5

We can control article content length with doc_content_chars_max. We can also get all the information about the article.

data[0].metadata.keys()
>>> dict_keys(['title', 'summary', 'source', 'categories', 'page_url', 'image_urls', 'related_titles', 'parent_id', 'references', 'revision_id', 'sections'])

for i in data:
    print(i.metadata['title'])
>>>Generative artificial intelligence
AI boom
Generative pre-trained transformer
ChatGPT
Artificial intelligence

Conclusion

LangChain offers a comprehensive and versatile framework for loading data from various sources, making it an invaluable tool for developing applications powered by Large Language Models (LLMs). By integrating multiple file types and data sources, such as CSV files, MS Office documents, PDF files, YouTube videos, and Wikipedia articles, LangChain allows developers to gather and standardize diverse data into Document objects, facilitating seamless data processing and analysis.

In the next article, we will learn why we need to split the documents and how to do it. Stay tuned to Analytics Vidhya Blogs for the next update!

Frequently Asked Questions

Q1. What is LangChain, and why is it important for developing LLM applications?

Ans. LangChain offers a range of functionalities, including loading, splitting, embedding, and retrieving data. It also supports parsing LLM outputs, adding tools and agentic capabilities to LLMs, and integrating with hundreds of third-party services. Additionally, it includes components like LangGraph for building stateful agents and LangSmith for productionizing LLM applications.

Q2. What functionalities does LangChain offer for working with data?

Ans. LangChain offers a range of functionalities, including loading, splitting, embedding, and retrieving data. It also supports parsing LLM outputs, adding tools and agentic capabilities to LLMs, and integrating with hundreds of third-party services. Additionally, it includes components like LangGraph for building stateful agents and LangSmith for productionizing LLM applications.

Q3. What are document loaders in LangChain, and what is their purpose?

Ans. Document loaders in LangChain are tools that convert data from various formats (e.g., CSV, PDF, HTML) into standardized Document objects. These objects include the data’s content, an optional ID, and metadata. Document loaders facilitate the seamless integration and processing of data from diverse sources into LLM applications.

Q4. How does LangChain handle different types of files and data sources?

Ans. LangChain supports over two hundred document loaders categorized by file type (e.g., CSV, PDF, HTML) and data source (e.g., YouTube, Wikipedia, GitHub). Public data sources like YouTube and Wikipedia can be accessed without tokens, while private data sources like AWS or Azure require access tokens. Each loader is designed to parse and load data appropriately based on the specific format or source.

User User 15 Jul, 2024

I am a Data Scientist interested in Machine Learning, Natural Language Processing, and Generative AI. I am interested in building products that leverage these technologies to solve real-world problems and drive innovation in various industries.

Generative AI Github Intermediate Large Language Models LLMs