7 Ways to Employ LangChain Text Splitters for Enhanced Data Processing

Santhosh Reddy D Last Updated : 23 Jul, 2024

8 min read

Introduction

In our previous article about LangChain Document Loaders, we explored how LangChain’s document loaders facilitate loading various file types and data sources into an LLM application. Can we send the data to the LLM now? Not so fast. LLMs have limits on context window size in terms of token numbers, so any data more than that size will be cut off, leading to potential loss of information and less accurate responses. Even if the context size is infinite, more input tokens will lead to higher costs, and money is not infinite. So, rather than sending all the data to the LLM, it is better to send the data that is relevant to our query about the data. To achieve this, we need to split the data first, and for that, LangChain Text Splitters is required. Now, let’s learn about LangChain Text Splitters.

Overview

Understand the Importance of Text Splitters in LLM Applications: Learn why text splitting is crucial for optimizing large language models (LLMs) and its impact on context window size and cost efficiency.
Learn Different Methods of Text Splitting: Explore various text-splitting techniques, including character count, token count, recursive splitting, HTML structure, and code syntax.
Implement Text Splitters Using LangChain: Learn to use LangChain’s text splitters, including installing them, writing code to split text, and handling different data formats.
Apply Semantic Splitting for Enhanced Relevance: Use sentence embeddings and cosine similarity to identify natural breakpoints, ensuring semantically similar content stays together.

What are Text Splitters?
Methods for Splitting Data
By Character Count
Recursive
By Token count
HTML
Code
JSON
Semantic Splitter
Frequently Asked Questions

What are Text Splitters?

Text splitters split large volumes of text into smaller chunks so that we can retrieve more relevant content for the given query. These splitters can be applied directly to raw text or to document objects loaded using LangChain’s document loaders.

Several methods are available for splitting data, each tailored to different types of content and use cases. Here are the various ways we can employ text splitters to enhance data processing.

Also read: A Comprehensive Guide to Using Chains in Langchain

Methods for Splitting Data

LangChain Text Splitters are essential for handling large documents by breaking them into manageable chunks. This improves performance, enhances contextual understanding, allows parallel processing, and facilitates better data management. Additionally, they enable customized processing and robust error handling, optimizing NLP tasks and making them more efficient and accurate. Further, we will discuss methods to split data into manageable chunks.

Pre-requisites

First, install the package using 'pip install langchain_text_splitters'

By Character Count

The text is split based on the number of characters. We can specify the separator to use for splitting the text. Let’s understand using the code. You can download the document used here: Free Strategy Formulation E-Book.

from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_text_splitters import CharacterTextSplitter

# load the data
loader = UnstructuredPDFLoader('how-to-formulate-successful-business-strategy.pdf', mode='single')
data = loader.load()

text_splitter = CharacterTextSplitter(
    separator="\n",           
    chunk_size=500,           
    chunk_overlap=0,          
    is_separator_regex=False,
)

This function splits the text where each chunk has a maximum of 500 characters. Text will be split only at new lines since we are using the new line (“\n”) as the separator. If any chunk has a size more than 500 but no new lines in it, it will be returned as such.

texts = text_splitter.split_documents(data)
# Created a chunk of size 535, which is longer than the specified 500
# Created a chunk of size 688, which is longer than the specified 500

len(texts)
>>> 73

for i in texts[48:49]:
    print(len(i.page_content))
    print(i.page_content)

Output

As we can see, the above-displayed chunk has 688 characters.

Recursive

Rather than using a single separator, we use multiple separators. This method will use each separator sequentially to split the data until the chunk reaches less than chunk_size. We can use this to split the text by each sentence, as shown below.

from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = UnstructuredPDFLoader('how-to-formulate-successful-business-strategy.pdf', mode='single')
data = loader.load()

recursive_splitter = RecursiveCharacterTextSplitter(
                  separators=["\n\n", "\n", r"(?<=[.?!])\s+"],                                   
                  keep_separator=False, is_separator_regex=True,
                  chunk_size=30, chunk_overlap=0)
                 
texts = recursive_splitter.split_documents(data)
len(texts)
>>> 293

# a few sample chunks
for text in texts[123:129]:
    print(len(text.page_content))
    print(text.page_content)

Output

As we can see, we have mentioned three separators, with the third one for splitting by sentence using regex.

By Token count

Both of the above methods use character counts. Since LLMs use tokens to count, we can also split the data by token count. Different LLMs use different token encodings. Let us use the encoding used in GPT-4o and GPT-4o-mini. You can find the model and encoding mapping here—GitHub link.

from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(encoding_name='o200k_base', chunk_size=50, chunk_overlap=0)
texts = text_splitter.split_documents(data)
len(texts)
>>> 105

We can also use character text splitter methods along with token counting.

text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    encoding_name='o200k_base', 
    separators=["\n\n", "\n", r"(?<=[.?!])\s+"], 
    keep_separator=False,
    is_separator_regex=True,
    chunk_size=10,          # chunk_size is number of tokens
    chunk_overlap=0)
    
texts = text_splitter.split_documents(data)
len(texts)
>>> 279
for i in texts[:4]:
    print(len(i.page_content))
    print(i.page_content)

Output

As shown, we can use token counting along with a recursive text splitter.

Among the three methods mentioned above, a recursive splitter with either character or token counting is better for splitting plain text data.

HTML

While the above methods are fine for plain text, If the data has some inherent structure like HTML or Markdown pages, it is better to split by considering that structure.

We can split the HTML page based on the headers

from langchain_text_splitters import HTMLHeaderTextSplitter, HTMLSectionSplitter

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3")]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on, return_each_element=True)
html_header_splits = html_splitter.split_text_from_url('https://diataxis.fr/')
len(html_header_splits)
>>> 37

Here, we split the HTML page from the URL by headers h1, h2, and h3. We can also use this class by specifying a file path or HTML string.

for header in html_header_splits[20:22]:
    print(header.metadata)
    
>>> {'Header 1': 'Diátaxis¶'}
    {'Header 1': 'Diátaxis¶', 'Header 2': 'Contents¶'} # there is no h3 in this page.

Similarly, we can also split Markdown file text using headers with MarkdownHeaderTextSplitter

We can also split based on any other sections of the HTML. For that, we need HTML as a text.

import requests
r = requests.get('https://diataxis.fr/')

sections_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("p", "section"),
]
html_splitter = HTMLSectionSplitter(sections_to_split_on)
html_section_splits = html_splitter.split_text(r.text)
len(html_section_splits)
>>> 18
for section in html_section_splits[1:6]:
    print(len(section.page_content))
    print(section)

Output

Here, we use h1, h2, and p tags to split the data in the HTML page.

Code

Since Programming languages have different structures than plain text, we can split the code based on the syntax of the specific language.

from langchain_text_splitters import RecursiveCharacterTextSplitter, Language

PYTHON_CODE = """
def add(a, b):
    return a + b

class Calculator:
    def __init__(self):
        self.result = 0

    def add(self, value):
        self.result += value
        return self.result

    def subtract(self, value):
        self.result -= value
        return self.result

# Call the function
def main():
    calc = Calculator()
    print(calc.add(5))
    print(calc.subtract(2))

if __name__ == "__main__":
    main()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=100, chunk_overlap=0)
    
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

Output

Here, the Python code is split based on the syntax words like class, def, etc. We can find separators for different languages here – GitHub Link.

JSON

A nested json object can be split such that initial json keys are in all the related chunks of text. If there are any long lists inside, we can convert them into dictionaries to split. Let’s look at an example.

from langchain_text_splitters import RecursiveJsonSplitter

# Example JSON object
json_data = {
    "company": {
        "name": "TechCorp",
        "location": {
            "city": "Metropolis",
            "state": "NY"
        },
        "departments": [
            {
                "name": "Research",
                "employees": [
                    {"name": "Alice", "age": 30, "role": "Scientist"},
                    {"name": "Bob", "age": 25, "role": "Technician"}
                ]
            },
            {
                "name": "Development",
                "employees": [
                    {"name": "Charlie", "age": 35, "role": "Engineer"},
                    {"name": "David", "age": 28, "role": "Developer"}
                ]
            }
        ]
    },
    "financials": {
        "year": 2023,
        "revenue": 1000000,
        "expenses": 750000
    }
}


# Initialize the RecursiveJsonSplitter with a maximum chunk size
splitter = RecursiveJsonSplitter(max_chunk_size=200, min_chunk_size=20)

# Split the JSON object
chunks = splitter.split_text(json_data, convert_lists=True)

# Process the chunks as needed
for chunk in chunks:
    print(len(chunk))
    print(chunk)

Output

This splitter maintains initial keys such as company and departments if the chunk contains data corresponding to those keys.

Semantic Splitter

The above methods work based on the text’s structure. However, splitting two sentences may not be helpful if they have similar meanings. We can utilize sentence embeddings and cosine similarity to identify natural break points where the semantic content of adjacent sentences diverges significantly.

Here are the Steps:

Split the input text into individual sentences.
Combine Sentences Using Buffer Size: Create combined sentences for a given buffer size. For example, if there are 10 sentences and the buffer size is 1, the combined sentences would be:
- Sentences 1 and 2
- Sentences 1, 2, and 3
- Sentences 2, 3, and 4
- Continue this pattern until the last combination, which will be sentences 9 and 10

Compute the embeddings for each combined sentence using an embedding model.
Determine sentence splits based on distance:
- Calculate the cosine distance (1 – cosine similarity) between adjacent combined sentences
- Identify indices where the cosine distance is above a defined threshold.
- Join the sentences based on those indices

from langchain_community.document_loaders import WikipediaLoader
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
# make sure to add OPENAI_API_KEY

loader = WikipediaLoader(query='Generative AI', load_max_docs=1, doc_content_chars_max=5000, load_all_available_meta=True)
data = loader.load()
semantic_splitter = SemanticChunker(OpenAIEmbeddings(model='text-embedding-3-small'), buffer_size=1,
                                    breakpoint_threshold_type='percentile', breakpoint_threshold_amount=70)
                                    
texts = semantic_splitter.create_documents([data[0].page_content])
len(texts)                  
>>> 10
for text in texts[:2]:
    print(len(text.page_content))
    print(text.page_content)

Output

The document is split into 10 chunks, and there are 29 sentences in it. We have these breakpoint threshold types available along with default values:

“percentile”: 95,

“standard_deviation”: 3,

“interquartile”: 1.5,

“gradient”: 95

If you’re interested in learning how embedding models compute embeddings for sentences, look for the next article, where we’ll discuss the details.

Conclusion

This article explored various text-splitting methods using LangChain, including character count, recursive splitting, token count, HTML structure, code syntax, JSON objects, and semantic splitter. Each method offers unique advantages for processing different data types, enhancing the efficiency and relevance of the content sent to LLMs. By understanding and implementing these techniques, you can optimize data for better accuracy and lower costs in your LLM applications.

Frequently Asked Questions

Q1. What are text splitters in LangChain?

Ans. Text splitters are tools that divide large volumes of text into smaller chunks, making it easier to process and retrieve relevant content for queries in LLM applications.

Q2. Why is it necessary to split text before sending it to an LLM?

Ans. Splitting text is crucial because LLMs have limits on context window size. Sending smaller, relevant chunks ensures no information is lost and lowers processing costs.

Q3. What methods are available for splitting text in LangChain?

Ans. LangChain offers various methods such as splitting by character count, token count, recursive splitting, HTML structure, code syntax, JSON objects, and semantic splitting.

Q4. How do I implement a text splitter in LangChain?

Ans. Implementing a text splitter involves installing the LangChain package, writing code to specify splitting criteria, and applying the splitter to different data formats.

Q5. What is semantic splitting, and when should it be used?

Ans. Semantic splitting uses sentence embeddings and cosine similarity to keep semantically similar content together. It’s ideal for maintaining the context and meaning in text chunks.

Santhosh Reddy D

I am a Data Scientist interested in Machine Learning, Natural Language Processing, and Generative AI. I am interested in building products that leverage these technologies to solve real-world problems and drive innovation in various industries.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

7 Ways to Employ LangChain Text Splitters for Enhanced Data Processing

Introduction

Overview

Table of contents

What are Text Splitters?

Methods for Splitting Data

Pre-requisites

By Character Count

Recursive

By Token count

HTML

Code

JSON

Semantic Splitter

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp