8 Types of Chunking for RAG Systems

Pankaj Singh Last Updated : 04 Apr, 2025

25 min read

The easiest way to learn anything—whether it’s for academics or personal growth—is by breaking it down into smaller, more manageable chunks. Similarly, when you’re tackling a complex subject, it can feel overwhelming at first. However, by dividing it into bite-sized pieces, understanding becomes much easier. Even if it seems like a small concept already, it’s always possible to split it into even more parts, no matter how simple they are. This chunking method makes it easier for a person to grasp or learn something and forms the foundation for how we process information in everyday life. Surprisingly, machines work similarly. Chunking is not just a method but a cognitive psychology concept that plays a vital role in data processing and AI systems that use RAG. Today, we will be talking about 8 types of Chunking in RAG with some Hands-on!!

What is Chunking for RAG System?
Why is Chunking Important?
RAG Architecture to Comprehend Chunking
Key Drawbacks of RAG (Retrieval-Augmented Generation)
How to Choose the Right Chunking Strategy?
Character Text Chunking
Recursive Character Text Splitting
Document Specific Chunking Using LangChain( HTML, Python, JSON or more)
Semantic Chunking
Agentic Chunking
Section Based Chunking
Contextual Chunking
Late Chunking
Conclusion

What is Chunking for RAG System?

Chunking is the process of breaking down large pieces of text into smaller, more manageable parts. This technique is crucial when working with language models because it ensures that the provided data fits within the model’s context window while maintaining the relevance and quality of the information.

By context window, I meant that every language model operates according to the user’s requirements for providing their own data. However, a limitation restricts the user from passing unlimited data to the model. This is because:

The Context Limit

There is always a limit on the number of words or tokens that you can provide to the language model. Here’s the context window of OpenAI models:

Maximizing Signal-to-Noise Ratio

Language models perform better when the signal-to-noise ratio is high. In other words, reducing irrelevant or distracting information in the model’s context window can significantly enhance performance.

So. the primary goal of chunking is not just to split data arbitrarily, but to optimize the way information is presented to the model. Proper chunking enhances the retrievability of useful content and improves the overall performance of applications relying on AI models.

Why is Chunking Important?

Anton Troynikov, co-founder of Chroma, points out, that unnecessary data within the context window can measurably degrade the overall effectiveness of an application. By focusing only on relevant content, we can optimize the model’s output and ensure more accurate, efficient responses.

Makes sense right? Similarly, Chunking is important because:

Overcoming Context Window Limitations
Every language model has a fixed context window, which restricts the amount of data that can be processed at once. By chunking, you ensure that essential information is retained within these limits, preventing important data from being omitted or truncated.
Improving Signal-to-Noise Ratio
When text is too large and contains unnecessary information, the model’s performance can degrade. Chunking helps in filtering out irrelevant content, ensuring that only the most relevant data is provided to the model, thereby increasing the signal-to-noise ratio and boosting accuracy.
Enhancing Retrieval Efficiency
Properly chunked data makes it easier to locate and retrieve relevant pieces when needed. This is especially important for retrieval-augmented generation (RAG) systems, where accessing the right information quickly can significantly impact response quality.
Task-Specific Optimization
Different tasks may require different chunking strategies. For instance, summarization tasks may benefit from larger chunks to maintain coherence, while question-answering tasks might require finer granularity to provide precise answers. The key is to chunk in a way that aligns with the specific needs of the application.

In summary, chunking is a foundational step in preparing text data for language models. It helps in balancing data volume, relevance, and retrievability, making it a critical practice in building efficient AI-powered applications.

Let’s understand this with the RAG architecture:

RAG Architecture to Comprehend Chunking

RAG Architecture to Understand Chunking — Source: Author

In Retrieval-Augmented Generation (RAG), chunking involves breaking down raw data sources (such as PDFs, spreadsheets, or other documents) into smaller, manageable pieces called “chunks of text.” The system then processes these chunks, converts them into vector embeddings, and stores them in a vector database (e.g., Chroma) to enable efficient retrieval when a user asks a question.

In short, Chunking refers to dividing large text data into smaller, manageable pieces to improve retrieval efficiency and relevance in downstream tasks like search and generation.

1. Chunking

Raw Data Source:
- Input data can come from various sources such as PDFs, databases, and reports.
- These raw sources often contain large blocks of information that are difficult to process in their entirety.
Data Processing (Chunking Stage):
- The large documents are split into smaller chunks, ensuring that each chunk represents a meaningful segment of information.
- These chunks may follow different strategies, such as:
  - Fixed-size chunks (e.g., 500 words each)
  - Semantic chunks (split based on meaning or structure, like paragraphs or sections)
  - Overlapping chunks (to preserve context between chunks)
Embedding Chunks:
- Each chunk is passed through an embedding model, which converts it into a high-dimensional vector representation.
- This process encodes the chunk’s meaning, allowing for efficient similarity searches.

2. Chunk Retrieval Using Vector Database

Once the chunks are embedded:

When a user asks a question, the query is also converted into an embedding vector.
A vector search is performed to find the most relevant chunks from the database (Chroma in this case).
The retrieved chunks (which are the most similar to the query) are sent to the LLM to provide contextual responses.

3. Generation Using Retrieved Chunks

After chunk retrieval:

The retrieved chunks are bundled with additional components like:
- Instruction: Defines how the model should respond.
- Context: The retrieved chunk(s) provide the factual basis.
- Query: The original user input.
The generator (LLM) then processes this information and generates a coherent response.

Also read: RAG vs Agentic RAG: A Comprehensive Guide

Let’s understand the drawbacks of RAG.

Key Drawbacks of RAG (Retrieval-Augmented Generation)

Retrieval Challenges:
- Precision and Recall Issues: The retrieval phase often struggles to identify relevant information, leading to:
  - Selection of misaligned or irrelevant content chunks.
  - Missing critical information that is essential for accurate responses.
- Inadequate Context: A single retrieval based on the original query may fail to capture sufficient context for complex issues.
Generation Difficulties:
- Hallucination: The model may generate content that is not supported by the retrieved context, reducing reliability.
- Irrelevance, Toxicity, or Bias: Outputs may suffer from:
  - Irrelevant or off-topic responses.
  - Toxic or biased language undermines the quality and trustworthiness of the generated content.
Augmentation Hurdles:
- Integration Challenges: Combining retrieved information with the task at hand can result in:
  - Disjointed or incoherent outputs.
  - Redundancy due to repetitive information from multiple sources.
- Stylistic and Tonal Inconsistency: Ensuring a consistent tone and style across the generated content adds complexity.
- Over-Reliance on Retrieved Content: The model may simply echo retrieved information without synthesizing or adding insightful analysis, limiting the depth of responses.

By implementing the right chunking strategies, the RAG pipeline can achieve more accurate retrieval, richer contextual grounding, and higher-quality response generation, ultimately enhancing the overall system’s reliability and user satisfaction.

How to Choose the Right Chunking Strategy?

Choosing the right chunking strategy involves carefully considering the content type, the embedding model, and the expected user queries. Here’s a detailed guide tailored to your example scenario:

1. Understand the Nature of the Content

Content characteristics heavily influence chunking strategy. Example Scenario:

Scientific documents (e.g., Nature articles):
- Structured content: Sections like Abstract, Introduction, Methods, etc.
- Dense information: Each section may contain multiple key points.
- Long paragraphs and citations.
Chunking Strategy for Such Content:
- By logical sections: Treat sections like “Abstract,” “Methods,” etc., as individual chunks.
- Smaller sub-chunks: Break long sections (e.g., 500–800 tokens) into subsections by paragraph or semantic boundaries.
- Maintain context: Avoid cutting in the middle of a thought or example to preserve semantic meaning.

2. Align with the Embedding Model

Different embedding models have varying limitations and strengths. Key Considerations:

Token Limitations:
- Many embedding models (like OpenAI’s models) have token limits. Ensure chunks fit well within these limits.
Semantic Encoding:
- Embedding models work best when input chunks contain coherent and self-contained ideas.
A good chunk typically includes a full sentence, paragraph, or logically connected set of points.

Steps to Optimize

Calculate Token Sizes: Use tools or scripts to estimate the token count of your content to ensure compatibility with the embedding model.
Pre-process with Overlapping Context: When breaking content into chunks, ensure some overlap between chunks (e.g., 20–30% overlap) to prevent loss of semantic connections across boundaries.
Prioritize Structure: Embed well-structured and self-contained chunks for better semantic relevance.

3. Anticipate User Queries

Understanding what users are likely to search for helps design the chunking strategy. Example User Queries:

General topics (e.g., “What is the method used in this study?”):
- Chunks aligned with document sections allow faster retrieval.
- Abstract or Results sections might be frequently accessed.
Specific details (e.g., “What’s the p-value for Experiment 1?”):
- Finer-grained chunking ensures detail-level retrieval.

In the next section, I will discuss different chunking strategies in detail.

1. Character Text Chunking

This method is one of the simplest approaches to chunking or splitting text. It divides the text into fixed-sized chunks of N characters, regardless of the content or structure. While it is a basic technique, it serves as an excellent starting point for understanding the fundamentals of text chunking and how it works in practice.

This approach is easy and simple to use; however, it is very rigid and does not take into account the structure of your text.

Character Text Chunking — Source: Chunkviz

text = "Clouds come floating into my life, no longer to carry rain or usher storm, but to add color to my sunset sky."
chunks = []
chunk_size = 35
chunk_overlap = 5 # Characters
# Run through the text with the length of your text and iterate every chunk_size,
# considering the overlap for the starting position of the next chunk.
for i in range(0, len(text) - chunk_size + 1, chunk_size - chunk_overlap):
   chunk = text[i:i + chunk_size]
   chunks.append(chunk)
chunks

Output

['Clouds come floating into my life, ',
 'ife, no longer to carry rain or ush',
 'r usher storm, but to add color to ']

Explanation:

Input Text:
- A string variable text contains a sentence.
Chunks List Initialization:
- chunks = [] creates an empty list to store text segments.
Chunking Parameters:
- chunk_size = 35: Defines the length of each chunk to be 35 characters.
- chunk_overlap = 5: Specifies that each chunk will overlap with the previous one by 5 characters.
Chunking Process:
- The for loop iterates through the text using a step size of chunk_size – chunk_overlap, meaning new chunks will start every 30 characters but will include the last 5 characters from the previous chunk.
- The loop range is determined by len(text) – chunk_size + 1, ensuring it doesn’t go beyond the text length.
- In each iteration, a substring of length chunk_size is extracted from the text and added to the chunks list.

Explanation of the Overlapping Mechanism

Step Size Calculation:

The loop iterates with a step of chunk_size – chunk_overlap, which means:
35−5=30.
This means after processing the first 35 characters, the next chunk starts 30 characters after the first one, causing a 5-character overlap.

Let’s analyze how the loop runs with the given values:

First chunk (index 0 to 35):
Extracts the substring “Clouds come floating into my life, “.
The loop then moves forward by 30 characters.

Second chunk (index 30 to 65):
Extracts the substring “ife, no longer to carry rain or ush”.
Notice how the last 5 characters of the previous chunk (“life,”) overlap in this chunk.

Third chunk (index 60 to 95):
Extracts the substring “r usher storm, but to add color to “.
Again, there’s an overlap with the last few characters from the second chunk.

Now let’s do it with Langchain

%pip install -qU langchain-text-splitters

This command installs the langchain-text-splitters library, which is used for splitting long pieces of text into smaller chunks.

The -q flag suppresses installation output, and -U ensures that the latest version is installed.

# Load an example document
with open("state_of_the_union.txt") as f:
  state_of_the_union = f.read()

Opens the file state_of_the_union.txt and reads its entire content into the variable state_of_the_union as a string.
This document is presumably the transcript of a U.S. State of the Union address.

text_splitter = CharacterTextSplitter(
  separator="\n\n",
  chunk_size=1000,
  chunk_overlap=200,
  length_function=len,
  is_separator_regex=False,
)

This code sets up a CharacterTextSplitter object with the following parameters:

separator=”\n\n”
- The document is split by double newline characters (\n\n), which typically indicate paragraph breaks in text files.
chunk_size=1000
- Each text chunk will contain approximately 1000 characters.
chunk_overlap=200
- There will be a 200-character overlap between consecutive chunks to ensure context continuity when processing the text.
length_function=len
- Specifies that the length of each chunk is calculated using Python’s built-in len() function, which measures the number of characters.
is_separator_regex=False
- Indicates that the separator provided (“\n\n”) is a literal string and not a regular expression.

texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])

The create_documents() method takes the list of texts (in this case, a single document) and splits it based on the specified parameters (chunk size, overlap, separator).

The result is a list of chunked document objects, where each chunk contains a portion of the original text.

Chunking in Action:

The content is split into paragraphs based on the double newline (\n\n) separator.
This ensures the logical separation of ideas while maintaining readability.

Overlap Handling:

The chunk may contain up to 200 characters from the previous chunk to preserve context.

2. Recursive Character Text Splitting

Unlike the first method which doesn’t look for the document structure, this method recursively divides text using a predefined list of separators and intelligently merges the resulting smaller chunks into larger ones. The final chunks are optimized to contain no more than N characters, ensuring efficient text processing and context preservation.

It is parameterized by a list of characters. The default list is:

“\n\n” – Double new line, or most commonly paragraph breaks
“\n” – New lines
” ” – Spaces
“” – Characters

%pip install -qU langchain-text-splitters

text = """

The Marvel Universe is a vast and interconnected world filled with superheroes, villains, and epic storytelling that has captivated audiences for decades. Founded by visionaries such as Stan Lee, Jack Kirby, and Steve Ditko, Marvel Comics has introduced some of the most iconic characters in pop culture history. From its early beginnings in 1939 as Timely Publications to its transformation into Marvel Comics in the 1960s, the company has consistently pushed the boundaries of storytelling by creating relatable and dynamic characters. Heroes like Spider-Man, Iron Man, Captain America, and Thor have become household names, each with their own compelling backstories and struggles that resonate with fans across generations. Marvel’s success extends beyond the pages of comic books. The launch of the Marvel Cinematic Universe (MCU) in 2008 with the release of Iron Man revolutionized the film industry, introducing interconnected storylines that culminated in epic crossover events such as The Avengers and Infinity War. The MCU’s success is largely attributed to its ability to blend action, humor, and emotional depth while maintaining the essence of the beloved comic book characters. Audiences have followed the journeys of superheroes as they face powerful foes like Thanos and Loki, all while dealing with their own internal conflicts and responsibilities."""

from langchain_text_splitters import RecursiveCharacterTextSplitter

The RecursiveCharacterTextSplitter is imported from the langchain-text-splitters package.

This class is used to split large text documents into smaller chunks efficiently while preserving context.

text_splitter = RecursiveCharacterTextSplitter(
   # Set a really small chunk size, just to show.
   chunk_size=400,
   chunk_overlap=0,
   length_function=len,
)
text_splitter.create_documents([text])

Output

[Document(metadata={}, page_content='The Marvel Universe is a vast and
 interconnected world filled with superheroes, villains, and epic
 storytelling that has captivated audiences for decades. Founded by
 visionaries such as Stan Lee, Jack Kirby, and Steve Ditko, Marvel Comics has
 introduced some of the most iconic characters in pop'),

 Document(metadata={}, page_content='culture history. From its early
 beginnings in 1939 as Timely Publications to its transformation into Marvel
 Comics in the 1960s, the company has consistently pushed the boundaries of
 storytelling by creating relatable and dynamic characters. Heroes like
 Spider-Man, Iron Man, Captain America, and'),

 Document(metadata={}, page_content='Thor have become household names, each
 with their own compelling backstories and struggles that resonate with fans
 across generations. Marvel’s success extends beyond the pages of comic
 books. The launch of the Marvel Cinematic Universe (MCU) in 2008 with the
 release of Iron Man revolutionized the'),

 Document(metadata={}, page_content='film industry, introducing
 interconnected storylines that culminated in epic crossover events such as
 The Avengers and Infinity War. The MCU’s success is largely attributed to
 its ability to blend action, humor, and emotional depth while maintaining
 the essence of the beloved comic book characters.'),

 Document(metadata={}, page_content='Audiences have followed the journeys of
 superheroes as they face powerful foes like Thanos and Loki, all while
 dealing with their own internal conflicts and responsibilities.')]

The resulting list of Document objects contains multiple chunks of the text, each with overlapping portions to ensure smooth transitions. Here’s a breakdown of the output:

First Chunk:
“The Marvel Universe is a vast and interconnected world filled with superheroes, … iconic characters in pop”
Second Chunk:
“culture history. From its early beginnings in 1939 as Timely Publications to its transformation into Marvel Comics in the 1960s, … Iron Man, Captain America, and”
Third Chunk:
“Thor have become household names, each with their own compelling backstories and struggles that resonate … Iron Man revolutionized the”
Fourth Chunk:
“film industry, introducing interconnected storylines that culminated in epic crossover events such as The Avengers … comic book characters.”
Fifth Chunk:
“Audiences have followed the journeys of superheroes as they face powerful foes like Thanos and Loki, … responsibilities.”

3. Document Specific Chunking Using LangChain( HTML, Python, JSON or more)

Document-specific chunking is a strategy designed to tailor text-splitting methods to fit different data formats such as images, PDFs, or code snippets. Unlike generic chunking methods, which may not work effectively across various content types, document-specific chunking takes into account the unique structure and characteristics of each format to ensure meaningful segmentation.

For instance, when dealing with Markdown, Python, or JavaScript files, chunking methods are adapted to use format-specific separators, such as headers in Markdown, function definitions in Python, or code blocks in JavaScript. This approach allows for more accurate and context-aware chunking, ensuring that key elements of the content remain intact and understandable.

By adopting document-specific chunking, organizations and developers can efficiently process diverse data types while maintaining logical segmentation, and improving downstream tasks such as search, summarization, and analysis.

1. Python

%pip install -qU langchain-text-splitters
from langchain_text_splitters import (Language,RecursiveCharacterTextSplitter,)
PYTHON_CODE = """
def hello_world():
   print("Hello, World!")
# Call the function
hello_world()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
   language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])

python_docs

Output

[Document(metadata={}, page_content='def hello_world():\n    print("Hello,
 World!")'),
 Document(metadata={}, page_content='# Call the function\nhello_world()')]

2. Markdown

%pip install -qU langchain-text-splitters
from langchain_text_splitters import(Language,RecursiveCharacterTextSplitter)

markdown_text = """# 🦜️🔗 LangChain
⚡ Building applications with LLMs through composability ⚡
## What is LangChain?
# Hopefully this code block isn't split
LangChain is a framework for...
As an open-source project in a rapidly developing field, we are extremely open to contributions.
"""

md_splitter = RecursiveCharacterTextSplitter.from_language(
   language=Language.MARKDOWN, chunk_size=60, chunk_overlap=0
)
md_docs = md_splitter.create_documents([markdown_text])

md_docs

Output

[Document(metadata={}, page_content='# 🦜️🔗 LangChain'),

 Document(metadata={}, page_content='⚡ Building applications with LLMs through composability ⚡'),

 Document(metadata={}, page_content='## What is LangChain?'),

 Document(metadata={}, page_content="# Hopefully this code block isn't split"),

 Document(metadata={}, page_content='LangChain is a framework for...'),

 Document(metadata={}, page_content='As an open-source project in a rapidly developing field, we'),

 Document(metadata={}, page_content='are extremely open to contributions.')]

4. Semantic Chunking

Semantic chunking is an advanced text-splitting technique that focuses on dividing a document into meaningful chunks based on the actual content and context rather than arbitrary size-based methods such as token count or delimiters. The primary goal of semantic chunking is to ensure that each chunk contains a single, concise meaning, optimizing it for downstream tasks like embedding into vector representations for machine learning applications.

Traditional chunking methods, such as splitting text by a fixed number of tokens or characters, often result in chunks that contain multiple, unrelated meanings. This can dilute the representation when encoding text into vector embeddings, leading to suboptimal retrieval and processing results. By contrast, semantic chunking works by identifying natural meaning boundaries within the text and segmenting it accordingly to ensure each chunk preserves a coherent and unified concept.

For example, in a newspaper article, different paragraphs may cover various aspects of a single story. A naive chunking approach may group unrelated sections together, leading to mixed embeddings that fail to represent any of the topics accurately. Semantic chunking, however, isolates sections with distinct meanings, ensuring that each vector embedding captures the core essence of that portion.

Implementing Semantic Chunking

In practice, semantic chunking can be implemented using natural language processing (NLP) techniques such as semantic similarity analysis, topic modeling, or machine learning-based segmentation. These methods analyze the underlying meaning of the text and intelligently determine appropriate chunk boundaries.

By adopting semantic chunking, text processing systems can achieve higher accuracy in tasks such as information retrieval, summarization, and AI-driven insights, ensuring that each chunk represents a concise and meaningful unit of information.

!pip install --quiet langchain_experimental langchain_openai

This command installs the required packages:

langchain_experimental: Provides experimental text-splitting techniques, including semantic chunking.
langchain_openai: Provides access to OpenAI’s embedding models for semantic processing.

The –quiet flag suppresses unnecessary output during installation.

# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
   state_of_the_union = f.read()

The state_of_the_union.txt file is read into a string variable state_of_the_union.

This text will later be split into meaningful chunks based on semantic differences.

from langchain_experimental.text_splitter import SemanticChunker
import os
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
from getpass import getpass

os: Used to manage environment variables such as the API key.
SemanticChunker: The class that performs the semantic chunking process.
OpenAIEmbeddings: Provides access to OpenAI’s embedding models to measure sentence similarity.
getpass: Securely prompts the user for their OpenAI API key.

os.environ["OPENAI_API_KEY"] = getpass("API")
text_splitter = SemanticChunker(
   OpenAIEmbeddings(), breakpoint_threshold_type="percentile"
)

Initializes the SemanticChunker using OpenAI’s embeddings model.

It will automatically calculate the semantic similarity between sentences to determine where to split the text.

Specifies breakpoint_threshold_type=”percentile”, which means the chunking decision is based on the percentile method for determining split points.

docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)

This method processes the input text and splits it into meaningful segments using the chosen semantic chunking strategy.
The result is a list of Document objects, each containing a chunk of text.

Semantic chunking works by determining where to split text based on differences in sentence embeddings, which capture the meaning of sentences numerically. The algorithm calculates the difference in meaning between consecutive sentences and splits them when a certain threshold is exceeded.

Methods to Determine Breakpoints (Threshold Types)

The chunking behaviour is controlled using the breakpoint_threshold_type parameter, which supports the following methods:

Percentile (Default Method)
- Measures the differences between sentence embeddings and splits the text at the top X percentile.
- The default percentile is 95.0, adjustable via breakpoint_threshold_amount.
- Example: If the differences between sentences follow a distribution, the method splits the largest 5% of differences.
Standard Deviation
- Splits chunks when the difference exceeds X standard deviations from the mean.
- The default value for X is 3.0.
- This method is useful when text has uniform patterns with occasional significant changes.
Interquartile Range (IQR)
- Uses statistical quartiles to determine split points by identifying outliers in semantic changes.
- The default scaling factor is 1.5, adjustable via breakpoint_threshold_amount.
- Effective for texts with moderate variation in meaning.
Gradient-Based Splitting
- Uses the gradient of embedding distance to identify split points, applying anomaly detection techniques.
- Suitable for domain-specific texts (e.g., legal or medical documents) where topic shifts are subtle.
- Works similarly to the percentile method but adapts to highly correlated data.

5. Agentic Chunking

Agentic chunking is an advanced method of segmenting documents into smaller, meaningful sections by leveraging a large language model (LLM) to identify natural breakpoints in the text. Unlike traditional chunking methods that rely on fixed character counts, agentic chunking analyzes the content to detect semantically relevant boundaries such as paragraph breaks and topic transitions.

By using AI to determine logical divisions within the text, agentic chunking ensures that each chunk retains contextual integrity and meaning, improving the AI’s ability to process, summarize, and respond effectively. This approach enhances information retrieval, content organization, and decision-making processes by creating well-structured, purpose-driven text segments.

Agentic chunking is particularly useful in applications such as knowledge retrieval, automated summarization, and AI-driven insights, where maintaining coherence and relevance is crucial for optimal performance.

Note: Most people refer to it as Agentic Chunking, but it’s primarily based on LLM-driven chunking.

Talking about the LLM-based Chunking – It is essentially the process of using a large language model (LLM)—like GPT-4—to break down or segment text into more manageable, structured pieces. Instead of using rigid rules (like splitting strictly on sentence boundaries or punctuation), LLM-based chunking leverages the model’s understanding of language and context to produce chunks in a way that’s more meaningful and coherent.

!pip install agno openai

from typing import List, Optional
from agno.document.base import Document
from agno.document.chunking.strategy import ChunkingStrategy
from agno.models.base import Model
from agno.models.defaults import DEFAULT_OPENAI_MODEL_ID
from agno.models.message import Message
from agno.models.openai import OpenAIChat

import os
os.environ["OPENAI_API_KEY"] = "your_api_key"

class AgenticChunking(ChunkingStrategy):
   """Chunking strategy that uses an LLM to determine natural breakpoints in the text"""


   def __init__(self, model: Optional[Model] = None, max_chunk_size: int = 5000):
       if "OPENAI_API_KEY" not in os.environ:
           raise ValueError("OPENAI_API_KEY environment variable not set.")
       self.model = model or OpenAIChat(DEFAULT_OPENAI_MODEL_ID)
       self.max_chunk_size = max_chunk_size


   def chunk(self, document: Document) -> List[Document]:
       """Split text into chunks using LLM to determine natural breakpoints based on context"""
       if len(document.content) <= self.max_chunk_size:
           return [document]


       chunks: List[Document] = []
       remaining_text = self.clean_text(document.content)
       chunk_meta_data = document.meta_data
       chunk_number = 1


       while remaining_text:
           # Ask model to find a good breakpoint within max_chunk_size
           prompt = f"""Analyze this text and determine a natural breakpoint within the first {self.max_chunk_size} characters.
           Consider semantic completeness, paragraph boundaries, and topic transitions.
           Return only the character position number of where to break the text:


           {remaining_text[: self.max_chunk_size]}"""


           try:
               response = self.model.response([Message(role="user", content=prompt)])
               if response and response.content:
                   break_point = min(int(response.content.strip()), self.max_chunk_size)
               else:
                   break_point = self.max_chunk_size
           except Exception:
               # Fallback to max size if model fails
               break_point = self.max_chunk_size


           # Extract chunk and update remaining text
           chunk = remaining_text[:break_point].strip()
           meta_data = chunk_meta_data.copy()
           meta_data["chunk"] = chunk_number
           chunk_id = None
           if document.id:
               chunk_id = f"{document.id}_{chunk_number}"
           elif document.name:
               chunk_id = f"{document.name}_{chunk_number}"
           meta_data["chunk_size"] = len(chunk)
           chunks.append(
               Document(
                   id=chunk_id,
                   name=document.name,
                   meta_data=meta_data,
                   content=chunk,
               )
           )
           chunk_number += 1
           remaining_text = remaining_text[break_point:].strip()
           if not remaining_text:
               break
       return chunks

# Example usage
document = Document(
   id="doc1",
   content="""Recursive chunking divides the input text into smaller chunks in a hierarchical and iterative manner using a set of separators. If the initial attempt at splitting the text doesn’t produce chunks of the desired size or structure, the method recursively calls itself on the resulting chunks with a different separator or criterion until the desired chunk size or structure is achieved. This means that while the chunks aren’t going to be exactly the same size, they’ll still “aspire” to be of a similar size.""",
   meta_data={"author": "Pankaj"}
)

chunker = AgenticChunking(max_chunk_size=200)
chunks = chunker.chunk(document)

# Print all chunks
for i, chunk in enumerate(chunks, 1):
   print(f"Chunk {i} (ID: {chunk.id}, Size: {len(chunk.content)})")
   print(chunk.content)
   print("-" * 50 + "\n")

Output

Chunk 1 (ID: doc1_1, Size: 179)
Recursive chunking divides the input text into smaller chunks in a
 hierarchical and iterative manner using a set of separators. If the initial
 attempt at splitting the text doesn’
--------------------------------------------------

Chunk 2 (ID: doc1_2, Size: 132)
t produce chunks of the desired size or structure, the method recursively
 calls itself on the resulting chunks with a different sepa
--------------------------------------------------

Chunk 3 (ID: doc1_3, Size: 104)
rator or criterion until the desired chunk size or structure is achieved.
 This means that while the chun
--------------------------------------------------

Chunk 4 (ID: doc1_4, Size: 66)
ks aren’t going to be exactly the same size, they’ll still “aspire
--------------------------------------------------

Chunk 5 (ID: doc1_5, Size: 26)
” to be of a similar size.
--------------------------------------------------

LLM-Based Chunking Using OpenAI Library

from openai import OpenAI

Imports the OpenAI library, required to interact with the GPT API.

content = "An outlier is a data point that significantly deviates from the rest of the data. It can be either much higher or much lower than the other data points, and its pr types of outliers: There are two main types of outliers: Global outliers: Global outliers are isolated data points that are far away from the main body of the data"

This is the input text that will be chunked.

# Initialize client with your API key
client = OpenAI(api_key="API_KEY")

Initializes the OpenAI client using an API key (replace “API_KEY” with an actual key to run the code).

response = client.chat.completions.create(
   model="gpt-4o",
   messages=[
       {
           "role": "system",
                       "role": "system",
           "content": """You are a agentic chunker. Decompose the content into clear and simple propositions:
                       1. Split compound sentences into simple sentences
                       2. Separate named entities with descriptions
                       3. Replace pronouns with specific references
                       4. Output as JSON list of strings"""
       },
       {
           "role": "user",
           "content": f"Here is the content: {content}"
       }
   ],
   temperature=0.3
)

Model: Uses gpt-4o for processing.

Messages: The system message defines GPT’s behavior: breaking down text into simple propositions, separating named entities, avoiding pronouns, and outputting as a JSON list.

The user message provides the actual content for chunking.
Temperature: 0.3 keeps responses deterministic, reducing randomness for more consistent outputs.

print(response.choices[0].message.content)

Output

"An outlier is a data point that significantly deviates from the rest of the data.",

  "An outlier can be much higher than the other data points.",

  "An outlier can be much lower than the other data points.",

  "There are two main types of outliers.",

  "Global outliers are isolated data points.",

  "Global outliers are far away from the main body of the data."

6. Section Based Chunking

Section-based chunking is a technique used to divide large texts into meaningful “chunks” or segments based on structural elements like headings, subheadings, paragraphs, or predefined section markers. Unlike topic modeling (which relies on statistical patterns to group content), section-based chunking leverages the document’s inherent structure to create logical divisions.

Structure-Driven:
Relies on document formatting like:

Headings (e.g., Introduction, Methods, Conclusion)
Numbered sections (e.g., 1.1, 2.3.4)
Bullet points, line breaks, or custom markers.

Preserves Context:
Keeps related information together, maintaining narrative flow within sections.

Efficient for Structured Documents:
Works well with academic papers, reports, PDFs, legal documents, etc.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np
import fitz  # PyMuPDF

Function to extract text from a PDF file

def extract_text_from_pdf(pdf_path):
   pdf_document = fitz.open(pdf_path)
   text = ""
   for page in pdf_document:
       text += page.get_text()
   return text

Topic-based chunking function

def topic_based_chunk(text, num_topics=3):
   sentences = text.split('. ')
   vectorizer = CountVectorizer()
   sentence_vectors = vectorizer.fit_transform(sentences)
   lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
   lda.fit(sentence_vectors)
   topic_word = lda.components_
   vocabulary = vectorizer.get_feature_names_out()
   topics = []
   for topic_idx, topic in enumerate(topic_word):
       top_words_idx = topic.argsort()[:-6:-1]
       topic_keywords = [vocabulary[i] for i in top_words_idx]
       topics.append(f"Topic {topic_idx + 1}: {', '.join(topic_keywords)}")
   chunks_with_topics = []
   for i, sentence in enumerate(sentences):
       topic_assignments = lda.transform(vectorizer.transform([sentence]))
       assigned_topic = np.argmax(topic_assignments)
       chunks_with_topics.append((topics[assigned_topic], sentence))
   return chunks_with_topics

Replace ‘your_file.pdf’ with your actual PDF file path

pdf_path = '/content/1738082270933.pdf'
pdf_text = extract_text_from_pdf(pdf_path)

Get topic-based chunks

topic_chunks = topic_based_chunk(pdf_text, num_topics=3)

Display results

for topic, chunk in topic_chunks:
   print(f"{topic}: {chunk}\n")

Output

Topic 3: reasoning, r1, deepseek, the, of: 

DeepSeek-R1 is a reasoning-focused large language model (LLM) developed to
 enhance reasoning capabilities in Generative AI systems through advanced
 reinforcement learning techniques.

Explanation: Topic 3 is characterized by keywords like “reasoning,” “R1,” “DeepSeek”, which frequently appear in sentences about the DeepSeek model.

7. Contextual Chunking

Contextual Chunking in Retrieval-Augmented Generation (RAG) refers to the strategy of segmenting documents or data into meaningful “chunks” that preserve the semantic context. This technique enhances the retrieval and generation performance of RAG models by ensuring that the model has access to coherent, context-rich pieces of information, rather than arbitrary or fragmented text segments.

Why Is It Important?

In RAG systems, the process involves two main steps:

Retrieval: Finding relevant chunks from a large knowledge base.
Generation: Using the retrieved chunks to produce a coherent response.

If the chunks are poorly segmented, the retrieval process might fetch incomplete or contextually weak information, leading to subpar generation quality. Contextual chunking helps mitigate this by ensuring that each chunk contains enough semantic information to be useful on its own.

Here’s how you set the chunk process prompt for contextual chunking:

# create chunk context generation chain
from langchain.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser
from langchain_openai import ChatOpenAI
chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
def generate_chunk_context(document, chunk):
   chunk_process_prompt = """You are an AI assistant specializing in research 
                             paper analysis. Your task is to provide brief,
                             relevant context for a chunk of text based on the
                             following research paper.
                             Here is the research paper:
                             <paper>
                             {paper}
                             </paper>
                             Here is the chunk we want to situate within the whole
                             document:
                             <chunk>
                             {chunk}
                             </chunk>
                             Provide a concise context (3-4 sentences max) for this
                             chunk, considering the following guidelines:
                             - Give a short succinct context to situate this chunk
                               within the overall document for the purposes of 
                               improving search retrieval of the chunk.
                             - Answer only with the succinct context and nothing
                               else.
                             - Context should be mentioned like 'Focuses on ....'
                               do not mention 'this chunk or section focuses on...'
                             Context:
                          """
   prompt_template = ChatPromptTemplate.from_template(chunk_process_prompt)
   agentic_chunk_chain = (prompt_template
                               |
                           chatgpt
                               |
                           StrOutputParser())
   context = agentic_chunk_chain.invoke({'paper': document, 'chunk': chunk})
   return context

For more information, refer to this article – A Comprehensive Guide to Building Contextual RAG Systems with Hybrid Search and Reranking

8. Late Chunking

Late Chunking addresses the challenges of maintaining contextual coherence when processing long documents for retrieval applications. Unlike traditional chunking approaches that segment text early in the pipeline, potentially disrupting long-distance contextual dependencies, Late Chunking leverages long-context embedding models to generate contextual chunk embeddings. This ensures that references spread across multiple text segments (like pronouns or entity mentions) are preserved within their broader context, leading to higher-quality vector representations and more effective retrieval performance. This method mitigates the shortcomings of conventional RAG pipelines, particularly in handling anaphoric references and fragmented information.

To see how Jina Embeddings works explore this: Jina Embeddings.

How Late Chunking Works?

When breaking down a Wikipedia article into smaller chunks, phrases like “its” or “the city” often refer back to something mentioned earlier, such as “Berlin” in the first sentence. However, splitting the text disconnects these references from the original entity, making it difficult for embedding models to correctly associate them with “Berlin.” This results in less accurate vector representations and weaker performance in retrieval-augmented generation (RAG) systems.

Late Chunking addresses this issue by processing the entire text—or as much of it as possible—through the transformer layer of the embedding model before splitting it into chunks. This approach generates token-level vector representations that capture the full context of the text. Afterward, the system applies mean pooling to each chunk to create embeddings, ensuring they retain important contextual information since the full text was initially considered.

Unlike basic chunking methods that process each chunk in isolation, Late Chunking allows every chunk to retain influence from the broader document context. As a result, references like “its” and “the city” remain correctly associated with “Berlin,” even when appearing in different chunks. This improves RAG systems’ accuracy, making them more context-aware and capable of delivering better, more coherent answers.

Implementation and Performance Gains

!pip install transformers==4.43.4

from transformers import AutoModel
from transformers import AutoTokenizer

# load model and tokenizer

tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)

def chunk_by_sentences(input_text: str, tokenizer: callable):
   """
   Split the input text into sentences using the tokenizer
   :param input_text: The text snippet to split into sentences
   :param tokenizer: The tokenizer to use
   :return: A tuple containing the list of text chunks and their corresponding token spans
   """
   inputs = tokenizer(input_text, return_tensors='pt', return_offsets_mapping=True)
   punctuation_mark_id = tokenizer.convert_tokens_to_ids('.')
   sep_id = tokenizer.convert_tokens_to_ids('[SEP]')
   token_offsets = inputs['offset_mapping'][0]
   token_ids = inputs['input_ids'][0]
   chunk_positions = [
       (i, int(start + 1))
       for i, (token_id, (start, end)) in enumerate(zip(token_ids, token_offsets))
       if token_id == punctuation_mark_id
       and (
           token_offsets[i + 1][0] - token_offsets[i][1] > 0
           or token_ids[i + 1] == sep_id
       )
   ]
   chunks = [
       input_text[x[1] : y[1]]
       for x, y in zip([(1, 0)] + chunk_positions[:-1], chunk_positions)
   ]
   span_annotations = [
       (x[0], y[0]) for (x, y) in zip([(1, 0)] + chunk_positions[:-1], chunk_positions)
   ]
   return chunks, span_annotations

import requests
def chunk_by_tokenizer_api(input_text: str, tokenizer: callable):
   # Define the API endpoint and payload
   url = 'https://tokenize.jina.ai/'
   payload = {
       "content": input_text,
       "return_chunks": "true",
       "max_chunk_length": "1000"
   }
   # Make the API request
   response = requests.post(url, json=payload)
   response_data = response.json()
   # Extract chunks and positions from the response
   chunks = response_data.get("chunks", [])
   chunk_positions = response_data.get("chunk_positions", [])
   # Adjust chunk positions to match the input format
   span_annotations = [(start, end) for start, end in chunk_positions]
   return chunks, span_annotations

nput_text = "Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits. The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."

# determine chunks
chunks, span_annotations = chunk_by_sentences(input_text, tokenizer)
print('Chunks:\n- "' + '"\n- "'.join(chunks) + '"')

Chunks:

- "Berlin is the capital and largest city of Germany, both by area and by
 population."

- " Its more than 3.85 million inhabitants make it the European Union's most
 populous city, as measured by population within city limits."

- " The city is also one of the states of Germany, and is the third smallest
 state in the country in terms of area."

def late_chunking(
   model_output: 'BatchEncoding', span_annotation: list, max_length=None
):
   token_embeddings = model_output[0]
   outputs = []
   for embeddings, annotations in zip(token_embeddings, span_annotation):
       if (
           max_length is not None
       ):  # remove annotations which go beyond the max-length of the model
           annotations = [
               (start, min(end, max_length - 1))
               for (start, end) in annotations
               if start < (max_length - 1)
           ]
       pooled_embeddings = [
           embeddings[start:end].sum(dim=0) / (end - start)
           for start, end in annotations
           if (end - start) >= 1
       ]
       pooled_embeddings = [
           embedding.detach().cpu().numpy() for embedding in pooled_embeddings
       ]
       outputs.append(pooled_embeddings)
   return outputs

# chunk before
embeddings_traditional_chunking = model.encode(chunks)
# chunk afterwards (context-sensitive chunked pooling)
inputs = tokenizer(input_text, return_tensors='pt')
model_output = model(**inputs)
embeddings = late_chunking(model_output, [span_annotations])[0]

import numpy as np
cos_sim = lambda x, y: np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))
berlin_embedding = model.encode('Berlin')
for chunk, new_embedding, trad_embeddings in zip(chunks, embeddings, embeddings_traditional_chunking):
   print(f'similarity_new("Berlin", "{chunk}"):', cos_sim(berlin_embedding, new_embedding))
   print(f'similarity_trad("Berlin", "{chunk}"):', cos_sim(berlin_embedding, trad_embeddings))

Output

similarity_new("Berlin", "Berlin is the capital and largest city of Germany,
 both by area and by population."): 0.849546

similarity_trad("Berlin", "Berlin is the capital and largest city of Germany,
 both by area and by population."): 0.8486219

similarity_new("Berlin", " Its more than 3.85 million inhabitants make it the
 European Union's most populous city, as measured by population within city
 limits."): 0.82489026

similarity_trad("Berlin", " Its more than 3.85 million inhabitants make it
 the European Union's most populous city, as measured by population within
 city limits."): 0.70843387

similarity_new("Berlin", " The city is also one of the states of Germany, and
 is the third smallest state in the country in terms of area."): 0.8498009

similarity_trad("Berlin", " The city is also one of the states of Germany,
 and is the third smallest state in the country in terms of area."): 

0.75345534

Here in the output, you can clearly see there is improvement in the semantic similarity.

General Performance Improvement:

Across all examples, the similarity_new scores are consistently higher than similarity_trad. This indicates that late chunking more effectively captures semantic relationships.
For example:
- “Berlin” vs. “The city is also one of the states of Germany…”
  - similarity_new: 0.8498
  - similarity_trad: 0.7535
  - The 0.0963 improvement highlights better contextual linkage between “the city” and “Berlin.”

Notable Improvements in Ambiguous References:

The most significant improvement occurs when dealing with indirect references like “the city” instead of explicitly repeating “Berlin.”
In:
- “Berlin” vs. “Its more than 3.85 million inhabitants…”
  - similarity_new: 0.8249
  - similarity_trad: 0.7084
  - The 0.1165 difference suggests that late chunking strengthens connections even when the entity isn’t explicitly named.

Consistency Across Examples:

While the traditional method maintains decent performance with direct mentions of “Berlin,” it struggles more with pronouns or indirect references.
The new method sustains high similarity scores even when contextual clues are sparse, reflecting improved semantic memory over longer passages.

Conclusion

Chunking for RAG systems to manage and optimise data processing is crucial to creating a reliable application. Various chunking strategies—ranging from simple character-based splits to advanced methods like semantic, agentic, and late chunking—help improve data retrievability, contextual relevance, and model performance. Selecting the right chunking approach depends on content type, task requirements, and desired output quality, making it an essential practice for efficient AI-powered applications.

If you find this article helpful then, comment below!

Pankaj Singh

Hi, I am Pankaj Singh Negi - Senior Content Editor | Passionate about storytelling and crafting compelling narratives that transform ideas into impactful content. I love reading about technology revolutionizing our lifestyle.

Advanced Generative AI RAG

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

8 Types of Chunking for RAG Systems

Table of contents

What is Chunking for RAG System?

The Context Limit

Maximizing Signal-to-Noise Ratio

Why is Chunking Important?

RAG Architecture to Comprehend Chunking

1. Chunking

2. Chunk Retrieval Using Vector Database

3. Generation Using Retrieved Chunks

Key Drawbacks of RAG (Retrieval-Augmented Generation)

How to Choose the Right Chunking Strategy?

1. Understand the Nature of the Content

2. Align with the Embedding Model

3. Anticipate User Queries

1. Character Text Chunking

Explanation:

Explanation of the Overlapping Mechanism

Now let’s do it with Langchain

2. Recursive Character Text Splitting

3. Document Specific Chunking Using LangChain( HTML, Python, JSON or more)

1. Python

2. Markdown

4. Semantic Chunking

Implementing Semantic Chunking

Methods to Determine Breakpoints (Threshold Types)

5. Agentic Chunking

LLM-Based Chunking Using OpenAI Library

6. Section Based Chunking

7. Contextual Chunking

Why Is It Important?

8. Late Chunking

How Late Chunking Works?

Implementation and Performance Gains

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC