The easiest way to learn anything—whether it’s for academics or personal growth—is by breaking it down into smaller, more manageable chunks. Similarly, when you’re tackling a complex subject, it can feel overwhelming at first. However, by dividing it into bite-sized pieces, understanding becomes much easier. Even if it seems like a small concept already, it’s always possible to split it into even more parts, no matter how simple they are. This chunking method makes it easier for a person to grasp or learn something and forms the foundation for how we process information in everyday life. Surprisingly, machines work similarly. Chunking is not just a method but a cognitive psychology concept that plays a vital role in data processing and AI systems that use RAG. Today, we will be talking about 8 types of Chunking in RAG with some Hands-on!!
Chunking is the process of breaking down large pieces of text into smaller, more manageable parts. This technique is crucial when working with language models because it ensures that the provided data fits within the model’s context window while maintaining the relevance and quality of the information.
By context window, I meant that every language model operates according to the user’s requirements for providing their own data. However, a limitation restricts the user from passing unlimited data to the model. This is because:
There is always a limit on the number of words or tokens that you can provide to the language model. Here’s the context window of OpenAI models:
Language models perform better when the signal-to-noise ratio is high. In other words, reducing irrelevant or distracting information in the model’s context window can significantly enhance performance.
So. the primary goal of chunking is not just to split data arbitrarily, but to optimize the way information is presented to the model. Proper chunking enhances the retrievability of useful content and improves the overall performance of applications relying on AI models.
Anton Troynikov, co-founder of Chroma, points out, that unnecessary data within the context window can measurably degrade the overall effectiveness of an application. By focusing only on relevant content, we can optimize the model’s output and ensure more accurate, efficient responses.
Makes sense right? Similarly, Chunking is important because:
In summary, chunking is a foundational step in preparing text data for language models. It helps in balancing data volume, relevance, and retrievability, making it a critical practice in building efficient AI-powered applications.
Let’s understand this with the RAG architecture:
In Retrieval-Augmented Generation (RAG), chunking involves breaking down raw data sources (such as PDFs, spreadsheets, or other documents) into smaller, manageable pieces called “chunks of text.” The system then processes these chunks, converts them into vector embeddings, and stores them in a vector database (e.g., Chroma) to enable efficient retrieval when a user asks a question.
In short, Chunking refers to dividing large text data into smaller, manageable pieces to improve retrieval efficiency and relevance in downstream tasks like search and generation.
Once the chunks are embedded:
After chunk retrieval:
Also read: RAG vs Agentic RAG: A Comprehensive Guide
Let’s understand the drawbacks of RAG.
By implementing the right chunking strategies, the RAG pipeline can achieve more accurate retrieval, richer contextual grounding, and higher-quality response generation, ultimately enhancing the overall system’s reliability and user satisfaction.
Choosing the right chunking strategy involves carefully considering the content type, the embedding model, and the expected user queries. Here’s a detailed guide tailored to your example scenario:
Content characteristics heavily influence chunking strategy. Example Scenario:
Different embedding models have varying limitations and strengths. Key Considerations:
Steps to Optimize
Understanding what users are likely to search for helps design the chunking strategy. Example User Queries:
In the next section, I will discuss different chunking strategies in detail.
This method is one of the simplest approaches to chunking or splitting text. It divides the text into fixed-sized chunks of N characters, regardless of the content or structure. While it is a basic technique, it serves as an excellent starting point for understanding the fundamentals of text chunking and how it works in practice.
This approach is easy and simple to use; however, it is very rigid and does not take into account the structure of your text.
text = "Clouds come floating into my life, no longer to carry rain or usher storm, but to add color to my sunset sky."
chunks = []
chunk_size = 35
chunk_overlap = 5 # Characters
# Run through the text with the length of your text and iterate every chunk_size,
# considering the overlap for the starting position of the next chunk.
for i in range(0, len(text) - chunk_size + 1, chunk_size - chunk_overlap):
chunk = text[i:i + chunk_size]
chunks.append(chunk)
chunks
Output
['Clouds come floating into my life, ',
'ife, no longer to carry rain or ush',
'r usher storm, but to add color to ']
Step Size Calculation:
Let’s analyze how the loop runs with the given values:
First chunk (index 0 to 35):
Extracts the substring “Clouds come floating into my life, “.
The loop then moves forward by 30 characters.
Second chunk (index 30 to 65):
Extracts the substring “ife, no longer to carry rain or ush”.
Notice how the last 5 characters of the previous chunk (“life,”) overlap in this chunk.
Third chunk (index 60 to 95):
Extracts the substring “r usher storm, but to add color to “.
Again, there’s an overlap with the last few characters from the second chunk.
%pip install -qU langchain-text-splitters
This command installs the langchain-text-splitters library, which is used for splitting long pieces of text into smaller chunks.
The -q flag suppresses installation output, and -U ensures that the latest version is installed.
# Load an example document
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()
text_splitter = CharacterTextSplitter(
separator="\n\n",
chunk_size=1000,
chunk_overlap=200,
length_function=len,
is_separator_regex=False,
)
This code sets up a CharacterTextSplitter object with the following parameters:
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
The create_documents() method takes the list of texts (in this case, a single document) and splits it based on the specified parameters (chunk size, overlap, separator).
The result is a list of chunked document objects, where each chunk contains a portion of the original text.
Chunking in Action:
Overlap Handling:
Unlike the first method which doesn’t look for the document structure, this method recursively divides text using a predefined list of separators and intelligently merges the resulting smaller chunks into larger ones. The final chunks are optimized to contain no more than N characters, ensuring efficient text processing and context preservation.
It is parameterized by a list of characters. The default list is:
%pip install -qU langchain-text-splitters
text = """
The Marvel Universe is a vast and interconnected world filled with superheroes, villains, and epic storytelling that has captivated audiences for decades. Founded by visionaries such as Stan Lee, Jack Kirby, and Steve Ditko, Marvel Comics has introduced some of the most iconic characters in pop culture history. From its early beginnings in 1939 as Timely Publications to its transformation into Marvel Comics in the 1960s, the company has consistently pushed the boundaries of storytelling by creating relatable and dynamic characters. Heroes like Spider-Man, Iron Man, Captain America, and Thor have become household names, each with their own compelling backstories and struggles that resonate with fans across generations. Marvel’s success extends beyond the pages of comic books. The launch of the Marvel Cinematic Universe (MCU) in 2008 with the release of Iron Man revolutionized the film industry, introducing interconnected storylines that culminated in epic crossover events such as The Avengers and Infinity War. The MCU’s success is largely attributed to its ability to blend action, humor, and emotional depth while maintaining the essence of the beloved comic book characters. Audiences have followed the journeys of superheroes as they face powerful foes like Thanos and Loki, all while dealing with their own internal conflicts and responsibilities."""
from langchain_text_splitters import RecursiveCharacterTextSplitter
The RecursiveCharacterTextSplitter is imported from the langchain-text-splitters package.
This class is used to split large text documents into smaller chunks efficiently while preserving context.
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size=400,
chunk_overlap=0,
length_function=len,
)
text_splitter.create_documents([text])
Output
[Document(metadata={}, page_content='The Marvel Universe is a vast and
interconnected world filled with superheroes, villains, and epic
storytelling that has captivated audiences for decades. Founded by
visionaries such as Stan Lee, Jack Kirby, and Steve Ditko, Marvel Comics has
introduced some of the most iconic characters in pop'),
Document(metadata={}, page_content='culture history. From its early
beginnings in 1939 as Timely Publications to its transformation into Marvel
Comics in the 1960s, the company has consistently pushed the boundaries of
storytelling by creating relatable and dynamic characters. Heroes like
Spider-Man, Iron Man, Captain America, and'),
Document(metadata={}, page_content='Thor have become household names, each
with their own compelling backstories and struggles that resonate with fans
across generations. Marvel’s success extends beyond the pages of comic
books. The launch of the Marvel Cinematic Universe (MCU) in 2008 with the
release of Iron Man revolutionized the'),
Document(metadata={}, page_content='film industry, introducing
interconnected storylines that culminated in epic crossover events such as
The Avengers and Infinity War. The MCU’s success is largely attributed to
its ability to blend action, humor, and emotional depth while maintaining
the essence of the beloved comic book characters.'),
Document(metadata={}, page_content='Audiences have followed the journeys of
superheroes as they face powerful foes like Thanos and Loki, all while
dealing with their own internal conflicts and responsibilities.')]
The resulting list of Document objects contains multiple chunks of the text, each with overlapping portions to ensure smooth transitions. Here’s a breakdown of the output:
Document-specific chunking is a strategy designed to tailor text-splitting methods to fit different data formats such as images, PDFs, or code snippets. Unlike generic chunking methods, which may not work effectively across various content types, document-specific chunking takes into account the unique structure and characteristics of each format to ensure meaningful segmentation.
For instance, when dealing with Markdown, Python, or JavaScript files, chunking methods are adapted to use format-specific separators, such as headers in Markdown, function definitions in Python, or code blocks in JavaScript. This approach allows for more accurate and context-aware chunking, ensuring that key elements of the content remain intact and understandable.
By adopting document-specific chunking, organizations and developers can efficiently process diverse data types while maintaining logical segmentation, and improving downstream tasks such as search, summarization, and analysis.
%pip install -qU langchain-text-splitters
from langchain_text_splitters import (Language,RecursiveCharacterTextSplitter,)
PYTHON_CODE = """
def hello_world():
print("Hello, World!")
# Call the function
hello_world()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs
Output
[Document(metadata={}, page_content='def hello_world():\n print("Hello,
World!")'),
Document(metadata={}, page_content='# Call the function\nhello_world()')]
%pip install -qU langchain-text-splitters
from langchain_text_splitters import(Language,RecursiveCharacterTextSplitter)
markdown_text = """# 🦜️🔗 LangChain
⚡ Building applications with LLMs through composability ⚡
## What is LangChain?
# Hopefully this code block isn't split
LangChain is a framework for...
As an open-source project in a rapidly developing field, we are extremely open to contributions.
"""
md_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.MARKDOWN, chunk_size=60, chunk_overlap=0
)
md_docs = md_splitter.create_documents([markdown_text])
md_docs
Output
[Document(metadata={}, page_content='# 🦜️🔗 LangChain'),
Document(metadata={}, page_content='⚡ Building applications with LLMs through composability ⚡'),
Document(metadata={}, page_content='## What is LangChain?'),
Document(metadata={}, page_content="# Hopefully this code block isn't split"),
Document(metadata={}, page_content='LangChain is a framework for...'),
Document(metadata={}, page_content='As an open-source project in a rapidly developing field, we'),
Document(metadata={}, page_content='are extremely open to contributions.')]
Semantic chunking is an advanced text-splitting technique that focuses on dividing a document into meaningful chunks based on the actual content and context rather than arbitrary size-based methods such as token count or delimiters. The primary goal of semantic chunking is to ensure that each chunk contains a single, concise meaning, optimizing it for downstream tasks like embedding into vector representations for machine learning applications.
Traditional chunking methods, such as splitting text by a fixed number of tokens or characters, often result in chunks that contain multiple, unrelated meanings. This can dilute the representation when encoding text into vector embeddings, leading to suboptimal retrieval and processing results. By contrast, semantic chunking works by identifying natural meaning boundaries within the text and segmenting it accordingly to ensure each chunk preserves a coherent and unified concept.
For example, in a newspaper article, different paragraphs may cover various aspects of a single story. A naive chunking approach may group unrelated sections together, leading to mixed embeddings that fail to represent any of the topics accurately. Semantic chunking, however, isolates sections with distinct meanings, ensuring that each vector embedding captures the core essence of that portion.
In practice, semantic chunking can be implemented using natural language processing (NLP) techniques such as semantic similarity analysis, topic modeling, or machine learning-based segmentation. These methods analyze the underlying meaning of the text and intelligently determine appropriate chunk boundaries.
By adopting semantic chunking, text processing systems can achieve higher accuracy in tasks such as information retrieval, summarization, and AI-driven insights, ensuring that each chunk represents a concise and meaningful unit of information.
!pip install --quiet langchain_experimental langchain_openai
This command installs the required packages:
The –quiet flag suppresses unnecessary output during installation.
# This is a long document we can split up.
with open("state_of_the_union.txt") as f:
state_of_the_union = f.read()
The state_of_the_union.txt file is read into a string variable state_of_the_union.
This text will later be split into meaningful chunks based on semantic differences.
from langchain_experimental.text_splitter import SemanticChunker
import os
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("API")
text_splitter = SemanticChunker(
OpenAIEmbeddings(), breakpoint_threshold_type="percentile"
)
Initializes the SemanticChunker using OpenAI’s embeddings model.
It will automatically calculate the semantic similarity between sentences to determine where to split the text.
Specifies breakpoint_threshold_type=”percentile”, which means the chunking decision is based on the percentile method for determining split points.
docs = text_splitter.create_documents([state_of_the_union])
print(docs[0].page_content)
Semantic chunking works by determining where to split text based on differences in sentence embeddings, which capture the meaning of sentences numerically. The algorithm calculates the difference in meaning between consecutive sentences and splits them when a certain threshold is exceeded.
The chunking behaviour is controlled using the breakpoint_threshold_type parameter, which supports the following methods:
Agentic chunking is an advanced method of segmenting documents into smaller, meaningful sections by leveraging a large language model (LLM) to identify natural breakpoints in the text. Unlike traditional chunking methods that rely on fixed character counts, agentic chunking analyzes the content to detect semantically relevant boundaries such as paragraph breaks and topic transitions.
By using AI to determine logical divisions within the text, agentic chunking ensures that each chunk retains contextual integrity and meaning, improving the AI’s ability to process, summarize, and respond effectively. This approach enhances information retrieval, content organization, and decision-making processes by creating well-structured, purpose-driven text segments.
Agentic chunking is particularly useful in applications such as knowledge retrieval, automated summarization, and AI-driven insights, where maintaining coherence and relevance is crucial for optimal performance.
Note: Most people refer to it as Agentic Chunking, but it’s primarily based on LLM-driven chunking.
Talking about the LLM-based Chunking – It is essentially the process of using a large language model (LLM)—like GPT-4—to break down or segment text into more manageable, structured pieces. Instead of using rigid rules (like splitting strictly on sentence boundaries or punctuation), LLM-based chunking leverages the model’s understanding of language and context to produce chunks in a way that’s more meaningful and coherent.
!pip install agno openai
from typing import List, Optional
from agno.document.base import Document
from agno.document.chunking.strategy import ChunkingStrategy
from agno.models.base import Model
from agno.models.defaults import DEFAULT_OPENAI_MODEL_ID
from agno.models.message import Message
from agno.models.openai import OpenAIChat
import os
os.environ["OPENAI_API_KEY"] = "your_api_key"
class AgenticChunking(ChunkingStrategy):
"""Chunking strategy that uses an LLM to determine natural breakpoints in the text"""
def __init__(self, model: Optional[Model] = None, max_chunk_size: int = 5000):
if "OPENAI_API_KEY" not in os.environ:
raise ValueError("OPENAI_API_KEY environment variable not set.")
self.model = model or OpenAIChat(DEFAULT_OPENAI_MODEL_ID)
self.max_chunk_size = max_chunk_size
def chunk(self, document: Document) -> List[Document]:
"""Split text into chunks using LLM to determine natural breakpoints based on context"""
if len(document.content) <= self.max_chunk_size:
return [document]
chunks: List[Document] = []
remaining_text = self.clean_text(document.content)
chunk_meta_data = document.meta_data
chunk_number = 1
while remaining_text:
# Ask model to find a good breakpoint within max_chunk_size
prompt = f"""Analyze this text and determine a natural breakpoint within the first {self.max_chunk_size} characters.
Consider semantic completeness, paragraph boundaries, and topic transitions.
Return only the character position number of where to break the text:
{remaining_text[: self.max_chunk_size]}"""
try:
response = self.model.response([Message(role="user", content=prompt)])
if response and response.content:
break_point = min(int(response.content.strip()), self.max_chunk_size)
else:
break_point = self.max_chunk_size
except Exception:
# Fallback to max size if model fails
break_point = self.max_chunk_size
# Extract chunk and update remaining text
chunk = remaining_text[:break_point].strip()
meta_data = chunk_meta_data.copy()
meta_data["chunk"] = chunk_number
chunk_id = None
if document.id:
chunk_id = f"{document.id}_{chunk_number}"
elif document.name:
chunk_id = f"{document.name}_{chunk_number}"
meta_data["chunk_size"] = len(chunk)
chunks.append(
Document(
id=chunk_id,
name=document.name,
meta_data=meta_data,
content=chunk,
)
)
chunk_number += 1
remaining_text = remaining_text[break_point:].strip()
if not remaining_text:
break
return chunks
# Example usage
document = Document(
id="doc1",
content="""Recursive chunking divides the input text into smaller chunks in a hierarchical and iterative manner using a set of separators. If the initial attempt at splitting the text doesn’t produce chunks of the desired size or structure, the method recursively calls itself on the resulting chunks with a different separator or criterion until the desired chunk size or structure is achieved. This means that while the chunks aren’t going to be exactly the same size, they’ll still “aspire” to be of a similar size.""",
meta_data={"author": "Pankaj"}
)
chunker = AgenticChunking(max_chunk_size=200)
chunks = chunker.chunk(document)
# Print all chunks
for i, chunk in enumerate(chunks, 1):
print(f"Chunk {i} (ID: {chunk.id}, Size: {len(chunk.content)})")
print(chunk.content)
print("-" * 50 + "\n")
Output
Chunk 1 (ID: doc1_1, Size: 179)
Recursive chunking divides the input text into smaller chunks in a
hierarchical and iterative manner using a set of separators. If the initial
attempt at splitting the text doesn’
--------------------------------------------------
Chunk 2 (ID: doc1_2, Size: 132)
t produce chunks of the desired size or structure, the method recursively
calls itself on the resulting chunks with a different sepa
--------------------------------------------------
Chunk 3 (ID: doc1_3, Size: 104)
rator or criterion until the desired chunk size or structure is achieved.
This means that while the chun
--------------------------------------------------
Chunk 4 (ID: doc1_4, Size: 66)
ks aren’t going to be exactly the same size, they’ll still “aspire
--------------------------------------------------
Chunk 5 (ID: doc1_5, Size: 26)
” to be of a similar size.
--------------------------------------------------
from openai import OpenAI
Imports the OpenAI library, required to interact with the GPT API.
content = "An outlier is a data point that significantly deviates from the rest of the data. It can be either much higher or much lower than the other data points, and its pr types of outliers: There are two main types of outliers: Global outliers: Global outliers are isolated data points that are far away from the main body of the data"
This is the input text that will be chunked.
# Initialize client with your API key
client = OpenAI(api_key="API_KEY")
Initializes the OpenAI client using an API key (replace “API_KEY” with an actual key to run the code).
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"role": "system",
"content": """You are a agentic chunker. Decompose the content into clear and simple propositions:
1. Split compound sentences into simple sentences
2. Separate named entities with descriptions
3. Replace pronouns with specific references
4. Output as JSON list of strings"""
},
{
"role": "user",
"content": f"Here is the content: {content}"
}
],
temperature=0.3
)
Model: Uses gpt-4o for processing.
Messages: The system message defines GPT’s behavior: breaking down text into simple propositions, separating named entities, avoiding pronouns, and outputting as a JSON list.
The user message provides the actual content for chunking.
Temperature: 0.3 keeps responses deterministic, reducing randomness for more consistent outputs.
print(response.choices[0].message.content)
Output
"An outlier is a data point that significantly deviates from the rest of the data.",
"An outlier can be much higher than the other data points.",
"An outlier can be much lower than the other data points.",
"There are two main types of outliers.",
"Global outliers are isolated data points.",
"Global outliers are far away from the main body of the data."
Section-based chunking is a technique used to divide large texts into meaningful “chunks” or segments based on structural elements like headings, subheadings, paragraphs, or predefined section markers. Unlike topic modeling (which relies on statistical patterns to group content), section-based chunking leverages the document’s inherent structure to create logical divisions.
Structure-Driven:
Relies on document formatting like:
Preserves Context:
Keeps related information together, maintaining narrative flow within sections.
Efficient for Structured Documents:
Works well with academic papers, reports, PDFs, legal documents, etc.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np
import fitz # PyMuPDF
Function to extract text from a PDF file
def extract_text_from_pdf(pdf_path):
pdf_document = fitz.open(pdf_path)
text = ""
for page in pdf_document:
text += page.get_text()
return text
Topic-based chunking function
def topic_based_chunk(text, num_topics=3):
sentences = text.split('. ')
vectorizer = CountVectorizer()
sentence_vectors = vectorizer.fit_transform(sentences)
lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
lda.fit(sentence_vectors)
topic_word = lda.components_
vocabulary = vectorizer.get_feature_names_out()
topics = []
for topic_idx, topic in enumerate(topic_word):
top_words_idx = topic.argsort()[:-6:-1]
topic_keywords = [vocabulary[i] for i in top_words_idx]
topics.append(f"Topic {topic_idx + 1}: {', '.join(topic_keywords)}")
chunks_with_topics = []
for i, sentence in enumerate(sentences):
topic_assignments = lda.transform(vectorizer.transform([sentence]))
assigned_topic = np.argmax(topic_assignments)
chunks_with_topics.append((topics[assigned_topic], sentence))
return chunks_with_topics
Replace ‘your_file.pdf’ with your actual PDF file path
pdf_path = '/content/1738082270933.pdf'
pdf_text = extract_text_from_pdf(pdf_path)
Get topic-based chunks
topic_chunks = topic_based_chunk(pdf_text, num_topics=3)
Display results
for topic, chunk in topic_chunks:
print(f"{topic}: {chunk}\n")
Output
Topic 3: reasoning, r1, deepseek, the, of:
DeepSeek-R1 is a reasoning-focused large language model (LLM) developed to
enhance reasoning capabilities in Generative AI systems through advanced
reinforcement learning techniques.
Explanation: Topic 3 is characterized by keywords like “reasoning,” “R1,” “DeepSeek”, which frequently appear in sentences about the DeepSeek model.
Contextual Chunking in Retrieval-Augmented Generation (RAG) refers to the strategy of segmenting documents or data into meaningful “chunks” that preserve the semantic context. This technique enhances the retrieval and generation performance of RAG models by ensuring that the model has access to coherent, context-rich pieces of information, rather than arbitrary or fragmented text segments.
In RAG systems, the process involves two main steps:
If the chunks are poorly segmented, the retrieval process might fetch incomplete or contextually weak information, leading to subpar generation quality. Contextual chunking helps mitigate this by ensuring that each chunk contains enough semantic information to be useful on its own.
Here’s how you set the chunk process prompt for contextual chunking:
# create chunk context generation chain
from langchain.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser
from langchain_openai import ChatOpenAI
chatgpt = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)
def generate_chunk_context(document, chunk):
chunk_process_prompt = """You are an AI assistant specializing in research
paper analysis. Your task is to provide brief,
relevant context for a chunk of text based on the
following research paper.
Here is the research paper:
<paper>
{paper}
</paper>
Here is the chunk we want to situate within the whole
document:
<chunk>
{chunk}
</chunk>
Provide a concise context (3-4 sentences max) for this
chunk, considering the following guidelines:
- Give a short succinct context to situate this chunk
within the overall document for the purposes of
improving search retrieval of the chunk.
- Answer only with the succinct context and nothing
else.
- Context should be mentioned like 'Focuses on ....'
do not mention 'this chunk or section focuses on...'
Context:
"""
prompt_template = ChatPromptTemplate.from_template(chunk_process_prompt)
agentic_chunk_chain = (prompt_template
|
chatgpt
|
StrOutputParser())
context = agentic_chunk_chain.invoke({'paper': document, 'chunk': chunk})
return context
For more information, refer to this article – A Comprehensive Guide to Building Contextual RAG Systems with Hybrid Search and Reranking
Late Chunking addresses the challenges of maintaining contextual coherence when processing long documents for retrieval applications. Unlike traditional chunking approaches that segment text early in the pipeline, potentially disrupting long-distance contextual dependencies, Late Chunking leverages long-context embedding models to generate contextual chunk embeddings. This ensures that references spread across multiple text segments (like pronouns or entity mentions) are preserved within their broader context, leading to higher-quality vector representations and more effective retrieval performance. This method mitigates the shortcomings of conventional RAG pipelines, particularly in handling anaphoric references and fragmented information.
To see how Jina Embeddings works explore this: Jina Embeddings.
When breaking down a Wikipedia article into smaller chunks, phrases like “its” or “the city” often refer back to something mentioned earlier, such as “Berlin” in the first sentence. However, splitting the text disconnects these references from the original entity, making it difficult for embedding models to correctly associate them with “Berlin.” This results in less accurate vector representations and weaker performance in retrieval-augmented generation (RAG) systems.
Late Chunking addresses this issue by processing the entire text—or as much of it as possible—through the transformer layer of the embedding model before splitting it into chunks. This approach generates token-level vector representations that capture the full context of the text. Afterward, the system applies mean pooling to each chunk to create embeddings, ensuring they retain important contextual information since the full text was initially considered.
Unlike basic chunking methods that process each chunk in isolation, Late Chunking allows every chunk to retain influence from the broader document context. As a result, references like “its” and “the city” remain correctly associated with “Berlin,” even when appearing in different chunks. This improves RAG systems’ accuracy, making them more context-aware and capable of delivering better, more coherent answers.
!pip install transformers==4.43.4
from transformers import AutoModel
from transformers import AutoTokenizer
# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
def chunk_by_sentences(input_text: str, tokenizer: callable):
"""
Split the input text into sentences using the tokenizer
:param input_text: The text snippet to split into sentences
:param tokenizer: The tokenizer to use
:return: A tuple containing the list of text chunks and their corresponding token spans
"""
inputs = tokenizer(input_text, return_tensors='pt', return_offsets_mapping=True)
punctuation_mark_id = tokenizer.convert_tokens_to_ids('.')
sep_id = tokenizer.convert_tokens_to_ids('[SEP]')
token_offsets = inputs['offset_mapping'][0]
token_ids = inputs['input_ids'][0]
chunk_positions = [
(i, int(start + 1))
for i, (token_id, (start, end)) in enumerate(zip(token_ids, token_offsets))
if token_id == punctuation_mark_id
and (
token_offsets[i + 1][0] - token_offsets[i][1] > 0
or token_ids[i + 1] == sep_id
)
]
chunks = [
input_text[x[1] : y[1]]
for x, y in zip([(1, 0)] + chunk_positions[:-1], chunk_positions)
]
span_annotations = [
(x[0], y[0]) for (x, y) in zip([(1, 0)] + chunk_positions[:-1], chunk_positions)
]
return chunks, span_annotations
import requests
def chunk_by_tokenizer_api(input_text: str, tokenizer: callable):
# Define the API endpoint and payload
url = 'https://tokenize.jina.ai/'
payload = {
"content": input_text,
"return_chunks": "true",
"max_chunk_length": "1000"
}
# Make the API request
response = requests.post(url, json=payload)
response_data = response.json()
# Extract chunks and positions from the response
chunks = response_data.get("chunks", [])
chunk_positions = response_data.get("chunk_positions", [])
# Adjust chunk positions to match the input format
span_annotations = [(start, end) for start, end in chunk_positions]
return chunks, span_annotations
nput_text = "Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits. The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."
# determine chunks
chunks, span_annotations = chunk_by_sentences(input_text, tokenizer)
print('Chunks:\n- "' + '"\n- "'.join(chunks) + '"')
Chunks:
- "Berlin is the capital and largest city of Germany, both by area and by
population."
- " Its more than 3.85 million inhabitants make it the European Union's most
populous city, as measured by population within city limits."
- " The city is also one of the states of Germany, and is the third smallest
state in the country in terms of area."
def late_chunking(
model_output: 'BatchEncoding', span_annotation: list, max_length=None
):
token_embeddings = model_output[0]
outputs = []
for embeddings, annotations in zip(token_embeddings, span_annotation):
if (
max_length is not None
): # remove annotations which go beyond the max-length of the model
annotations = [
(start, min(end, max_length - 1))
for (start, end) in annotations
if start < (max_length - 1)
]
pooled_embeddings = [
embeddings[start:end].sum(dim=0) / (end - start)
for start, end in annotations
if (end - start) >= 1
]
pooled_embeddings = [
embedding.detach().cpu().numpy() for embedding in pooled_embeddings
]
outputs.append(pooled_embeddings)
return outputs
# chunk before
embeddings_traditional_chunking = model.encode(chunks)
# chunk afterwards (context-sensitive chunked pooling)
inputs = tokenizer(input_text, return_tensors='pt')
model_output = model(**inputs)
embeddings = late_chunking(model_output, [span_annotations])[0]
import numpy as np
cos_sim = lambda x, y: np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))
berlin_embedding = model.encode('Berlin')
for chunk, new_embedding, trad_embeddings in zip(chunks, embeddings, embeddings_traditional_chunking):
print(f'similarity_new("Berlin", "{chunk}"):', cos_sim(berlin_embedding, new_embedding))
print(f'similarity_trad("Berlin", "{chunk}"):', cos_sim(berlin_embedding, trad_embeddings))
Output
similarity_new("Berlin", "Berlin is the capital and largest city of Germany,
both by area and by population."): 0.849546
similarity_trad("Berlin", "Berlin is the capital and largest city of Germany,
both by area and by population."): 0.8486219
similarity_new("Berlin", " Its more than 3.85 million inhabitants make it the
European Union's most populous city, as measured by population within city
limits."): 0.82489026
similarity_trad("Berlin", " Its more than 3.85 million inhabitants make it
the European Union's most populous city, as measured by population within
city limits."): 0.70843387
similarity_new("Berlin", " The city is also one of the states of Germany, and
is the third smallest state in the country in terms of area."): 0.8498009
similarity_trad("Berlin", " The city is also one of the states of Germany,
and is the third smallest state in the country in terms of area."):
0.75345534
Here in the output, you can clearly see there is improvement in the semantic similarity.
General Performance Improvement:
Notable Improvements in Ambiguous References:
Consistency Across Examples:
Chunking for RAG systems to manage and optimise data processing is crucial to creating a reliable application. Various chunking strategies—ranging from simple character-based splits to advanced methods like semantic, agentic, and late chunking—help improve data retrievability, contextual relevance, and model performance. Selecting the right chunking approach depends on content type, task requirements, and desired output quality, making it an essential practice for efficient AI-powered applications.
If you find this article helpful then, comment below!