Natural Language Processing (NLP) has rapidly advanced, particularly with the emergence of Retrieval-Augmented Generation (RAG) pipelines, which effectively address complex, information-dense queries. By combining the precision of retrieval-based systems with the creativity of generative models, RAG pipelines enhance the ability to answer questions with high relevance and context, whether by extracting sections from research papers, summarizing lengthy documents, or addressing user queries based on extensive knowledge bases. However, a key challenge in RAG pipelines is managing large documents, as entire texts often exceed the token limits of models like GPT-4.
This necessitates document chunking techniques, which break down texts into smaller, more manageable pieces while preserving context and relevance, ensuring that the most meaningful information can be retrieved for improved response accuracy. The effectiveness of a RAG pipeline can be significantly influenced by chunking strategies, whether through fixed sizes, semantic meaning, or sentence boundaries. In this blog, we’ll explore various chunking techniques, provide code snippets for each, and discuss how these methods contribute to building a robust and efficient RAG pipeline. Ready to discover how chunking can enhance your RAG pipeline? Let’s get started!
This article was published as a part of the Data Science Blogathon.
In the context of Retrieval-Augmented Generation (RAG) pipelines, chunking refers to the process of breaking down large documents into smaller, manageable pieces, or chunks, for more effective retrieval and generation. Since most large language models (LLMs) like GPT-4 have limits on the number of tokens they can process at once, chunking ensures that documents are split into sections that the model can handle while preserving the context and meaning necessary for accurate retrieval.
Without proper chunking, a RAG pipeline may miss critical information or provide incomplete, out-of-context responses. The goal is to create chunks that strike a balance between being large enough to retain meaning and small enough to fit within the model’s processing limits. Well-structured chunks help ensure that the retrieval system can accurately identify relevant parts of a document, which the generative model can then use to generate an informed response.
In short, chunking is not just about breaking text into pieces—it’s about designing the right chunks that retain meaning and context, handle multiple modalities, and fit within the model’s constraints. The right chunking strategy can significantly improve both retrieval accuracy and the quality of the responses generated by the pipeline.
Effective chunking helps preserve context, improve retrieval accuracy, and ensure smooth interaction between the retrieval and generation phases in an RAG pipeline. Below, we’ll cover different chunking strategies, explain when to use them, and explore their advantages and disadvantages—each followed by a code example.
Fixed-size chunking splits documents into chunks of a predefined size, typically by word count, token count, or character count.
When to Use:
When you need a simple, straightforward approach and the document structure isn’t critical. It works well when processing smaller, less complex documents.
Advantages:
Disadvantages:
def fixed_size_chunk(text, max_words=100):
words = text.split()
return [' '.join(words[i:i + max_words]) for i in range(0, len(words),
max_words)]
# Applying Fixed-Size Chunking
fixed_chunks = fixed_size_chunk(sample_text)
for chunk in fixed_chunks:
print(chunk, '\n---\n')
Code Output: The output for this and the following codes will be shown for a sample text as below. The final result will vary based on the use case or document considered.
sample_text = """
Introduction
Data Science is an interdisciplinary field that uses scientific methods, processes,
algorithms, and systems to extract knowledge and insights from structured and
unstructured data. It draws from statistics, computer science, machine learning,
and various data analysis techniques to discover patterns, make predictions, and
derive actionable insights.
Data Science can be applied across many industries, including healthcare, finance,
marketing, and education, where it helps organizations make data-driven decisions,
optimize processes, and understand customer behaviors.
Overview of Big Data
Big data refers to large, diverse sets of information that grow at ever-increasing
rates. It encompasses the volume of information, the velocity or speed at which it
is created and collected, and the variety or scope of the data points being
covered.
Data Science Methods
There are several important methods used in Data Science:
1. Regression Analysis
2. Classification
3. Clustering
4. Neural Networks
Challenges in Data Science
- Data Quality: Poor data quality can lead to incorrect conclusions.
- Data Privacy: Ensuring the privacy of sensitive information.
- Scalability: Handling massive datasets efficiently.
Conclusion
Data Science continues to be a driving force in many industries, offering insights
that can lead to better decisions and optimized outcomes. It remains an evolving
field that incorporates the latest technological advancements.
"""
This method chunks text based on natural sentence boundaries. Each chunk contains a set number of sentences, preserving semantic units.
When to Use:
Maintaining coherent ideas is crucial, and splitting mid-sentence would result in losing meaning.
Advantages:
Disadvantages:
import spacy
nlp = spacy.load("en_core_web_sm")
def sentence_chunk(text):
doc = nlp(text)
return [sent.text for sent in doc.sents]
# Applying Sentence-Based Chunking
sentence_chunks = sentence_chunk(sample_text)
for chunk in sentence_chunks:
print(chunk, '\n---\n')
Code Output:
This strategy splits text based on paragraph boundaries, treating each paragraph as a chunk.
When to Use:
Best for structured documents like reports or essays where each paragraph contains a complete idea or argument.
Advantages:
Disadvantages:
def paragraph_chunk(text):
paragraphs = text.split('\n\n')
return paragraphs
# Applying Paragraph-Based Chunking
paragraph_chunks = paragraph_chunk(sample_text)
for chunk in paragraph_chunks:
print(chunk, '\n---\n')
Code Output:
This method uses machine learning models (like transformers) to split text into chunks based on semantic meaning.
When to Use:
When preserving the highest level of context is critical, such as in complex, technical documents.
Advantages:
Disadvantages:
def semantic_chunk(text, max_len=200):
doc = nlp(text)
chunks = []
current_chunk = []
for sent in doc.sents:
current_chunk.append(sent.text)
if len(' '.join(current_chunk)) > max_len:
chunks.append(' '.join(current_chunk))
current_chunk = []
if current_chunk:
chunks.append(' '.join(current_chunk))
return chunks
# Applying Semantic-Based Chunking
semantic_chunks = semantic_chunk(sample_text)
for chunk in semantic_chunks:
print(chunk, '\n---\n')
Code Output:
This strategy handles different content types (text, images, tables) separately. Each modality is chunked independently based on its characteristics.
When to Use:
For documents containing diverse content types like PDFs or technical manuals with mixed media.
Advantages:
Disadvantages:
def modality_chunk(text, images=None, tables=None):
# This function assumes you have pre-processed text, images, and tables
text_chunks = paragraph_chunk(text)
return {'text_chunks': text_chunks, 'images': images, 'tables': tables}
# Applying Modality-Specific Chunking
modality_chunks = modality_chunk(sample_text, images=['img1.png'], tables=['table1'])
print(modality_chunks)
Code Output: The sample text contained only text modality, so only one chunk would be obtained as shown below.
Sliding window chunking creates overlapping chunks, allowing each chunk to share part of its content with the next.
When to Use:
When you need to ensure continuity of context between chunks, such as in legal or academic documents.
Advantages:
Disadvantages:
def sliding_window_chunk(text, chunk_size=100, overlap=20):
tokens = text.split()
chunks = []
for i in range(0, len(tokens), chunk_size - overlap):
chunk = ' '.join(tokens[i:i + chunk_size])
chunks.append(chunk)
return chunks
# Applying Sliding Window Chunking
sliding_chunks = sliding_window_chunk(sample_text)
for chunk in sliding_chunks:
print(chunk, '\n---\n')
Code Output: The image output does not capture the overlap; manual text output is also provided for reference. Note the text overlaps.
--- Applying sliding_window_chunk ---
Chunk 1:
Introduction Data Science is an interdisciplinary field that uses scientific
methods, processes, algorithms, and systems to extract knowledge and insights
from structured and unstructured data. It draws from statistics, computer
science, machine learning, and various data analysis techniques to discover
patterns, make predictions, and derive actionable insights. Data Science can
be applied across many industries, including healthcare, finance, marketing,
and education, where it helps organizations make data-driven decisions, optimize
processes, and understand customer behaviors. Overview of Big Data Big data refers
to large, diverse sets of information that grow at ever-increasing rates.
It encompasses the volume of information, the velocity
--------------------------------------------------
Chunk 2:
refers to large, diverse sets of information that grow at ever-increasing rates.
It encompasses the volume of information, the velocity or speed at which it is
created and collected, and the variety or scope of the data points being covered.
Data Science Methods There are several important methods used in Data Science:
1. Regression Analysis 2. Classification 3. Clustering 4. Neural Networks
Challenges in Data Science - Data Quality: Poor data quality can lead to
incorrect conclusions. - Data Privacy: Ensuring the privacy of sensitive
information. - Scalability: Handling massive datasets efficiently. Conclusion
Data Science continues to be a driving
--------------------------------------------------
Chunk 3:
Ensuring the privacy of sensitive information. - Scalability: Handling massive
datasets efficiently. Conclusion Data Science continues to be a driving force
in many industries, offering insights that can lead to better decisions and
optimized outcomes. It remains an evolving field that incorporates the latest
technological advancements.
--------------------------------------------------
Hierarchical chunking breaks down documents at multiple levels, such as sections, subsections, and paragraphs.
When to Use:
For highly structured documents like academic papers or legal texts, where maintaining hierarchy is essential.
Advantages:
Disadvantages:
def hierarchical_chunk(text, section_keywords):
sections = []
current_section = []
for line in text.splitlines():
if any(keyword in line for keyword in section_keywords):
if current_section:
sections.append("\n".join(current_section))
current_section = [line]
else:
current_section.append(line)
if current_section:
sections.append("\n".join(current_section))
return sections
# Applying Hierarchical Chunking
section_keywords = ["Introduction", "Overview", "Methods", "Conclusion"]
hierarchical_chunks = hierarchical_chunk(sample_text, section_keywords)
for chunk in hierarchical_chunks:
print(chunk, '\n---\n')
Code Output:
This method adapts chunking based on content characteristics (e.g., chunking text at paragraph level, tables as separate entities).
When to Use:
For documents with heterogeneous content, such as eBooks or technical manuals, chunking must vary based on content type.
Advantages:
Disadvantages:
def content_aware_chunk(text):
chunks = []
current_chunk = []
for line in text.splitlines():
if line.startswith(('##', '###', 'Introduction', 'Conclusion')):
if current_chunk:
chunks.append('\n'.join(current_chunk))
current_chunk = [line]
else:
current_chunk.append(line)
if current_chunk:
chunks.append('\n'.join(current_chunk))
return chunks
# Applying Content-Aware Chunking
content_chunks = content_aware_chunk(sample_text)
for chunk in content_chunks:
print(chunk, '\n---\n')
Code Output:
This strategy specifically handles document tables by extracting them as independent chunks and converting them into formats like markdown or JSON for easier processing
When to Use:
For documents that contain tabular data, such as financial reports or technical documents, where tables carry important information.
Advantages:
Disadvantages:
import pandas as pd
def table_aware_chunk(table):
return table.to_markdown()
# Sample table data
table = pd.DataFrame({
"Name": ["John", "Alice", "Bob"],
"Age": [25, 30, 22],
"Occupation": ["Engineer", "Doctor", "Artist"]
})
# Applying Table-Aware Chunking
table_markdown = table_aware_chunk(table)
print(table_markdown)
Code Output: For this example, a table was considered; note that only the table is chunked in the code output.
Token-based chunking splits text based on a fixed number of tokens rather than words or sentences. It uses tokenizers from NLP models (e.g., Hugging Face’s transformers).
When to Use:
For models that operate on tokens, such as transformer-based models with token limits (e.g., GPT-3 or GPT-4).
Advantages:
Disadvantages:
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
def token_based_chunk(text, max_tokens=200):
tokens = tokenizer(text)["input_ids"]
chunks = [tokens[i:i + max_tokens] for i in range(0, len(tokens), max_tokens)]
return [tokenizer.decode(chunk) for chunk in chunks]
# Applying Token-Based Chunking
token_chunks = token_based_chunk(sample_text)
for chunk in token_chunks:
print(chunk, '\n---\n')
Code Output
Entity-based chunking leverages Named Entity Recognition (NER) to break text into chunks based on recognized entities, such as people, organizations, or locations.
When to Use:
For documents where specific entities are important to maintain as contextual units, such as resumes, contracts, or legal documents.
Advantages:
Disadvantages:
def entity_based_chunk(text):
doc = nlp(text)
entities = [ent.text for ent in doc.ents]
return entities
# Applying Entity-Based Chunking
entity_chunks = entity_based_chunk(sample_text)
print(entity_chunks)
Code Output: For this purpose, training a specific NER model for the input would be the ideal way. Given output is for reference and code sample.
This strategy splits the document based on topics using techniques like Latent Dirichlet Allocation (LDA) or other topic modeling algorithms to segment the text.
When to Use:
For documents that cover multiple topics, such as news articles, research papers, or reports with diverse subject matter.
Advantages:
Disadvantages:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np
def topic_based_chunk(text, num_topics=3):
# Split the text into sentences for chunking
sentences = text.split('. ')
# Vectorize the sentences
vectorizer = CountVectorizer()
sentence_vectors = vectorizer.fit_transform(sentences)
# Apply LDA for topic modeling
lda = LatentDirichletAllocation(n_components=num_topics, random_state=42)
lda.fit(sentence_vectors)
# Get the topic-word distribution
topic_word = lda.components_
vocabulary = vectorizer.get_feature_names_out()
# Identify the top words for each topic
topics = []
for topic_idx, topic in enumerate(topic_word):
top_words_idx = topic.argsort()[:-6:-1]
topic_keywords = [vocabulary[i] for i in top_words_idx]
topics.append("Topic {}: {}".format(topic_idx + 1, ', '.join(topic_keywords)))
# Generate chunks with topics
chunks_with_topics = []
for i, sentence in enumerate(sentences):
topic_assignments = lda.transform(vectorizer.transform([sentence]))
assigned_topic = np.argmax(topic_assignments)
chunks_with_topics.append((topics[assigned_topic], sentence))
return chunks_with_topics
# Get topic-based chunks
topic_chunks = topic_based_chunk(sample_text, num_topics=3)
# Display results
for topic, chunk in topic_chunks:
print(f"{topic}: {chunk}\n")
Code Output:
This approach splits documents based on page boundaries, commonly used for PDFs or formatted documents where each page is treated as a chunk.
When to Use:
For page-oriented documents, such as PDFs or print-ready reports, where page boundaries have semantic importance.
Advantages:
Disadvantages:
def page_based_chunk(pages):
# Split based on pre-processed page list (simulating PDF page text)
return pages
# Sample pages
pages = ["Page 1 content", "Page 2 content", "Page 3 content"]
# Applying Page-Based Chunking
page_chunks = page_based_chunk(pages)
for chunk in page_chunks:
print(chunk, '\n---\n')
Code Output: The sample text lacks segregation based on page numbers, so the code output is out of scope for this snippet. Readers can take the code snippet and try it on their documents to get the page-based chunked output.
This method chunks documents based on predefined keywords or phrases that signal topic shifts (e.g., “Introduction,” “Conclusion”).
When to Use:
Best for documents that follow a clear structure, such as scientific papers or technical specifications.
Advantages:
Disadvantages:
def keyword_based_chunk(text, keywords):
chunks = []
current_chunk = []
for line in text.splitlines():
if any(keyword in line for keyword in keywords):
if current_chunk:
chunks.append('\n'.join(current_chunk))
current_chunk = [line]
else:
current_chunk.append(line)
if current_chunk:
chunks.append('\n'.join(current_chunk))
return chunks
# Applying Keyword-Based Chunking
keywords = ["Introduction", "Overview", "Conclusion", "Methods", "Challenges"]
keyword_chunks = keyword_based_chunk(sample_text, keywords)
for chunk in keyword_chunks:
print(chunk, '\n---\n')
Code Output:
Hybrid chunking combines multiple chunking strategies based on content type and document structure. For instance, text can be chunked by sentences, while tables and images are handled separately.
When to Use:
For complex documents that contain various content types, such as technical reports, business documents, or product manuals.
Advantages:
Disadvantages:
def hybrid_chunk(text):
paragraphs = paragraph_chunk(text)
hybrid_chunks = []
for paragraph in paragraphs:
hybrid_chunks += sentence_chunk(paragraph)
return hybrid_chunks
# Applying Hybrid Chunking
hybrid_chunks = hybrid_chunk(sample_text)
for chunk in hybrid_chunks:
print(chunk, '\n---\n')
Code Output:
Bonus: The entire notebook is being made available for the reader to use the codes and visualize the chucking outputs easily (notebook link). Feel free to browse through and try out these strategies to build your next RAG application.
Next, we will look into some chunking trafe-offs and try to get some idea on the use case scenarios.
When building a retrieval-augmented generation (RAG) pipeline, optimizing chunking for specific use cases and document types is crucial. Different scenarios have different requirements based on document size, content diversity, and retrieval speed. Let’s explore some optimization strategies based on these factors.
Large-scale documents like academic papers, legal texts, or government reports often span hundreds of pages and contain diverse types of content (e.g., text, images, tables, footnotes). Chunking strategies for such documents should balance between capturing relevant context and keeping chunk sizes manageable for fast and efficient retrieval.
Key Considerations:
Best Strategies for Large-Scale Documents:
Example Use Case:
The size of the chunks directly affects both retrieval speed and the accuracy of results. Larger chunks tend to preserve more context, improving the accuracy of retrieval, but they can slow down the system as they require more memory and computation. Conversely, smaller chunks allow for faster retrieval but at the risk of losing important contextual information.
Key Trade-offs:
Optimization Tactics:
Example Use Case:
Each chunking strategy fits different types of documents and retrieval scenarios, so understanding when to use a particular method can greatly improve performance in an RAG pipeline.
For smaller documents, like FAQs or customer support pages, the retrieval speed is paramount, and maintaining perfect context isn’t always necessary. Strategies like sentence-based chunking or keyword-based chunking can work well.
For long-form documents, such as research papers or legal documents, context matters more, and breaking down by semantic or hierarchical boundaries becomes important.
In documents with mixed content types like images, tables, and text (e.g., scientific reports), modality-specific chunking is crucial to ensure each type of content is handled separately for optimal results.
Documents that cover multiple topics or sections, like eBooks or news articles, benefit from topic-based chunking strategies. This ensures that each chunk focuses on a coherent topic, which is ideal for use cases where specific topics need to be retrieved.
In this blog, we’ve delved into the critical role of chunking within retrieval-augmented generation (RAG) pipelines. Chunking serves as a foundational process that transforms large documents into smaller, manageable pieces, enabling models to retrieve and generate relevant information efficiently. Each chunking strategy presents its own advantages and disadvantages, making it essential to choose the appropriate method based on specific use cases. By understanding how different strategies impact the retrieval process, you can optimize the performance of your RAG system.
Choosing the right chunking strategy depends on several factors, including document type, the need for context preservation, and the balance between retrieval speed and accuracy. Whether you’re working with academic papers, legal documents, or mixed-content files, selecting an appropriate approach can significantly enhance the effectiveness of your RAG pipeline. By iterating and refining your chunking methods, you can adapt to changing document types and user needs, ensuring that your retrieval system remains robust and efficient.
A. Chunking techniques in NLP involve breaking down large texts into smaller, manageable pieces to enhance processing efficiency while preserving context and relevance.
A. The choice of chunking strategy depends on several factors, including the type of document, its structure, and the specific use case. For example, fixed-size chunking might be suitable for smaller documents, while semantic-based chunking is better for complex texts requiring context preservation. Evaluating the pros and cons of each strategy will help determine the best approach for your specific needs.
A. Yes, the choice of chunking strategy can significantly impact the performance of a RAG pipeline. Strategies that preserve context and semantics, such as semantic-based or sentence-based chunking, can lead to more accurate retrieval and generation results. Conversely, methods that break context (e.g., fixed-size chunking) may reduce the quality of the generated responses, as relevant information may be lost between chunks.
A. Chunking techniques improve RAG pipelines by ensuring that only meaningful information is retrieved, leading to more accurate and contextually relevant responses.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.