Can AI generate truly relevant answers at scale? How do we make sure it understands complex, multi-turn conversations? And how do we keep it from confidently spitting out incorrect facts? These are the kinds of challenges that modern AI systems face, especially those built using RAG. RAG combines the power of document retrieval with the fluency of language generation, allowing systems to answer questions with context-aware, grounded responses. While basic RAG systems work well for straightforward tasks, they often stumble with complex queries, hallucinations, and context retention across longer interactions. That’s where advanced RAG techniques come in.
In this blog, we’ll explore how to level up your RAG pipelines, enhancing each stage of the stack: Indexing, Retrieval, and Generation. We’ll walk through powerful methods (with hands-on code) that can help improve relevance, reduce noise, and scale your system’s performance—whether you’re building a healthcare assistant, an educational tutor, or an enterprise knowledge bot.
Let’s look at the Basic RAG framework:
This RAG system architecture shows the basic storing of chunk embeddings in the Vector store. The first step is to load the documents, then split or chunk it using various chunking techniques and then embed it using an embedding model so that it can be understood by LLMs easily.
This image depicts the retrieval and generation steps of RAG; a question is asked by the user, and then our system extracts the results based on the question by searching the Vector store. Then the retrieved content is passed to the LLM along with the question, and the LLM provides a structured output.
Basic RAG systems have clear limitations, especially in demanding situations.
Hence, we’ll go through each part of the RAG stack for Advanced RAG Techniques i.e. Indexing, Retrieval and Generation. We’ll discuss improvements using open-source libraries and resources. These Advanced RAG Techniques apply generally, whether you build a healthcare chatbot, educational bot or other applications. They will improve most RAG systems.
Let’s begin with the Advanced RAG Techniques!
Good indexing is essential for any RAG system. The first step involves how we bring in, break up, and store data. Let’s explore methods to index data, focusing on indexing and chunking text and using metadata.
Hierarchical Navigable Small Worlds (HNSW) is an effective algorithm for finding similar items in large datasets. It helps in quickly locating approximate nearest neighbors (ANN) by using a structured approach based on graphs.
The working of HNSW includes several key components:
HNSW’s design allows it to find similar items quickly and accurately. This makes it a strong choice for tasks that require efficient searches in large datasets.
The image depicts a simplified HNSW search: starting at the “entry point” (blue), the algorithm navigates the graph towards the “query vector” (yellow). The “nearest neighbor” (striped) is identified by traversing edges based on proximity. This illustrates the core concept of navigating a graph for efficient approximate nearest neighbor search.
Follow these steps to implement the Hierarchical Navigable Small Worlds (HNSW) algorithm with FAISS. This guide includes example outputs to illustrate the process.
First, define the parameters for the HNSW index. You need to specify the size of the vectors and the number of neighbors for each node.
import faiss
import numpy as np
# Set up HNSW parameters
d = 128 # Size of the vectors
M = 32 # Number of neighbors for each nodel
Create the HNSW index using the parameters defined above.
# Initialize the HNSW index
index = faiss.IndexHNSWFlat(d, M)
Before adding data to the index, set the `efConstruction` parameter. This parameter controls how many neighbors the algorithm considers when building the index.
efConstruction = 200 # Example value for efConstruction
index.hnsw.efConstruction = efConstruction
For this example, generate random data to index. Here, `xb` represents the dataset you want to index.
# Generate random dataset of vectors
n = 10000 # Number of vectors to index
xb = np.random.random((n, d)).astype('float32')
# Add data to the index
index.add(xb) # Build the index
After building the index, set the `efSearch` parameter. This parameter affects the search process.
efSearch = 100 # Example value for efSearch
index.hnsw.efSearch = efSearch
Now, you can search for the nearest neighbors of your query vectors. Here, `xq` represents the query vectors.
# Generate random query vectors
nq = 5 # Number of query vectors
xq = np.random.random((nq, d)).astype('float32')
# Perform a search for the top k nearest neighbors
k = 5 # Number of nearest neighbors to retrieve
distances, indices = index.search(xq, k)
# Output the results
print("Query Vectors:\n", xq)
print("\nNearest Neighbors Indices:\n", indices)
print("\nNearest Neighbors Distances:\n", distances)
Query Vectors:
[[0.12345678 0.23456789 ... 0.98765432]
[0.23456789 0.34567890 ... 0.87654321]
[0.34567890 0.45678901 ... 0.76543210]
[0.45678901 0.56789012 ... 0.65432109]
[0.56789012 0.67890123 ... 0.54321098]]
Nearest Neighbors Indices:
[[ 123 456 789 101 112]
[ 234 567 890 123 134]
[ 345 678 901 234 245]
[ 456 789 012 345 356]
[ 567 890 123 456 467]]
Nearest Neighbors Distances:
[[0.123 0.234 0.345 0.456 0.567]
[0.234 0.345 0.456 0.567 0.678]
[0.345 0.456 0.567 0.678 0.789]
[0.456 0.567 0.678 0.789 0.890]
[0.567 0.678 0.789 0.890 0.901]]
This approach divides text based on meaning, not just fixed sizes. Each chunk represents a coherent piece of information. We calculate the cosine distance between sentence embeddings. If two sentences are semantically similar (below a threshold), they go in the same chunk. This creates chunks of different lengths based on the content’s meaning.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
text_splitter = SemanticChunker(OpenAIEmbeddings())
docs = text_splitter.create_documents([document])
print(docs[0].page_content)
This code utilizes SemanticChunker from LangChain, which splits a document into semantically related chunks using OpenAI embeddings. It creates document chunks where each chunk aims to capture coherent semantic units rather than arbitrary text segments. The
This advanced method uses a language model to create complete statements from text. Each chunk is semantically whole. A language model (e.g., a 7-billion parameter model) processes the text. It breaks it into statements that make sense on their own. The model then combines these into chunks, balancing completeness and context. This method is computationally heavy but offers high accuracy.
async def generate_contexts(document, chunks):
async def process_chunk(chunk):
response = await client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Generate a brief context explaining how this chunk relates to the full document."},
{"role": "user", "content": f"<document> \n{document} \n</document> \nHere is the chunk we want to situate within the whole document \n<chunk> \n{chunk} \n</chunk> \nPlease give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."}
],
temperature=0.3,
max_tokens=100
)
context = response.choices[0].message.content
return f"{context} {chunk}"
# Process all chunks concurrently
contextual_chunks = await asyncio.gather(
*[process_chunk(chunk) for chunk in chunks]
)
return contextual_chunks
This code snippet utilizes an LLM (likely OpenAI’s GPT-4o via the client.chat.completions.create call) to generate contextual information for each chunk of a document. It processes each chunk asynchronously, prompting the LLM to explain how the chunk relates to the full document. Finally, it returns a list of the original chunks prepended with their generated context, effectively enriching them for improved search retrieval.
Metadata provides extra context. This improves retrieval accuracy. By including metadata like dates, patient age, and pre-existing conditions, you can filter out irrelevant information during searches. Filtering narrows the search, making retrieval more efficient and relevant. When indexing, store metadata alongside the text.
For example, healthcare data include age, visit date, and specific conditions in patient records. Use this metadata to filter search results. This ensures the system retrieves only relevant information. For instance, if a query relates to children, filter out records of patients over 18. This reduces noise and improves relevance.
Chunk #1
Source Metadata: {'id': 'doc:1c6f3e3f7ee14027bc856822871572dc:26e9aac7d5494208a56ff0c6cbbfda20', 'source': 'https://plato.stanford.edu/entries/goedel/'}
Source Text:
2.2.1 The First Incompleteness Theorem
In his Logical Journey (Wang 1996) Hao Wang published the
full text of material Gödel had written (at Wang’s request)
about his discovery of the incompleteness theorems. This material had
formed the basis of Wang’s “Some Facts about Kurt
Gödel,” and was read and approved by Gödel:
Chunk #2
Source Metadata: {'id': 'doc:1c6f3e3f7ee14027bc856822871572dc:d15f62c453c64072b768e136080cb5ba', 'source': 'https://plato.stanford.edu/entries/goedel/'}
Source Text:
The First Incompleteness Theorem provides a counterexample to
completeness by exhibiting an arithmetic statement which is neither
provable nor refutable in Peano arithmetic, though true in the
standard model. The Second Incompleteness Theorem shows that the
consistency of arithmetic cannot be proved in arithmetic itself. Thus
Gödel’s theorems demonstrated the infeasibility of the
Hilbert program, if it is to be characterized by those particular
desiderata, consistency and completeness.
Here, we can see that metadata contains the unique ID and source of the chunk, which provide more context to the chunk and helps in easy retrieval.
You won’t always have a lot of metadata but using a model like GLiNER can generate metadata on the fly! GLiNER tags and labels chunks during ingestion to create metadata.
Give GLiNER each chunk with tags to identify. If tags are found, it will label them. If no matches are confident, no tags are produced.
Works well generally, but might need fine-tuning for niche datasets. Improves retrieval accuracy but adds a processing step.
GLiNER can parse incoming queries and match them against metadata labels for filtering.
GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer Demo: Click Here
These techniques build a strong RAG system. They enable efficient retrieval from large datasets. The choice of chunking and metadata use depends on your dataset’s specific needs and features.
Now, let’s focus on the “R” in RAG. How can we improve retrieval from a vector database? This is about retrieving all documents relevant to a query. This greatly increases the chances the LLM can produce high-quality results. Here are several techniques:
Combines vector search (finding semantic meaning) and keyword search (finding exact matches). Hybrid search uses the strengths of both. In AI, many terms are specific keywords: algorithm names, technology terms, LLMs. A vector search alone might miss these. Keyword search ensures these important terms are considered. Combining both methods creates a more complete retrieval process. These searches run at the same time.
Results are merged and ranked using a weighting system. For example, using Weaviate, you adjust the alpha parameter to balance vector and keyword results. This creates a combined, ranked list.
from langchain_community.retrievers import WeaviateHybridSearchRetriever
from langchain_core.documents import Document
retriever = WeaviateHybridSearchRetriever(
client=client,
index_name="LangChain",
text_key="text",
attributes=[],
create_schema_if_missing=True,
)
retriever.invoke("the ethical implications of AI")
This code initializes a WeaviateHybridSearchRetriever for retrieving documents from a Weaviate vector database. It combines vector search and keyword search within Weaviate’s hybrid retrieval capabilities. Finally, it executes a query, “the ethical implications of AI” to retrieve relevant documents using this hybrid approach.
Recognizes that human queries may not be optimal for databases or language models. Using a language model to rewrite queries significantly improves retrieval.
Retrieval can yield different results based on slight changes in how a query is worded. If the embeddings do not accurately reflect the meaning of the data, this issue can become more pronounced. To address these challenges, prompt engineering or tuning is often used, but this process can be time-consuming.
The MultiQueryRetriever simplifies this task. It uses a large language model (LLM) to create multiple queries from different angles based on a single user input. For each generated query, it retrieves a set of relevant documents. By combining the unique results from all queries, the MultiQueryRetriever provides a broader set of potentially relevant documents. This approach enhances the chances of finding useful information without the need for extensive manual tuning.
from langchain_openai import ChatOpenAI
chatgpt = ChatOpenAI(model_name="gpt-4o", temperature=0)
from langchain.retrievers.multi_query import MultiQueryRetriever
# Set logging for the queries
import logging
similarity_retriever3 = chroma_db3.as_retriever(search_type="similarity",
search_kwargs={"k": 2})
mq_retriever = MultiQueryRetriever.from_llm(
retriever=similarity_retriever3, llm=chatgpt,
include_original=True
)
logging.basicConfig()
# so we can see what queries are generated by the LLM
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)
query = "what is the capital of India?"
docs = mq_retriever.invoke(query)
docs
This code sets up a multi-query retrieval system using LangChain. It generates multiple variations of the input query (“what is the capital of India?”). These variations are then used to query a Chroma vector database (chroma_db3) via a similarity retriever, aiming to broaden the search and capture diverse relevant documents. The MultiQueryRetriever ultimately aggregates and returns the retrieved documents.
[Document(metadata={'article_id': '5117', 'title': 'New Delhi'},
page_content='New Delhi () is the capital of India and a union territory of
the megacity of Delhi. It has a very old history and is home to several
monuments where the city is expensive to live in. In traditional Indian
geography it falls under the North Indian zone. The city has an area of
about 42.7\xa0km. New Delhi has a population of about 9.4 Million people.'),
Document(metadata={'article_id': '4062', 'title': 'Kolkata'},
page_content="Kolkata (spelled Calcutta before 1 January 2001) is the
capital city of the Indian state of West Bengal. It is the second largest
city in India after Mumbai. It is on the east bank of the River Hooghly.
When it is called Calcutta, it includes the suburbs. This makes it the third
largest city of India. This also makes it the world's 8th largest
metropolitan area as defined by the United Nations. Kolkata served as the
capital of India during the British Raj until 1911. Kolkata was once the
center of industry and education. However, it has witnessed political
violence and economic problems since 1954. Since 2000, Kolkata has grown due
to economic growth. Like other metropolitan cities in India, Kolkata
struggles with poverty, pollution and traffic congestion."),
Document(metadata={'article_id': '22215', 'title': 'States and union
territories of India'}, page_content='The Republic of India is divided into
twenty-eight States,and eight union territories including the National
Capital Territory.')]
Context compression helps improve the relevance of retrieved documents. This can occur in two main ways:
To achieve this, we can use the LLMChainExtractor, which reviews the initially returned documents and extracts only the relevant content for the query. It may also drop completely irrelevant documents.
Here is how to implement this using LangChain:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI
# Initialize the language model
chatgpt = ChatOpenAI(model_name="gpt-4o", temperature=0)
# Set up a similarity retriever
similarity_retriever = chroma_db3.as_retriever(search_type="similarity", search_kwargs={"k": 3})
# Create the extractor to get relevant content
compressor = LLMChainExtractor.from_llm(llm=chatgpt)
# Combine the retriever and the extractor
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=similarity_retriever)
# Example query
query = "What is the capital of India?"
docs = compression_retriever.invoke(query)
print(docs)
Output:
[Document(metadata={'article_id': '5117', 'title': 'New Delhi'},
page_content='New Delhi is the capital of India and a union territory of the
megacity of Delhi.')]
For a different query:
query = "What is the old capital of India?"
docs = compression_retriever.invoke(query)
print(docs)
[Document(metadata={'article_id': '4062', 'title': 'Kolkata'},
page_content='Kolkata served as the capital of India during the British Raj
until 1911.')]
The `LLMChainFilter` offers a simpler but effective way to filter documents. It uses an LLM chain to decide which documents to keep and which to discard without changing the content of the documents.
Here’s how to implement the filter:
from langchain.retrievers.document_compressors import LLMChainFilter
# Set up the filter
_filter = LLMChainFilter.from_llm(llm=chatgpt)
# Combine the retriever and the filter
compression_retriever = ContextualCompressionRetriever(base_compressor=_filter, base_retriever=similarity_retriever)
# Example query
query = "What is the capital of India?"
docs = compression_retriever.invoke(query)
print(docs)
[Document(metadata={'article_id': '5117', 'title': 'New Delhi'},
page_content='New Delhi is the capital of India and a union territory of the
megacity of Delhi.')]
For another query:
query = "What is the old capital of India?"
docs = compression_retriever.invoke(query)
print(docs)
Output:
[Document(metadata={'article_id': '4062', 'title': 'Kolkata'},
page_content='Kolkata served as the capital of India during the British Raj
until 1911.')]
These strategies help refine the retrieval process by focusing on relevant content. The `LLMChainExtractor` extracts only the necessary parts of documents, while the `LLMChainFilter` decides which documents to keep. Both methods enhance the quality of the information retrieved, making it more relevant to the user’s query.
Pre-trained embedding models are a good start. Fine-tuning these models on your data greatly improves retrieval.
Choosing the Right Models: For specialized fields like medicine, select models pre-trained on relevant data. For example, you can use the MedCPT family of query and document encoders pre-trained on a large scale of 255M query-article pairs from PubMed search logs.
Fine-Tuning with Positive and Negative Pairs: Collect your own data and create pairs of similar (positive) and dissimilar (negative) examples. Fine-tune the model to understand these differences. This helps the model learn domain-specific relationships, improving retrieval.
These combined techniques create a strong retrieval system. This improves the relevance of objects given to the LLM, boosting generation quality.
Also read this: Training and Finetuning Embedding Models with Sentence Transformers v3
Finally, let’s discuss improving the generation quality of a Language Model (LLM). The goal is to give the LLM context that is as relevant to the prompt as possible. Irrelevant data can trigger hallucinations. Here are tips for better generation:
Autocut filters out irrelevant information retrieved from the database. This prevents the LLM from being misled.
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from typing import List
from langchain_core.documents import Document
from langchain_core.runnables import chain
vectorstore = PineconeVectorStore.from_documents(
docs, index_name="sample", embedding=OpenAIEmbeddings()
)
@chain
def retriever(query: str):
docs, scores = zip(*vectorstore.similarity_search_with_score(query))
for doc, score in zip(docs, scores):
doc.metadata["score"] = score
return docs
result = retriever.invoke("dinosaur")
result
This code snippet uses LangChain and Pinecone to perform a similarity search. It embeds documents using OpenAI embeddings, stores them in a Pinecone vector store, and defines a retriever function. The retriever searches for documents similar to a given query (“dinosaur”), calculates similarity scores, and adds these scores to the document metadata before returning the results.
[Document(page_content='In her second book, Dr. Simmons delves deeper into
the ethical considerations surrounding AI development and deployment. It is
an eye-opening examination of the dilemmas faced by developers,
policymakers, and society at large.', metadata={}),
Document(page_content='A comprehensive analysis of the evolution of
artificial intelligence, from its inception to its future prospects. Dr.
Simmons covers ethical considerations, potentials, and threats posed by
AI.', metadata={}),
Document(page_content="In his follow-up to 'Symbiosis', Prof. Sterling takes
a look at the subtle, unnoticed presence and influence of AI in our everyday
lives. It reveals how AI has become woven into our routines, often without
our explicit realization.", metadata={}),
Document(page_content='Prof. Sterling explores the potential for harmonious
coexistence between humans and artificial intelligence. The book discusses
how AI can be integrated into society in a beneficial and non-disruptive
manner.', metadata={})]
We can see that it is also giving the similarity scores with it we can cut off based on a threshold value.
Reranking uses a more advanced model to re-evaluate and reorder the initially retrieved objects. This improves the quality of the final retrieved set.
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import FlashrankRerank
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature=0)
compressor = FlashrankRerank()
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor, base_retriever=retriever
)
compressed_docs = compression_retriever.invoke(
"What did the president say about Ketanji Jackson Brown"
)
print([doc.metadata["id"] for doc in compressed_docs])
pretty_print_docs(compressed_docs)
This code snippet utilizes FlashrankRerank within a ContextualCompressionRetriever to improve the relevance of retrieved documents. It specifically reranks documents obtained by a base retriever (represented by retriever) based on their relevance to the query “What did the president say about Ketanji Jackson Brown”. Finally, it prints the document IDs and the compressed, reranked documents.
[0, 5, 3]
Document 1:
One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
----------------------------------------------------------------------------------------------------
Document 2:
He met the Ukrainian people.
From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.
Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland.
In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight.
----------------------------------------------------------------------------------------------------
Document 3:
And tonight, I’m announcing that the Justice Department will name a chief prosecutor for pandemic fraud.
By the end of this year, the deficit will be down to less than half what it was before I took office.
The only president ever to cut the deficit by more than one trillion dollars in a single year.
Lowering your costs also means demanding more competition.
I’m a capitalist, but capitalism without competition isn’t capitalism.
It’s exploitation—and it drives up prices.
The output shoes it reranks the retrieved chunks based on the relevancy.
Fine-tuning the LLM on domain-specific data greatly enhances its performance. For instance, use a model like Meditron 70B. This is a fine-tuned version of LLaMA 2 70b for medical data, using both:
Unsupervised Fine-Tuning: Continue pre-training on a large collection of domain-specific text (e.g., PubMed literature).
Supervised Fine-Tuning: Further refine the model using supervised learning on domain-specific tasks (e.g., medical multiple-choice questions). This specialized training helps the model perform well in the target domain. It outperforms its base model and larger, less specialized models like GPT-3.5 on specific tasks.
This image denotes the process of fine-tuning in task-specific examples. This approach allows developers to specify desired outputs, encourage certain behaviors, or achieve better control over the model’s responses.
RAFT, or Retrieval-Augmented fine-tuning, is a method that improves how large language models (LLMs) work in specific fields. It helps these models use relevant information from documents to answer questions more accurately.
The RAFT architecture includes several key components:
By using this architecture, RAFT enhances the model’s ability to work in specific domains. It provides a reliable way to generate accurate and relevant responses.
The top-left figure depicts the approach of adapting LLMs to reading solutions from a set of positive and distractor documents in contrast to the standard RAG setup, where models are trained based on the retriever outputs, which is a mixture of both memorization and reading. At test time, all methods follow the standard RAG setting, provided with top-k retrieved documents in the context.
Improving retrieval and generation in RAG systems is essential for better AI applications. The techniques discussed range from low-effort, high-impact methods (query rewriting, reranking) to more intensive processes (embedding and LLM fine-tuning). The best technique depends on your application’s specific needs and limits. Advanced RAG techniques, when applied thoughtfully, allow developers to build more accurate, reliable, and context-aware AI systems capable of handling complex information needs.