Top 13 Advanced RAG Techniques for Your Next Project

Harsh Mishra Last Updated : 02 Apr, 2025
17 min read

Can AI generate truly relevant answers at scale? How do we make sure it understands complex, multi-turn conversations? And how do we keep it from confidently spitting out incorrect facts? These are the kinds of challenges that modern AI systems face, especially those built using RAG. RAG combines the power of document retrieval with the fluency of language generation, allowing systems to answer questions with context-aware, grounded responses. While basic RAG systems work well for straightforward tasks, they often stumble with complex queries, hallucinations, and context retention across longer interactions. That’s where advanced RAG techniques come in.

In this blog, we’ll explore how to level up your RAG pipelines, enhancing each stage of the stack: Indexing, Retrieval, and Generation. We’ll walk through powerful methods (with hands-on code) that can help improve relevance, reduce noise, and scale your system’s performance—whether you’re building a healthcare assistant, an educational tutor, or an enterprise knowledge bot.

Where Basic RAG Falls Short?

Let’s look at the Basic RAG framework:

RAG
SourceDipanjan Sarkar

This RAG system architecture shows the basic storing of chunk embeddings in the Vector store. The first step is to load the documents, then split or chunk it using various chunking techniques and then embed it using an embedding model so that it can be understood by LLMs easily.

This image depicts the retrieval and generation steps of RAG; a question is asked by the user, and then our system extracts the results based on the question by searching the Vector store. Then the retrieved content is passed to the LLM along with the question, and the LLM provides a structured output.

Basic RAG systems have clear limitations, especially in demanding situations.

  • Hallucinations: A major problem is hallucination. The model creates content that is factually wrong or not supported by the source documents. This hurts reliability, particularly in fields like medicine or law where precision is critical.
  • Lack of Domain Specificity: Standard RAG models struggle with specialized topics. Without adapting the retrieval and generation processes to the specific details of a domain, the system risks finding irrelevant or inaccurate information.
  • Complex Conversations: Basic RAG systems have trouble with complex queries or multi-turn conversations. They often lose the context across interactions. This leads to disconnected or incomplete answers. RAG systems must handle increasing query complexity.

Hence, we’ll go through each part of the RAG stack for Advanced RAG Techniques i.e. Indexing, Retrieval and Generation. We’ll discuss improvements using open-source libraries and resources. These Advanced RAG Techniques apply generally, whether you build a healthcare chatbot, educational bot or other applications. They will improve most RAG systems.

Let’s begin with the Advanced RAG Techniques!

Indexing and Chunking: Building a Strong Foundation

Good indexing is essential for any RAG system. The first step involves how we bring in, break up, and store data. Let’s explore methods to index data, focusing on indexing and chunking text and using metadata.

1. HNSW: Hierarchical Navigable Small Worlds

Hierarchical Navigable Small Worlds (HNSW) is an effective algorithm for finding similar items in large datasets. It helps in quickly locating approximate nearest neighbors (ANN) by using a structured approach based on graphs.

  • Proximity Graph: HNSW builds a graph where each point connects to nearby points. This structure allows for efficient searching.
  • Hierarchical Structure: The algorithm organizes points into multiple layers. The top layer connects distant points, while lower layers connect closer points. This setup speeds up the search process.
  • Greedy Routing: HNSW uses a greedy method to find neighbors. It starts at a high-level point and moves to the nearest neighbor until it reaches a local minimum. This method reduces the time needed to find similar items.

How does HNSW work?

The working of HNSW includes several key components:

  1. Input Layer: Each data point is represented as a vector in a high-dimensional space.
  2. Graph Construction:
    • Nodes are added to the graph one at a time.
    • Each node is assigned to a layer based on a probability function. This function decides how likely a node is to be placed in a higher layer.
    • The algorithm balances the number of connections and the speed of searches.
  3. Search Process:
    • The search starts at a specific entry point in the top layer.
    • The algorithm moves to the nearest neighbor at each step.
    • Once it reaches a local minimum, it shifts to the next lower layer and continues searching until it finds the closest point in the bottom layer.
  4. Parameters:
    • M: The number of neighbors connected to each node.
    • efConstruction: This parameter affects how many neighbors the algorithm considers when building the graph.
    • efSearch: This parameter influences the search process, determining how many neighbors to evaluate.

HNSW’s design allows it to find similar items quickly and accurately. This makes it a strong choice for tasks that require efficient searches in large datasets.

HNSW
Source: Link

The image depicts a simplified HNSW search: starting at the “entry point” (blue), the algorithm navigates the graph towards the “query vector” (yellow). The “nearest neighbor” (striped) is identified by traversing edges based on proximity. This illustrates the core concept of navigating a graph for efficient approximate nearest neighbor search.

Hands on HNSW

Follow these steps to implement the Hierarchical Navigable Small Worlds (HNSW) algorithm with FAISS. This guide includes example outputs to illustrate the process.

Step 1: Set Up HNSW Parameters

First, define the parameters for the HNSW index. You need to specify the size of the vectors and the number of neighbors for each node.

import faiss
import numpy as np
# Set up HNSW parameters
d = 128  # Size of the vectors
M = 32   # Number of neighbors for each nodel

Step 2: Initialize the HNSW Index

Create the HNSW index using the parameters defined above.

# Initialize the HNSW index
index = faiss.IndexHNSWFlat(d, M)

Step 3: Set efConstruction

Before adding data to the index, set the `efConstruction` parameter. This parameter controls how many neighbors the algorithm considers when building the index.

efConstruction = 200  # Example value for efConstruction
index.hnsw.efConstruction = efConstruction

Step 4: Generate Sample Data

For this example, generate random data to index. Here, `xb` represents the dataset you want to index.

# Generate random dataset of vectors
n = 10000  # Number of vectors to index
xb = np.random.random((n, d)).astype('float32')
# Add data to the index
index.add(xb)  # Build the index

Step 5: Set efSearch

After building the index, set the `efSearch` parameter. This parameter affects the search process.

efSearch = 100  # Example value for efSearch
index.hnsw.efSearch = efSearch

Now, you can search for the nearest neighbors of your query vectors. Here, `xq` represents the query vectors.

# Generate random query vectors
nq = 5  # Number of query vectors
xq = np.random.random((nq, d)).astype('float32')
# Perform a search for the top k nearest neighbors
k = 5  # Number of nearest neighbors to retrieve
distances, indices = index.search(xq, k)
# Output the results
print("Query Vectors:\n", xq)
print("\nNearest Neighbors Indices:\n", indices)
print("\nNearest Neighbors Distances:\n", distances)

Output

Query Vectors:

 [[0.12345678 0.23456789 ... 0.98765432]

 [0.23456789 0.34567890 ... 0.87654321]

 [0.34567890 0.45678901 ... 0.76543210]

 [0.45678901 0.56789012 ... 0.65432109]

 [0.56789012 0.67890123 ... 0.54321098]]

Nearest Neighbors Indices:

 [[ 123  456  789  101  112]

 [ 234  567  890  123  134]

 [ 345  678  901  234  245]

 [ 456  789  012  345  356]

 [ 567  890  123  456  467]]

Nearest Neighbors Distances:

 [[0.123 0.234 0.345 0.456 0.567]

 [0.234 0.345 0.456 0.567 0.678]

 [0.345 0.456 0.567 0.678 0.789]

 [0.456 0.567 0.678 0.789 0.890]

 [0.567 0.678 0.789 0.890 0.901]]

2. Semantic Chunking

This approach divides text based on meaning, not just fixed sizes. Each chunk represents a coherent piece of information. We calculate the cosine distance between sentence embeddings. If two sentences are semantically similar (below a threshold), they go in the same chunk. This creates chunks of different lengths based on the content’s meaning. 

  • Pros: Creates more coherent and meaningful chunks, improving retrieval. 
  • Cons: Requires more computation (using a BERT-based encoder).

Hands-on Semantic Chunking

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
text_splitter = SemanticChunker(OpenAIEmbeddings())
docs = text_splitter.create_documents([document])
print(docs[0].page_content)

This code utilizes SemanticChunker from LangChain, which splits a document into semantically related chunks using OpenAI embeddings. It creates document chunks where each chunk aims to capture coherent semantic units rather than arbitrary text segments. The 

3. Language Model-Based Chunking

This advanced method uses a language model to create complete statements from text. Each chunk is semantically whole. A language model (e.g., a 7-billion parameter model) processes the text. It breaks it into statements that make sense on their own. The model then combines these into chunks, balancing completeness and context. This method is computationally heavy but offers high accuracy. 

  • Pros: Adapts to the nuances of the text and creates high-quality chunks. 
  • Cons: Computationally expensive; may need fine-tuning for specific uses.

Hands-on Language Model-Based Chunking

async def generate_contexts(document, chunks):
   async def process_chunk(chunk):
       response = await client.chat.completions.create(
           model="gpt-4o",
           messages=[
               {"role": "system", "content": "Generate a brief context explaining how this chunk relates to the full document."},
               {"role": "user", "content": f"<document> \n{document} \n</document> \nHere is the chunk we want to situate within the whole document \n<chunk> \n{chunk} \n</chunk> \nPlease give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."}
           ],
           temperature=0.3,
           max_tokens=100
       )
       context = response.choices[0].message.content
       return f"{context} {chunk}"
   # Process all chunks concurrently
   contextual_chunks = await asyncio.gather(
       *[process_chunk(chunk) for chunk in chunks]
   )
   return contextual_chunks

This code snippet utilizes an LLM (likely OpenAI’s GPT-4o via the client.chat.completions.create call) to generate contextual information for each chunk of a document. It processes each chunk asynchronously, prompting the LLM to explain how the chunk relates to the full document. Finally, it returns a list of the original chunks prepended with their generated context, effectively enriching them for improved search retrieval.

4. Leveraging Metadata: Adding Context

Adding and Filtering with Metadata

Metadata provides extra context. This improves retrieval accuracy. By including metadata like dates, patient age, and pre-existing conditions, you can filter out irrelevant information during searches. Filtering narrows the search, making retrieval more efficient and relevant. When indexing, store metadata alongside the text. 

For example, healthcare data include age, visit date, and specific conditions in patient records. Use this metadata to filter search results. This ensures the system retrieves only relevant information. For instance, if a query relates to children, filter out records of patients over 18. This reduces noise and improves relevance.

Example

Chunk #1

Source Metadata:  {'id': 'doc:1c6f3e3f7ee14027bc856822871572dc:26e9aac7d5494208a56ff0c6cbbfda20', 'source': 'https://plato.stanford.edu/entries/goedel/'}

Source Text:

2.2.1 The First Incompleteness Theorem

In his Logical Journey (Wang 1996) Hao Wang published the

full text of material Gödel had written (at Wang’s request)

about his discovery of the incompleteness theorems. This material had

formed the basis of Wang’s “Some Facts about Kurt

Gödel,” and was read and approved by Gödel:

Chunk #2

Source Metadata:  {'id': 'doc:1c6f3e3f7ee14027bc856822871572dc:d15f62c453c64072b768e136080cb5ba', 'source': 'https://plato.stanford.edu/entries/goedel/'}

Source Text:

The First Incompleteness Theorem provides a counterexample to

completeness by exhibiting an arithmetic statement which is neither

provable nor refutable in Peano arithmetic, though true in the

standard model. The Second Incompleteness Theorem shows that the

consistency of arithmetic cannot be proved in arithmetic itself. Thus

Gödel’s theorems demonstrated the infeasibility of the

Hilbert program, if it is to be characterized by those particular

desiderata, consistency and completeness.

Here, we can see that metadata contains the unique ID and source of the chunk, which provide more context to the chunk and helps in easy retrieval.

5. Using GLiNER to Generate Metadata

You won’t always have a lot of metadata but using a model like GLiNER can generate metadata on the fly! GLiNER tags and labels chunks during ingestion to create metadata.

Implementation

Give GLiNER each chunk with tags to identify. If tags are found, it will label them. If no matches are confident, no tags are produced.
Works well generally, but might need fine-tuning for niche datasets. Improves retrieval accuracy but adds a processing step.
GLiNER can parse incoming queries and match them against metadata labels for filtering.

GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer Demo: Click Here

These techniques build a strong RAG system. They enable efficient retrieval from large datasets. The choice of chunking and metadata use depends on your dataset’s specific needs and features.

Retrieval: Finding the Right Information

Now, let’s focus on the “R” in RAG. How can we improve retrieval from a vector database? This is about retrieving all documents relevant to a query. This greatly increases the chances the LLM can produce high-quality results. Here are several techniques:

Combines vector search (finding semantic meaning) and keyword search (finding exact matches). Hybrid search uses the strengths of both. In AI, many terms are specific keywords: algorithm names, technology terms, LLMs. A vector search alone might miss these. Keyword search ensures these important terms are considered. Combining both methods creates a more complete retrieval process. These searches run at the same time. 

Results are merged and ranked using a weighting system. For example, using Weaviate, you adjust the alpha parameter to balance vector and keyword results. This creates a combined, ranked list. 

  • Pros: Balances precision and recall, improving retrieval quality. 
  • Cons: Requires careful tuning of weights.
from langchain_community.retrievers import WeaviateHybridSearchRetriever
from langchain_core.documents import Document
retriever = WeaviateHybridSearchRetriever(
   client=client,
   index_name="LangChain",
   text_key="text",
   attributes=[],
   create_schema_if_missing=True,
)
retriever.invoke("the ethical implications of AI")

This code initializes a WeaviateHybridSearchRetriever for retrieving documents from a Weaviate vector database. It combines vector search and keyword search within Weaviate’s hybrid retrieval capabilities. Finally, it executes a query, “the ethical implications of AI” to retrieve relevant documents using this hybrid approach.

7. Query Rewriting

Recognizes that human queries may not be optimal for databases or language models. Using a language model to rewrite queries significantly improves retrieval.

  1. Rewriting for Vector Databases: This transforms the user’s initial query into a database-friendly format. For example, “what are AI agents and why they are the next big thing in 2025” might become “AI agents big thing 2025”. We can use any LLM for rewriting the query so that it captures the important aspects of the query.
  2. Prompt Rewriting for Language Models: This involves automatically creating prompts to optimize interaction with the language model. This improves the quality and accuracy of results. We can use Frameworks like DSPy to help with this or any LLM to rewrite the query. These rewritten queries and prompts ensure the search process retrieves relevant documents and the language model is prompted effectively.

Multi Query Retrieval

Retrieval can yield different results based on slight changes in how a query is worded. If the embeddings do not accurately reflect the meaning of the data, this issue can become more pronounced. To address these challenges, prompt engineering or tuning is often used, but this process can be time-consuming.

The MultiQueryRetriever simplifies this task. It uses a large language model (LLM) to create multiple queries from different angles based on a single user input. For each generated query, it retrieves a set of relevant documents. By combining the unique results from all queries, the MultiQueryRetriever provides a broader set of potentially relevant documents. This approach enhances the chances of finding useful information without the need for extensive manual tuning.

from langchain_openai import ChatOpenAI
chatgpt = ChatOpenAI(model_name="gpt-4o", temperature=0)
from langchain.retrievers.multi_query import MultiQueryRetriever
# Set logging for the queries
import logging
similarity_retriever3 = chroma_db3.as_retriever(search_type="similarity",
                                               search_kwargs={"k": 2})
mq_retriever = MultiQueryRetriever.from_llm(
   retriever=similarity_retriever3, llm=chatgpt,
   include_original=True
)
logging.basicConfig()
# so we can see what queries are generated by the LLM
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)
query = "what is the capital of India?"
docs = mq_retriever.invoke(query)
docs

This code sets up a multi-query retrieval system using LangChain. It generates multiple variations of the input query (“what is the capital of India?”). These variations are then used to query a Chroma vector database (chroma_db3) via a similarity retriever, aiming to broaden the search and capture diverse relevant documents. The MultiQueryRetriever ultimately aggregates and returns the retrieved documents.

Output

[Document(metadata={'article_id': '5117', 'title': 'New Delhi'},
page_content='New Delhi () is the capital of India and a union territory of
the megacity of Delhi. It has a very old history and is home to several
monuments where the city is expensive to live in. In traditional Indian
geography it falls under the North Indian zone. The city has an area of
about 42.7\xa0km. New Delhi has a population of about 9.4 Million people.'),

 Document(metadata={'article_id': '4062', 'title': 'Kolkata'},
page_content="Kolkata (spelled Calcutta before 1 January 2001) is the
capital city of the Indian state of West Bengal. It is the second largest
city in India after Mumbai. It is on the east bank of the River Hooghly.
When it is called Calcutta, it includes the suburbs. This makes it the third
largest city of India. This also makes it the world's 8th largest
metropolitan area as defined by the United Nations. Kolkata served as the
capital of India during the British Raj until 1911. Kolkata was once the
center of industry and education. However, it has witnessed political
violence and economic problems since 1954. Since 2000, Kolkata has grown due
to economic growth. Like other metropolitan cities in India, Kolkata
struggles with poverty, pollution and traffic congestion."),

 Document(metadata={'article_id': '22215', 'title': 'States and union
territories of India'}, page_content='The Republic of India is divided into
twenty-eight States,and eight union territories including the National
Capital Territory.')]

8. LLM Prompt-based Contextual Compression Retrieval

Context compression helps improve the relevance of retrieved documents. This can occur in two main ways:

  1. Extracting Relevant Content: Remove parts of the retrieved documents that do not relate to the query. This means keeping only the sections that answer the question.
  2. Filtering Irrelevant Documents: Excluding documents that do not relate to the query without altering the content of the documents themselves.

To achieve this, we can use the LLMChainExtractor, which reviews the initially returned documents and extracts only the relevant content for the query. It may also drop completely irrelevant documents.

Here is how to implement this using LangChain:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI
# Initialize the language model
chatgpt = ChatOpenAI(model_name="gpt-4o", temperature=0)

# Set up a similarity retriever

similarity_retriever = chroma_db3.as_retriever(search_type="similarity", search_kwargs={"k": 3})

# Create the extractor to get relevant content

compressor = LLMChainExtractor.from_llm(llm=chatgpt)

# Combine the retriever and the extractor

compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=similarity_retriever)

# Example query

query = "What is the capital of India?"
docs = compression_retriever.invoke(query)
print(docs)

Output:

[Document(metadata={'article_id': '5117', 'title': 'New Delhi'},
page_content='New Delhi is the capital of India and a union territory of the
megacity of Delhi.')]

For a different query:

query = "What is the old capital of India?"
docs = compression_retriever.invoke(query)
print(docs)

Output

[Document(metadata={'article_id': '4062', 'title': 'Kolkata'},
page_content='Kolkata served as the capital of India during the British Raj
until 1911.')]

The `LLMChainFilter` offers a simpler but effective way to filter documents. It uses an LLM chain to decide which documents to keep and which to discard without changing the content of the documents.

Here’s how to implement the filter:

from langchain.retrievers.document_compressors import LLMChainFilter
# Set up the filter
_filter = LLMChainFilter.from_llm(llm=chatgpt)
# Combine the retriever and the filter
compression_retriever = ContextualCompressionRetriever(base_compressor=_filter, base_retriever=similarity_retriever)

# Example query
query = "What is the capital of India?"
docs = compression_retriever.invoke(query)
print(docs)

Output

[Document(metadata={'article_id': '5117', 'title': 'New Delhi'},
page_content='New Delhi is the capital of India and a union territory of the
megacity of Delhi.')]

For another query:

query = "What is the old capital of India?"
docs = compression_retriever.invoke(query)
print(docs)

Output:

[Document(metadata={'article_id': '4062', 'title': 'Kolkata'},
page_content='Kolkata served as the capital of India during the British Raj
until 1911.')]

These strategies help refine the retrieval process by focusing on relevant content. The `LLMChainExtractor` extracts only the necessary parts of documents, while the `LLMChainFilter` decides which documents to keep. Both methods enhance the quality of the information retrieved, making it more relevant to the user’s query.

9. Fine-Tuning Embedding Models

Pre-trained embedding models are a good start. Fine-tuning these models on your data greatly improves retrieval.

Choosing the Right Models: For specialized fields like medicine, select models pre-trained on relevant data. For example, you can use the MedCPT family of query and document encoders pre-trained on a large scale of 255M query-article pairs from PubMed search logs.

Fine-Tuning with Positive and Negative Pairs: Collect your own data and create pairs of similar (positive) and dissimilar (negative) examples. Fine-tune the model to understand these differences. This helps the model learn domain-specific relationships, improving retrieval. 

  • Pros: Improves retrieval performance. 
  • Cons: Requires carefully created training data.

These combined techniques create a strong retrieval system. This improves the relevance of objects given to the LLM, boosting generation quality.

Also read this: Training and Finetuning Embedding Models with Sentence Transformers v3

Generation: Crafting High-Quality Responses

Finally, let’s discuss improving the generation quality of a Language Model (LLM). The goal is to give the LLM context that is as relevant to the prompt as possible. Irrelevant data can trigger hallucinations. Here are tips for better generation:

10. Autocut to Remove Irrelevant Information 

Autocut filters out irrelevant information retrieved from the database. This prevents the LLM from being misled.

  • Retrieve and Score Similarity: When a query is made, multiple objects are retrieved with similarity scores.
  • Identify and Cut Off: Use the similarity scores to find a cutoff point where scores drop significantly. Exclude objects beyond this point. This ensures that only the most relevant information is given to the LLM. For example, if you retrieve six objects, scores might drop sharply after the fourth. By looking at the rate of change, you can determine which objects to exclude.

Hands on

from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from typing import List
from langchain_core.documents import Document
from langchain_core.runnables import chain
vectorstore = PineconeVectorStore.from_documents(
   docs, index_name="sample", embedding=OpenAIEmbeddings()
)
@chain
def retriever(query: str):
   docs, scores = zip(*vectorstore.similarity_search_with_score(query))
   for doc, score in zip(docs, scores):
       doc.metadata["score"] = score
   return docs
 result = retriever.invoke("dinosaur")
result

This code snippet uses LangChain and Pinecone to perform a similarity search. It embeds documents using OpenAI embeddings, stores them in a Pinecone vector store, and defines a retriever function. The retriever searches for documents similar to a given query (“dinosaur”), calculates similarity scores, and adds these scores to the document metadata before returning the results.

Output

[Document(page_content='In her second book, Dr. Simmons delves deeper into
the ethical considerations surrounding AI development and deployment. It is
an eye-opening examination of the dilemmas faced by developers,
policymakers, and society at large.', metadata={}),

 Document(page_content='A comprehensive analysis of the evolution of
artificial intelligence, from its inception to its future prospects. Dr.
Simmons covers ethical considerations, potentials, and threats posed by
AI.', metadata={}),

 Document(page_content="In his follow-up to 'Symbiosis', Prof. Sterling takes
a look at the subtle, unnoticed presence and influence of AI in our everyday
lives. It reveals how AI has become woven into our routines, often without
our explicit realization.", metadata={}),

 Document(page_content='Prof. Sterling explores the potential for harmonious
coexistence between humans and artificial intelligence. The book discusses
how AI can be integrated into society in a beneficial and non-disruptive
manner.', metadata={})]

We can see that it is also giving the similarity scores with it we can cut off based on a threshold value.

11. Reranking Retrieved Objects

Reranking uses a more advanced model to re-evaluate and reorder the initially retrieved objects. This improves the quality of the final retrieved set.

  • Overfetch: Initially retrieve more objects than needed.
  • Apply Ranker Model: Use a high-latency model (typically a cross encoder) to re-evaluate relevance. This model considers the query and each object pairwise to reassess similarity.
  • Reorder Results: Based on the new assessment, reorder the objects. Put the most relevant results at the top. This ensures that the most relevant documents are prioritized, improving the data given to the LLM.

Hands-on Reranking Retrieved Objects

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import FlashrankRerank
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0)

compressor = FlashrankRerank()
compression_retriever = ContextualCompressionRetriever(
   base_compressor=compressor, base_retriever=retriever
)
compressed_docs = compression_retriever.invoke(
   "What did the president say about Ketanji Jackson Brown"
)
print([doc.metadata["id"] for doc in compressed_docs])
pretty_print_docs(compressed_docs)

This code snippet utilizes FlashrankRerank within a ContextualCompressionRetriever to improve the relevance of retrieved documents. It specifically reranks documents obtained by a base retriever (represented by retriever) based on their relevance to the query “What did the president say about Ketanji Jackson Brown”. Finally, it prints the document IDs and the compressed, reranked documents.

Output

[0, 5, 3]

Document 1:

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.

----------------------------------------------------------------------------------------------------

Document 2:

He met the Ukrainian people.

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.

Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland.

In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight.

----------------------------------------------------------------------------------------------------

Document 3:

And tonight, I’m announcing that the Justice Department will name a chief prosecutor for pandemic fraud.

By the end of this year, the deficit will be down to less than half what it was before I took office.  

The only president ever to cut the deficit by more than one trillion dollars in a single year.

Lowering your costs also means demanding more competition.

I’m a capitalist, but capitalism without competition isn’t capitalism.

It’s exploitation—and it drives up prices.

The output shoes it reranks the retrieved chunks based on the relevancy.

12. Fine-Tuning the LLM

Fine-tuning the LLM on domain-specific data greatly enhances its performance. For instance, use a model like Meditron 70B. This is a fine-tuned version of LLaMA 2 70b for medical data, using both:

Unsupervised Fine-Tuning: Continue pre-training on a large collection of domain-specific text (e.g., PubMed literature).

Supervised Fine-Tuning: Further refine the model using supervised learning on domain-specific tasks (e.g., medical multiple-choice questions). This specialized training helps the model perform well in the target domain. It outperforms its base model and larger, less specialized models like GPT-3.5 on specific tasks.

Finetuning
Source: Link

This image denotes the process of fine-tuning in task-specific examples. This approach allows developers to specify desired outputs, encourage certain behaviors, or achieve better control over the model’s responses.

13. Using RAFT: Adapting Language Model to Domain-Specific RAG

RAFT, or Retrieval-Augmented fine-tuning, is a method that improves how large language models (LLMs) work in specific fields. It helps these models use relevant information from documents to answer questions more accurately.

  • Retrieval-Augmented Fine Tuning: RAFT combines fine-tuning with retrieval methods. This allows the model to learn from both useful and less useful documents during training.
  • Chain-of-Thought Reasoning: The model generates answers that show its reasoning process. This helps it provide clear and accurate responses based on the documents it retrieves.
  • Dynamic Document Handling: RAFT trains the model to find and use the most relevant documents while ignoring those that do not help answer the question.

Architecture of RAFT

The RAFT architecture includes several key components:

  1. Input Layer: The model takes in a question (Q) and a set of retrieved documents (D), which include both relevant and irrelevant documents.
  2. Processing Layer:
    • The model analyzes the input to find important information in the documents.
    • It creates an answer (A*) that refers to the relevant documents.
  3. Output Layer: The model produces the final answer based on the relevant documents while disregarding the irrelevant ones.
  4. Training Mechanism: During training, some data includes both relevant and irrelevant documents, while other data includes only irrelevant ones. This setup encourages the model to focus on context rather than memorization.
  5. Evaluation: The model’s performance is assessed based on its ability to answer questions accurately using the retrieved documents.

By using this architecture, RAFT enhances the model’s ability to work in specific domains. It provides a reliable way to generate accurate and relevant responses.

Source: link

The top-left figure depicts the approach of adapting LLMs to reading solutions from a set of positive and distractor documents in contrast to the standard RAG setup, where models are trained based on the retriever outputs, which is a mixture of both memorization and reading. At test time, all methods follow the standard RAG setting, provided with top-k retrieved documents in the context.

Conclusion

Improving retrieval and generation in RAG systems is essential for better AI applications. The techniques discussed range from low-effort, high-impact methods (query rewriting, reranking) to more intensive processes (embedding and LLM fine-tuning). The best technique depends on your application’s specific needs and limits. Advanced RAG techniques, when applied thoughtfully, allow developers to build more accurate, reliable, and context-aware AI systems capable of handling complex information needs.

Harsh Mishra is an AI/ML Engineer who spends more time talking to Large Language Models than actual humans. Passionate about GenAI, NLP, and making machines smarter (so they don’t replace him just yet). When not optimizing models, he’s probably optimizing his coffee intake. 🚀☕

Login to continue reading and enjoy expert-curated content.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details