Retrieval Augmented Generation (RAG) systems are revolutionizing how we interact with information, but they’re only as good as the data they retrieve. Optimizing those retrieval results is where the reranker comes in. For instance, consider it as a quality control system for your search results, ensuring that only the most relevant information comes into the final output.
This article explores the world of rerankers, explaining why they’re important, when you need them, their potential drawbacks, and their types. This article will also guide you in selecting the best reranker for your specific RAG system and how to evaluate its performance.
A reranker is an important component of the information retrieval systems, it acts as a second-pass filter. While doing an initial search (using methods like semantic or keyword search) it returns a set of documents, and the reranker helps to reorder them. This reordering filters and prioritizes documents based on their relevance to a specific query, hence improving the quality of search results. Rerankers achieve this balance between speed and quality by employing more complex matching techniques than the initial retrieval stage.
This image illustrates a two-stage search process. Reranking is the second stage, where an initial set of search results, based on semantic or keyword matching, is refined to significantly improve the relevance and ordering of the final results, delivering a more accurate and useful outcome for the user’s query.
Imagine your RAG system as a chef, and the retrieved documents are the ingredients. To create a delicious dish (accurate answer), you require the best ingredients. But what if some of those ingredients are irrelevant or simply don’t belong in the recipe? That’s where rerankers help!
Here’s why you need a reranker:
Relying solely on embeddings for retrieval can be problematic due to:
Rerankers excel where embeddings fall short by:
A query is used to search a vector database, retrieving the top 25 most relevant documents. These documents are then passed to a “Reranker” module. The reranker refines the results, selecting the top 3 most relevant documents for the final output.
Also read: How to Choose the Right Embedding for Your RAG Model
The world of rerankers is constantly evolving. Here’s a breakdown of the main types:
Approach | Examples | Access Type | Performance Level | Cost Range |
Cross Encoder | Sentence Transformers, Flashrank | Open-source | Excellent | Moderate |
Multi-Vector | ColBERT | Open-source | Good | Low |
Fine-tuned Large Language Model | RankZephyr, RankT5 | Open-source | Excellent | High |
LLM as a Judge | GPT, Claude, Gemini | Proprietary | Top-tier | Very Expensive |
Rerank API | Cohere, Jina | Proprietary | Excellent | Moderate |
Cross-encoders classify pairs of data, analyzing the relationship between a query and a document together. They offer nuanced understanding, making them excellent for precise relevance scoring. However, they require significant computational resources, making them less suitable for real-time applications.
Example Code:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import FlashrankRerank
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature=0)
compressor = FlashrankRerank()
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor, base_retriever=retriever
)
compressed_docs = compression_retriever.invoke(
"What did the president say about Ketanji Jackson Brown"
)
print([doc.metadata["id"] for doc in compressed_docs])
pretty_print_docs(compressed_docs)
This code snippet utilizes FlashrankRerank within a ContextualCompressionRetriever to improve the relevance of retrieved documents. It specifically reranks documents obtained by a base retriever (represented by a retriever) based on their relevance to the query “What did the president say about Ketanji Jackson Brown”. Finally, it prints the document IDs and the compressed, reranked documents.
[0, 5, 3]
Document 1:
One of the most serious constitutional responsibilities a President has is
nominating someone to serve on the United States Supreme Court.
And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge
Ketanji Brown Jackson. One of our nation’s top legal minds, who will
continue Justice Breyer’s legacy of excellence.
----------------------------------------------------------------------------------------------------
Document 2:
He met the Ukrainian people.
From President Zelenskyy to every Ukrainian, their fearlessness, their
courage, their determination, inspires the world.
Groups of citizens blocking tanks with their bodies. Everyone from students
to retirees teachers turned soldiers defending their homeland.
In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight.
----------------------------------------------------------------------------------------------------
Document 3:
And tonight, I’m announcing that the Justice Department will name a chief prosecutor for pandemic fraud.
By the end of this year, the deficit will be down to less than half what it was before I took office.
The only president ever to cut the deficit by more than one trillion dollars in a single year.
Lowering your costs also means demanding more competition.
I’m a capitalist, but capitalism without competition isn’t capitalism.
It’s exploitation—and it drives up prices.
The output shoes it reranks the retrieved chunks based on the relevancy.
Multi-vector models like ColBERT use a late interaction approach. Query and document representations are encoded independently, and their interaction occurs later in the process. This allows pre-computation of document representations, leading to faster retrieval times and reduced computational demands.
Install the Ragtouille library for using the ColBERT reranker
pip install -U ragatouille
Now setting up the ColBERT reranker
from ragatouille import RAGPretrainedModel
from langchain.retrievers import ContextualCompressionRetriever
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
compression_retriever = ContextualCompressionRetriever(
base_compressor=RAG.as_langchain_document_compressor(), base_retriever=retriever
)
compressed_docs = compression_retriever.invoke(
"What animation studio did Miyazaki found"
)
print(compressed_docs[0])
Document(page_content='In June 1985, Miyazaki, Takahata, Tokuma and Suzuki
founded the animation production company Studio Ghibli, with funding from
Tokuma Shoten. Studio Ghibli\'s first film, Laputa: Castle in the Sky
(1986), employed the same production crew of Nausicaä. Miyazaki\'s designs
for the film\'s setting were inspired by Greek architecture and "European
urbanistic templates". Some of the architecture in the film was also
inspired by a Welsh mining town; Miyazaki witnessed the mining strike upon
his first', metadata={'relevance_score': 26.5194149017334})
Fine-tuning large language models (LLMs) for reranking tasks is essential. Pre-trained LLMs do not inherently understand how to measure the relevance of a query to a document. By fine-tuning these models on specific ranking datasets, like the MS MARCO passage ranking dataset, we can improve their ability to rank documents effectively.
There are two main types of supervised rerankers based on their model structure:
By applying these fine-tuning techniques, we can enhance the performance of LLMs in reranking tasks, making them more effective in understanding and prioritizing relevant documents.
First, install the RankLLM library
pip install --upgrade --quiet rank_llm
Set up the RankZephyr
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_community.document_compressors.rankllm_rerank import RankLLMRerank
compressor = RankLLMRerank(top_n=3, model="zephyr")
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor, base_retriever=retriever
)
compressed_docs = compression_retriever.invoke(query)
pretty_print_docs(compressed_docs)
Document 1:
Together with our allies –we are right now enforcing powerful economic
sanctions.
We are cutting off Russia’s largest banks from the international financial
system.
Preventing Russia’s central bank from defending the Russian Ruble making
Putin’s $630 Billion “war fund” worthless.
We are choking off Russia’s access to technology that will sap its economic
strength and weaken its military for years to come.
----------------------------------------------------------------------------------------------------
Document 2:
And tonight I am announcing that we will join our allies in closing off
American air space to all Russian flights – further isolating Russia – and
adding an additional squeeze –on their economy. The Ruble has lost 30% of
its value.
The Russian stock market has lost 40% of its value and trading remains
suspended. Russia’s economy is reeling and Putin alone is to blame.
----------------------------------------------------------------------------------------------------
Document 3:
And now that he has acted the free world is holding him accountable.
Along with twenty-seven members of the European Union including France,
Germany, Italy, as well as countries like the United Kingdom, Canada, Japan,
Korea, Australia, New Zealand, and many others, even Switzerland.
We are inflicting pain on Russia and supporting the people of Ukraine. Putin
is now isolated from the world more than ever.
Together with our allies –we are right now enforcing powerful economic
sanctions.
Large language models can be used to improve document reranking autonomously through prompting strategies like pointwise, listwise, and pairwise methods. These methods leverage the reasoning capabilities of LLMs (LLM as a judge) to assess the relevance of documents to a query directly. While offering competitive effectiveness, the high computational cost and latency associated with LLMs can be a barrier to practical use.
import openai
# Set your OpenAI API key
openai.api_key = 'YOUR_API_KEY'
def pointwise_rerank(query, document):
prompt = f"Rate the relevance of the following document to the query on a scale from 1 to 10:\n\nQuery: {query}\nDocument: {document}\n\nRelevance Score:"
response = openai.ChatCompletion.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": prompt}]
)
return response['choices'][0]['message']['content'].strip()
def listwise_rerank(query, documents):
# Use a sliding window approach to rerank documents
window_size = 5
reranked_docs = []
for i in range(0, len(documents), window_size):
window = documents[i:i + window_size]
prompt = f"Given the query, please rank the following documents:\n\nQuery: {query}\nDocuments: {', '.join(window)}\n\nRanked Document Identifiers:"
response = openai.ChatCompletion.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": prompt}]
)
ranked_ids = response['choices'][0]['message']['content'].strip().split(', ')
reranked_docs.extend(ranked_ids)
return reranked_docs
def pairwise_rerank(query, documents):
scores = {}
for i in range(len(documents)):
for j in range(i + 1, len(documents)):
doc1 = documents[i]
doc2 = documents[j]
prompt = f"Which document is more relevant to the query?\n\nQuery: {query}\nDocument 1: {doc1}\nDocument 2: {doc2}\n\nAnswer with '1' for Document 1, '2' for Document 2:"
response = openai.ChatCompletion.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": prompt}]
)
winner = response['choices'][0]['message']['content'].strip()
if winner == '1':
scores[doc1] = scores.get(doc1, 0) + 1
scores[doc2] = scores.get(doc2, 0)
elif winner == '2':
scores[doc2] = scores.get(doc2, 0) + 1
scores[doc1] = scores.get(doc1, 0)
# Sort documents based on scores
ranked_docs = sorted(scores.items(), key=lambda item: item[1], reverse=True)
return [doc for doc, score in ranked_docs]
# Example usage
query = "What are the benefits of using LLMs for document reranking?"
documents = [
"LLMs can process large amounts of text quickly.",
"They require extensive fine-tuning for specific tasks.",
"LLMs can generate human-like text responses.",
"They are limited by their training data and may produce biased results."
]
# Pointwise Reranking
for doc in documents:
score = pointwise_rerank(query, doc)
print(f"Document: {doc} - Relevance Score: {score}")
# Listwise Reranking
reranked_listwise = listwise_rerank(query, documents)
print(f"Listwise Reranked Documents: {reranked_listwise}")
# Pairwise Reranking
reranked_pairwise = pairwise_rerank(query, documents)
print(f"Pairwise Reranked Documents: {reranked_pairwise}")
Output:
Document: LLMs can process large amounts of text quickly. - Relevance Score:
8
Document: They require extensive fine-tuning for specific tasks. - Relevance
Score: 6
Document: LLMs can generate human-like text responses. - Relevance Score: 9
Document: They are limited by their training data and may produce biased
results. - Relevance Score: 5
Listwise Reranked Documents: ['LLMs can generate human-like text responses.',
'LLMs can process large amounts of text quickly.', 'They require extensive
fine-tuning for specific tasks.', 'They are limited by their training data
and may produce biased results.']
Pairwise Reranked Documents: ['LLMs can generate human-like text responses.',
'LLMs can process large amounts of text quickly.', 'They require extensive
fine-tuning for specific tasks.', 'They are limited by their training data
and may produce biased results.']
Private reranking APIs offer a convenient solution for organizations seeking to enhance search systems with semantic relevance without significant infrastructure investment. Companies like Cohere, Jina, and Mixedbread offer these services.
Install the RankLLM library:
pip install --upgrade --quiet cohere
Set up the Cohere and ContextualCompressionRetriever:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain_community.llms import Cohere
from langchain.chains import RetrievalQA
llm = Cohere(temperature=0)
compressor = CohereRerank(model="rerank-english-v3.0")
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor, base_retriever=retriever
)
chain = RetrievalQA.from_chain_type(
llm=Cohere(temperature=0), retriever=compression_retriever
)
Output:
{'query': 'What did the president say about Ketanji Brown Jackson',
'result': " The president speaks highly of Ketanji Brown Jackson, stating
that she is one of the nation's top legal minds, and will continue the
legacy of excellence of Justice Breyer. The president also mentions that he
worked with her family and that she comes from a family of public school
educators and police officers. Since her nomination, she has received
support from various groups, including the Fraternal Order of Police and
judges from both major political parties. \n\nWould you like me to extract
another sentence from the provided text? "}
Selecting the optimal reranker for RAG requires careful evaluation of several factors:
Recent research has highlighted the effectiveness and efficiency of cross-encoders, especially when paired with strong retrievers. While in-domain performance differences might be subtle, out-of-domain scenarios reveal the significant impact a reranker can have. Cross-encoders have shown the ability to outperform most LLMs in reranking tasks (except for GPT-4 in some cases) while being more efficient.
Choosing the right reranker for RAG is important for improving systems and ensuring accurate search results. As the RAG landscape evolves, having clear visibility across the entire pipeline helps teams build effective systems. By addressing challenges in the process, teams can improve performance. Understanding the different types of rerankers and their strengths and weaknesses is essential. Carefully choosing and evaluating your reranker for RAG can enhance the accuracy and efficiency of your RAG applications. This thoughtful approach leads to better results and a more dependable system.