Retrieval Augmented Generation is the new technology now. RAG is replacing the traditional search-based approaches and creating a chat with a document environment. Since the inception of RAG, various methods have been proposed to enhance the standard RAG approach. The biggest hurdle in RAG is to retrieve the right document. Only when we get the right documents, the LLM will be able to generate the right answers. In this guide, we will be talking about HyDE(Hypothetical Document Embedding), an approach that was created to improve the Retrieval in RAG.
This article was published as a part of the Data Science Blogathon.
Retrieval Augmented Generation is very popular and right now is widely used. A simple RAG(Retrieval Augmented Generation) involves taking in raw text, chunking it into smaller pieces, creating embeddings for all the chunks, and storing the embeddings in a vector store. Then when a user provides a query, we compare the similarity between the user query and the chunks and retrieve the similar chunks. Finally, the user query along with similar chunks is sent to the Large Language Model to generate the final answer. This is the regular Retrieval Augmented Generation.
This regular and plain Retrieval Augmented Generation has many flaws. Starting with the chunking itself. There is no one size to chunking. The size of chunking documents largely depends on the type of Large Language Models we are working with and sometimes we have to try a bunch of sizes to get better results. Then comes the Retrieval, the main focus for this guide.
The RAG was developed to prevent the Large Language Models from hallucination. This largely depends on the similar information retrieved through the user query from the vector stores. If the Retrieval is not good, then the Large Language Model will either hallucinate or will not respond to the question provided by the user. One way to improve the Retrieval is Hypothetical Document Embeddings.
Hypothetical Document Embeddings (HyDE) is one of the transformative solutions to tackle poor Retrievals faced in RAG-based solutions. As the name suggests, HyDE works by generating Hypothetical Documents, which will help in better retrieval of similar documents so that the Large Language Model can take these inputs and generate a better answer.
Let’s understand HyDE with the below diagram:
The first step involves taking in a user query. Now in a normal RAG system, we convert the user query into embeddings and send it to the vector store to retrieve similar chunks. But in Hypothetical Document Embeddings, we take in the user query and then pass it to a Large Language Model to generate a Hypothetical Answer to the question. So the LLM takes in the user question and tries to generate a fake Hypothetical Answer/Document with similar textual patterns from the first user query. We then convert this Hypothetical Document into embedding vectors and then use these embeddings to retrieve similar chunks from the vector store. Finally, we bind these similar chunks to the original query and pass on these together to LLM to generate the final answer.
So what we are trying to do here is, that instead of trying to perform a query to answer embedding vectors similarity, we are trying to perform an answer to answer embedding vectors similarity so that it yields better results.
In this section, we will be creating the Hypothetical Document Embeddings from scratch and see how well it retrieves the relevant content. Along with that, we will even look at an implementation in LangChain for the Hypothetical Document Embeddings.
We will start by downloading and installing the Python libraries:
pip install -q langchain langchain-google-genai sentence-transformers chromadb
We install the following libraries:
Let us implement HyDE by following certain steps:
Let us start by loading the LLM and the embedding models. For this, we will work with the below code:
# --- Setting API KEY ---
import os
os.environ['GOOGLE_API_KEY']='YOUR GOOGLE API KEY'
# --- Model Loading ---
# Import the necessary modules from the langchain_google_genai package.
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_google_genai import ChatGoogleGenerativeAI
# Create a ChatGoogleGenerativeAI Object and convert system messages to human-readable format.
llm = ChatGoogleGenerativeAI(model="gemini-pro", convert_system_message_to_human=True)
# Create a GoogleGenerativeAIEmbeddings object for embedding our Prompts and documents
Embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
You can visit this link to get your free API Key. After getting the API Key, paste in the above code in place of “YOUR GOOGLE API KEY”.
The first step in a general Retrieval Augmented Generation involves data loading. Here is the code created in LangChain to fetch and load data from the given URL.
# --- Data Loading ---
# Import the WebBaseLoader class from the langchain_community.document_loaders module.
from langchain_community.document_loaders import WebBaseLoader
# Create a WebBaseLoader object with the URL of the blog post to load.
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/")
# Load the blog post and store the documents in the `docs` variable.
docs = loader.load()
After running the above code, the variable docs will contain the documents retrieved from the given web URL. After loading the data, we need to chunk them into smaller pieces so that we can extract/retrieve only relevant data when necessary. To perform this, we will be working with the below code.
Let us now split the data and create chunks.
# --- Splitting / Creating Chunks ---
# Import the RecursiveCharacterTextSplitter class from the
# langchain.text_splitter module.
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Create a RecursiveCharacterTextSplitter object using the provided
# chunk size and overlap.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300,
chunk_overlap=50)
# Split the documents in the `docs` variable into smaller chunks and
#store the resulting splits in the `splits` variable.
splits = text_splitter.split_documents(docs)
Now we have created our documents and have chunked them, the next step is to store these documents in a vector store so that we can retrieve them later.
The code for this will be:
# --- Creating Embeddings by Passing Hyde Embeddings to Vector Store ---
from langchain_community.vectorstores import Chroma
# passing the hyde embeddings to create and store embeddings
vectorstore = Chroma.from_documents(documents=splits,
collection_name='my-collection',
embedding=Embeddings)
# Creating Retriever
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 4})
Now we are finally done with the data loading, preprocessing and storing part. Now we will be creating a Prompt Template for generating hypothetical documents for the user queries. The code for this can be found below:
# Importing the Prompt Template
from langchain.prompts import ChatPromptTemplate
# Creating the Prompt Template
template = """For the given question try to generate a hypothetical answer\
Only generate the answer and nothing else:
Question: {question}
"""
Prompt = ChatPromptTemplate.from_template(template)
query = Prompt.format(question = 'What are different Chain of Thought(CoT) Prompting?')
hypothetical_answer = llm.invoke(query).content
print(hypothetical_answer)
Running the above will result in a Hypothetical Document/Answer generated by the Large Language Model based on the given user query.
We can see that based on the user query, the Large Language Model has generated a possible answer i.e. a Hypothetical Document. Now let’s try to retrieve documents from our vector store that are relevant to this Hypothetical Answer/Document.
# retrieval with hypothetical answer/document
similar_docs = retriever.get_relevant_documents(hypothetical_answer)
for doc in similar_docs:
print(doc.page_content)
print()
After running the code, below we can see the relevant documents retrieved.
We can see that all four chunks retrieved seem to have a close relationship to the original query asked by the user. Especially the first 3 chunks have an ample amount of information needed by the Large Language Model to generate the answer. Let us try getting the relevant documents from the plain Prompt. The code for this will be:
# retrieval with original query
similar_docs = retriever.get_relevant_documents('What are different \
Chain of Thought(CoT) Prompting?')
for doc in similar_docs:
print(doc.page_content)
print()
Outputs :
Types of CoT Prompts#
Two main types of CoT Prompting:
Chain-of-thought (CoT) prompting (Wei et al. 2022) generates a
sequence of short sentences to describe reasoning logics step by
step, known as reasoning chains or rationales, to eventually lead to
the final answer. The benefit of CoT is more pronounced for complicated
reasoning tasks, while using
Chain-of-Thought (CoT)#
Table of Contents
Basic Prompting
Zero-Shot
Few-shot
Tips for Example Selection
Tips for Example Ordering
Instruction Prompting
Self-Consistency Sampling
Chain-of-Thought (CoT)
Types of CoT Prompts
Tips and Extensions
Automatic Prompt Design
Augmented Language Models
Here we can see that the retrieved documents do not contain in-depth information when compared to the one with the Hypothetical Documents Embeddings Approach. So let us pass these documents retrieved through the Hyde way to the LLM and see the output that it generates.
# Creating the Prompt Template
template = """Answer the following question based on this context:
{context}
Question: {question}
"""
Prompt = ChatPromptTemplate.from_template(template)
# Creating a function to format the retrieved docs
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
formatted_docs = format_docs(similar_docs)
Query_Prompt = Prompt.format(context=formatted_docs,
question="What are different Chain of Thought(CoT) Prompting?")
print(Query_Prompt)
response = llm.invoke(Query_Prompt)
print(response.content)
After running the code, the Large Language Model generated the following response to the user query.
We can notice that it has taken in the retrieved documents that we were able to get through the Hypothetical Answer and then generate a correct answer to the user question without any hallucination. Now, this is the manual processing of performing Hypothetical Document Embeddings, where we can do it from scratch by defining a prompt to create a Hypothetical Answer and then performing a similar search for this Answer and the document chunks.
Luckily Langchain comes with a predefined class for HyDE. Let us take a look at it through the below code:
from langchain_google_genai import GoogleGenerativeAI
from langchain_google_genai import GoogleGenerativeAIEmbeddings
llm = GoogleGenerativeAI(model="gemini-pro")
Emebeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")
from langchain_community.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/")
docs = loader.load()
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=300,chunk_overlap=50)
splits = text_splitter.split_documents(docs)
from langchain.chains import HypotheticalDocumentEmbedder
hyde_embeddings = HypotheticalDocumentEmbedder.from_llm(llm,
Embeddings,
prompt_key = "web_search")
from langchain_community.vectorstores import Chroma
vectorstore = Chroma.from_documents(documents=splits,
collection_name='collection-1',
embedding=hyde_embeddings)
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 4})
from langchain.schema.runnable import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import PromptTemplate, ChatPromptTemplate
template = """Answer the following question based on this context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
response = rag_chain.invoke("What are different Chain of Thought(CoT) prompting?")
print(response)
The code is the same till the part where we chunk the documents that we have downloaded from the web.
Now we can just call the rag_chain’s invoke() function and pass it the question. The rag_chain will take care of creating the Hypothetical Answers for us from the provided query, then create embedding vectors for them and retrieve similar chunks from the vector store. Then format these chunks to fit in the Prompt Template and pass the final Prompt to the Large Language Model, which will generate an answer based on the retrieved chunks and the user query.
Below is the output generated after running this code:
We can see that the answer generated from the LLM is similar to the answer that we generated when were doing the Hypothetical Document Embeddings from scratch. But do note that this in-built Hyde is not producing good results, so it is better to test both the from-scratch approach and this approach before going forward. So here the HypotheticalDocumentEmbedder takes care of this work so that we can start building efficient RAG applications.
In this guide, we delved into the realm of Hypothetical Document Embeddings (HyDE) a strategy to improve retrieval accuracy in Retrieval Augmented Generation (RAG) systems. By leveraging HyDE, we aimed to overcome the limitations of traditional RAG practices, which include accurately retrieving relevant documents for generating responses. Through the guide and practical implementation of HyDE using LangChain, we explored its potential in improving retrieval accuracy and reducing hallucinations, thereby contributing to more reliable and contextually relevant responses from Large Language Models (LLMs). By understanding the intricacies of HyDE and its practical application, we can pave the way for more efficient and effective RAG systems.
A. RAG is a framework/tool for generating text by combining retrieval and generation. It retrieves relevant information from a document store based on a user query and then uses that information to generate a response. However, traditional RAG can struggle if the retrieved information isn’t a good match for the query.
A. The biggest hurdle in RAG is retrieving the right documents. Traditional RAG relies on pure user query matching, which can be inaccurate. HyDE addresses this by creating “hypothetical documents” based on the user query. These hypothetical documents are then used to retrieve more relevant information from the document store.
A. The guide explores implementing HyDE using the LangChain library. It includes creating hypothetical documents, storing them in a vector store, and retrieving relevant documents based on the hypothetical documents.
A. The quality of the generated hypothetical documents can impact the retrieval accuracy. HyDE needs extra computational resources compared to traditional RAG.
A. Langchain provides a built-in class called HypotheticalDocumentEmbedder that simplifies the HyDE process. This class handles generating hypothetical documents, embedding them, and retrieving relevant chunks.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.