Retrieval Augmented-Generation (RAG) has taken the world by Storm ever since its inception. RAG is what is necessary for the Large Language Models (LLMs) to provide or generate accurate and factual answers. We solve the factuality of LLMs by RAG, where we try to give the LLM a context that is contextually similar to the user query so that the LLM will work with this context and generate a factually correct response. We do this by representing our data and user query in the form of vector embeddings and performing a cosine similarity. But the problem is, that all the traditional approaches represent the data in a single embedding, which may not be ideal for good retrieval systems. In this guide, we will look into ColBERT which performs retrieval with better accuracy than traditional bi-encoder models.
This article was published as a part of the Data Science Blogathon.
LLMs, although capable of generating text that is both meaningful and grammatically correct, these LLMs suffer from a problem called hallucination. Hallucination in LLMs is the concept where the LLMs confidently generate wrong answers, that is they make up wrong answers in a way that makes us believe that it is true. This has been a major problem since the introduction of the LLMs. These hallucinations lead to incorrect and factually wrong answers. Hence Retrieval Augmented Generation was introduced.
In RAG, we take a list of documents/chunks of documents and encode these textual documents into a numerical representation called vector embeddings, where a single vector embedding represents a single chunk of document and stores them in a database called vector store. The models required for encoding these chunks into embeddings are called encoding models or bi-encoders. These encoders are trained on a large corpus of data, thus making them powerful enough to encode the chunks of documents in a single vector embedding representation.
Now when a user asks a query to the LLM, then we give this query to the same encoder to produce a single vector embedding. This embedding is then used to calculate the similarity score with various other vector embeddings of the document chunks to get the most relevant chunk of the document. The most relevant chunk or a list of the most relevant chunks along with the user query are given to the LLM. The LLM then receives this extra contextual information and then generates an answer that is aligned with the context received from the user query. This makes sure that the generated content by the LLM is factual and something that can be traced back if necessary.
The problem with traditional Encoder models like the all-miniLM, OpenAI embedding model, and other encoder models is that they compress the entire text into a single vector embedding representation. These single vector embedding representations are useful because they help in the efficient and quick retrieval of similar documents. However, the problem lies in the contextuality between the query and the document. The single vector embedding may not be sufficient to store the contextual information of a document chunk, thus creating an information bottleneck.
Imagine that 500 words are being compressed to a single vector of size 782. It may not be sufficient to represent such a chunk with a single vector embedding, thus giving subpar results in retrieval in most of the cases. The single vector representation may also fail in cases of complex queries or documents. One such solution would be to represent the document chunk or a query as a list of embedding vectors instead of a single embedding vector, this is where ColBERT comes in.
ColBERT (Contextual Late Interactions BERT) is a bi-encoder that represents text in a multi-vector embedding representation. It takes in a Query or a chunk of a Document / a small Document and creates vector embeddings at the token level. That is each token gets its own vector embedding, and the query/document is encoded to a list of token-level vector embeddings. The token level embeddings are generated from a pre-trained BERT model hence the name BERT.
These are then stored in the vector database. Now, when a query comes in, a list of token-level embeddings is created for it and then a matrix multiplication is performed between the user query and each document, thus resulting in a matrix containing similarity scores. The overall similarity is achieved by taking the sum of maximum similarity across the document tokens for each query token. The formula for this can be seen in the below pic:
Here in the above equation, we see that we do a dot product between the Query Tokens Matrix (containing N token level vector embeddings)and the Transpose of Document Tokens Matrix (containing M token level vector embeddings), and then we take the maximum similarity cross the document tokens for each query token. Then we take the sum of all these maximum similarities, which gives us the final similarity score between the document and the query. The reason why this produces effective and accurate retrieval is, that here we are having a token-level interaction, which gives room for more contextual understanding between the query and document.
As we are computing the list of embedding vectors before itself and only performing this MaxSim (maximum similarity) operation during the model inference, thus calling it a late interaction step, and as we are getting more contextual information through token level interactions, it’s called contextual late interactions. Thus the name Contextual Late Interactions BERT or ColBERT. These computations can be performed in parallel, hence they can be computed efficiently. Finally, one concern is the space, that is, it requires a lot of space to store this list of token-level vector embeddings. This issue was solved in the ColBERTv2, where the embeddings are compressed through the technique called residual compression, thus optimizing the space utilized.
In this section, we will get hands-on with the ColBERT and even check how it performs against a regular embedding model.
We will start by downloading the following library:
!pip install ragatouille langchain langchain_openai chromadb einops sentence-transformers tiktoken
In the next step, we will download the pre-trained ColBERT model. For this, the code will be
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
Running the code above will instantiate a ColBERT RAG model. Now let’s download a Wikipedia page and perform retrieval from it. For this, the code will be:
from ragatouille.utils import get_wikipedia_page
document = get_wikipedia_page("Elon_Musk")
print("Word Count:",len(document))
print(document[:1000])
The RAGatouille comes with a handy function called get_wikipedia_page which takes in a string and gets the corresponding Wikipedia page. Here we download the Wikipedia content on Elon Musk and store it in the variable document. Let’s print the number of words present in the document and the first few lines of the document.
Here we can see the output in the pic. We can see that there are a total of 64,668 words on the Wikipedia page of Elon Musk.
Now we will create an index on this document.
RAG.index(
# List of Documents
collection=[document],
# List of IDs for the above Documents
document_ids=['elon_musk'],
# List of Dictionaries for the metadata for the above Documents
document_metadatas=[{"entity": "person", "source": "wikipedia"}],
# Name of the index
index_name="Elon2",
# Chunk Size of the Document Chunks
max_document_length=256,
# Wether to Split Document or Not
split_documents=True
)
Here we call the .index() of the RAG to index our document. To this, we pass the following:
Running the code above will chunk our document in sizes of 256 per chunk, then embed them through the ColBERT model, which will produce a list of token-level vector embeddings for each chunk and finally store them in an index. This step will take a bit of time to run and can be accelerated if having a GPU. Finally, it creates a directory where our index is stored. Here the directory will be “.ragatouille/colbert/indexes/Elon2”
Now, we will begin the search. For this, the code will be
results = RAG.search(query="What companies did Elon Musk find?", k=3, index_name='Elon2')
for i, doc, in enumerate(results):
print(f"---------------------------------- doc-{i} ------------------------------------")
print(doc["content"])
Running the code will produce the following results:
In the pic, we can see that the first and last document entirely covers the different companies founded by Elon Musk. The ColBERT was able to correctly retrieve the relevant chunks needed to answer the query.
Now let’s go a step further and ask it a specific question.
results = RAG.search(query="How much Tesla stocks did Elon sold in \
Decemeber 2022?", k=3, index_name='Elon2')
for i, doc, in enumerate(results):
print(f"---------------
------------------- doc-{i} ------------------------------------")
print(doc["content"])
Here in the above code, we are asking a very specific question about how many stocks worth of Tesla Elon sold in the month of December 2022. We can see the output here. The doc-1 contains the answer to the question. Elon has sold $3.6 billion worth of his stock in Tesla. Again, ColBERT was able to successfully retrieve the relevant chunk for the given query.
Let’s now try the same question with the other embedding models both open-source and closed here:
from langchain_community.embeddings import HuggingFaceEmbeddings
from transformers import AutoModel
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
model_name = "jinaai/jina-embeddings-v2-base-en"
model_kwargs = {'device': 'cpu'}
embeddings = HuggingFaceEmbeddings(
model_name=model_name,
model_kwargs=model_kwargs,
)
Running this code will download and load the Jina embedding model so that we can work with it
Now, we need to start splitting our document and then create embeddings out of it and store them in the Chroma vector store. For this, we work with the following code:
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=256,
chunk_overlap=0)
splits = text_splitter.split_text(document)
vectorstore = Chroma.from_texts(texts=splits,
embedding=embeddings,
collection_name="elon")
retriever = vectorstore.as_retriever(search_kwargs = {'k':3})
Running this code will take our document, split it into smaller documents of size 256 per chunk, and then embed these smaller chunks with the Jina embedding model and store these embedding vectors in the chroma vector store.
Finally, we create a retriever from it. Now we will perform a vector search and check the results.
docs = retriever.get_relevant_documents("What companies did Elon Musk find?",)
for i, doc in enumerate(docs):
print(f"---------------------------------- doc-{i} ------------------------------------")
print(doc.page_content)
We can clearly spot the difference between the Jina, the embedding model that represents each chunk as a single vector embedding, and the ColBERT model which represents each chunk as a list of token-level embedding vectors. The ColBERT clearly outperforms in this case.
Now let’s try using a closed-source embedding model like the OpenAI Embedding model.
import os
os.environ["OPENAI_API_KEY"] = "Your API Key"
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
model_name = "gpt-4",
chunk_size = 256,
chunk_overlap = 0,
)
splits = text_splitter.split_text(document)
vectorstore = Chroma.from_texts(texts=splits,
embedding=embeddings,
collection_name="elon_collection")
retriever = vectorstore.as_retriever(search_kwargs = {'k':3})
Here the code is very similar to the one that we have just written
Running this code will again take our documents, chunk them into smaller documents of size 256, and then embed them into single vector embedding representation with the OpenAI embedding model and finally store these embeddings in the Chroma Vector Store. Now let’s try to retrieve the relevant documents to the other question.
docs = retriever.get_relevant_documents("How much Tesla stocks did Elon sold in Decemeber 2022?",)
for i, doc in enumerate(docs):
print(f"---------------------------------- doc-{i} ------------------------------------")
print(doc.page_content)
Even here we can see a clear difference between the single-vector embedding representation vs the multi-vector embedding representation. The multi-embedding representations clearly capture the complex queries which results in more accurate retrievals.
In conclusion, ColBERT demonstrates a significant advancement in retrieval performance over traditional bi-encoder models by representing text as multi-vector embeddings at the token level. This approach allows for more nuanced contextual understanding between queries and documents, leading to more accurate retrieval results and mitigating the issue of hallucinations commonly observed in LLMs.
A. Traditional bi-encoders compress entire texts into single vector embeddings, potentially losing contextual information. This limits their effectiveness in retrieval tasks, especially with complex queries or documents.
A. ColBERT (Contextual Late Interactions BERT) is a bi-encoder model that represents text using token-level vector embeddings. It allows for more nuanced contextual understanding between queries and documents, improving retrieval accuracy.
A. ColBERT generates token-level embeddings for queries and documents, performs matrix multiplication to calculate similarity scores, and then selects the most relevant information based on maximum similarity across tokens. This allows for effective retrieval with contextual understanding.
A. ColBERTv2 optimizes Space through the residual compression method, reducing the storage requirements for token-level embeddings while maintaining retrieval accuracy.
A. You can use libraries like RAGatouille to work with ColBERT easily. By indexing documents and queries, you can perform efficient retrieval tasks and generate accurate answers aligned with the context.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.