The intersection of artificial intelligence and data processing has evolved significantly with the rise of multimodal Retrieval-Augmented Generation systems. Multimodal RAG goes beyond traditional models that focus only on text. It integrates various data types like text, images, audio, and video. This allows for more nuanced and context-aware responses. A key innovation is Nomic vision embeddings. They create a unified space for both visual and textual data. This enables seamless interaction across different formats. By using advanced models to generate high-quality embeddings, multimodal RAG improves information retrieval. It bridges the gap between different content forms. The result is richer and more informative user experiences.
This article was published as a part of the Data Science Blogathon.
Multimodal RAG represents a significant advancement in artificial intelligence. It is build upon traditional RAG systems by incorporating diverse data types such as text, images, audio, and video. Unlike conventional RAG systems that primarily process textual information, multimodal RAG is designed to handle and integrate multiple forms of data simultaneously. This capability allows for more comprehensive understanding and generation of responses that are context-aware across different modalities.
Key Components of Multimodal RAG
A significant innovation in this field of multimodal embeddings is the incorporation of Nomic vision embeddings, which create a cohesive embedding space for both visual and textual data.
Nomic Embed Vision v1 and v1.5 are both high-quality vision embedding models developed by Nomic AI, designed to share the same latent space as their corresponding text embedding models, Nomic Embed Text v1 and v1.5, respectively. It operates within the same space as Nomic Embed Text, making it well-suited for multimodal tasks such as text-to-image retrieval. With a vision encoder comprising only 92M parameters, Nomic Embed Vision is well-suited for high-volume production applications, complementing the 137M parameters of Nomic Embed Text.
CLIP models suffer in unimodal tasks
Multimodal models such as CLIP demonstrate remarkable zero-shot capabilities across different modalities. However, CLIP’s text encoders struggle with tasks beyond image retrieval, as seen in benchmarks like MTEB, which evaluates the effectiveness of text embedding models. Nomic Embed Vision aims to address these limitations by aligning a vision encoder with the existing Nomic Embed Text latent space.
To tackle the issue of underperformance on unimodal tasks, such as semantic similarity, Nomic Embed Vision, a vision encoder, was trained alongside Nomic Embed Text, a long-context text encoder. The training method involved freezing the text encoder and training the vision encoder on image-text pairs. This approach not only produced optimal results but also ensured backward compatibility with the embeddings from Nomic Embed Text.
As mentioned earlier, existing multimodal models such as CLIP exhibit impressive zero-shot capabilities across different modalities. However, the performance of CLIP’s text encoders is subpar outside of tasks like image retrieval, as evidenced by benchmarks like MTEB, which evaluates the quality of text embedding models. Nomic Embed Vision is specifically designed to address these shortcomings by aligning a vision encoder with the existing Nomic Embed Text latent space. This alignment results in a unified multimodal latent space that delivers strong performance on image, text, and multimodal tasks, as demonstrated by the Imagenet Zero-Shot, MTEB, and Datacomp benchmarks.
In this tutorial, we will build a multimodal RAG system that can efficiently retrieve information from a PDF containing both textual and visual content. We will build this on Google Colab using T4 GPU (Free tier).
Install all required Python libraries, including OpenAI, Qdrant, Transformers, Torch, and PyMuPDF.
!pip install openai==1.55.3 httpx==0.27.2
!pip install qdrant_client
!pip install transformers
!pip install transformers torch pillow
!pip install --upgrade nltk
!pip install sentence-transformers
!pip install --upgrade qdrant-client fastembed Pillow
!pip install PyMuPDF
Set up the OpenAI API key and import essential libraries like PyMuPDF, PIL, LangChain, and OpenAI.
from openai import ChatCompletion
import openai
import os
from openai import AzureOpenAI
from PIL import Image
import torch
import numpy as np
import fitz # PyMuPDF
import os
import time
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI
from openai import ChatCompletion
import openai
import base64
from base64 import b64decode
os.environ["OPENAI_API_KEY"] = ''
Set up the OpenAI API key and import essential libraries like PyMuPDF, PIL, LangChain, and OpenAI.
#images
def extract_images_from_pdf(pdf_path, output_folder):
pdf_document = fitz.open(pdf_path)
os.makedirs(output_folder, exist_ok=True)
#Iterating throught the pages in the PDF
for page_number in range(len(pdf_document)):
page = pdf_document[page_number]
#Function For Getting Images From the PDF Pages
images = page.get_images(full=True)
for image_index, img in enumerate(images):
xref = img[0]
base_image = pdf_document.extract_image(xref)
image_bytes = base_image["image"]
image_ext = base_image["ext"]
image_filename = f"page_{page_number+1}_image_{image_index+1}.{image_ext}"
image_path = os.path.join(output_folder, image_filename)
with open(image_path, "wb") as image_file:
image_file.write(image_bytes)
pdf_document.close()
Use PyMuPDF to extract text from all pages of the PDF and store it in a list.
def extract_text_pdf(path):
"""Extracts text from a PDF using PyMuPDF."""
doc = fitz.open(path)
text_results = []
for page in doc:
text = page.get_text()
text_results.append(text)
return text_results
Save images in the “test” directory and extract text for further processing.
def get_contents(pdf_path, output_directory):
"""Extracts text and images from a PDF, saves images, and returns text and elapsed time."""
extract_images_from_pdf(pdf_path, output_directory)
text_results=extract_text_pdf(pdf_path)
return(text_results)
pdf_path = "/content/retailcoffee.pdf"
output_directory = "/content/test"
text_results=get_contents(pdf_path, output_directory)
We use this PDF that has both text and images or charts to test the multimodal RAG.
We save the images extracted from the PDF using the PyMuPDF library in the “test” directory. In the next steps, create embeddings of these images so as to be able to retrieve information from them in future based on a user query.
Split extracted text into smaller chunks using LangChain’s RecursiveCharacterTextSplitter.
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=2048,
chunk_overlap=50,
length_function=len,
is_separator_regex=False,
separators=[
"\n\n",
"\n",
" ",
".",
",",
"\u200b", # Zero-width space
"\uff0c", # Fullwidth comma
"\u3001", # Ideographic comma
"\uff0e", # Fullwidth full stop
"\u3002", # Ideographic full stop
"",
],
)
doc_texts = text_splitter.create_documents(text_results)
Load Nomic’s text and vision embedding models using Hugging Face’s Transformers library.
from transformers import AutoTokenizer, AutoModel
# Load the tokenizer and model
text_tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
text_model = AutoModel.from_pretrained("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
def text_embeddings(text):
inputs = text_tokenizer(text, return_tensors="pt", padding=True, truncation=True)
outputs = text_model(**inputs)
embeddings = outputs.last_hidden_state.mean(dim=1)
return embeddings[0].detach().numpy()
from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch
model = AutoModel.from_pretrained("nomic-ai/nomic-embed-vision-v1.5", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("nomic-ai/nomic-embed-vision-v1.5")
Convert text and images into vector embeddings for efficient retrieval.
#Text Embeddins
texts_embeded = [text_embeddings(document.page_content) for document in doc_texts]
#Image Embeddings
image_embeddings = []
for img in image_files:
try:
image = Image.open(os.path.join(output_directory, img))
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state
if embeddings.size(0) > 0: # Ensure the batch size is non-zero
image_embedding = embeddings.mean(dim=1).squeeze().cpu().numpy()
image_embeddings.append(image_embedding)
else:
print(f"No Embeddings For {img}")
except Exception as e:
print(e)
#SIZE OF Text & Image Embeddings
text_embeddings_size=len(texts_embeded[0])
image_embeddings_size=len(image_embeddings[0])
Qdrant is an open-source vector database and search engine designed to efficiently store, manage, and query high-dimensional vectors.We save our embeddings in this vector DB.
from qdrant_client import QdrantClient, models
client = QdrantClient(":memory:")
if not client.collection_exists("text1"): #creating a Collection
client.create_collection(
collection_name ="text1",
vectors_config=models.VectorParams(
size=text_embeddings_size, # Vector size is defined by used model
distance=models.Distance.COSINE,
),
)
client.upload_points(
collection_name="text1",
points=[
models.PointStruct(
id=str(uuid.uuid4()),
vector=np.array(texts_embeded[idx]),
payload={
"metadata": doc.metadata,
"content": doc.page_content
}
)
for idx, doc in enumerate(doc_texts)
]
)
Save image embeddings in a separate Qdrant collection for multimodal retrieval.
if not client.collection_exists("images1"):
client.create_collection(
collection_name="images1",
vectors_config=models.VectorParams(
size=image_embeddings_size, # Vector size is defined by used model
distance=models.Distance.COSINE,
),
)
# Ensure that image_embeddings are not empty
if len(image_embeddings) > 0:
client.upload_points(
collection_name="images1",
points=[
models.PointStruct(
id=str(uuid.uuid4()), # unique id
vector= np.array(image_embeddings[idx]) ,
payload={"image_path": output_directory+'/'+str(image_files[idx])} # Image path as metadata
)
for idx in range(len(image_embeddings))
)
else:
print("No embeddings found")
Retrieve the most relevant text and image embeddings based on a user query.
def MultiModalRetriever(query):
query = text_embeddings(query)
# Retrieve text hits
text_hits = client.query_points(
collection_name="text1",
query=query,
limit=3,3
).points
# Retrieve image hits
Image_hits = client.query_points(
collection_name="images1",
query=query,
limit=5,
).points
return text_hits, Image_hits
Use LangChain to process retrieved text and images, generating context-aware responses using GPT-4o.
def MultiModalRAG(context,images,user_query,model):
# Helper function to encode an image as a base64 string
def encode_image(image_path):
if image_path:
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode()
return None
image_paths = images
#three images based on retrived images
img_base64 = encode_image(image_paths[0])
img_base641 = encode_image(image_paths[1])
img_base642 = encode_image(image_paths[2])
message = HumanMessage(
content=[
{"type": "text", "text": "BASED ON RETRIEVED CONTEXT %s ONLY, ANSWER THE FOLLOWING QUERY %s. Context can be tables, texts or Images"%(context,user_query)},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{img_base641}"},
},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{img_base642}"},
},
],)
model = ChatOpenAI(model=model)
response = model.invoke([message])
return response.content
def RAG(query):
text_hits, Image_hits=MultiModalRetriever(query)
retrieved_images=[i.payload['image_path'] for i in Image_hits]
print(retrieved_images)
answer=MultiModalRAG(text_hits,retrieved_images,query,"gpt-4o")
return(answer)
Let us now query our multimodal RAG system with different queries to test its multimodal capability,
RAG("Revenue of Starbucks in billion dollars of Food in 2020?")
Output:
'Based on the chart showing Starbucks' revenue by product for 2020, the revenue from
food is approximately $3 billion.'
The response to this query is only present in the following chart (Fig 4) in the PDF and not in any text. So our, multimodal RAG is able to retrieve this information accurately.
RAG("Explain what the Ansoff Matrix is for Starbucks.")
Output:
'The Ansoff Matrix is a strategic tool that helps businesses like Starbucks analyze
their growth strategies. For Starbucks, it can be broken down as follows:
1. **Market Penetration:** Starbu cks focuses on increasing sales of existing
products in current markets. This includes enhancing the customer experience, leveraging their mobile app for convenience, and promoting existing offerings.
2. **Product Development:** Starbucks introduces new products for existing markets. Examples include launching new beverage options or introducing meatless breakfast
items to adapt to changing consumer preferences.
3. **Market Development:** This involves Starbucks expanding into new geographical
locations or market segments with existing products. It selects high-traffic
locations and creates a consistent brand image and store experience to attract customers.
4. **Diversification:** Introducing entirely new products to new markets. This could
involve Starbuck s exploring areas like offering alcoholic beverages to attract
different customer demographics.
Overall, the Ansoff Matrix helps Starbucks strategically plan how to grow and adapt
in various market conditions by focusing on either current or new products and
markets.
The response to this query as well is only present in the following diagram (Fig 3) in the PDF and not in any text. So our, multimodal RAG is able to retrieve this information accurately.
RAG("Global coffee consumption in 2017")
Output:
'The global coffee consumption in 2017 was 161.37 million bags.'
The response to this query as well is only present in the following chart (Fig 1) in the PDF and not in any text. So our, multimodal RAG is able to retrieve this information accurately.
The integration of Nomic vision embeddings into multimodal RAG systems represents a major leap in AI, allowing seamless interaction between visual and textual data for enhanced understanding and response generation. By overcoming limitations seen in models like CLIP, Nomic Embed Vision offers a unified embedding space, boosting performance on multimodal tasks. This development paves the way for richer, more context-aware user experiences in high-volume production environments.
A. Multimodal Retrieval-Augmented Generation (RAG) is an advanced AI architecture designed to process and synthesize data from various modalities, including text, images, audio, and video, enabling more context-aware and nuanced outputs. Unlike traditional RAG systems that focus primarily on text, multimodal RAG integrates multiple data types for more comprehensive understanding and response generation.
A. Nomic vision embeddings create a unified embedding space for both visual and textual data, allowing seamless interaction between different formats. This integration improves the system’s ability to retrieve and process information across modalities, resulting in richer and more informative user experiences.
A. Nomic Embed Vision is designed to integrate both image and text comprehension in a shared latent space, making it highly suitable for tasks such as text-to-image retrieval. Its 92M parameter vision encoder complements the 137M parameter Nomic Embed Text, making it ideal for high-volume production environments.
A. CLIP models demonstrate strong zero-shot capabilities but struggle with unimodal tasks like semantic similarity. Nomic Embed Vision addresses this by aligning its vision encoder with the Nomic Embed Text latent space, ensuring better performance on a wider range of tasks, including unimodal tasks.
A. Nomic Embed Vision has been benchmarked against Imagenet Zero-Shot, MTEB, and Datacomp, showing strong performance across image, text, and multimodal tasks. These benchmarks highlight its ability to bridge the gap between different data types while maintaining high accuracy and efficiency.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.