Enhancing RAG Systems with Nomic Embeddings

Nibedita Dutta Last Updated : 28 Feb, 2025
10 min read

The intersection of artificial intelligence and data processing has evolved significantly with the rise of multimodal Retrieval-Augmented Generation systems. Multimodal RAG goes beyond traditional models that focus only on text. It integrates various data types like text, images, audio, and video. This allows for more nuanced and context-aware responses. A key innovation is Nomic vision embeddings. They create a unified space for both visual and textual data. This enables seamless interaction across different formats. By using advanced models to generate high-quality embeddings, multimodal RAG improves information retrieval. It bridges the gap between different content forms. The result is richer and more informative user experiences.

Learning Objectives

  • Understand the fundamentals of multimodal Retrieval-Augmented Generation systems and their advantages over traditional RAG.
  • Explore the role of Nomic Vision Embeddings in creating a unified embedding space for text and images.
  • Compare Nomic Vision Embeddings with CLIP models and analyze their performance benchmarks.
  • Implement a multimodal RAG system in Python using Nomic Vision and Text Embeddings.
  • Learn how to extract and process textual and visual data from PDFs for multimodal retrieval.

This article was published as a part of the Data Science Blogathon.

What is Multimodal RAG?

Multimodal RAG represents a significant advancement in artificial intelligence. It is build upon traditional RAG systems by incorporating diverse data types such as text, images, audio, and video. Unlike conventional RAG systems that primarily process textual information, multimodal RAG is designed to handle and integrate multiple forms of data simultaneously. This capability allows for more comprehensive understanding and generation of responses that are context-aware across different modalities.

Key Components of Multimodal RAG

  • Data Ingestion: The process begins with ingesting various types of data through specialized processors for each format. This ensures that the system can validate, clean, and normalize incoming data while preserving its essential characteristics
  • Vector Representation: Different modalities are processed using respective neural networks (e.g., CLIP for images or BERT for text) to generate unified vector representations or embeddings. These embeddings maintain semantic relationships across different modalities.
  • Vector Database Storage: The generated embeddings are stored in vector databases optimized with indexing techniques like HNSW or FAISS for efficient retrieval
  • Query Processing: Incoming queries are analyzed and transformed into the same vector space as the stored data to determine relevant modalities and generate appropriate embeddings for search

Nomic Vision Embeddings

A significant innovation in this field of multimodal embeddings is the incorporation of Nomic vision embeddings, which create a cohesive embedding space for both visual and textual data. 

Nomic Embed Vision v1 and v1.5 are both high-quality vision embedding models developed by Nomic AI, designed to share the same latent space as their corresponding text embedding models, Nomic Embed Text v1 and v1.5, respectively. It operates within the same space as Nomic Embed Text, making it well-suited for multimodal tasks such as text-to-image retrieval. With a vision encoder comprising only 92M parameters, Nomic Embed Vision is well-suited for high-volume production applications, complementing the 137M parameters of Nomic Embed Text.

CLIP models suffer in unimodal tasks

Multimodal models such as CLIP demonstrate remarkable zero-shot capabilities across different modalities. However, CLIP’s text encoders struggle with tasks beyond image retrieval, as seen in benchmarks like MTEB, which evaluates the effectiveness of text embedding models. Nomic Embed Vision aims to address these limitations by aligning a vision encoder with the existing Nomic Embed Text latent space.

Nomic Vision Embeddings
Source: Nomic Blog

To tackle the issue of underperformance on unimodal tasks, such as semantic similarity, Nomic Embed Vision, a vision encoder, was trained alongside Nomic Embed Text, a long-context text encoder. The training method involved freezing the text encoder and training the vision encoder on image-text pairs. This approach not only produced optimal results but also ensured backward compatibility with the embeddings from Nomic Embed Text.

Performance Benchmarks of Nomic Vision Embeddings

As mentioned earlier, existing multimodal models such as CLIP exhibit impressive zero-shot capabilities across different modalities. However, the performance of CLIP’s text encoders is subpar outside of tasks like image retrieval, as evidenced by benchmarks like MTEB, which evaluates the quality of text embedding models. Nomic Embed Vision is specifically designed to address these shortcomings by aligning a vision encoder with the existing Nomic Embed Text latent space. This alignment results in a unified multimodal latent space that delivers strong performance on image, text, and multimodal tasks, as demonstrated by the Imagenet Zero-Shot, MTEB, and Datacomp benchmarks.

Performance Benchmarks of Nomic Embed v1 and v1.5
Source: Nomic Blog

Hands on Python Implementation of MultiModal RAG with Nomic Vision Embeddings

In this tutorial, we will build a multimodal RAG system that can efficiently retrieve information from a PDF containing both textual and visual content. We will build this on Google Colab using T4 GPU (Free tier).

Step 1: Installing Necessary Libraries

Install all required Python libraries, including OpenAI, Qdrant, Transformers, Torch, and PyMuPDF.

!pip install openai==1.55.3 httpx==0.27.2 
!pip install qdrant_client
!pip install transformers
!pip install transformers torch pillow
!pip install --upgrade nltk
!pip install sentence-transformers
!pip install --upgrade qdrant-client fastembed Pillow
!pip install PyMuPDF

Step 2: Setting OpenAI API key and Importing Necessary Libraries

Set up the OpenAI API key and import essential libraries like PyMuPDF, PIL, LangChain, and OpenAI.

from openai import ChatCompletion
import openai
import os
from openai import AzureOpenAI
from PIL import Image
import torch
import numpy as np
import fitz  # PyMuPDF
import os
import time
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI
from openai import ChatCompletion
import openai
import base64
from base64 import b64decode

os.environ["OPENAI_API_KEY"] = ''

Step 3: Extracting Images From PDF

Set up the OpenAI API key and import essential libraries like PyMuPDF, PIL, LangChain, and OpenAI.

#images

def extract_images_from_pdf(pdf_path, output_folder):
    pdf_document = fitz.open(pdf_path)
    os.makedirs(output_folder, exist_ok=True)
    #Iterating throught the pages in the PDF
    for page_number in range(len(pdf_document)):
        page = pdf_document[page_number]
        #Function For Getting Images From the PDF Pages
        images = page.get_images(full=True)

        for image_index, img in enumerate(images):
            xref = img[0]
            base_image = pdf_document.extract_image(xref)
            image_bytes = base_image["image"]
            image_ext = base_image["ext"]
            image_filename = f"page_{page_number+1}_image_{image_index+1}.{image_ext}"
            image_path = os.path.join(output_folder, image_filename)
            with open(image_path, "wb") as image_file:
                image_file.write(image_bytes)
    pdf_document.close()

Step 4: Extracting Text From PDF

Use PyMuPDF to extract text from all pages of the PDF and store it in a list.

def extract_text_pdf(path):
    """Extracts text from a PDF using PyMuPDF."""
    doc = fitz.open(path)
    text_results = []
    for page in doc:
        text = page.get_text()
        text_results.append(text)
    return text_results

Step 5: Saving Extracted Text and Images From PDF

Save images in the “test” directory and extract text for further processing.

def get_contents(pdf_path, output_directory):
  """Extracts text and images from a PDF, saves images, and returns text and elapsed time."""

  extract_images_from_pdf(pdf_path, output_directory)
  text_results=extract_text_pdf(pdf_path)
  return(text_results)
  
pdf_path = "/content/retailcoffee.pdf"
output_directory = "/content/test"
text_results=get_contents(pdf_path, output_directory)

We use this PDF that has both text and images or charts to test the multimodal RAG. 

We save the images extracted from the PDF using the PyMuPDF library in the “test” directory. In the next steps, create embeddings of these images so as to be able to retrieve information from them in future based on a user query.

Step 6. Chunking Text Data For RAG

Split extracted text into smaller chunks using LangChain’s RecursiveCharacterTextSplitter.

text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=2048,
        chunk_overlap=50,
        length_function=len,
        is_separator_regex=False,
        separators=[
            "\n\n",
            "\n",
            " ",
            ".",
            ",",
            "\u200b",  # Zero-width space
            "\uff0c",  # Fullwidth comma
            "\u3001",  # Ideographic comma
            "\uff0e",  # Fullwidth full stop
            "\u3002",  # Ideographic full stop
            "",
        ],
    )

doc_texts = text_splitter.create_documents(text_results)

Step 7: Loading Nomic Text Embedding Model and Nomic Vision Embedding Model

Load Nomic’s text and vision embedding models using Hugging Face’s Transformers library.

from transformers import AutoTokenizer, AutoModel

# Load the tokenizer and model
text_tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
text_model = AutoModel.from_pretrained("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)

def text_embeddings(text):
    inputs = text_tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    outputs = text_model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings[0].detach().numpy()
    
from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch
model = AutoModel.from_pretrained("nomic-ai/nomic-embed-vision-v1.5", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("nomic-ai/nomic-embed-vision-v1.5")

Step 8: Generating Text and Image Embeddings For Our Data

Convert text and images into vector embeddings for efficient retrieval.

#Text Embeddins
texts_embeded = [text_embeddings(document.page_content) for document in doc_texts]

#Image Embeddings
image_embeddings = []
for img in image_files:
    try:
        image = Image.open(os.path.join(output_directory, img))
        inputs = processor(images=image, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**inputs)
        embeddings = outputs.last_hidden_state
        if embeddings.size(0) > 0:  # Ensure the batch size is non-zero

            image_embedding = embeddings.mean(dim=1).squeeze().cpu().numpy()
            image_embeddings.append(image_embedding)
        else:
            print(f"No Embeddings For {img}")

    except Exception as e:
        print(e)

#SIZE OF Text & Image Embeddings
text_embeddings_size=len(texts_embeded[0])
image_embeddings_size=len(image_embeddings[0])

Step 9: Storing Text Embeddings in Qdrant 

Qdrant is an open-source vector database and search engine designed to efficiently store, manage, and query high-dimensional vectors.We save our embeddings in this vector DB.

from qdrant_client import QdrantClient, models

client = QdrantClient(":memory:")

if not client.collection_exists("text1"): #creating a Collection
 client.create_collection(
        collection_name ="text1",
      vectors_config=models.VectorParams(
        size=text_embeddings_size,  # Vector size is defined by used model
        distance=models.Distance.COSINE,
    ),
 )
 
 client.upload_points(
    collection_name="text1",
    points=[
        models.PointStruct(
            id=str(uuid.uuid4()),
            vector=np.array(texts_embeded[idx]),
            payload={
                "metadata": doc.metadata,
                "content": doc.page_content
            }
        )
        for idx, doc in enumerate(doc_texts)
    ]
)

Step 10: Storing Image Embeddings in Qdrant 

Save image embeddings in a separate Qdrant collection for multimodal retrieval.

if not client.collection_exists("images1"):
    client.create_collection(
        collection_name="images1",
        vectors_config=models.VectorParams(
        size=image_embeddings_size,  # Vector size is defined by used model
        distance=models.Distance.COSINE,
    ),
  )
  
# Ensure that image_embeddings are not empty
if len(image_embeddings) > 0:
    client.upload_points(
        collection_name="images1",
        points=[
            models.PointStruct(
                id=str(uuid.uuid4()),  # unique id
                vector= np.array(image_embeddings[idx])  ,
                payload={"image_path": output_directory+'/'+str(image_files[idx])}  # Image path as metadata
            )
            for idx in range(len(image_embeddings))  
    )
else:
    print("No embeddings found")

Step 11: Creating a MultiModal Retriever For Retrieving Images and Text

Retrieve the most relevant text and image embeddings based on a user query.

def MultiModalRetriever(query):

    query = text_embeddings(query)

    # Retrieve text hits
    text_hits = client.query_points(
        collection_name="text1",
        query=query,
        limit=3,3
    ).points
    # Retrieve image hits
    Image_hits = client.query_points(
        collection_name="images1",
        query=query,
        limit=5,
    ).points

    return text_hits, Image_hits

Step 12: Creating a MultiModal RAG using LangChain

Use LangChain to process retrieved text and images, generating context-aware responses using GPT-4o.

def MultiModalRAG(context,images,user_query,model):  
    # Helper function to encode an image as a base64 string
    def encode_image(image_path):
        if image_path:
            with open(image_path, "rb") as image_file:
                return base64.b64encode(image_file.read()).decode()
        return None


    image_paths = images   
    #three images based on retrived images
    img_base64 = encode_image(image_paths[0])        
    img_base641 = encode_image(image_paths[1])  
    img_base642 = encode_image(image_paths[2])  

    message = HumanMessage(
            content=[
                {"type": "text", "text": "BASED ON RETRIEVED CONTEXT %s ONLY, ANSWER THE FOLLOWING QUERY %s. Context can be tables, texts or Images"%(context,user_query)},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{img_base641}"},
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{img_base642}"},
                },
            ],)

    model = ChatOpenAI(model=model)    
    response = model.invoke([message])
    return response.content


def RAG(query):
  text_hits, Image_hits=MultiModalRetriever(query)

  retrieved_images=[i.payload['image_path'] for i in Image_hits]
  print(retrieved_images)
  answer=MultiModalRAG(text_hits,retrieved_images,query,"gpt-4o")
  return(answer)

Querying the Model

Let us now query our multimodal RAG system with different queries to test its multimodal capability, 

RAG("Revenue of Starbucks in billion dollars of Food in 2020?")

Output:

'Based on the chart showing Starbucks' revenue by product for 2020, the revenue from
food is approximately $3 billion.'

The response to this query is only present in the following chart (Fig 4) in the PDF and not in any text. So our, multimodal RAG is able to retrieve this information accurately.

response to query
RAG("Explain what the Ansoff Matrix is for Starbucks.")

Output:


'The Ansoff Matrix is a strategic tool that helps businesses like Starbucks analyze
their growth strategies. For Starbucks, it can be broken down as follows:
1. **Market Penetration:** Starbu cks focuses on increasing sales of existing
products in current markets. This includes enhancing the customer experience, leveraging their mobile app for convenience, and promoting existing offerings.
2. **Product Development:** Starbucks introduces new products for existing markets. Examples include launching new beverage options or introducing meatless breakfast
items to adapt to changing consumer preferences.
3. **Market Development:** This involves Starbucks expanding into new geographical
locations or market segments with existing products. It selects high-traffic
locations and creates a consistent brand image and store experience to attract customers.
4. **Diversification:** Introducing entirely new products to new markets. This could
involve Starbuck s exploring areas like offering alcoholic beverages to attract
different customer demographics.
Overall, the Ansoff Matrix helps Starbucks strategically plan how to grow and adapt
in various market conditions by focusing on either current or new products and
markets.

The response to this query as well is only present in the following diagram (Fig 3) in the PDF and not in any text. So our, multimodal RAG is able to retrieve this information accurately.

output
RAG("Global coffee consumption in 2017")

Output:


'The global coffee consumption in 2017 was 161.37 million bags.'

The response to this query as well is only present in the following chart (Fig 1) in the PDF and not in any text. So our, multimodal RAG is able to retrieve this information accurately.

coffee consumption

Conclusion

The integration of Nomic vision embeddings into multimodal RAG systems represents a major leap in AI, allowing seamless interaction between visual and textual data for enhanced understanding and response generation. By overcoming limitations seen in models like CLIP, Nomic Embed Vision offers a unified embedding space, boosting performance on multimodal tasks. This development paves the way for richer, more context-aware user experiences in high-volume production environments.

Key Takeaways

  • Multimodal Retrieval-Augmented Generation (RAG) systems integrate various data types, such as text, images, audio, and video, enabling more context-aware and nuanced outputs compared to traditional RAG systems focused on text alone.
  • Nomic vision embeddings play a key role by unifying visual and textual data into a single embedding space, enhancing the system’s ability to retrieve and synthesize information across multiple modalities.
  • The multimodal RAG system processes data through specialized ingestion, vector representation, and storage techniques, ensuring efficient retrieval and meaningful responses across diverse content formats.
  • While CLIP models excel in zero-shot capabilities, they struggle with unimodal tasks like semantic similarity. Nomic Embed Vision addresses this by aligning vision and text encoders, improving performance on a wide range of tasks.

Frequently Asked Questions

Q1. What is Multimodal RAG?

A. Multimodal Retrieval-Augmented Generation (RAG) is an advanced AI architecture designed to process and synthesize data from various modalities, including text, images, audio, and video, enabling more context-aware and nuanced outputs. Unlike traditional RAG systems that focus primarily on text, multimodal RAG integrates multiple data types for more comprehensive understanding and response generation.

Q2. How do Nomic Vision Embeddings enhance Multimodal RAG systems?

A. Nomic vision embeddings create a unified embedding space for both visual and textual data, allowing seamless interaction between different formats. This integration improves the system’s ability to retrieve and process information across modalities, resulting in richer and more informative user experiences.

Q3. What is the main advantage of Nomic Embed Vision in multimodal tasks?

A. Nomic Embed Vision is designed to integrate both image and text comprehension in a shared latent space, making it highly suitable for tasks such as text-to-image retrieval. Its 92M parameter vision encoder complements the 137M parameter Nomic Embed Text, making it ideal for high-volume production environments.

Q4. How does Nomic Embed Vision overcome the limitations of CLIP models?

A. CLIP models demonstrate strong zero-shot capabilities but struggle with unimodal tasks like semantic similarity. Nomic Embed Vision addresses this by aligning its vision encoder with the Nomic Embed Text latent space, ensuring better performance on a wider range of tasks, including unimodal tasks.

Q5. What are the key benchmarks that demonstrate Nomic Vision Embeddings’ performance?

A. Nomic Embed Vision has been benchmarked against Imagenet Zero-Shot, MTEB, and Datacomp, showing strong performance across image, text, and multimodal tasks. These benchmarks highlight its ability to bridge the gap between different data types while maintaining high accuracy and efficiency.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Nibedita completed her master’s in Chemical Engineering from IIT Kharagpur in 2014 and is currently working as a Senior Data Scientist. In her current capacity, she works on building intelligent ML-based solutions to improve business processes.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details