Enhancing RAG Systems with Nomic Embeddings

Nibedita Dutta Last Updated : 28 Feb, 2025

10 min read

The intersection of artificial intelligence and data processing has evolved significantly with the rise of multimodal Retrieval-Augmented Generation systems. Multimodal RAG goes beyond traditional models that focus only on text. It integrates various data types like text, images, audio, and video. This allows for more nuanced and context-aware responses. A key innovation is Nomic vision embeddings. They create a unified space for both visual and textual data. This enables seamless interaction across different formats. By using advanced models to generate high-quality embeddings, multimodal RAG improves information retrieval. It bridges the gap between different content forms. The result is richer and more informative user experiences.

Learning Objectives

Understand the fundamentals of multimodal Retrieval-Augmented Generation systems and their advantages over traditional RAG.
Explore the role of Nomic Vision Embeddings in creating a unified embedding space for text and images.
Compare Nomic Vision Embeddings with CLIP models and analyze their performance benchmarks.
Implement a multimodal RAG system in Python using Nomic Vision and Text Embeddings.
Learn how to extract and process textual and visual data from PDFs for multimodal retrieval.

This article was published as a part of the Data Science Blogathon.

What is Multimodal RAG?
Nomic Vision Embeddings
Performance Benchmarks of Nomic Vision Embeddings
Hands on Python Implementation of MultiModal RAG with Nomic Vision Embeddings
Querying the Model
Conclusion
Frequently Asked Questions

What is Multimodal RAG?

Multimodal RAG represents a significant advancement in artificial intelligence. It is build upon traditional RAG systems by incorporating diverse data types such as text, images, audio, and video. Unlike conventional RAG systems that primarily process textual information, multimodal RAG is designed to handle and integrate multiple forms of data simultaneously. This capability allows for more comprehensive understanding and generation of responses that are context-aware across different modalities.

Key Components of Multimodal RAG

Data Ingestion: The process begins with ingesting various types of data through specialized processors for each format. This ensures that the system can validate, clean, and normalize incoming data while preserving its essential characteristics
Vector Representation: Different modalities are processed using respective neural networks (e.g., CLIP for images or BERT for text) to generate unified vector representations or embeddings. These embeddings maintain semantic relationships across different modalities.
Vector Database Storage: The generated embeddings are stored in vector databases optimized with indexing techniques like HNSW or FAISS for efficient retrieval
Query Processing: Incoming queries are analyzed and transformed into the same vector space as the stored data to determine relevant modalities and generate appropriate embeddings for search

Nomic Vision Embeddings

A significant innovation in this field of multimodal embeddings is the incorporation of Nomic vision embeddings, which create a cohesive embedding space for both visual and textual data.

Nomic Embed Vision v1 and v1.5 are both high-quality vision embedding models developed by Nomic AI, designed to share the same latent space as their corresponding text embedding models, Nomic Embed Text v1 and v1.5, respectively. It operates within the same space as Nomic Embed Text, making it well-suited for multimodal tasks such as text-to-image retrieval. With a vision encoder comprising only 92M parameters, Nomic Embed Vision is well-suited for high-volume production applications, complementing the 137M parameters of Nomic Embed Text.

CLIP models suffer in unimodal tasks

Multimodal models such as CLIP demonstrate remarkable zero-shot capabilities across different modalities. However, CLIP’s text encoders struggle with tasks beyond image retrieval, as seen in benchmarks like MTEB, which evaluates the effectiveness of text embedding models. Nomic Embed Vision aims to address these limitations by aligning a vision encoder with the existing Nomic Embed Text latent space.

To tackle the issue of underperformance on unimodal tasks, such as semantic similarity, Nomic Embed Vision, a vision encoder, was trained alongside Nomic Embed Text, a long-context text encoder. The training method involved freezing the text encoder and training the vision encoder on image-text pairs. This approach not only produced optimal results but also ensured backward compatibility with the embeddings from Nomic Embed Text.

Performance Benchmarks of Nomic Vision Embeddings

As mentioned earlier, existing multimodal models such as CLIP exhibit impressive zero-shot capabilities across different modalities. However, the performance of CLIP’s text encoders is subpar outside of tasks like image retrieval, as evidenced by benchmarks like MTEB, which evaluates the quality of text embedding models. Nomic Embed Vision is specifically designed to address these shortcomings by aligning a vision encoder with the existing Nomic Embed Text latent space. This alignment results in a unified multimodal latent space that delivers strong performance on image, text, and multimodal tasks, as demonstrated by the Imagenet Zero-Shot, MTEB, and Datacomp benchmarks.

Performance Benchmarks of Nomic Embed v1 and v1.5 — Source: Nomic Blog

Hands on Python Implementation of MultiModal RAG with Nomic Vision Embeddings

In this tutorial, we will build a multimodal RAG system that can efficiently retrieve information from a PDF containing both textual and visual content. We will build this on Google Colab using T4 GPU (Free tier).

Step 1: Installing Necessary Libraries

Install all required Python libraries, including OpenAI, Qdrant, Transformers, Torch, and PyMuPDF.

!pip install openai==1.55.3 httpx==0.27.2 
!pip install qdrant_client
!pip install transformers
!pip install transformers torch pillow
!pip install --upgrade nltk
!pip install sentence-transformers
!pip install --upgrade qdrant-client fastembed Pillow
!pip install PyMuPDF

Step 2: Setting OpenAI API key and Importing Necessary Libraries

Set up the OpenAI API key and import essential libraries like PyMuPDF, PIL, LangChain, and OpenAI.

from openai import ChatCompletion
import openai
import os
from openai import AzureOpenAI
from PIL import Image
import torch
import numpy as np
import fitz  # PyMuPDF
import os
import time
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.messages import HumanMessage
from langchain_openai import ChatOpenAI
from openai import ChatCompletion
import openai
import base64
from base64 import b64decode

os.environ["OPENAI_API_KEY"] = ''

Step 3: Extracting Images From PDF

Set up the OpenAI API key and import essential libraries like PyMuPDF, PIL, LangChain, and OpenAI.

#images

def extract_images_from_pdf(pdf_path, output_folder):
    pdf_document = fitz.open(pdf_path)
    os.makedirs(output_folder, exist_ok=True)
    #Iterating throught the pages in the PDF
    for page_number in range(len(pdf_document)):
        page = pdf_document[page_number]
        #Function For Getting Images From the PDF Pages
        images = page.get_images(full=True)

        for image_index, img in enumerate(images):
            xref = img[0]
            base_image = pdf_document.extract_image(xref)
            image_bytes = base_image["image"]
            image_ext = base_image["ext"]
            image_filename = f"page_{page_number+1}_image_{image_index+1}.{image_ext}"
            image_path = os.path.join(output_folder, image_filename)
            with open(image_path, "wb") as image_file:
                image_file.write(image_bytes)
    pdf_document.close()

Step 4: Extracting Text From PDF

Use PyMuPDF to extract text from all pages of the PDF and store it in a list.

def extract_text_pdf(path):
    """Extracts text from a PDF using PyMuPDF."""
    doc = fitz.open(path)
    text_results = []
    for page in doc:
        text = page.get_text()
        text_results.append(text)
    return text_results

Step 5: Saving Extracted Text and Images From PDF

Save images in the “test” directory and extract text for further processing.

def get_contents(pdf_path, output_directory):
  """Extracts text and images from a PDF, saves images, and returns text and elapsed time."""

  extract_images_from_pdf(pdf_path, output_directory)
  text_results=extract_text_pdf(pdf_path)
  return(text_results)
  
pdf_path = "/content/retailcoffee.pdf"
output_directory = "/content/test"
text_results=get_contents(pdf_path, output_directory)

We use this PDF that has both text and images or charts to test the multimodal RAG.

We save the images extracted from the PDF using the PyMuPDF library in the “test” directory. In the next steps, create embeddings of these images so as to be able to retrieve information from them in future based on a user query.

Step 6. Chunking Text Data For RAG

Split extracted text into smaller chunks using LangChain’s RecursiveCharacterTextSplitter.

text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=2048,
        chunk_overlap=50,
        length_function=len,
        is_separator_regex=False,
        separators=[
            "\n\n",
            "\n",
            " ",
            ".",
            ",",
            "\u200b",  # Zero-width space
            "\uff0c",  # Fullwidth comma
            "\u3001",  # Ideographic comma
            "\uff0e",  # Fullwidth full stop
            "\u3002",  # Ideographic full stop
            "",
        ],
    )

doc_texts = text_splitter.create_documents(text_results)

Step 7: Loading Nomic Text Embedding Model and Nomic Vision Embedding Model

Load Nomic’s text and vision embedding models using Hugging Face’s Transformers library.

from transformers import AutoTokenizer, AutoModel

# Load the tokenizer and model
text_tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
text_model = AutoModel.from_pretrained("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)

def text_embeddings(text):
    inputs = text_tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    outputs = text_model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings[0].detach().numpy()
    
from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch
model = AutoModel.from_pretrained("nomic-ai/nomic-embed-vision-v1.5", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("nomic-ai/nomic-embed-vision-v1.5")

Step 8: Generating Text and Image Embeddings For Our Data

Convert text and images into vector embeddings for efficient retrieval.

#Text Embeddins
texts_embeded = [text_embeddings(document.page_content) for document in doc_texts]

#Image Embeddings
image_embeddings = []
for img in image_files:
    try:
        image = Image.open(os.path.join(output_directory, img))
        inputs = processor(images=image, return_tensors="pt")
        with torch.no_grad():
            outputs = model(**inputs)
        embeddings = outputs.last_hidden_state
        if embeddings.size(0) > 0:  # Ensure the batch size is non-zero

            image_embedding = embeddings.mean(dim=1).squeeze().cpu().numpy()
            image_embeddings.append(image_embedding)
        else:
            print(f"No Embeddings For {img}")

    except Exception as e:
        print(e)

#SIZE OF Text & Image Embeddings
text_embeddings_size=len(texts_embeded[0])
image_embeddings_size=len(image_embeddings[0])

Step 9: Storing Text Embeddings in Qdrant

Qdrant is an open-source vector database and search engine designed to efficiently store, manage, and query high-dimensional vectors.We save our embeddings in this vector DB.

from qdrant_client import QdrantClient, models

client = QdrantClient(":memory:")

if not client.collection_exists("text1"): #creating a Collection
 client.create_collection(
        collection_name ="text1",
      vectors_config=models.VectorParams(
        size=text_embeddings_size,  # Vector size is defined by used model
        distance=models.Distance.COSINE,
    ),
 )
 
 client.upload_points(
    collection_name="text1",
    points=[
        models.PointStruct(
            id=str(uuid.uuid4()),
            vector=np.array(texts_embeded[idx]),
            payload={
                "metadata": doc.metadata,
                "content": doc.page_content
            }
        )
        for idx, doc in enumerate(doc_texts)
    ]
)

Step 10: Storing Image Embeddings in Qdrant

Save image embeddings in a separate Qdrant collection for multimodal retrieval.

if not client.collection_exists("images1"):
    client.create_collection(
        collection_name="images1",
        vectors_config=models.VectorParams(
        size=image_embeddings_size,  # Vector size is defined by used model
        distance=models.Distance.COSINE,
    ),
  )
  
# Ensure that image_embeddings are not empty
if len(image_embeddings) > 0:
    client.upload_points(
        collection_name="images1",
        points=[
            models.PointStruct(
                id=str(uuid.uuid4()),  # unique id
                vector= np.array(image_embeddings[idx])  ,
                payload={"image_path": output_directory+'/'+str(image_files[idx])}  # Image path as metadata
            )
            for idx in range(len(image_embeddings))  
    )
else:
    print("No embeddings found")

Step 11: Creating a MultiModal Retriever For Retrieving Images and Text

Retrieve the most relevant text and image embeddings based on a user query.

def MultiModalRetriever(query):

    query = text_embeddings(query)

    # Retrieve text hits
    text_hits = client.query_points(
        collection_name="text1",
        query=query,
        limit=3,3
    ).points
    # Retrieve image hits
    Image_hits = client.query_points(
        collection_name="images1",
        query=query,
        limit=5,
    ).points

    return text_hits, Image_hits

Step 12: Creating a MultiModal RAG using LangChain

Use LangChain to process retrieved text and images, generating context-aware responses using GPT-4o.

def MultiModalRAG(context,images,user_query,model):  
    # Helper function to encode an image as a base64 string
    def encode_image(image_path):
        if image_path:
            with open(image_path, "rb") as image_file:
                return base64.b64encode(image_file.read()).decode()
        return None


    image_paths = images   
    #three images based on retrived images
    img_base64 = encode_image(image_paths[0])        
    img_base641 = encode_image(image_paths[1])  
    img_base642 = encode_image(image_paths[2])  

    message = HumanMessage(
            content=[
                {"type": "text", "text": "BASED ON RETRIEVED CONTEXT %s ONLY, ANSWER THE FOLLOWING QUERY %s. Context can be tables, texts or Images"%(context,user_query)},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{img_base64}"},
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{img_base641}"},
                },
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{img_base642}"},
                },
            ],)

    model = ChatOpenAI(model=model)    
    response = model.invoke([message])
    return response.content


def RAG(query):
  text_hits, Image_hits=MultiModalRetriever(query)

  retrieved_images=[i.payload['image_path'] for i in Image_hits]
  print(retrieved_images)
  answer=MultiModalRAG(text_hits,retrieved_images,query,"gpt-4o")
  return(answer)

Querying the Model

Let us now query our multimodal RAG system with different queries to test its multimodal capability,

RAG("Revenue of Starbucks in billion dollars of Food in 2020?")

Output:

'Based on the chart showing Starbucks' revenue by product for 2020, the revenue from
 food is approximately $3 billion.'

The response to this query is only present in the following chart (Fig 4) in the PDF and not in any text. So our, multimodal RAG is able to retrieve this information accurately.

RAG("Explain what the Ansoff Matrix is for Starbucks.")

Output:


'The Ansoff Matrix is a strategic tool that helps businesses like Starbucks analyze
 their growth strategies. For Starbucks, it can be broken down as follows: 
1. **Market Penetration:** Starbu cks focuses on increasing sales of existing
 products in current markets. This includes enhancing the customer experience, leveraging their mobile app for convenience, and promoting existing offerings.
2. **Product Development:** Starbucks introduces new products for existing markets. Examples include launching new beverage options or introducing meatless breakfast
 items to adapt to changing consumer preferences.
3. **Market Development:** This involves Starbucks expanding into new geographical
 locations or market segments with existing products. It selects high-traffic
 locations and creates a consistent brand image and store experience to attract customers.
4. **Diversification:** Introducing entirely new products to new markets. This could
 involve Starbuck s exploring areas like offering alcoholic beverages to attract 
different customer demographics. 
Overall, the Ansoff Matrix helps Starbucks strategically plan how to grow and adapt
 in various market conditions by focusing on either current or new products and 
markets.

The response to this query as well is only present in the following diagram (Fig 3) in the PDF and not in any text. So our, multimodal RAG is able to retrieve this information accurately.

RAG("Global coffee consumption in 2017")

Output:


'The global coffee consumption in 2017 was 161.37 million bags.'

The response to this query as well is only present in the following chart (Fig 1) in the PDF and not in any text. So our, multimodal RAG is able to retrieve this information accurately.

Conclusion

The integration of Nomic vision embeddings into multimodal RAG systems represents a major leap in AI, allowing seamless interaction between visual and textual data for enhanced understanding and response generation. By overcoming limitations seen in models like CLIP, Nomic Embed Vision offers a unified embedding space, boosting performance on multimodal tasks. This development paves the way for richer, more context-aware user experiences in high-volume production environments.

Key Takeaways

Multimodal Retrieval-Augmented Generation (RAG) systems integrate various data types, such as text, images, audio, and video, enabling more context-aware and nuanced outputs compared to traditional RAG systems focused on text alone.
Nomic vision embeddings play a key role by unifying visual and textual data into a single embedding space, enhancing the system’s ability to retrieve and synthesize information across multiple modalities.
The multimodal RAG system processes data through specialized ingestion, vector representation, and storage techniques, ensuring efficient retrieval and meaningful responses across diverse content formats.
While CLIP models excel in zero-shot capabilities, they struggle with unimodal tasks like semantic similarity. Nomic Embed Vision addresses this by aligning vision and text encoders, improving performance on a wide range of tasks.

Frequently Asked Questions

Q1. What is Multimodal RAG?

A. Multimodal Retrieval-Augmented Generation (RAG) is an advanced AI architecture designed to process and synthesize data from various modalities, including text, images, audio, and video, enabling more context-aware and nuanced outputs. Unlike traditional RAG systems that focus primarily on text, multimodal RAG integrates multiple data types for more comprehensive understanding and response generation.

Q2. How do Nomic Vision Embeddings enhance Multimodal RAG systems?

A. Nomic vision embeddings create a unified embedding space for both visual and textual data, allowing seamless interaction between different formats. This integration improves the system’s ability to retrieve and process information across modalities, resulting in richer and more informative user experiences.

Q3. What is the main advantage of Nomic Embed Vision in multimodal tasks?

A. Nomic Embed Vision is designed to integrate both image and text comprehension in a shared latent space, making it highly suitable for tasks such as text-to-image retrieval. Its 92M parameter vision encoder complements the 137M parameter Nomic Embed Text, making it ideal for high-volume production environments.

Q4. How does Nomic Embed Vision overcome the limitations of CLIP models?

A. CLIP models demonstrate strong zero-shot capabilities but struggle with unimodal tasks like semantic similarity. Nomic Embed Vision addresses this by aligning its vision encoder with the Nomic Embed Text latent space, ensuring better performance on a wider range of tasks, including unimodal tasks.

Q5. What are the key benchmarks that demonstrate Nomic Vision Embeddings’ performance?

A. Nomic Embed Vision has been benchmarked against Imagenet Zero-Shot, MTEB, and Datacomp, showing strong performance across image, text, and multimodal tasks. These benchmarks highlight its ability to bridge the gap between different data types while maintaining high accuracy and efficiency.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Nibedita Dutta

Nibedita completed her master’s in Chemical Engineering from IIT Kharagpur in 2014 and is currently working as a Senior Data Scientist. In her current capacity, she works on building intelligent ML-based solutions to improve business processes.

Advanced Generative AI RAG

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Enhancing RAG Systems with Nomic Embeddings

Learning Objectives

Table of contents

What is Multimodal RAG?

Nomic Vision Embeddings

Performance Benchmarks of Nomic Vision Embeddings

Hands on Python Implementation of MultiModal RAG with Nomic Vision Embeddings

Step 1: Installing Necessary Libraries

Step 2: Setting OpenAI API key and Importing Necessary Libraries

Step 3: Extracting Images From PDF

Step 4: Extracting Text From PDF

Step 5: Saving Extracted Text and Images From PDF

Step 6. Chunking Text Data For RAG

Step 7: Loading Nomic Text Embedding Model and Nomic Vision Embedding Model

Step 8: Generating Text and Image Embeddings For Our Data

Step 9: Storing Text Embeddings in Qdrant

Step 10: Storing Image Embeddings in Qdrant

Step 11: Creating a MultiModal Retriever For Retrieving Images and Text

Step 12: Creating a MultiModal RAG using LangChain

Querying the Model

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap