Mastering Multimodal RAG with Vertex AI & Gemini for Content and Images

Soumyadarshan Dash Last Updated : 24 Feb, 2025
10 min read

Retrieval Augmented Generation (RAG) has revolutionized how large language models access external data, but traditional approaches are limited to text. With the rise of multimodal data, integrating text and visual information is crucial for comprehensive analysis, especially in complex fields like finance and research. Multimodal RAG addresses this by enabling models to process both text and images for better knowledge retrieval and reasoning. This article explores building a multimodal RAG system using Google’s Gemini models, Vertex AI, and LangChain, guiding you through environment setup, data processing, embedding generation, and constructing a robust document search engine.

Learning Objectives

  • Understand the concept of Multimodal RAG and its significance in enhancing data retrieval.
  • Learn how Gemini can be used to process and integrate both text and visual data.
  • Explore the capabilities of Vertex AI in building scalable AI models for real-time applications.
  • Gain insight into how LangChain facilitates seamless integration of language models with external data sources.
  • Learn how to construct shrewd frameworks that use content and visual information for precise, context-aware reactions.
  • Know how to apply these innovations for utilize cases like substance era, personalized suggestions, and AI associates.

Multimodal RAG Model: An Overview

Multimodal RAG models combine visual and printed information to supply more strong and context-aware yields. Not at all like conventional Cloth models, which exclusively depend on content, multimodal Clothes are outlined to get and consolidate visual substance such as graphs, charts, and pictures. This dual-processing capability is particularly valuable for analyzing complex records where visuals are as enlightening as content, such as money-related reports, logical papers, or client manuals.

multimodal Retrieval Augmented Generation (RAG) system architecture
Source: Author

By preparing content and pictures, the show offers a more profound understanding of the substance, driving to more precise and smart reactions. This integration relieves the chance of producing deceiving or relevantly erroneous data (commonly known as visualization in machine learning), coming about in more dependable yields for decision-making and investigation.

Key Technologies Used

Here’s a summary of each key technology:

  1. Gemini by Google DeepMind: A robust generative AI suite designed for multimodal functions, capable of processing and creating text and images seamlessly.
  2. Vertex AI: A comprehensive platform for developing, deploying, and scaling machine learning models, known for its vector search feature for multimodal data retrieval.
  3. LangChain: A tool that streamlines the integration of large language models (LLMs) with various tools and data sources, supporting the connection between models, embeddings, and external resources.
  4. Retrieval-Augmented Generation (RAG) Framework: Combines retrieval-based and generation-based models to enhance response accuracy by pulling context from external sources before generating outputs, ideal for multimodal content handling.
  5. OpenAI’s DALL·E: An image-generation model that translates textual prompts into visual content, enhancing multimodal RAG outputs with tailored and contextually relevant imagery.
  6. Transformers for Multimodal Processing: The backbone architecture for handling mixed input types, enabling models to process and generate responses involving both text and visual data efficiently.

Model Architecture Explained

The architecture of a multimodal RAG system involves:

  • Gemini for Multimodal Processing: Handles both text and visual inputs, extracting detailed information.
  • Vertex AI Vector Search: Provides a vector store for embedding management, enabling seamless data retrieval.
  • LangChain MultiVectorRetriever: Acts as a mediator for retrieving relevant data from the vector store based on user queries.
  • RAG Framework Integration: Combines retrieved data with generative capabilities to create accurate, context-rich responses.
  • Multimodal Encoder-Decoder: Processes and fuses textual and visual content, ensuring both types of data contribute effectively to the output.
  • Transformers for Hybrid Data Handling: Uses attention mechanisms to align and integrate information from different modalities.
  • Fine-Tuning Pipelines: Customized training routines that adjust the model’s performance based on specific multimodal datasets for enhanced accuracy and context understanding.
building a multimodal Retrieval Augmented Generation (RAG) system with Gemini and LangChain

Building a Multimodal RAG System with Vertex AI, Gemini, and LangChain

Now let’s get into the actual coding part. In this section, I will guide you through the steps of building a multimodal RAG system for content and images, using Google Gemini, Vertex AI, and LangChain.

Step 1: Setting Up Your Development Environment

 Let’s begin by setting up the environment.

1. Install necessary packages

The %pip install command installs all the necessary Python libraries, including google-cloud-aiplatform, langchain, and various document-processing libraries like pypdf.

%pip install -U -q google-cloud-aiplatform langchain-core langchain-google-vertexai langchain-text-splitters langchain-community "unstructured[all-docs]" pypdf pydantic lxml pillow matplotlib opencv-python tiktoken

2. Restart the runtime to make sure new packages are accessible

import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

3. Authenticate the notebook environment (Google Colab only)

Add the code to authenticate and initialize the Vertex AI environment
The auth.authenticate_user() function is used for authenticating your Google Cloud account in Google Colab.

import sys

# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
    # Authenticate user to Google Cloud
    from google.colab import auth

    auth.authenticate_user()

Step 2: Define Google Cloud Project Information

  • PROJECT_ID and LOCATION: Define your Google Cloud project and location.
  • Vertex AI SDK Initialization: The aiplatform.init() function initializes the Vertex AI SDK with your project and bucket information.

PROJECT_ID = “YOUR_PROJECT_ID” # @param {type:”string”}

PROJECT_ID = "YOUR_PROJECT_ID"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

# For Vector Search Staging
GCS_BUCKET = "YOUR_BUCKET_NAME"  # @param {type:"string"}
GCS_BUCKET_URI = f"gs://{GCS_BUCKET}"

Step 3: Initialize the Vertex AI SDK

from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=LOCATION, staging_bucket=GCS_BUCKET_URI)

Step 4: Import Necessary Libraries

Add the code for constructing the document repository and integrating LangChain:
Imports various libraries like langchain, IPython, pillow, and others needed for the retrieval and processing pipeline.

import base64
import os
import re
import uuid

from IPython.display import Image, Markdown, display
from langchain.prompts import PromptTemplate
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_core.documents import Document
from langchain_core.messages import AIMessage, HumanMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_google_vertexai import (
    ChatVertexAI,
    VectorSearchVectorStore,
    VertexAI,
    VertexAIEmbeddings,
)
from langchain_text_splitters import CharacterTextSplitter
from unstructured.partition.pdf import partition_pdf

# from langchain_community.vectorstores import Chroma  # Optional

Step 5: Define Model Information

MODEL_NAME = "gemini-1.5-flash"
GEMINI_OUTPUT_TOKEN_LIMIT = 8192

EMBEDDING_MODEL_NAME = "text-embedding-004"
EMBEDDING_TOKEN_LIMIT = 2048

TOKEN_LIMIT = min(GEMINI_OUTPUT_TOKEN_LIMIT, EMBEDDING_TOKEN_LIMIT)

Step 6: Load the Data

1. Get documents and images from GCS

# Download documents and images used in this notebook
!gsutil -m rsync -r gs://github-repo/rag/intro_multimodal_rag/ .
print("Download completed")

2. Extract images, tables, and chunk text from a PDF file

  • Partitions a PDF into tables and text using partition_pdf from unstructured.
pdf_folder_path = "/content/data/" if "google.colab" in sys.modules else "data/"
pdf_file_name = "google-10k-sample-14pages.pdf"

# Extract images, tables, and chunk text from a PDF file.
raw_pdf_elements = partition_pdf(
    filename=pdf_file_name,
    extract_images_in_pdf=False,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=pdf_folder_path,
)

# Categorize extracted elements from a PDF into tables and texts.
tables = []
texts = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        tables.append(str(element))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        texts.append(str(element))

# Optional: Enforce a specific token size for texts
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=10000, chunk_overlap=0
)
joined_texts = " ".join(texts)
texts_4k_token = text_splitter.split_text(joined_texts)
  • Generate summaries of text elements
  • A function generate_text_summaries uses Vertex AI’s model to summarize text and tables extracted from the PDF for later use in retrieval.
def generate_text_summaries(
    texts: list[str], tables: list[str], summarize_texts: bool = False
) -> tuple[list, list]:
    """
    Summarize text elements
    texts: List of str
    tables: List of str
    summarize_texts: Bool to summarize texts
    """

    # Prompt
    prompt_text = """You are an assistant tasked with summarizing tables and text for retrieval. \
    These summaries will be embedded and used to retrieve the raw text or table elements. \
    Give a concise summary of the table or text that is well optimized for retrieval. Table or text: {element} """
    prompt = PromptTemplate.from_template(prompt_text)
    empty_response = RunnableLambda(
        lambda x: AIMessage(content="Error processing document")
    )
    # Text summary chain
    model = VertexAI(
        temperature=0, model_name=MODEL_NAME, max_output_tokens=TOKEN_LIMIT
    ).with_fallbacks([empty_response])
    summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

    # Initialize empty summaries
    text_summaries = []
    table_summaries = []

    # Apply to text if texts are provided and summarization is requested
    if texts:
        if summarize_texts:
            text_summaries = summarize_chain.batch(texts, {"max_concurrency": 1})
        else:
            text_summaries = texts

    # Apply to tables if tables are provided
    if tables:
        table_summaries = summarize_chain.batch(tables, {"max_concurrency": 1})

    return text_summaries, table_summaries


# Get text, table summaries
text_summaries, table_summaries = generate_text_summaries(
    texts_4k_token, tables, summarize_texts=True
)
def encode_image(image_path: str) -> str:
    """Getting the base64 string"""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


def image_summarize(model: ChatVertexAI, base64_image: str, prompt: str) -> str:
    """Make image summary"""
    msg = model.invoke(
        [
            HumanMessage(
                content=[
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{base64_image}"},
                    },
                ]
            )
        ]
    )
    return msg.content


def generate_img_summaries(path: str) -> tuple[list[str], list[str]]:
    """
    Generate summaries and base64 encoded strings for images
    path: Path to list of .jpg files extracted by Unstructured
    """

    # Store base64 encoded images
    img_base64_list = []

    # Store image summaries
    image_summaries = []

    # Prompt
    prompt = """You are an assistant tasked with summarizing images for retrieval. \
    These summaries will be embedded and used to retrieve the raw image. \
    Give a concise summary of the image that is well optimized for retrieval.
    If it's a table, extract all elements of the table.
    If it's a graph, explain the findings in the graph.
    Do not include any numbers that are not mentioned in the image.
    """

    model = ChatVertexAI(model_name=MODEL_NAME, max_output_tokens=TOKEN_LIMIT)

    # Apply to images
    for img_file in sorted(os.listdir(path)):
        if img_file.endswith(".png"):
            base64_image = encode_image(os.path.join(path, img_file))
            img_base64_list.append(base64_image)
            image_summaries.append(image_summarize(model, base64_image, prompt))

    return img_base64_list, image_summaries


# Image summaries
img_base64_list, image_summaries = generate_img_summaries(".")

Step 7: Create and Deploy a Vertex AI Vector Search Index and Endpoint

# https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text-embeddings
DIMENSIONS = 768  # Dimensions output from textembedding-gecko

index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name="mm_rag_langchain_index",
    dimensions=DIMENSIONS,
    approximate_neighbors_count=150,
    leaf_node_embedding_count=500,
    leaf_nodes_to_search_percent=7,
    description="Multimodal RAG LangChain Index",
    index_update_method="STREAM_UPDATE",
)
DEPLOYED_INDEX_ID = "mm_rag_langchain_index_endpoint"

index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name=DEPLOYED_INDEX_ID,
    description="Multimodal RAG LangChain Index Endpoint",
    public_endpoint_enabled=True,
)
  • Deploy Index to Index Endpoint
index_endpoint = index_endpoint.deploy_index(
    index=index, deployed_index_id="mm_rag_langchain_deployed_index"
)
index_endpoint.deployed_indexes

Step 8: Create Retriever and Load Documents

# The vectorstore to use to index the summaries
vectorstore = VectorSearchVectorStore.from_components(
    project_id=PROJECT_ID,
    region=LOCATION,
    gcs_bucket_name=GCS_BUCKET,
    index_id=index.name,
    endpoint_id=index_endpoint.name,
    embedding=VertexAIEmbeddings(model_name=EMBEDDING_MODEL_NAME),
    stream_update=True,
)
docstore = InMemoryStore()

id_key = "doc_id"
# Create the multi-vector retriever
retriever_multi_vector_img = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    id_key=id_key,
)

• Load data into Document Store and Vector Store

# Raw Document Contents
doc_contents = texts + tables + img_base64_list

doc_ids = [str(uuid.uuid4()) for _ in doc_contents]
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(text_summaries + table_summaries + image_summaries)
]

retriever_multi_vector_img.docstore.mset(list(zip(doc_ids, doc_contents)))

# If using Vertex AI Vector Search, this will take a while to complete.
# You can cancel this cell and continue later.
retriever_multi_vector_img.vectorstore.add_documents(summary_docs)

Step 9: Create Chain with Retriever and Gemini LLM

def looks_like_base64(sb):
    """Check if the string looks like base64"""
    return re.match("^[A-Za-z0-9+/]+[=]{0,2}$", sb) is not None


def is_image_data(b64data):
    """
    Check if the base64 data is an image by looking at the start of the data
    """
    image_signatures = {
        b"\xFF\xD8\xFF": "jpg",
        b"\x89\x50\x4E\x47\x0D\x0A\x1A\x0A": "png",
        b"\x47\x49\x46\x38": "gif",
        b"\x52\x49\x46\x46": "webp",
    }
    try:
        header = base64.b64decode(b64data)[:8]  # Decode and get the first 8 bytes
        for sig, format in image_signatures.items():
            if header.startswith(sig):
                return True
        return False
    except Exception:
        return False


def split_image_text_types(docs):
    """
    Split base64-encoded images and texts
    """
    b64_images = []
    texts = []
    for doc in docs:
        # Check if the document is of type Document and extract page_content if so
        if isinstance(doc, Document):
            doc = doc.page_content
        if looks_like_base64(doc) and is_image_data(doc):
            b64_images.append(doc)
        else:
            texts.append(doc)
    return {"images": b64_images, "texts": texts}


def img_prompt_func(data_dict):
    """
    Join the context into a single string
    """
    formatted_texts = "\n".join(data_dict["context"]["texts"])
    messages = [
        {
            "type": "text",
            "text": (
                "You are financial analyst tasking with providing investment advice.\n"
                "You will be given a mix of text, tables, and image(s) usually of charts or graphs.\n"
                "Use this information to provide investment advice related to the user's question. \n"
                f"User-provided question: {data_dict['question']}\n\n"
                "Text and / or tables:\n"
                f"{formatted_texts}"
            ),
        }
    ]

    # Adding image(s) to the messages if present
    if data_dict["context"]["images"]:
        for image in data_dict["context"]["images"]:
            messages.append(
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image}"},
                }
            )
    return [HumanMessage(content=messages)]


# Create RAG chain
chain_multimodal_rag = (
    {
        "context": retriever_multi_vector_img | RunnableLambda(split_image_text_types),
        "question": RunnablePassthrough(),
    }
    | RunnableLambda(img_prompt_func)
    | ChatVertexAI(
        temperature=0,
        model_name=MODEL_NAME,
        max_output_tokens=TOKEN_LIMIT,
    )  # Multi-modal LLM
    | StrOutputParser()
)

Step 10: Test the Model

1. Process User Query

query = "What are the EV / NTM and NTM rev growth for MongoDB, Cloudflare, and Datadog?
"

2. Get Retrieved documents

# List of source documents
docs = retriever_multi_vector_img.get_relevant_documents(query, limit=1)

# We get relevant docs
len(docs)

docs
RAG system with Vertex AI, Google Gemini, and LangChain

3. Get generative response

plt_img_base64(docs[3])
EV / NTM revenue multiples
result = chain_multimodal_rag.invoke(query)

from IPython.display import Markdown as md
md(result)
Vertex AI, Google Gemini, and LangChain

Practical Applications

  1. Financial Analysis: In financial analysis, information from money-related reports such as adjust sheets, salary articulations, and cash stream reports can be extricated to evaluate a company’s execution and make educated choices.
  2. Healthcare: Cross-referencing restorative records with pictures like X-rays makes a difference specialists to create precise analyze by comparing the patient’s history with visual information.
  3. Education: In education, providing explanations alongside diagrams aids in visualizing complex concepts, making them easier to understand and enhancing retention for students.

Conclusion

Multimodal RAG (Retrieval-Augmented Generation) combines text and visual data to enhance information retrieval, enabling more contextually accurate and comprehensive AI responses. By leveraging tools like Gemini, Vertex AI, and LangChain, developers can build intelligent systems that efficiently process both textual and visual data.

Gemini enables understanding of diverse data types, while Vertex AI supports scalable model deployment for real-time applications. LangChain streamlines integration with external APIs and databases, allowing seamless interaction with multiple data sources. Together, these technologies provide powerful capabilities for creating context-aware, data-rich systems for use in areas like content generation, personalized recommendations, and interactive AI assistants.

Key Takeaways

  • Multimodal RAG combines text and visual data for more accurate, context-aware information retrieval.
  • Gemini helps process and understand both text and images, enhancing data richness.
  • Vertex AI offers tools for scalable, efficient AI model deployment, improving real-time performance.
  • LangChain simplifies the integration of language models with external data sources, enabling seamless data interaction.
  • These technologies enable the creation of intelligent systems that improve content generation, personalized recommendations, and interactive AI assistants.
  • The combination of these tools broadens the scope of AI applications, making them more versatile and accurate across diverse use cases.

Frequently Asked Questions

Q1. What is Multimodal RAG, and why is it important?

A. Multimodal RAG (Retrieval Augmented Generation) combines text and visual data to improve the accuracy and context of information retrieval, allowing AI systems to provide more comprehensive and relevant responses.

Q2. How does Gemini contribute to Multimodal RAG?

A. Gemini, by Google, is designed to process both text and visual data, enabling AI models to understand and generate insights from mixed data types, enhancing the overall performance of multimodal systems.

Q3. What is Vertex AI, and how does it support building intelligent systems?

A. Vertex AI may be a stage by Google Cloud that provides tools for sending and overseeing AI models at scale. It streamlines the method of building, preparing, and optimizing models, making it simpler for engineers to execute effective multimodal frameworks.

Q4. What is LangChain, and how does it enhance AI model integration?

A. LangChain is a framework that helps integrate large language models with external data sources, APIs, and databases. It enables seamless interaction with different types of data, enhancing the capabilities of multimodal RAG systems.

Q5. What are some practical applications of Multimodal RAG in real-world scenarios?

A. Multimodal RAG can be applied in areas like personalized recommendations, content generation, image-captioning, healthcare (cross-referencing X-rays with medical records), and AI assistants that provide context-aware responses.

Hello there! I'm Soumyadarshan Dash, a passionate and enthusiastic person when it comes to data science and machine learning. I'm constantly exploring new topics and techniques in this field, always striving to expand my knowledge and skills. In fact, upskilling myself is not just a hobby, but a way of life for me.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details