Mastering Multimodal RAG with Vertex AI & Gemini for Content and Images

Soumyadarshan Dash Last Updated : 20 Mar, 2025

10 min read

Retrieval Augmented Generation (RAG) has revolutionized how large language models access external data, but traditional approaches are limited to text. With the rise of multimodal data, integrating text and visual information is crucial for comprehensive analysis, especially in complex fields like finance and research. Multimodal RAG addresses this by enabling models to process both text and images for better knowledge retrieval and reasoning. This article explores building a multimodal RAG system using Google’s Gemini models, Vertex AI, and LangChain, guiding you through environment setup, data processing, embedding generation, and constructing a robust document search engine.

Learning Objectives

Understand the concept of Multimodal RAG and its significance in enhancing data retrieval.
Learn how Gemini can be used to process and integrate both text and visual data.
Explore the capabilities of Vertex AI in building scalable AI models for real-time applications.
Gain insight into how LangChain facilitates seamless integration of language models with external data sources.
Learn how to construct shrewd frameworks that use content and visual information for precise, context-aware reactions.
Know how to apply these innovations for utilize cases like substance era, personalized suggestions, and AI associates.

This article was published as a part of the Data Science Blogathon.

Multimodal RAG Model: An Overview
Key Technologies Used
Model Architecture Explained
Building a Multimodal RAG System with Vertex AI, Gemini, and LangChain
Practical Applications
Conclusion
Frequently Asked Questions

Multimodal RAG Model: An Overview

Multimodal RAG models combine visual and printed information to supply more strong and context-aware yields. Not at all like conventional Cloth models, which exclusively depend on content, multimodal Clothes are outlined to get and consolidate visual substance such as graphs, charts, and pictures. This dual-processing capability is particularly valuable for analyzing complex records where visuals are as enlightening as content, such as money-related reports, logical papers, or client manuals.

multimodal Retrieval Augmented Generation (RAG) system architecture — Source: Author

By preparing content and pictures, the show offers a more profound understanding of the substance, driving to more precise and smart reactions. This integration relieves the chance of producing deceiving or relevantly erroneous data (commonly known as visualization in machine learning), coming about in more dependable yields for decision-making and investigation.

Key Technologies Used

Here’s a summary of each key technology:

Gemini by Google DeepMind: A robust generative AI suite designed for multimodal functions, capable of processing and creating text and images seamlessly.
Vertex AI: A comprehensive platform for developing, deploying, and scaling machine learning models, known for its vector search feature for multimodal data retrieval.
LangChain: A tool that streamlines the integration of large language models (LLMs) with various tools and data sources, supporting the connection between models, embeddings, and external resources.
Retrieval-Augmented Generation (RAG) Framework: Combines retrieval-based and generation-based models to enhance response accuracy by pulling context from external sources before generating outputs, ideal for multimodal content handling.
OpenAI’s DALL·E: An image-generation model that translates textual prompts into visual content, enhancing multimodal RAG outputs with tailored and contextually relevant imagery.
Transformers for Multimodal Processing: The backbone architecture for handling mixed input types, enabling models to process and generate responses involving both text and visual data efficiently.

Model Architecture Explained

The architecture of a multimodal RAG system involves:

Gemini for Multimodal Processing: Handles both text and visual inputs, extracting detailed information.
Vertex AI Vector Search: Provides a vector store for embedding management, enabling seamless data retrieval.
LangChain MultiVectorRetriever: Acts as a mediator for retrieving relevant data from the vector store based on user queries.
RAG Framework Integration: Combines retrieved data with generative capabilities to create accurate, context-rich responses.
Multimodal Encoder-Decoder: Processes and fuses textual and visual content, ensuring both types of data contribute effectively to the output.
Transformers for Hybrid Data Handling: Uses attention mechanisms to align and integrate information from different modalities.
Fine-Tuning Pipelines: Customized training routines that adjust the model’s performance based on specific multimodal datasets for enhanced accuracy and context understanding.

building a multimodal Retrieval Augmented Generation (RAG) system with Gemini and LangChain

Building a Multimodal RAG System with Vertex AI, Gemini, and LangChain

Now let’s get into the actual coding part. In this section, I will guide you through the steps of building a multimodal RAG system for content and images, using Google Gemini, Vertex AI, and LangChain.

Step 1: Setting Up Your Development Environment

Let’s begin by setting up the environment.

1. Install necessary packages

The %pip install command installs all the necessary Python libraries, including google-cloud-aiplatform, langchain, and various document-processing libraries like pypdf.

%pip install -U -q google-cloud-aiplatform langchain-core langchain-google-vertexai langchain-text-splitters langchain-community "unstructured[all-docs]" pypdf pydantic lxml pillow matplotlib opencv-python tiktoken

2. Restart the runtime to make sure new packages are accessible

import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

3. Authenticate the notebook environment (Google Colab only)

Add the code to authenticate and initialize the Vertex AI environment
The auth.authenticate_user() function is used for authenticating your Google Cloud account in Google Colab.

import sys

# Additional authentication is required for Google Colab
if "google.colab" in sys.modules:
    # Authenticate user to Google Cloud
    from google.colab import auth

    auth.authenticate_user()

Step 2: Define Google Cloud Project Information

PROJECT_ID and LOCATION: Define your Google Cloud project and location.
Vertex AI SDK Initialization: The aiplatform.init() function initializes the Vertex AI SDK with your project and bucket information.

PROJECT_ID = “YOUR_PROJECT_ID” # @param {type:”string”}

PROJECT_ID = "YOUR_PROJECT_ID"  # @param {type:"string"}
LOCATION = "us-central1"  # @param {type:"string"}

# For Vector Search Staging
GCS_BUCKET = "YOUR_BUCKET_NAME"  # @param {type:"string"}
GCS_BUCKET_URI = f"gs://{GCS_BUCKET}"

Step 3: Initialize the Vertex AI SDK

from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=LOCATION, staging_bucket=GCS_BUCKET_URI)

Step 4: Import Necessary Libraries

Add the code for constructing the document repository and integrating LangChain:
Imports various libraries like langchain, IPython, pillow, and others needed for the retrieval and processing pipeline.

import base64
import os
import re
import uuid

from IPython.display import Image, Markdown, display
from langchain.prompts import PromptTemplate
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_core.documents import Document
from langchain_core.messages import AIMessage, HumanMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_google_vertexai import (
    ChatVertexAI,
    VectorSearchVectorStore,
    VertexAI,
    VertexAIEmbeddings,
)
from langchain_text_splitters import CharacterTextSplitter
from unstructured.partition.pdf import partition_pdf

# from langchain_community.vectorstores import Chroma  # Optional

Step 5: Define Model Information

MODEL_NAME = "gemini-1.5-flash"
GEMINI_OUTPUT_TOKEN_LIMIT = 8192

EMBEDDING_MODEL_NAME = "text-embedding-004"
EMBEDDING_TOKEN_LIMIT = 2048

TOKEN_LIMIT = min(GEMINI_OUTPUT_TOKEN_LIMIT, EMBEDDING_TOKEN_LIMIT)

Step 6: Load the Data

1. Get documents and images from GCS

# Download documents and images used in this notebook
!gsutil -m rsync -r gs://github-repo/rag/intro_multimodal_rag/ .
print("Download completed")

2. Extract images, tables, and chunk text from a PDF file

Partitions a PDF into tables and text using partition_pdf from unstructured.

pdf_folder_path = "/content/data/" if "google.colab" in sys.modules else "data/"
pdf_file_name = "google-10k-sample-14pages.pdf"

# Extract images, tables, and chunk text from a PDF file.
raw_pdf_elements = partition_pdf(
    filename=pdf_file_name,
    extract_images_in_pdf=False,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=4000,
    new_after_n_chars=3800,
    combine_text_under_n_chars=2000,
    image_output_dir_path=pdf_folder_path,
)

# Categorize extracted elements from a PDF into tables and texts.
tables = []
texts = []
for element in raw_pdf_elements:
    if "unstructured.documents.elements.Table" in str(type(element)):
        tables.append(str(element))
    elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
        texts.append(str(element))

# Optional: Enforce a specific token size for texts
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=10000, chunk_overlap=0
)
joined_texts = " ".join(texts)
texts_4k_token = text_splitter.split_text(joined_texts)

Generate summaries of text elements
A function generate_text_summaries uses Vertex AI’s model to summarize text and tables extracted from the PDF for later use in retrieval.

def generate_text_summaries(
    texts: list[str], tables: list[str], summarize_texts: bool = False
) -> tuple[list, list]:
    """
    Summarize text elements
    texts: List of str
    tables: List of str
    summarize_texts: Bool to summarize texts
    """

    # Prompt
    prompt_text = """You are an assistant tasked with summarizing tables and text for retrieval. \
    These summaries will be embedded and used to retrieve the raw text or table elements. \
    Give a concise summary of the table or text that is well optimized for retrieval. Table or text: {element} """
    prompt = PromptTemplate.from_template(prompt_text)
    empty_response = RunnableLambda(
        lambda x: AIMessage(content="Error processing document")
    )
    # Text summary chain
    model = VertexAI(
        temperature=0, model_name=MODEL_NAME, max_output_tokens=TOKEN_LIMIT
    ).with_fallbacks([empty_response])
    summarize_chain = {"element": lambda x: x} | prompt | model | StrOutputParser()

    # Initialize empty summaries
    text_summaries = []
    table_summaries = []

    # Apply to text if texts are provided and summarization is requested
    if texts:
        if summarize_texts:
            text_summaries = summarize_chain.batch(texts, {"max_concurrency": 1})
        else:
            text_summaries = texts

    # Apply to tables if tables are provided
    if tables:
        table_summaries = summarize_chain.batch(tables, {"max_concurrency": 1})

    return text_summaries, table_summaries


# Get text, table summaries
text_summaries, table_summaries = generate_text_summaries(
    texts_4k_token, tables, summarize_texts=True
)

def encode_image(image_path: str) -> str:
    """Getting the base64 string"""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")


def image_summarize(model: ChatVertexAI, base64_image: str, prompt: str) -> str:
    """Make image summary"""
    msg = model.invoke(
        [
            HumanMessage(
                content=[
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{base64_image}"},
                    },
                ]
            )
        ]
    )
    return msg.content


def generate_img_summaries(path: str) -> tuple[list[str], list[str]]:
    """
    Generate summaries and base64 encoded strings for images
    path: Path to list of .jpg files extracted by Unstructured
    """

    # Store base64 encoded images
    img_base64_list = []

    # Store image summaries
    image_summaries = []

    # Prompt
    prompt = """You are an assistant tasked with summarizing images for retrieval. \
    These summaries will be embedded and used to retrieve the raw image. \
    Give a concise summary of the image that is well optimized for retrieval.
    If it's a table, extract all elements of the table.
    If it's a graph, explain the findings in the graph.
    Do not include any numbers that are not mentioned in the image.
    """

    model = ChatVertexAI(model_name=MODEL_NAME, max_output_tokens=TOKEN_LIMIT)

    # Apply to images
    for img_file in sorted(os.listdir(path)):
        if img_file.endswith(".png"):
            base64_image = encode_image(os.path.join(path, img_file))
            img_base64_list.append(base64_image)
            image_summaries.append(image_summarize(model, base64_image, prompt))

    return img_base64_list, image_summaries


# Image summaries
img_base64_list, image_summaries = generate_img_summaries(".")

Step 7: Create and Deploy a Vertex AI Vector Search Index and Endpoint

# https://cloud.google.com/vertex-ai/generative-ai/docs/model-reference/text-embeddings
DIMENSIONS = 768  # Dimensions output from textembedding-gecko

index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
    display_name="mm_rag_langchain_index",
    dimensions=DIMENSIONS,
    approximate_neighbors_count=150,
    leaf_node_embedding_count=500,
    leaf_nodes_to_search_percent=7,
    description="Multimodal RAG LangChain Index",
    index_update_method="STREAM_UPDATE",
)

DEPLOYED_INDEX_ID = "mm_rag_langchain_index_endpoint"

index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
    display_name=DEPLOYED_INDEX_ID,
    description="Multimodal RAG LangChain Index Endpoint",
    public_endpoint_enabled=True,
)

Deploy Index to Index Endpoint

index_endpoint = index_endpoint.deploy_index(
    index=index, deployed_index_id="mm_rag_langchain_deployed_index"
)
index_endpoint.deployed_indexes

Step 8: Create Retriever and Load Documents

# The vectorstore to use to index the summaries
vectorstore = VectorSearchVectorStore.from_components(
    project_id=PROJECT_ID,
    region=LOCATION,
    gcs_bucket_name=GCS_BUCKET,
    index_id=index.name,
    endpoint_id=index_endpoint.name,
    embedding=VertexAIEmbeddings(model_name=EMBEDDING_MODEL_NAME),
    stream_update=True,
)

docstore = InMemoryStore()

id_key = "doc_id"
# Create the multi-vector retriever
retriever_multi_vector_img = MultiVectorRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    id_key=id_key,
)

• Load data into Document Store and Vector Store

# Raw Document Contents
doc_contents = texts + tables + img_base64_list

doc_ids = [str(uuid.uuid4()) for _ in doc_contents]
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(text_summaries + table_summaries + image_summaries)
]

retriever_multi_vector_img.docstore.mset(list(zip(doc_ids, doc_contents)))

# If using Vertex AI Vector Search, this will take a while to complete.
# You can cancel this cell and continue later.
retriever_multi_vector_img.vectorstore.add_documents(summary_docs)

Step 9: Create Chain with Retriever and Gemini LLM

def looks_like_base64(sb):
    """Check if the string looks like base64"""
    return re.match("^[A-Za-z0-9+/]+[=]{0,2}$", sb) is not None


def is_image_data(b64data):
    """
    Check if the base64 data is an image by looking at the start of the data
    """
    image_signatures = {
        b"\xFF\xD8\xFF": "jpg",
        b"\x89\x50\x4E\x47\x0D\x0A\x1A\x0A": "png",
        b"\x47\x49\x46\x38": "gif",
        b"\x52\x49\x46\x46": "webp",
    }
    try:
        header = base64.b64decode(b64data)[:8]  # Decode and get the first 8 bytes
        for sig, format in image_signatures.items():
            if header.startswith(sig):
                return True
        return False
    except Exception:
        return False


def split_image_text_types(docs):
    """
    Split base64-encoded images and texts
    """
    b64_images = []
    texts = []
    for doc in docs:
        # Check if the document is of type Document and extract page_content if so
        if isinstance(doc, Document):
            doc = doc.page_content
        if looks_like_base64(doc) and is_image_data(doc):
            b64_images.append(doc)
        else:
            texts.append(doc)
    return {"images": b64_images, "texts": texts}


def img_prompt_func(data_dict):
    """
    Join the context into a single string
    """
    formatted_texts = "\n".join(data_dict["context"]["texts"])
    messages = [
        {
            "type": "text",
            "text": (
                "You are financial analyst tasking with providing investment advice.\n"
                "You will be given a mix of text, tables, and image(s) usually of charts or graphs.\n"
                "Use this information to provide investment advice related to the user's question. \n"
                f"User-provided question: {data_dict['question']}\n\n"
                "Text and / or tables:\n"
                f"{formatted_texts}"
            ),
        }
    ]

    # Adding image(s) to the messages if present
    if data_dict["context"]["images"]:
        for image in data_dict["context"]["images"]:
            messages.append(
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{image}"},
                }
            )
    return [HumanMessage(content=messages)]


# Create RAG chain
chain_multimodal_rag = (
    {
        "context": retriever_multi_vector_img | RunnableLambda(split_image_text_types),
        "question": RunnablePassthrough(),
    }
    | RunnableLambda(img_prompt_func)
    | ChatVertexAI(
        temperature=0,
        model_name=MODEL_NAME,
        max_output_tokens=TOKEN_LIMIT,
    )  # Multi-modal LLM
    | StrOutputParser()
)

Step 10: Test the Model

1. Process User Query

query = "What are the EV / NTM and NTM rev growth for MongoDB, Cloudflare, and Datadog?
"

2. Get Retrieved documents

# List of source documents
docs = retriever_multi_vector_img.get_relevant_documents(query, limit=1)

# We get relevant docs
len(docs)

docs

RAG system with Vertex AI, Google Gemini, and LangChain

3. Get generative response

plt_img_base64(docs[3])

result = chain_multimodal_rag.invoke(query)

from IPython.display import Markdown as md
md(result)

Practical Applications

Financial Analysis: In financial analysis, information from money-related reports such as adjust sheets, salary articulations, and cash stream reports can be extricated to evaluate a company’s execution and make educated choices.
Healthcare: Cross-referencing restorative records with pictures like X-rays makes a difference specialists to create precise analyze by comparing the patient’s history with visual information.
Education: In education, providing explanations alongside diagrams aids in visualizing complex concepts, making them easier to understand and enhancing retention for students.

Conclusion

Multimodal RAG (Retrieval-Augmented Generation) combines text and visual data to enhance information retrieval, enabling more contextually accurate and comprehensive AI responses. By leveraging tools like Gemini, Vertex AI, and LangChain, developers can build intelligent systems that efficiently process both textual and visual data.

Gemini enables understanding of diverse data types, while Vertex AI supports scalable model deployment for real-time applications. LangChain streamlines integration with external APIs and databases, allowing seamless interaction with multiple data sources. Together, these technologies provide powerful capabilities for creating context-aware, data-rich systems for use in areas like content generation, personalized recommendations, and interactive AI assistants.

Key Takeaways

Multimodal RAG combines text and visual data for more accurate, context-aware information retrieval.
Gemini helps process and understand both text and images, enhancing data richness.
Vertex AI offers tools for scalable, efficient AI model deployment, improving real-time performance.
LangChain simplifies the integration of language models with external data sources, enabling seamless data interaction.
These technologies enable the creation of intelligent systems that improve content generation, personalized recommendations, and interactive AI assistants.
The combination of these tools broadens the scope of AI applications, making them more versatile and accurate across diverse use cases.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What is Multimodal RAG, and why is it important?

A. Multimodal RAG (Retrieval Augmented Generation) combines text and visual data to improve the accuracy and context of information retrieval, allowing AI systems to provide more comprehensive and relevant responses.

Q2. How does Gemini contribute to Multimodal RAG?

A. Gemini, by Google, is designed to process both text and visual data, enabling AI models to understand and generate insights from mixed data types, enhancing the overall performance of multimodal systems.

Q3. What is Vertex AI, and how does it support building intelligent systems?

A. Vertex AI may be a stage by Google Cloud that provides tools for sending and overseeing AI models at scale. It streamlines the method of building, preparing, and optimizing models, making it simpler for engineers to execute effective multimodal frameworks.

Q4. What is LangChain, and how does it enhance AI model integration?

A. LangChain is a framework that helps integrate large language models with external data sources, APIs, and databases. It enables seamless interaction with different types of data, enhancing the capabilities of multimodal RAG systems.

Q5. What are some practical applications of Multimodal RAG in real-world scenarios?

A. Multimodal RAG can be applied in areas like personalized recommendations, content generation, image-captioning, healthcare (cross-referencing X-rays with medical records), and AI assistants that provide context-aware responses.

Soumyadarshan Dash

Hello there! I'm Soumyadarshan Dash, a passionate and enthusiastic person when it comes to data science and machine learning. I'm constantly exploring new topics and techniques in this field, always striving to expand my knowledge and skills. In fact, upskilling myself is not just a hobby, but a way of life for me.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Mastering Multimodal RAG with Vertex AI & Gemini for Content and Images

Learning Objectives

Table of Contents

Multimodal RAG Model: An Overview

Key Technologies Used

Model Architecture Explained

Building a Multimodal RAG System with Vertex AI, Gemini, and LangChain

Step 1: Setting Up Your Development Environment

1. Install necessary packages

2. Restart the runtime to make sure new packages are accessible

3. Authenticate the notebook environment (Google Colab only)

Step 2: Define Google Cloud Project Information

Step 3: Initialize the Vertex AI SDK

Step 4: Import Necessary Libraries

Step 5: Define Model Information

Step 6: Load the Data

1. Get documents and images from GCS

2. Extract images, tables, and chunk text from a PDF file

Step 7: Create and Deploy a Vertex AI Vector Search Index and Endpoint

Step 8: Create Retriever and Load Documents

Step 9: Create Chain with Retriever and Gemini LLM

Step 10: Test the Model

1. Process User Query

2. Get Retrieved documents

3. Get generative response

Practical Applications

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)