Building a Multimodal RAG Pipeline using Gemma 3 and Docling

Shaik Hamzah Last Updated : 28 Mar, 2025

7 min read

In this tutorial, we explore how to set up and execute a sophisticated retrieval-augmented generation (RAG) pipeline in Google Colab. We leverage multiple state-of-the-art tools and libraries-including Gemma 3 for language and vision tasks, Docling for document conversion, LangChain for chain-of-thought orchestration, and Milvus as our vector database build a multimodal system that understands and processes text, tables, and images. Let’s dive into each component and see how they work together.

What is Multimodal RAG?
Proposed Architecture of Multimodal RAG with Gemma 3
Overview of Libraries and Tools
Building Multimodal Rag with Gemma 3
Use Cases
Conclusion

What is Multimodal RAG?

Multimodal RAG (Retrieval-Augmented Generation) extends traditional text-based RAG systems by integrating multiple data modalities, in this case, text, tables, and images. This means that the pipeline not only processes and retrieves text but also leverages vision models to understand and describe image content, making the solution more comprehensive. This multimodal approach is particularly beneficial for documents like annual reports that often contain visual elements, such as charts and diagrams.

Proposed Architecture of Multimodal RAG with Gemma 3

The aim of this project is to build a robust multimodal RAG pipeline that can ingest documents (like PDFs), process text and images, store document embeddings in a vector database, and answer queries by retrieving relevant information. This setup is particularly useful for applications such as analyzing annual reports, extracting financial statements, or summarizing technical papers. By integrating various libraries and tools, we combine the power of language models with document conversion and vector search to create a comprehensive end-to-end solution.

Overview of Libraries and Tools

The pipeline uses several key libraries and tools:

Colab-Xterm Extension: This adds terminal support in Colab, allowing us to run shell commands and manage the environment efficiently.
Ollama Models: Provides pre-trained models such as Gemma3, which are used for both language and vision tasks.
Transformers: From Hugging Face, for model loading and tokenization.
LangChain: Orchestrates the chain of processing steps, from prompt creation to document retrieval and generation.
Docling: Converts PDF documents into structured formats, enabling extraction of text, tables, and images.
Milvus: A vector database that stores document embeddings and supports efficient similarity search.
Hugging Face CLI: Used for logging into Hugging Face to access certain models.
Additional Utilities: Such as Pillow for image processing and IPython for display functionalities.

Also read: A Comprehensive Guide to Building Multimodal RAG Systems

Building Multimodal Rag with Gemma 3

We are building multimodal rag: This approach improves contextual understanding, accuracy, and relevance, especially in fields like healthcare, research, and media analysis. By leveraging cross-modal embeddings, hybrid retrieval strategies, and vision-language models, multimodal RAG systems can provide richer and more insightful responses. The key challenge lies in efficiently integrating and retrieving multimodal data while maintaining coherence and scalability. As AI progresses, developing optimized architectures and retrieval strategies will be crucial for unlocking the full potential of multimodal intelligence.

Terminal Setup with Colab-Xterm

First, we install the colab-xterm extension to bring a terminal environment directly into Colab. This allows us to run system commands, install packages, and manage our session more flexibly.

!pip install colab-xterm  # Install colab-xterm
%load_ext colabxterm     # Load the xterm extension
%xterm                  # Launch an xterm terminal session in Colab

This terminal support is especially useful for installing additional dependencies or managing background processes.

Installing and Managing Ollama Models

We pull specific Ollama models into our environment using simple shell commands. For example:

!ollama pull gemma3:4b
!ollama pull llama3.2
!ollama list

These commands ensure that we have the necessary language and vision models available, such as the powerful Gemma 3 model, which is central to our multimodal processing.

Installing Essential Python Packages

The next step involves installing a host of packages required for our pipeline. This includes libraries for deep learning, text processing, and document handling:

! pip install transformers pillow langchain_community langchain_huggingface langchain_milvus docling langchain_ollama

By installing these packages, we prepare the environment for everything from document conversion to retrieval-augmented generation.

Logging and Hugging Face Authentication

Setting up logging is crucial for monitoring pipeline operations:

import logging
logging.basicConfig(level=logging.INFO)

We also log in to Hugging Face using their CLI to access certain pre-trained models:

!huggingface-cli login

This authentication step is necessary for fetching model artifacts and ensuring smooth integration with Hugging Face’s ecosystem.

Configuring Vision and Language Models (Gemma 3)

The pipeline leverages the Gemma 3 model for both vision and language tasks. For the language side, we set up the model and tokenizer:

This dual setup enables the system to generate textual descriptions from images, making the pipeline truly multimodal.

Document Conversion with Docling

1. Converting PDFs to Structured Documents

We employ Docling’s DocumentConverter to convert PDFs into structured documents. The conversion process involves extracting text, tables, and images from the source PDFs:

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
pdf_pipeline_options = PdfPipelineOptions(
    do_ocr=False,
    generate_picture_images=True,
)
format_options = { InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_pipeline_options) }
converter = DocumentConverter(format_options=format_options)
# Define the sources (URLs) of the documents to be converted.
# "https://arxiv.org/pdf/1706.03762"
sources = [
 "https://www.pwc.com/jm/en/research-publications/pdf/basic-understanding-of-a-companys-financials.pdf"
]
# Convert the PDF documents from the sources into an internal document format.
conversions = { source: converter.convert(source=source).document for source in sources }

Input File

We’ll be using PwC’s publicly available financial statements. I’ve included the PDF link, and you’re welcome to add your own source links as well!

2. Extracting and Chunking Content

After conversion, we chunk the document into manageable pieces, separating text from tables and images. This segmentation allows each component to be processed independently:

from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from langchain_core.documents import Document
# Process text chunks (excluding pure table segments)
texts: list[Document] = []
for source, docling_document in conversions.items():
    for chunk in HybridChunker(tokenizer=embeddings_tokenizer).chunk(docling_document):
        # Skip table-only chunks; process tables separately
        if len(chunk.meta.doc_items) == 1:
            continue
        document = Document(
            page_content=chunk.text,
            metadata={"source": source, "ref": "reference details"}
        )
        texts.append(document)

This approach not only improves processing efficiency but also facilitates more precise vector storage and retrieval later.

Image Processing and Encoding

Images from the documents are processed using Pillow. We convert images into base64-encoded strings that can be embedded directly into prompts:

import base64, io, PIL.Image, PIL.ImageOps
def encode_image(image: PIL.Image.Image, format: str = "png") -> str:
    image = PIL.ImageOps.exif_transpose(image) or image
    image = image.convert("RGB")
    buffer = io.BytesIO()
    image.save(buffer, format)
    encoding = base64.b64encode(buffer.getvalue()).decode("utf-8")
    return f"data:image/{format};base64,{encoding}"

Subsequently, these images are fed into our vision model to generate descriptive text, enhancing the multimodal capabilities of our pipeline.

Creating a Vector Database with Milvus

To enable fast and accurate retrieval of document embeddings, we set up Milvus as our vector store:

import tempfile
from langchain_core.vectorstores import VectorStore
from langchain_milvus import Milvus
db_file = tempfile.NamedTemporaryFile(prefix="vectorstore_", suffix=".db", delete=False).name
vector_db: VectorStore = Milvus(
    embedding_function=embeddings_model,
    connection_args={"uri": db_file},
    auto_id=True,
    enable_dynamic_field=True,
    index_params={"index_type": "AUTOINDEX"},
)

Documents—whether text, tables, or image descriptions—are then added to the vector database, enabling fast and accurate similarity searches during query execution.

Building the Retrieval-Augmented Generation (RAG) Chain

1. Prompt Creation and Document Wrapping

Using LangChain’s prompt templates, we create custom prompts to feed context and queries into our language model:

from langchain.prompts import PromptTemplate
prompt = "{input} Given the context: {context}"
prompt_template = PromptTemplate.from_template(template=prompt)

Each retrieved document is wrapped using a document prompt template, ensuring that the model understands the structure of the input context.

2. Assembling the RAG Pipeline

We combine the prompt with the vector store to create a retrieval chain that first fetches relevant documents and then uses them to generate a coherent answer:

from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
combine_docs_chain = create_stuff_documents_chain(
    llm=model,
    prompt=prompt_template,
    document_prompt=PromptTemplate.from_template(template="""\
Document {doc_id}
{page_content}"""),
    document_separator="\n\n",
)
rag_chain = create_retrieval_chain(
    retriever=vector_db.as_retriever(),
    combine_docs_chain=combine_docs_chain,
)

Queries are then executed against this chain, retrieving context and generating responses based on both the query and the stored document embeddings.

Executing Queries and Retrieving Information

Once the RAG chain is established, you can run queries to retrieve relevant information from your document database. For example:

query = "Explain Three Key Financial Statements Notes"
outputs = rag_chain.invoke({"input": query})
Markdown(outputs['answer'])

query = "tell me the Contents of an annual report"
outputs = rag_chain.invoke({"input": query})
Markdown(outputs['answer'])

query = "what are the benefits of an annual report?"
outputs = rag_chain.invoke({"input": query})
Markdown(outputs['answer'])

The same process can be applied for various queries, such as explaining financial statement notes or summarizing an annual report, thereby demonstrating the versatility of the pipeline.

Here’s the full code: AV-multimodal-gemma3-rag

Use Cases

This pipeline has numerous applications:

Financial Reporting: Automatically extract and summarize key financial statements, cash flow elements, and annual report details.
Document Analysis: Convert PDFs into structured data for further analysis or machine learning tasks.
Multimodal Search: Enable search and retrieval from mixed media documents, combining textual and visual content.
Business Intelligence: Provide quick insights into complex documents by aggregating and synthesizing information across modalities.

Conclusion

In this tutorial, we demonstrated how to build a multimodal RAG with Gemma 3 in Google Colab. By integrating tools like Colab-Xterm, Ollama models (Gemma 3), Docling, LangChain, and Milvus, we created a system capable of processing text, tables, and images. This powerful setup not only enables effective document retrieval but also supports complex query answering and analysis in diverse applications. Whether you’re dealing with financial reports, research papers, or business intelligence tasks, this pipeline offers a versatile and scalable solution.

Happy coding, and enjoy exploring the possibilities of multimodal retrieval-augmented generation!

Shaik Hamzah

GenAI Intern @ Analytics Vidhya | Final Year @ VIT Chennai
Passionate about AI and machine learning, I'm eager to dive into roles as an AI/ML Engineer or Data Scientist where I can make a real impact. With a knack for quick learning and a love for teamwork, I'm excited to bring innovative solutions and cutting-edge advancements to the table. My curiosity drives me to explore AI across various fields and take the initiative to delve into data engineering, ensuring I stay ahead and deliver impactful projects.

Advanced RAG

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Data analyst Learning Path

Tableau Learning Path

NLP Learning Path

Data Scientist Learning Path

Data Engineer Learning Path

MLOps Learning Path

AI Engineer Learning Path

Computer Vision Learning Path

Generative AI Learning Path

Generative AI Roadmap for Enterprises

LLMs Roadmap

Prompt Engineer Leaning Path

Building a Multimodal RAG Pipeline using Gemma 3 and Docling

Table of contents

What is Multimodal RAG?

Proposed Architecture of Multimodal RAG with Gemma 3

Overview of Libraries and Tools

Building Multimodal Rag with Gemma 3

Terminal Setup with Colab-Xterm

Installing and Managing Ollama Models

Installing Essential Python Packages

Logging and Hugging Face Authentication

Configuring Vision and Language Models (Gemma 3)

Document Conversion with Docling

1. Converting PDFs to Structured Documents

Input File

2. Extracting and Chunking Content

Image Processing and Encoding

Creating a Vector Database with Milvus

Building the Retrieval-Augmented Generation (RAG) Chain

1. Prompt Creation and Document Wrapping

2. Assembling the RAG Pipeline

Executing Queries and Retrieving Information

Use Cases

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit