In this tutorial, we explore how to set up and execute a sophisticated retrieval-augmented generation (RAG) pipeline in Google Colab. We leverage multiple state-of-the-art tools and libraries-including Gemma 3 for language and vision tasks, Docling for document conversion, LangChain for chain-of-thought orchestration, and Milvus as our vector database build a multimodal system that understands and processes text, tables, and images. Let’s dive into each component and see how they work together.
Multimodal RAG (Retrieval-Augmented Generation) extends traditional text-based RAG systems by integrating multiple data modalities, in this case, text, tables, and images. This means that the pipeline not only processes and retrieves text but also leverages vision models to understand and describe image content, making the solution more comprehensive. This multimodal approach is particularly beneficial for documents like annual reports that often contain visual elements, such as charts and diagrams.
The aim of this project is to build a robust multimodal RAG pipeline that can ingest documents (like PDFs), process text and images, store document embeddings in a vector database, and answer queries by retrieving relevant information. This setup is particularly useful for applications such as analyzing annual reports, extracting financial statements, or summarizing technical papers. By integrating various libraries and tools, we combine the power of language models with document conversion and vector search to create a comprehensive end-to-end solution.
The pipeline uses several key libraries and tools:
Also read: A Comprehensive Guide to Building Multimodal RAG Systems
We are building multimodal rag: This approach improves contextual understanding, accuracy, and relevance, especially in fields like healthcare, research, and media analysis. By leveraging cross-modal embeddings, hybrid retrieval strategies, and vision-language models, multimodal RAG systems can provide richer and more insightful responses. The key challenge lies in efficiently integrating and retrieving multimodal data while maintaining coherence and scalability. As AI progresses, developing optimized architectures and retrieval strategies will be crucial for unlocking the full potential of multimodal intelligence.
First, we install the colab-xterm extension to bring a terminal environment directly into Colab. This allows us to run system commands, install packages, and manage our session more flexibly.
!pip install colab-xterm # Install colab-xterm
%load_ext colabxterm # Load the xterm extension
%xterm # Launch an xterm terminal session in Colab
This terminal support is especially useful for installing additional dependencies or managing background processes.
We pull specific Ollama models into our environment using simple shell commands. For example:
!ollama pull gemma3:4b
!ollama pull llama3.2
!ollama list
These commands ensure that we have the necessary language and vision models available, such as the powerful Gemma 3 model, which is central to our multimodal processing.
The next step involves installing a host of packages required for our pipeline. This includes libraries for deep learning, text processing, and document handling:
! pip install transformers pillow langchain_community langchain_huggingface langchain_milvus docling langchain_ollama
By installing these packages, we prepare the environment for everything from document conversion to retrieval-augmented generation.
Setting up logging is crucial for monitoring pipeline operations:
import logging
logging.basicConfig(level=logging.INFO)
We also log in to Hugging Face using their CLI to access certain pre-trained models:
!huggingface-cli login
This authentication step is necessary for fetching model artifacts and ensuring smooth integration with Hugging Face’s ecosystem.
The pipeline leverages the Gemma 3 model for both vision and language tasks. For the language side, we set up the model and tokenizer:
This dual setup enables the system to generate textual descriptions from images, making the pipeline truly multimodal.
We employ Docling’s DocumentConverter to convert PDFs into structured documents. The conversion process involves extracting text, tables, and images from the source PDFs:
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
pdf_pipeline_options = PdfPipelineOptions(
do_ocr=False,
generate_picture_images=True,
)
format_options = { InputFormat.PDF: PdfFormatOption(pipeline_options=pdf_pipeline_options) }
converter = DocumentConverter(format_options=format_options)
# Define the sources (URLs) of the documents to be converted.
# "https://arxiv.org/pdf/1706.03762"
sources = [
"https://www.pwc.com/jm/en/research-publications/pdf/basic-understanding-of-a-companys-financials.pdf"
]
# Convert the PDF documents from the sources into an internal document format.
conversions = { source: converter.convert(source=source).document for source in sources }
We’ll be using PwC’s publicly available financial statements. I’ve included the PDF link, and you’re welcome to add your own source links as well!
After conversion, we chunk the document into manageable pieces, separating text from tables and images. This segmentation allows each component to be processed independently:
from docling_core.transforms.chunker.hybrid_chunker import HybridChunker
from langchain_core.documents import Document
# Process text chunks (excluding pure table segments)
texts: list[Document] = []
for source, docling_document in conversions.items():
for chunk in HybridChunker(tokenizer=embeddings_tokenizer).chunk(docling_document):
# Skip table-only chunks; process tables separately
if len(chunk.meta.doc_items) == 1:
continue
document = Document(
page_content=chunk.text,
metadata={"source": source, "ref": "reference details"}
)
texts.append(document)
This approach not only improves processing efficiency but also facilitates more precise vector storage and retrieval later.
Images from the documents are processed using Pillow. We convert images into base64-encoded strings that can be embedded directly into prompts:
import base64, io, PIL.Image, PIL.ImageOps
def encode_image(image: PIL.Image.Image, format: str = "png") -> str:
image = PIL.ImageOps.exif_transpose(image) or image
image = image.convert("RGB")
buffer = io.BytesIO()
image.save(buffer, format)
encoding = base64.b64encode(buffer.getvalue()).decode("utf-8")
return f"data:image/{format};base64,{encoding}"
Subsequently, these images are fed into our vision model to generate descriptive text, enhancing the multimodal capabilities of our pipeline.
To enable fast and accurate retrieval of document embeddings, we set up Milvus as our vector store:
import tempfile
from langchain_core.vectorstores import VectorStore
from langchain_milvus import Milvus
db_file = tempfile.NamedTemporaryFile(prefix="vectorstore_", suffix=".db", delete=False).name
vector_db: VectorStore = Milvus(
embedding_function=embeddings_model,
connection_args={"uri": db_file},
auto_id=True,
enable_dynamic_field=True,
index_params={"index_type": "AUTOINDEX"},
)
Documents—whether text, tables, or image descriptions—are then added to the vector database, enabling fast and accurate similarity searches during query execution.
Using LangChain’s prompt templates, we create custom prompts to feed context and queries into our language model:
from langchain.prompts import PromptTemplate
prompt = "{input} Given the context: {context}"
prompt_template = PromptTemplate.from_template(template=prompt)
Each retrieved document is wrapped using a document prompt template, ensuring that the model understands the structure of the input context.
We combine the prompt with the vector store to create a retrieval chain that first fetches relevant documents and then uses them to generate a coherent answer:
from langchain.chains.retrieval import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
combine_docs_chain = create_stuff_documents_chain(
llm=model,
prompt=prompt_template,
document_prompt=PromptTemplate.from_template(template="""\
Document {doc_id}
{page_content}"""),
document_separator="\n\n",
)
rag_chain = create_retrieval_chain(
retriever=vector_db.as_retriever(),
combine_docs_chain=combine_docs_chain,
)
Queries are then executed against this chain, retrieving context and generating responses based on both the query and the stored document embeddings.
Once the RAG chain is established, you can run queries to retrieve relevant information from your document database. For example:
query = "Explain Three Key Financial Statements Notes"
outputs = rag_chain.invoke({"input": query})
Markdown(outputs['answer'])
query = "tell me the Contents of an annual report"
outputs = rag_chain.invoke({"input": query})
Markdown(outputs['answer'])
query = "what are the benefits of an annual report?"
outputs = rag_chain.invoke({"input": query})
Markdown(outputs['answer'])
The same process can be applied for various queries, such as explaining financial statement notes or summarizing an annual report, thereby demonstrating the versatility of the pipeline.
Here’s the full code: AV-multimodal-gemma3-rag
This pipeline has numerous applications:
In this tutorial, we demonstrated how to build a multimodal RAG with Gemma 3 in Google Colab. By integrating tools like Colab-Xterm, Ollama models (Gemma 3), Docling, LangChain, and Milvus, we created a system capable of processing text, tables, and images. This powerful setup not only enables effective document retrieval but also supports complex query answering and analysis in diverse applications. Whether you’re dealing with financial reports, research papers, or business intelligence tasks, this pipeline offers a versatile and scalable solution.
Happy coding, and enjoy exploring the possibilities of multimodal retrieval-augmented generation!