Retrieval Augmented Generation systems, better known as RAG systems, have become the de-facto standard for building intelligent AI assistants answering questions on custom enterprise data without the hassles of expensive fine-tuning of large language models (LLMs). One of the key advantages of RAG systems is you can easily integrate your own data and augment your LLM’s intelligence, and give more contextual answers to your questions. However, the key limitation of most RAG systems is that it works well only on text data. However, a lot of real-world data is multimodal in nature, which means a mixture of text, images, tables, and more. In this comprehensive hands-on guide, we will look at building a Multimodal RAG System that can handle mixed data formats using intelligent data transformations and multimodal LLMs.
A retrieval augmented generation (RAG) system architecture typically consists of two major steps:
In Step 1, Data Processing and Indexing, we focus on getting our custom enterprise data into a more consumable format by loading typically the text content from these documents, splitting large text elements into smaller chunks, converting them into embeddings using an embedder model and then storing these chunks and embeddings into a vector database as depicted in the following figure.
In Step 2, the workflow starts with the user asking a question, relevant text document chunks which are similar to the input question are retrieved from the vector database and then the question and the context document chunks are sent to an LLM to generate a human-like response as depicted in the following figure.
This two-step workflow is commonly used in the industry to build a traditional RAG system; however, it does have its own set of limitations, some of which we discuss below in detail.
Traditional RAG systems have several limitations, some of which are mentioned as follows:
In this article, we will focus particularly on solving the limitations of traditional RAG systems in terms of their inability to work with multimodal content, as well as traditional LLMs, which can only reason and analyze text data to generate responses. Before diving into multimodal RAG systems, let’s first understand what Multimodal data is.
Multimodal data is essentially data belonging to multiple modalities. The formal definition of modality comes from the context of Human-Computer Interaction (HCI) systems, where a modality is termed as the classification of a single independent channel of input and output between a computer and human (more details on Wikipedia). Common Computer-Human modalities include the following:
In short, multimodal data is essentially data that has a mixture of modalities or formats, as seen in the sample document below, with some of the distinct formats highlighted in various colors.
The key focus here is to build a RAG system that can handle documents with a mixture of data modalities, such as text, images, tables, and maybe even audio and video, depending on your data sources. This guide will focus on handling text, images, and tables. One of the key components needed to understand such data is multimodal large language models (LLMs).
Multimodal Large Language Models (LLMs) are essentially transformer-based LLMs that have been pre-trained and fine-tuned on multimodal data to analyze and understand various data formats, including text, images, tables, audio, and video. A true multimodal model ideally should be able not just to understand mixed data formats but also generate the same as shown in the following workflow illustration of NExT-GPT, published as a paper, NExT-GPT: Any-to-Any Multimodal Large Language Model
From the paper on NExT-GPT, any true multimodal model would typically have the following stages:
However, most current Multimodal LLMs available for practical use are one-sided, which means they can understand mixed data formats but only generate text responses. The most popular commercial multimodal models are as follows:
You can also consider open or open-source multimodal LLMs in case you want to build a completely open-source solution or have concerns on data privacy or latency and prefer to host everything locally in-house. The most popular open and open-source multimodal models are as follows:
For our Multimodal RAG System, we will leverage GPT-4o, one of the most powerful multimodal models currently available.
In this section, we will explore potential ways to build the architecture and workflow of a multimodal RAG system. The following figure illustrates potential approaches in detail and highlights the one we will use in this guide.
Multimodal RAG Systems can be implemented in various ways, the above figure illustrates three possible workflows as recommended in the LangChain blog, this include:
There are limitations in Option 1 as we cannot use images, which are charts and visuals, which is often the case with a lot of documents. The reason is that multimodal embedding models can’t often encode granular information like numbers in these visual images and compress them into meaningful embedding. Option 2 is severely limited because we do not end up using images at all in this system even if it might contain valuable information and it is not truly a multimodal RAG system.
Hence, we will proceed with Option 3 as our Multimodal RAG System workflow. In this workflow, we will create summaries out of our images, tables, and, optionally, our text chunks and use a multi-vector retriever, which can help in mapping and retrieving the original image, table, and text elements based on their corresponding summaries.
Considering the workflow we will implement as discussed in the previous section, for our retrieval workflow, we will be using a multi-vector retriever as depicted in the following illustration, as recommended and mentioned in the LangChain blog. The key purpose of the multi-vector retriever is to act as a wrapper and help in mapping every text chunk, table, and image summary to the actual text chunk, table, and image element, which can then be obtained during retrieval.
The workflow illustrated above will first use a document parsing tool like Unstructured to extract the text, table and image elements separately. Then we will pass each extracted element into an LLM and generate a detailed text summary as depicted above. Next we will store the summaries and their embeddings into a vector database by using any popular embedder model like OpenAI Embedders. We will also store the corresponding raw document element (text, table, image) for each summary in a document store, which can be any database platform like Redis.
The multi-vector retriever links each summary and its embedding to the original document’s raw element (text, table, image) using a common document identifier (doc_id). Now, when a user question comes in, first, the multi-vector retriever retrieves the relevant summaries, which are similar to the question in terms of semantic (embedding) similarity, and then using the common doc_ids, the original text, table and image elements are returned back which are further passed on to the RAG system’s LLM as the context to answer the user question.
Now, let’s dive deep into the detailed system architecture of our multimodal RAG system. We will understand each component in this workflow and what happens step-by-step. The following illustration depicts this architecture in detail.
We will now discuss the key steps of the above-illustrated multimodal RAG System and how it will work. The workflow is as follows:
It’s not too complicated once you see all the components in place and structure the flow using the above steps! Let’s implement this system now in the next section.
We will now implement the Multimodal RAG System we have discussed so far using LangChain. We will be loading the raw text, table and image elements from our documents into Redis and the element summaries and their embeddings in our vector database which will be the Chroma database and connect them together using a multi-vector retriever. Connections to LLMs and prompting will be done with LangChain. For our multimodal LLM, we will be using ChatGPT GPT-4o which is a powerful multimodal LLM. However, you are free to use any other multimodal LLM, including the open-source options mentioned earlier. It is recommended to use a powerful multimodal LLM that can effectively understand images, tables, and text to generate quality responses.
We start by installing the necessary dependencies, which are going to be the libraries we will be using to build our system. This includes langchain, unstructured as well as necessary dependencies like openai, chroma and utilities for data processing and extraction of tables and images.
!pip install langchain
!pip install langchain-openai
!pip install langchain-chroma
!pip install langchain-community
!pip install langchain-experimental
!pip install "unstructured[all-docs]"
!pip install htmltabletomd
# install OCR dependencies for unstructured
!sudo apt-get install tesseract-ocr
!sudo apt-get install poppler-utils
We downloaded a report on wildfire statistics in the US from the Congressional Research Service Reports Page which provides open access to detailed reports. This document has a mixture of text, tables and images as shown in the Multimodal Data section above. We will build a simple Chat to my PDF application here using our multimodal RAG System but you can easily extend this to multiple documents also.
!wget https://sgp.fas.org/crs/misc/IF10244.pdf
OUTPUT
--2024-08-18 10:08:54-- https://sgp.fas.org/crs/misc/IF10244.pdf
Connecting to sgp.fas.org (sgp.fas.org)|18.172.170.73|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 435826 (426K) [application/pdf]
Saving to: ‘IF10244.pdf’
IF10244.pdf 100%[===================>] 425.61K 732KB/s in 0.6s
2024-08-18 10:08:55 (732 KB/s) - ‘IF10244.pdf’ saved [435826/435826]
We will now use the unstructured library, which provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and more. We will use it to extract and chunk text elements and extract tables and images separately using the following code snippet.
from langchain_community.document_loaders import UnstructuredPDFLoader
doc = './IF10244.pdf'
# takes 1 min on Colab
loader = UnstructuredPDFLoader(file_path=doc,
strategy='hi_res',
extract_images_in_pdf=True,
infer_table_structure=True,
# section-based chunking
chunking_strategy="by_title",
max_characters=4000, # max size of chunks
new_after_n_chars=4000, # preferred size of chunks
# smaller chunks < 2000 chars will be combined into a larger chunk
combine_text_under_n_chars=2000,
mode='elements',
image_output_dir_path='./figures')
data = loader.load()
len(data)
OUTPUT
7
This tells us that unstructured has successfully extracted seven elements from the document and also downloaded the images separately in the `figures` directory. It has used section-based chunking to chunk text elements based on section headings in the documents and a chunk size of roughly 4000 characters. Also document intelligence deep learning models have been used to detect and extract tables and images separately. We can look at the type of elements extracted using the following code.
[doc.metadata['category'] for doc in data]
OUTPUT
['CompositeElement',
'CompositeElement',
'Table',
'CompositeElement',
'CompositeElement',
'Table',
'CompositeElement']
This tells us we have some text chunks and tables in our extracted content. We can now explore and check out the contents of some of these elements.
# This is a text chunk element
data[0]
OUTPUT
Document(metadata={'source': './IF10244.pdf', 'filetype': 'application/pdf',
'languages': ['eng'], 'last_modified': '2024-04-10T01:27:48', 'page_number':
1, 'orig_elements': 'eJzF...eUyOAw==', 'file_directory': '.', 'filename':
'IF10244.pdf', 'category': 'CompositeElement', 'element_id':
'569945de4df264cac7ff7f2a5dbdc8ed'}, page_content='a. aa = Informing the
legislative debate since 1914 Congressional Research Service\n\nUpdated June
1, 2023\n\nWildfire Statistics\n\nWildfires are unplanned fires, including
lightning-caused fires, unauthorized human-caused fires, and escaped fires
from prescribed burn projects. States are responsible for responding to
wildfires that begin on nonfederal (state, local, and private) lands, except
for lands protected by federal agencies under cooperative agreements. The
federal government is responsible for responding to wildfires that begin on
federal lands. The Forest Service (FS)—within the U.S. Department of
Agriculture—carries out wildfire management ...... Over 40% of those acres
were in Alaska (3.1 million acres).\n\nAs of June 1, 2023, around 18,300
wildfires have impacted over 511,000 acres this year.')
The following snippet depicts one of the table elements extracted.
# This is a table element
data[2]
OUTPUT
Document(metadata={'source': './IF10244.pdf', 'last_modified': '2024-04-
10T01:27:48', 'text_as_html': '<table><thead><tr><th></th><th>2018</th>
<th>2019</th><th>2020</th><th>2021</th><th>2022</th></tr></thead><tbody><tr>
<td colspan="6">Number of Fires (thousands)</td></tr><tr><td>Federal</td>
<td>12.5</td><td>10.9</td><td>14.4</td><td>14.0</td><td>11.7</td></tr><tr>
<td>FS</td>......<td>50.5</td><td>59.0</td><td>59.0</td><td>69.0</td></tr>
<tr><td colspan="6">Acres Burned (millions)</td></tr><tr><td>Federal</td>
<td>4.6</td><td>3.1</td><td>7.1</td><td>5.2</td><td>40</td></tr><tr>
<td>FS</td>......<td>1.6</td><td>3.1</td><td>Lg</td><td>3.6</td></tr><tr>
<td>Total</td><td>8.8</td><td>4.7</td><td>10.1</td><td>7.1</td><td>7.6</td>
</tr></tbody></table>', 'filetype': 'application/pdf', 'languages': ['eng'],
'page_number': 1, 'orig_elements': 'eJylVW1.....AOBFljW', 'file_directory':
'.', 'filename': 'IF10244.pdf', 'category': 'Table', 'element_id':
'40059c193324ddf314ed76ac3fe2c52c'}, page_content='2018 2019 2020 Number of
Fires (thousands) Federal 12.5 10......Nonfederal 4.1 1.6 3.1 1.9 Total 8.8
4.7 10.1 7.1 <0.1 3.6 7.6')
We can see the text content extracted from the table using the following code snippet.
print(data[2].page_content)
OUTPUT
2018 2019 2020 Number of Fires (thousands) Federal 12.5 10.9 14.4 FS 5.6 5.3
6.7 DOI 7.0 5.3 7.6 2021 14.0 6.2 7.6 11.7 5.9 5.8 Other 0.1 0.2 <0.1 0.2
0.1 Nonfederal 45.6 39.6 44.6 45.0 57.2 Total 58.1 Acres Burned (millions)
Federal 4.6 FS 2.3 DOI 2.3 50.5 3.1 0.6 2.3 59.0 7.1 4.8 2.3 59.0 5.2 4.1
1.0 69.0 4.0 1.9 2.1 Other <0.1 <0.1 <0.1 <0.1 Nonfederal 4.1 1.6 3.1 1.9
Total 8.8 4.7 10.1 7.1 <0.1 3.6 7.6
While this can be fed into an LLM, the structure of the table is lost here so we can rather focus on the HTML table content itself and do some transformations later.
data[2].metadata['text_as_html']
OUTPUT
<table><thead><tr><th></th><th>2018</th><th>2019</th><th>2020</th>
<th>2021</th><th>2022</th></tr></thead><tbody><tr><td colspan="6">Number of
Fires (thousands)</td></tr><tr><td>Federal</td><td>12.5</td><td>10.9</td>
<td>14.4</td><td>14.0</td><td>11.7</td>......<td>45.6</td><td>39.6</td>
<td>44.6</td><td>45.0</td><td>$7.2</td></tr><tr><td>Total</td><td>58.1</td>
<td>50.5</td><td>59.0</td><td>59.0</td><td>69.0</td></tr><tr><td
colspan="6">Acres Burned (millions)</td></tr><tr><td>Federal</td>
<td>4.6</td><td>3.1</td><td>7.1</td><td>5.2</td><td>40</td></tr><tr>
<td>FS</td><td>2.3</td>......<td>1.0</td><td>2.1</td></tr><tr><td>Other</td>
We can view this as HTML as follows to see what it looks like.
display(Markdown(data[2].metadata['text_as_html']))
OUTPUT
It does a pretty good job here in preserving the structure however some of the extractions are not correct but you can still get away with it when using a powerful LLM like GPT-4o which we will see later. One option here is to use a more powerful table extraction model. Let’s now look at how to convert this HTML table into Markdown. While we can use the HTML text and put it directly in prompts (LLMs understand HTML tables well) or even better convert HTML tables to Markdown tables as depicted below.
import htmltabletomd
md_table = htmltabletomd.convert_table(data[2].metadata['text_as_html'])
print(md_table)
OUTPUT
| | 2018 | 2019 | 2020 | 2021 | 2022 |
| :--- | :--- | :--- | :--- | :--- | :--- |
| Number of Fires (thousands) |
| Federal | 12.5 | 10.9 | 14.4 | 14.0 | 11.7 |
| FS | 5.6 | 5.3 | 6.7 | 6.2 | 59 |
| Dol | 7.0 | 5.3 | 7.6 | 7.6 | 5.8 |
| Other | 0.1 | 0.2 | <0.1 | 0.2 | 0.1 |
| Nonfederal | 45.6 | 39.6 | 44.6 | 45.0 | $7.2 |
| Total | 58.1 | 50.5 | 59.0 | 59.0 | 69.0 |
| Acres Burned (millions) |
| Federal | 4.6 | 3.1 | 7.1 | 5.2 | 40 |
| FS | 2.3 | 0.6 | 48 | 41 | 19 |
| Dol | 2.3 | 2.3 | 2.3 | 1.0 | 2.1 |
| Other | <0.1 | <0.1 | <0.1 | <0.1 | <0.1 |
| Nonfederal | 4.1 | 1.6 | 3.1 | Lg | 3.6 |
| Total | 8.8 | 4.7 | 10.1 | 7.1 | 7.6 |
This looks great! Let’s now separate the text and table elements and convert all table elements from HTML to Markdown.
docs = []
tables = []
for doc in data:
if doc.metadata['category'] == 'Table':
tables.append(doc)
elif doc.metadata['category'] == 'CompositeElement':
docs.append(doc)
for table in tables:
table.page_content = htmltabletomd.convert_table(table.metadata['text_as_html'])
len(docs), len(tables)
OUTPUT
(5, 2)
We can also validate the tables extracted and converted into Markdown.
for table in tables:
print(table.page_content)
print()
OUTPUT
We can now view some of the extracted images from the document as shown below.
! ls -l ./figures
OUTPUT
total 144
-rw-r--r-- 1 root root 27929 Aug 18 10:10 figure-1-1.jpg
-rw-r--r-- 1 root root 27182 Aug 18 10:10 figure-1-2.jpg
-rw-r--r-- 1 root root 26589 Aug 18 10:10 figure-1-3.jpg
-rw-r--r-- 1 root root 26448 Aug 18 10:10 figure-2-4.jpg
-rw-r--r-- 1 root root 29260 Aug 18 10:10 figure-2-5.jpg
from IPython.display import Image
Image('./figures/figure-1-2.jpg')
OUTPUT
Image('./figures/figure-1-3.jpg')
OUTPUT
Everything looks to be in order, we can see that the images from the document which are mostly charts and graphs have been correctly extracted.
We enter our Open AI key using the getpass()
function so we don’t accidentally expose our key in the code.
from getpass import getpass
OPENAI_KEY = getpass('Enter Open AI API Key: ')
Next, we setup some system environment variables which will be used later when authenticating our LLM.
import os
os.environ['OPENAI_API_KEY'] = OPENAI_KEY
Next, we create a connection to GPT-4o, the multimodal LLM we will use in our system.
from langchain_openai import ChatOpenAI
chatgpt = ChatOpenAI(model_name='gpt-4o', temperature=0)
We will now build our multi-vector-retriever to index image, text chunk and table element summaries, create their embeddings and store in the vector database and the raw elements in a document store and connect them so that we can then retrieve the raw image, text and table elements for user queries.
We will use GPT-4o to produce table and text summaries. Text summaries are advised if using large chunk sizes (e.g., as set above, we use 4k token chunks). Summaries are used to retrieve raw tables and / or raw chunks of text later on using the multi-vector retriever. Creating summaries of text elements is optional.
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough
# Prompt
prompt_text = """
You are an assistant tasked with summarizing tables and text particularly for semantic retrieval.
These summaries will be embedded and used to retrieve the raw text or table elements
Give a detailed summary of the table or text below that is well optimized for retrieval.
For any tables also add in a one line description of what the table is about besides the summary.
Do not add additional words like Summary: etc.
Table or text chunk:
{element}
"""
prompt = ChatPromptTemplate.from_template(prompt_text)
# Summary chain
summarize_chain = (
{"element": RunnablePassthrough()}
|
prompt
|
chatgpt
|
StrOutputParser() # extracts response as text
)
# Initialize empty summaries
text_summaries = []
table_summaries = []
text_docs = [doc.page_content for doc in docs]
table_docs = [table.page_content for table in tables]
text_summaries = summarize_chain.batch(text_docs, {"max_concurrency": 5})
table_summaries = summarize_chain.batch(table_docs, {"max_concurrency": 5})
The above snippet uses a LangChain chain to create a detailed summary of each text chunk and table and we can see the output for some of them below.
# Summary of a text chunk element
text_summaries[0]
OUTPUT
Wildfires include lightning-caused, unauthorized human-caused, and escaped
prescribed burns. States handle wildfires on nonfederal lands, while federal
agencies manage those on federal lands. The Forest Service oversees 193
million acres of the National Forest System, ...... In 2022, 68,988
wildfires burned 7.6 million acres, with over 40% of the acreage in Alaska.
As of June 1, 2023, 18,300 wildfires have burned over 511,000 acres.
#Summary of a table element
table_summaries[0]
OUTPUT
This table provides data on the number of fires and acres burned from 2018 to
2022, categorized by federal and nonfederal sources. \n\nNumber of Fires
(thousands):\n- Federal: Ranges from 10.9K to 14.4K, peaking in 2020.\n- FS
(Forest Service): Ranges from 5.3K to 6.7K, with an anomaly of 59K in
2022.\n- Dol (Department of the Interior): Ranges from 5.3K to 7.6K.\n-
Other: Consistently low, mostly around 0.1K.\n- ....... Other: Consistently
less than 0.1M.\n- Nonfederal: Ranges from 1.6M to 4.1M, with an anomaly of
"Lg" in 2021.\n- Total: Ranges from 4.7M to 10.1M.
This looks pretty good and the summaries are quite informative and should generate good embeddings for retrieval later on.
We will use GPT-4o to produce the image summaries. However since images cannot be passed directly, we will base64 encode the images as strings and then pass it to them. We start by creating a few utility functions to encode images and generate a summary for any input image by passing it to GPT-4o.
import base64
import os
from langchain_core.messages import HumanMessage
# create a function to encode images
def encode_image(image_path):
"""Getting the base64 string"""
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode("utf-8")
# create a function to summarize the image by passing a prompt to GPT-4o
def image_summarize(img_base64, prompt):
"""Make image summary"""
chat = ChatOpenAI(model="gpt-4o", temperature=0)
msg = chat.invoke(
[
HumanMessage(
content=[
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {"url":
f"data:image/jpeg;base64,{img_base64}"},
},
]
)
]
)
return msg.content
The above functions serve the following purpose:
We now use the above utilities to summarize each of our images using the following function.
def generate_img_summaries(path):
"""
Generate summaries and base64 encoded strings for images
path: Path to list of .jpg files extracted by Unstructured
"""
# Store base64 encoded images
img_base64_list = []
# Store image summaries
image_summaries = []
# Prompt
prompt = """You are an assistant tasked with summarizing images for retrieval.
Remember these images could potentially contain graphs, charts or
tables also.
These summaries will be embedded and used to retrieve the raw image
for question answering.
Give a detailed summary of the image that is well optimized for
retrieval.
Do not add additional words like Summary: etc.
"""
# Apply to images
for img_file in sorted(os.listdir(path)):
if img_file.endswith(".jpg"):
img_path = os.path.join(path, img_file)
base64_image = encode_image(img_path)
img_base64_list.append(base64_image)
image_summaries.append(image_summarize(base64_image, prompt))
return img_base64_list, image_summaries
# Image summaries
IMG_PATH = './figures'
imgs_base64, image_summaries = generate_img_summaries(IMG_PATH)
We can now look at one of the images and its summary just to get an idea of how GPT-4o has generated the image summaries.
# View one of the images
display(Image('./figures/figure-1-2.jpg'))
OUTPUT
# View the image summary generated by GPT-4o
image_summaries[1]
OUTPUT
Line graph showing the number of fires (in thousands) and the acres burned
(in millions) from 1993 to 2022. The left y-axis represents the number of
fires, peaking around 100,000 in the mid-1990s and fluctuating between
50,000 and 100,000 thereafter. The right y-axis represents acres burned,
with peaks reaching up to 10 million acres. The x-axis shows the years from
1993 to 2022. The graph uses a red line to depict the number of fires and a
grey shaded area to represent the acres burned.
Overall looks to be quite descriptive and we can use these summaries and embed them into a vector database soon.
We are now going to add the raw text, table and image elements and their summaries to a Multi Vector Retriever using the following strategy:
The first step is to get the docstore ready, for this we use the following code to download the open-source version of Redis and start a Redis server locally as a background process.
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update > /dev/null 2>&1
sudo apt-get install redis-stack-server > /dev/null 2>&1
redis-stack-server --daemonize yes
OUTPUT
deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg]
https://packages.redis.io/deb jammy main
Starting redis-stack-server, database path /var/lib/redis-stack
LangChain enables us to access Open AI embedding models which include the newest models: a smaller and highly efficient text-embedding-3-small
model, and a larger and more powerful text-embedding-3-large
model. We need an embedding model to convert our document chunks into embeddings before storing in our vector database.
from langchain_openai import OpenAIEmbeddings
# details here: https://openai.com/blog/new-embedding-models-and-api-updates
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')
We now create a function that will help us connect our vector store and doctors and index the documents, summaries, and embeddings using the following function.
import uuid
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_community.storage import RedisStore
from langchain_community.utilities.redis import get_client
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
def create_multi_vector_retriever(
docstore, vectorstore, text_summaries, texts, table_summaries, tables,
image_summaries, images
):
"""
Create retriever that indexes summaries, but returns raw images or texts
"""
id_key = "doc_id"
# Create the multi-vector retriever
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
docstore=docstore,
id_key=id_key,
)
# Helper function to add documents to the vectorstore and docstore
def add_documents(retriever, doc_summaries, doc_contents):
doc_ids = [str(uuid.uuid4()) for _ in doc_contents]
summary_docs = [
Document(page_content=s, metadata={id_key: doc_ids[i]})
for i, s in enumerate(doc_summaries)
]
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, doc_contents)))
# Add texts, tables, and images
# Check that text_summaries is not empty before adding
if text_summaries:
add_documents(retriever, text_summaries, texts)
# Check that table_summaries is not empty before adding
if table_summaries:
add_documents(retriever, table_summaries, tables)
# Check that image_summaries is not empty before adding
if image_summaries:
add_documents(retriever, image_summaries, images)
return retriever
Following are the key components in the above function and their role:
We will now create our vectorstore using Chroma as the vector database so we can index summaries and their embeddings shortly.
# The vectorstore to use to index the summaries and their embeddings
chroma_db = Chroma(
collection_name="mm_rag",
embedding_function=openai_embed_model,
collection_metadata={"hnsw:space": "cosine"},
)
We will now create our docstore using Redis as the database platform so we can index the actual document elements which are the raw text chunks, tables and images shortly. Here we just connect to the Redis server we started earlier.
# Initialize the storage layer - to store raw images, text and tables
client = get_client('redis://localhost:6379')
redis_store = RedisStore(client=client) # you can use filestore, memorystore, any other DB store also
We will now index our document raw elements, their summaries and embeddings in the document and vectorstore and build the multi-vector retriever.
# Create retriever
retriever_multi_vector = create_multi_vector_retriever(
redis_store, chroma_db,
text_summaries, text_docs,
table_summaries, table_docs,
image_summaries, imgs_base64,
)
We will now test the retrieval aspect in our RAG pipeline to see if our multi-vector retriever is able to return the right text, table and image elements based on user queries. Before we check it out, let’s create a utility to be able to visualize any images retrieved as we need to convert them back from their encoded base64 format into the raw image element to be able to view it.
from IPython.display import HTML, display, Image
from PIL import Image
import base64
from io import BytesIO
def plt_img_base64(img_base64):
"""Disply base64 encoded string as image"""
# Decode the base64 string
img_data = base64.b64decode(img_base64)
# Create a BytesIO object
img_buffer = BytesIO(img_data)
# Open the image using PIL
img = Image.open(img_buffer)
display(img)
This function will help in taking in any base64 encoded string representation of an image, convert it back into an image and display it. Now let’s test our retriever.
# Check retrieval
query = "Tell me about the annual wildfires trend with acres burned"
docs = retriever_multi_vector.invoke(query, limit=5)
# We get 3 relevant docs
len(docs)
OUTPUT
3
We can check out the documents retrieved as follows:
docs
[b'a. aa = Informing the legislative debate since 1914 Congressional Research
Service\n\nUpdated June 1, 2023\n\nWildfire Statistics\n\nWildfires are
unplanned fires, including lightning-caused fires, unauthorized human-caused
fires, and escaped fires from prescribed burn projects ...... and an average
of 7.2 million acres impacted annually. In 2022, 68,988 wildfires burned 7.6
million acres. Over 40% of those acres were in Alaska (3.1 million
acres).\n\nAs of June 1, 2023, around 18,300 wildfires have impacted over
511,000 acres this year.',
b'| | 2018 | 2019 | 2020 | 2021 | 2022 |\n| :--- | :--- | :--- | :--- | :--
- | :--- |\n| Number of Fires (thousands) |\n| Federal | 12.5 | 10.9 | 14.4 |
14.0 | 11.7 |\n| FS | 5.6 | 5.3 | 6.7 | 6.2 | 59 |\n| Dol | 7.0 | 5.3 | 7.6
| 7.6 | 5.8 |\n| Other | 0.1 | 0.2 | <0.1 | 0.2 | 0.1 |\n| Nonfederal |
45.6 | 39.6 | 44.6 | 45.0 | $7.2 |\n| Total | 58.1 | 50.5 | 59.0 | 59.0 |
69.0 |\n| Acres Burned (millions) |\n| Federal | 4.6 | 3.1 | 7.1 | 5.2 | 40
|\n| FS | 2.3 | 0.6 | 48 | 41 | 19 |\n| Dol | 2.3 | 2.3 | 2.3 | 1.0 | 2.1
|\n| Other | <0.1 | <0.1 | <0.1 | <0.1 | <0.1 |\n| Nonfederal
| 4.1 | 1.6 | 3.1 | Lg | 3.6 |\n| Total | 8.8 | 4.7 | 10.1 | 7.1 | 7.6 |\n',
b'/9j/4AAQSkZJRgABAQAAAQABAAD/......RXQv+gZB+RrYooAx/']
It is clear that the first retrieved element is a text chunk, the second retrieved element is a table and the last retrieved element is an image for our given query. We can also use the utility function from above to view the retrieved image.
# view retrieved image
plt_img_base64(docs[2])
OUTPUT
We can definitely see the right context being retrieved based on the user question. Let’s try one more and validate this again.
# Check retrieval
query = "Tell me about the percentage of residences burned by wildfires in 2022"
docs = retriever_multi_vector.invoke(query, limit=5)
# We get 2 docs
docs
OUTPUT
[b'Source: National Interagency Coordination Center (NICC) Wildland Fire
Summary and Statistics annual reports. Notes: FS = Forest Service; DOI =
Department of the Interior. Column totals may not sum precisely due to
rounding.\n\n2022\n\nYear Acres burned (millions) Number of Fires 2015 2020
2017 2006 2007\n\nSource: NICC Wildland Fire Summary and Statistics annual
reports. ...... and structures (residential, commercial, and other)
destroyed. For example, in 2022, over 2,700 structures were burned in
wildfires; the majority of the damage occurred in California (see Table 2).',
b'| | 2019 | 2020 | 2021 | 2022 |\n| :--- | :--- | :--- | :--- | :--- |\n|
Structures Burned | 963 | 17,904 | 5,972 | 2,717 |\n| % Residences | 46% |
54% | 60% | 46% |\n']
This definitely shows that our multi-vector retriever is working quite well and is able to retrieve multimodal contextual data based on user queries!
import re
import base64
# helps in detecting base64 encoded strings
def looks_like_base64(sb):
"""Check if the string looks like base64"""
return re.match("^[A-Za-z0-9+/]+[=]{0,2}$", sb) is not None
# helps in checking if the base64 encoded image is actually an image
def is_image_data(b64data):
"""
Check if the base64 data is an image by looking at the start of the data
"""
image_signatures = {
b"\xff\xd8\xff": "jpg",
b"\x89\x50\x4e\x47\x0d\x0a\x1a\x0a": "png",
b"\x47\x49\x46\x38": "gif",
b"\x52\x49\x46\x46": "webp",
}
try:
header = base64.b64decode(b64data)[:8] # Decode and get the first 8 bytes
for sig, format in image_signatures.items():
if header.startswith(sig):
return True
return False
except Exception:
return False
# returns a dictionary separating images and text (with table) elements
def split_image_text_types(docs):
"""
Split base64-encoded images and texts (with tables)
"""
b64_images = []
texts = []
for doc in docs:
# Check if the document is of type Document and extract page_content if so
if isinstance(doc, Document):
doc = doc.page_content.decode('utf-8')
else:
doc = doc.decode('utf-8')
if looks_like_base64(doc) and is_image_data(doc):
b64_images.append(doc)
else:
texts.append(doc)
return {"images": b64_images, "texts": texts}
These utility functions mentioned above help us in separating the text (with table) elements and image elements separately from the retrieved context documents. Their functionality is explained in a bit more detail as follows:
We can quickly test this function on any retrieval output from our multi-vector retriever as shown below with an example.
# Check retrieval
query = "Tell me detailed statistics of the top 5 years with largest wildfire
acres burned"
docs = retriever_multi_vector.invoke(query, limit=5)
r = split_image_text_types(docs)
r
OUTPUT
{'images': ['/9j/4AAQSkZJRgABAQAh......30aAPda8Kn/wCPiT/eP86PPl/56v8A99GpURSgJGTQB//Z'],
'texts': ['Figure 2. Top Five Years with Largest Wildfire Acreage Burned
Since 1960\n\nTable 1. Annual Wildfires and Acres Burned',
'Source: NICC Wildland Fire Summary and Statistics annual reports.\n\nConflagrations Of the 1.6 million wildfires that have occurred
since 2000, 254 exceeded 100,000 acres burned and 16 exceeded 500,000 acres
burned. A small fraction of wildfires become .......']}
Looks like our function is working perfectly and separating out the retrieved context elements as desired.
Now let’s connect our multi-vector retriever, prompt instructions and build a multimodal RAG chain. To begin, we create a multimodal prompt function that takes context text, tables, and images to structure a proper prompt in the correct format for GPT-4o.
from operator import itemgetter
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_core.messages import HumanMessage
def multimodal_prompt_function(data_dict):
"""
Create a multimodal prompt with both text and image context.
This function formats the provided context from `data_dict`, which contains
text, tables, and base64-encoded images. It joins the text (with table) portions
and prepares the image(s) in a base64-encoded format to be included in a
message.
The formatted text and images (context) along with the user question are used to
construct a prompt for GPT-4o
"""
formatted_texts = "\n".join(data_dict["context"]["texts"])
messages = []
# Adding image(s) to the messages if present
if data_dict["context"]["images"]:
for image in data_dict["context"]["images"]:
image_message = {
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image}"},
}
messages.append(image_message)
# Adding the text for analysis
text_message = {
"type": "text",
"text": (
f"""You are an analyst tasked with understanding detailed information
and trends from text documents,
data tables, and charts and graphs in images.
You will be given context information below which will be a mix of
text, tables, and images usually of charts or graphs.
Use this information to provide answers related to the user
question.
Do not make up answers, use the provided context documents below and
answer the question to the best of your ability.
User question:
{data_dict['question']}
Context documents:
{formatted_texts}
Answer:
"""
),
}
messages.append(text_message)
return [HumanMessage(content=messages)]
This function helps in structuring the prompt to be sent to GPT-4o as explained here:
We now construct our multimodal RAG chain using the following code snippet.
# Create RAG chain
multimodal_rag = (
{
"context": itemgetter('context'),
"question": itemgetter('input'),
}
|
RunnableLambda(multimodal_prompt_function)
|
chatgpt
|
StrOutputParser()
)
# Pass input query to retriever and get context document elements
retrieve_docs = (itemgetter('input')
|
retriever_multi_vector
|
RunnableLambda(split_image_text_types))
# Below, we chain `.assign` calls. This takes a dict and successively
# adds keys-- "context" and "answer"-- where the value for each key
# is determined by a Runnable (function or chain executing at runtime).
# This helps in having the retrieved context along with the answer generated by GPT-4o
multimodal_rag_w_sources = (RunnablePassthrough.assign(context=retrieve_docs)
.assign(answer=multimodal_rag)
)
The chains create above work as follows:
Everything is set up and ready to go; let’s test out our multimodal RAG pipeline!
# Run multimodal RAG chain
query = "Tell me detailed statistics of the top 5 years with largest wildfire acres
burned"
response = multimodal_rag_w_sources.invoke({'input': query})
response
OUTPUT
{'input': 'Tell me detailed statistics of the top 5 years with largest
wildfire acres burned',
'context': {'images': ['/9j/4AAQSkZJRgABAa.......30aAPda8Kn/wCPiT/eP86PPl/56v8A99GpURSgJGTQB//Z'],
'texts': ['Figure 2. Top Five Years with Largest Wildfire Acreage Burned
Since 1960\n\nTable 1. Annual Wildfires and Acres Burned',
'Source: NICC Wildland Fire Summary and Statistics annual
reports.\n\nConflagrations Of the 1.6 million wildfires that have occurred
since 2000, 254 exceeded 100,000 acres burned and 16 exceeded 500,000 acres
burned. A small fraction of wildfires become catastrophic, and a small
percentage of fires accounts for the vast majority of acres burned. For
example, about 1% of wildfires become conflagrations—raging, destructive
fires—but predicting which fires will “blow up” into conflagrations is
challenging and depends on a multitude of factors, such as weather and
geography. There have been 1,041 large or significant fires annually on
average from 2018 through 2022. In 2022, 2% of wildfires were classified as
large or significant (1,289); 45 exceeded 40,000 acres in size, and 17
exceeded 100,000 acres. For context, there were fewer large or significant
wildfires in 2021 (943)......']},
'answer': 'Based on the provided context and the image, here are the
detailed statistics for the top 5 years with the largest wildfire acres
burned:\n\n1. **2015**\n - **Acres burned:** 10.13 million\n - **Number
of fires:** 68.2 thousand\n\n2. **2020**\n - **Acres burned:** 10.12
million\n - **Number of fires:** 59.0 thousand\n\n3. **2017**\n -
**Acres burned:** 10.03 million\n - **Number of fires:** 71.5
thousand\n\n4. **2006**\n - **Acres burned:** 9.87 million\n - **Number
of fires:** 96.4 thousand\n\n5. **2007**\n - **Acres burned:** 9.33
million\n - **Number of fires:** 67.8 thousand\n\nThese statistics
highlight the years with the most significant wildfire activity in terms of
acreage burned, showing a trend of large-scale wildfires over the past few
decades.'}
Looks like we are above to get the answer as well as the source context documents used to answer the question! Let’s create a function now to format these results and display them in a better way!
def multimodal_rag_qa(query):
response = multimodal_rag_w_sources.invoke({'input': query})
print('=='*50)
print('Answer:')
display(Markdown(response['answer']))
print('--'*50)
print('Sources:')
text_sources = response['context']['texts']
img_sources = response['context']['images']
for text in text_sources:
display(Markdown(text))
print()
for img in img_sources:
plt_img_base64(img)
print()
print('=='*50)
This is a simple function which just takes the dictionary output from our multimodal RAG pipeline and displays the results in a nice format. Time to put this to the test!
query = "Tell me detailed statistics of the top 5 years with largest wildfire acres
burned"
multimodal_rag_qa(query)
OUTPUT
It does a pretty good job leveraging text and image context documents to answer the question correctly. Let’s try another one.
# Run RAG chain
query = "Tell me about the annual wildfires trend with acres burned"
multimodal_rag_qa(query)
OUTPUT
It does a pretty good job here analyzing tables, images and text context documents to answer the user question with a detailed report. Let’s look at one more example of a very specific query.
# Run RAG chain
query = "Tell me about the number of acres burned by wildfires for the forest service in 2021"
multimodal_rag_qa(query)
OUTPUT
Here you can clearly see that even though the table elements were wrongly extracted for some of the rows, especially the one being used to answer the question, GPT-4o is intelligent enough to look at the surrounding table elements and the text chunks retrieved to give the right answer of 4.1 million instead of 41 million. Of course this may not always work and that is where you might need to focus on improving your extraction pipelines.
If you are reading this, I commend your efforts in staying right till the end in this massive guide! Here, we went through an in-depth understanding of the current challenges in traditional RAG systems especially in handling multimodal data. We then talked about what is multimodal data as well as multimodal large language models (LLMs). We discussed at length a detailed system architecture and workflow for a Multimodal RAG system with GPT-4o. Last but not the least, we implemented this Multimodal RAG system with LangChain and tested it on various scenarios. Do check out this Colab notebook for easy access to the code and try improving this system by adding more capabilities like support for audio, video and more!
If you want to become a Generative AI expert, then explore: GenAI Pinnacle Program
Ans. A Retrieval Augmented Generation (RAG) system is an AI framework that combines data retrieval with language generation, enabling more contextual and accurate responses without the need for fine-tuning large language models (LLMs).
Ans. Traditional RAG systems primarily handle text data, cannot process multimodal data (like images or tables), and are limited by the quality of the stored data in the vector database.
Ans. Multimodal data consists of multiple types of data formats such as text, images, tables, audio, video, and more, allowing AI systems to process a combination of these modalities.
Ans. A multimodal Large Language Model (LLM) is an AI model capable of processing and understanding various data types (text, images, tables) to generate relevant responses or summaries.
Ans. Some popular multimodal LLMs include GPT-4o (OpenAI), Gemini (Google), Claude (Anthropic), and open-source models like LLaVA-NeXT and Pixtral 12B.