Empowering AI with Senses: A Journey into Multimodal LLMs Part 1

Ram Singh Last Updated : 27 Jan, 2025

9 min read

The human mind naturally perceives language, vision, smell, and touch, enabling us to understand our surroundings. We are particularly inclined toward linguistic thought and visual memory. As GenAI models continue to grow, researchers are now working on extending their capabilities by incorporating multimodality. Large Language models (LLMs) only accept text as input and produce text as output, which means these models do not process or generate data from other modalities such as images, videos, or voice. LLMs have excelled in handling tasks such as question-answering, text summarization, translation, information retrieval, code generation, and reasoning. However, integrating other modalities with LLMs (Multimodal LLMs) enhances the potential of GenAI models. For instance, training a model by combining text and images solves problems such as visual Q&A, image segmentation, and object detection. Likewise, we can add videos in the same model for more advanced media-related analysis.

Introduction to Multimodal LLMs
Datasets
Application of Multimodal LLMs
Architecture of Large Vision-Language Models (LVLMs)
Conclusion

Introduction to Multimodal LLMs

Generative AI is a subsection of machine learning models allowing for new content generation. We can generate new text after feeding input as text to the model known as text-to-text. However, after extending the capabilities of LLMs with other modalities, we can open the solution to a wide range of use cases such as text-to-image, text-to-video, text-to-speech, image-to-image, and image-to-video. We call such models Large multimodal models (Multimodal LLMs). Training such models happens on large datasets containing text and other modalities so that algorithms can learn the relationships among all the input types. Intuitively, these models are not limited to a single input or output type; they can be adapted to handle inputs from any modality and generate output accordingly. In this way, multimodal LLMs can be seen as providing the system with the ability to process and understand different types of sensory inputs.

This blog is split into two sections; in the first part, I will explore the applications of multimodal LLMs and various architectures, while in the second part, I will train a small vision model.

Datasets

While combining different input types to create multimodal LLMs may appear straightforward, it becomes more complex when processing data from 1D, 2D, and 3D together. It is a multi-step problem that needs to be solved sequentially in a step-by-step manner, and the data must be carefully curated to enhance the problem-solving capabilities of such models.

For now, we will limit our discussion to text and images. Unlike text, images and videos come in varying sizes and resolutions, so a robust pre-processing technique is needed to standardize all inputs into a single framework. Furthermore, inputs like images, videos, prompts, and metadata should be prepared in a way that helps models build coherent thought processes and maintain logical consistency during inference. Models trained with text, image, and video data are called Large Vision-Language Models (LVLMs).

Application of Multimodal LLMs

The following image is taken from a Qwen2-VL paper where researchers trained a vision model based on Qwen2 LLM that can solve multiple visual use cases.

The figure below demonstrates how a Multimodal Language Model (MMLM) processes different types of input data (image, text, audio, video) to achieve various objectives. The core part of the diagram, the MMLM, integrates all the different modalities (image, text, audio, video) to process them in combination.

A generic understanding of the Input and output flow of MMLMs.

Let’s proceed further and understand the different applications of vision models. The complete code used in this blog is stored in GitHub.

1. Image captioning

It is the task of describing the features of images in words. People are using this feature to generate descriptions of images and innovating a range of engaging captions and relevant hashtags for their social media posts to improve visibility.

image_path = "quantum.jpg"
with open(image_path, 'rb') as image_file:
    image_data = image_file.read()
    
image_data = base64.b64encode(image_data).decode("utf-8")

prompt="""explain this image"""
message = HumanMessage(
    content=[
        {"type": "text", "text": prompt},
        {
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
        },
    ],
)
response = llm.invoke([message])
print(response.content)

2. Information Extraction

Information extraction is another application for vision models where we expect the model to retrieve features or data points from the images. For example, we can question the model to identify underlying objects’ colour, text, or feature. Contemporary models use function calling or JSON parsing techniques to extract structured data points from the images.

from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
import json

class Retrieval(BaseModel):
    Description: str = Field(description="Describe the image")
    Machine: str = Field(description="Explain what is the machine about")
    Color: str = Field(description="What are the color used in the image")
    People: str = Field(description="Count how many male and female are standing their")

parser = PydanticOutputParser(pydantic_object=Retrieval)

prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract the requested details as per the given details.\n'{struct_format}'\n"),
    ("human", [
        {
            "type": "image_url",
            "image_url": {"url": "data:image/jpeg;base64,{image_data}"},
        },
    ]),
])

chain = prompt | llm | parser

image_path = "quantum.jpg"
with open(image_path, 'rb') as image_file:
    image_data = image_file.read()
    
image_data = base64.b64encode(image_data).decode("utf-8")


response = chain.invoke({
    "struct_format": parser.get_format_instructions(),
    "image_data": image_data
})

data = json.loads(response.model_dump_json())

for k,v in data.items():
    print(f"{k}: {v}")

3. Visual Interpretation & Reasoning

It is a use case for a vision model to analyze the image and perform reasoning tasks. For example, the model can interpret the underlying information in images, diagrams, and graphical representations, create step-by-step analyses, and conclude.

4. OCR’ing

It is one of the most important use cases in the area of Document AI where models convert and extract text data from images for downstream tasks.

image_path = "qubits.png"
with open(image_path, 'rb') as image_file:
    image_data = image_file.read()
    
image_data = base64.b64encode(image_data).decode("utf-8")

prompt="""Extract all the text from the image"""
message = HumanMessage(
    content=[
        {"type": "text", "text": prompt},
        {
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
        },
    ],
)
response = llm.invoke([message])
print(response.content)

5. Object Detection & Segmentation

Vision models are capable of identifying objects in the images and classifying them into defined categories. Mainly in the case of object detection models can locate the objects and classify them whereas in the case of segmentation, vision models can divide the images into different regions based on surrounding pixel values.

from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from typing import List

import json

class Segmentation(BaseModel):
    Object: List[str] = Field(description="Identify the object and give a name")
    Bounding_box: List[List[int]] = Field(description="Extract the bounding boxes")

parser = PydanticOutputParser(pydantic_object=Segmentation)

prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract all the image objects and their bounding boxes. You must always return valid JSON.\n'{struct_format}'\n"),
    ("human", [
        {
            "type": "image_url",
            "image_url": {"url": "data:image/jpeg;base64,{image_data}"},
        },
    ]),
])

chain = prompt | llm | parser

image_path = "quantum.jpg"
with open(image_path, 'rb') as image_file:
    image_data = image_file.read()
    
image_data = base64.b64encode(image_data).decode("utf-8")

response = chain.invoke({
    "struct_format": parser.get_format_instructions(),
    "image_data": image_data
})

data = json.loads(response.model_dump_json())

for k,v in data.items():
    print(f"{k}: {v}")

## Complete code is available in GitHub
plot_bounding_boxes(im=img,labels=data['Object'], bounding_boxes=data['Bounding_box'])

Vision models have a wide range of use cases across various industries and are increasingly being integrated into different platforms like Canva, Fireflies, Instagram, and YouTube.

Architecture of Large Vision-Language Models (LVLMs)

The primary purpose of developing vision models is to unify features from images, videos, and text. Researchers are exploring different architectures to pretrain Large Vision-Language Models (LVLMs).
Typically, encoders are employed to extract image features, while text data can be processed using an encoder, a decoder, or a combination of both. Modality projectors, sometimes called connectors, are dense neural networks used to align image features with text representations.

Below is the general overview of common network designs.

1. Two-Tower VLM

The figure below represents the simplest architecture where images and text are encoded separately and trained under a common objective. Here’s a breakdown of the components:

Image Encoder: On the left side, there is an encoder that processes image data. This encoder extracts meaningful features from the image for further processing.
Text Encoder: On the right side, a similar encoder that encodes text data. It transforms the textual data into a format suitable for the shared objective.
Objective: Representation of the image and text encoders feed into a shared objective. Here the goal is to align the information from both modalities (image and text).

This setup is common in models that aim to learn relationships between images and text. These models also work as the base for multiple downstream tasks like image captioning or visual question answering.

2. Two-Leg VLM

The architecture described below resembles the two-tower approach, but it incorporates a fusion layer (a dense neural network) to merge the features from images and text. Let’s go through each step in detail.

Image Encoder: This component processes input images. It extracts important features and representations from the image data.
Text Encoder: The right side component processes textual data. It transforms the text data into meaningful representations.
Fusion Layer: The key addition in this image is the fusion layer. After the image and text data are encoded separately, their representations are combined or fused in this layer. This is critical for learning relationships between the two modalities (images and text).
Objective: Ultimately, the fused data is utilized for a shared objective, which could be a downstream task such as classification, caption generation, or question answering.

In summary, the image describes a multimodal system where image and text data are encoded separately and then combined at the fusion layer to achieve a unified goal. The fusion layer is crucial for leveraging the information from both data types in a coordinated way.

3. VLM with Image Encoder – Text Encoder & Decoder

The next architecture we can think of is an encoder for images and splitting the encoder and decoder for textual data. We divided the text into two parts where one part will pass through the encoder, and
the remaining text data will feed into the decoder and learn further relations during cross-attention. This can be one use case of question-answering from images and their long description combined. Therefore, the image will pass through the encoder, the image description will go through the text encoder, and question-answers will feed into the decoder.

Here is an explanation of the different components:

Conv Stage: This step processes images through a convolutional layer to extract features from the image data.
Text Embedding: Text data (such as image descriptions) is embedded into a high-dimensional vector representation.
Concatenate: Both the processed image features and the embedded text features are combined into a unified representation.
Encoder: The concatenated features are passed through an encoder, which transforms the data into a higher-level representation.
Projector: After encoding, the features are projected into a space where they can be more easily integrated with features from the decoder.
Cross Attention: This block enables interaction between the features from the projector and the decoder. In this case, the system learns which parts of the image and text data are most relevant to each other.
Concatenate Features: Instead of using cross-attention, we can stack features from the projector and decoder together.
Decoder: The combined features are passed to a decoder, which processes the integrated information and generates output.
Objective. The objective could be the same as given above.

Overall, this diagram represents a system where images and text are processed together. Their features are concatenated or cross-attended, and finally decoded to achieve a specific objective in a multimodal task.

4. VLM with Encoder-Decoder

Our final architecture talks about an approach where all the images will be passed to encoders whereas text data will go to the decoder. During combined representation learning, we can use either
cross-attention or simply concatenate the features from both modalities.

Following is a step-by-step explanation:

Image Encoder: It extracts visual features from the image, transforming it into a numerical representation that the model can understand.
Projector: The projector takes the output from the Image Encoder and projects it into a vector space compatible with the text data.
Cross Attention: This is where the core interaction between the image and text happens. It helps the model align the visual information with the relevant textual context.
Concatenate Features: At the place of using cross attention, we can merely stack the features of both modalities for better comprehensive context contextual learning.
Text Decoder: It takes the concatenated features as input and uses them to predict the next word in the sequence.

The model learns to “view” the images, “comprehend” the text, and then generate a coherent and informative output by aligning the visual and textual information.

Conclusion

Multimodal LLMs, or Vision-Language Models (VLMs) as discussed in this blog, are trained on image-text datasets to facilitate efficient communication across different data modalities. These models excel at recognizing pixels and addressing visual tasks such as object detection and semantic segmentation. However, it is important to highlight that achieving competitive performance with VLMs demands large datasets and significant computational resources. For instance, Qwen2-VL was trained on 1.4 trillion image and text tokens.

While VLMs can handle various visual tasks, they still show limitations in use cases such as reasoning, image interpretation, and extracting complex data.

I will conclude the first part here, hoping it has provided a clear overview of how vision models are generally trained. It is important to note that developing these models requires a strong understanding of matrix operations, model parallelism, flash attention, and hyperparameter tuning. In the next part, we will explore training our VLMs for a small use case.

References

Ram Singh

I am Ram, a data scientist. I work as an Associate Director of Machine Learning at Cleareye.AI. Throughout my career, I have worked on various AI projects, ranging from traditional algorithms to cutting-edge technologies. I have extensive experience with LLMs and Graph Neural Networks. I am always eager to learn, and my next pursuit involves exploring Quantum computing.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Empowering AI with Senses: A Journey into Multimodal LLMs Part 1

Table of contents

Introduction to Multimodal LLMs

Datasets

Application of Multimodal LLMs

1. Image captioning

2. Information Extraction

3. Visual Interpretation & Reasoning

4. OCR’ing

5. Object Detection & Segmentation

Architecture of Large Vision-Language Models (LVLMs)

1. Two-Tower VLM

2. Two-Leg VLM

3. VLM with Image Encoder – Text Encoder & Decoder

4. VLM with Encoder-Decoder

Conclusion

References

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg