Empowering AI with Senses: A Journey into Multimodal LLMs Part 1

Ram Singh Last Updated : 27 Jan, 2025
9 min read

The human mind naturally perceives language, vision, smell, and touch, enabling us to understand our surroundings. We are particularly inclined toward linguistic thought and visual memory. As GenAI models continue to grow, researchers are now working on extending their capabilities by incorporating multimodality. Large Language models (LLMs) only accept text as input and produce text as output, which means these models do not process or generate data from other modalities such as images, videos, or voice. LLMs have excelled in handling tasks such as question-answering, text summarization, translation, information retrieval, code generation, and reasoning. However, integrating other modalities with LLMs (Multimodal LLMs) enhances the potential of GenAI models. For instance, training a model by combining text and images solves problems such as visual Q&A, image segmentation, and object detection. Likewise, we can add videos in the same model for more advanced media-related analysis.

Introduction to Multimodal LLMs

Generative AI is a subsection of machine learning models allowing for new content generation. We can generate new text after feeding input as text to the model known as text-to-text. However, after extending the capabilities of LLMs with other modalities, we can open the solution to a wide range of use cases such as text-to-image, text-to-video, text-to-speech, image-to-image, and image-to-video. We call such models Large multimodal models (Multimodal LLMs). Training such models happens on large datasets containing text and other modalities so that algorithms can learn the relationships among all the input types. Intuitively, these models are not limited to a single input or output type; they can be adapted to handle inputs from any modality and generate output accordingly. In this way, multimodal LLMs can be seen as providing the system with the ability to process and understand different types of sensory inputs.

This blog is split into two sections; in the first part, I will explore the applications of multimodal LLMs and various architectures, while in the second part, I will train a small vision model.

Datasets

While combining different input types to create multimodal LLMs may appear straightforward, it becomes more complex when processing data from 1D, 2D, and 3D together. It is a multi-step problem that needs to be solved sequentially in a step-by-step manner, and the data must be carefully curated to enhance the problem-solving capabilities of such models.

For now, we will limit our discussion to text and images. Unlike text, images and videos come in varying sizes and resolutions, so a robust pre-processing technique is needed to standardize all inputs into a single framework. Furthermore, inputs like images, videos, prompts, and metadata should be prepared in a way that helps models build coherent thought processes and maintain logical consistency during inference. Models trained with text, image, and video data are called Large Vision-Language Models (LVLMs).

Application of Multimodal LLMs

The following image is taken from a Qwen2-VL paper where researchers trained a vision model based on Qwen2 LLM that can solve multiple visual use cases.

Qwen2-VL
Source: Qwen2-VL

The figure below demonstrates how a Multimodal Language Model (MMLM) processes different types of input data (image, text, audio, video) to achieve various objectives. The core part of the diagram, the MMLM, integrates all the different modalities (image, text, audio, video) to process them in combination.

A generic understanding of the Input and output flow of MMLMs.
A generic understanding of the Input and output flow of MMLMs.

Let’s proceed further and understand the different applications of vision models. The complete code used in this blog is stored in GitHub.

1. Image captioning

It is the task of describing the features of images in words. People are using this feature to generate descriptions of images and innovating a range of engaging captions and relevant hashtags for their social media posts to improve visibility.

image_path = "quantum.jpg"
with open(image_path, 'rb') as image_file:
    image_data = image_file.read()
    
image_data = base64.b64encode(image_data).decode("utf-8")

prompt="""explain this image"""
message = HumanMessage(
    content=[
        {"type": "text", "text": prompt},
        {
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
        },
    ],
)
response = llm.invoke([message])
print(response.content)

2. Information Extraction

Information extraction is another application for vision models where we expect the model to retrieve features or data points from the images. For example, we can question the model to identify underlying objects’ colour, text, or feature. Contemporary models use function calling or JSON parsing techniques to extract structured data points from the images.

from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
import json

class Retrieval(BaseModel):
    Description: str = Field(description="Describe the image")
    Machine: str = Field(description="Explain what is the machine about")
    Color: str = Field(description="What are the color used in the image")
    People: str = Field(description="Count how many male and female are standing their")

parser = PydanticOutputParser(pydantic_object=Retrieval)

prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract the requested details as per the given details.\n'{struct_format}'\n"),
    ("human", [
        {
            "type": "image_url",
            "image_url": {"url": "data:image/jpeg;base64,{image_data}"},
        },
    ]),
])

chain = prompt | llm | parser

image_path = "quantum.jpg"
with open(image_path, 'rb') as image_file:
    image_data = image_file.read()
    
image_data = base64.b64encode(image_data).decode("utf-8")


response = chain.invoke({
    "struct_format": parser.get_format_instructions(),
    "image_data": image_data
})

data = json.loads(response.model_dump_json())

for k,v in data.items():
    print(f"{k}: {v}")

3. Visual Interpretation & Reasoning

It is a use case for a vision model to analyze the image and perform reasoning tasks. For example, the model can interpret the underlying information in images, diagrams, and graphical representations, create step-by-step analyses, and conclude.

4. OCR’ing

It is one of the most important use cases in the area of Document AI where models convert and extract text data from images for downstream tasks.

image_path = "qubits.png"
with open(image_path, 'rb') as image_file:
    image_data = image_file.read()
    
image_data = base64.b64encode(image_data).decode("utf-8")

prompt="""Extract all the text from the image"""
message = HumanMessage(
    content=[
        {"type": "text", "text": prompt},
        {
            "type": "image_url",
            "image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
        },
    ],
)
response = llm.invoke([message])
print(response.content)

5. Object Detection & Segmentation

Vision models are capable of identifying objects in the images and classifying them into defined categories. Mainly in the case of object detection models can locate the objects and classify them whereas in the case of segmentation, vision models can divide the images into different regions based on surrounding pixel values.

from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from typing import List

import json

class Segmentation(BaseModel):
    Object: List[str] = Field(description="Identify the object and give a name")
    Bounding_box: List[List[int]] = Field(description="Extract the bounding boxes")

parser = PydanticOutputParser(pydantic_object=Segmentation)

prompt = ChatPromptTemplate.from_messages([
    ("system", "Extract all the image objects and their bounding boxes. You must always return valid JSON.\n'{struct_format}'\n"),
    ("human", [
        {
            "type": "image_url",
            "image_url": {"url": "data:image/jpeg;base64,{image_data}"},
        },
    ]),
])

chain = prompt | llm | parser

image_path = "quantum.jpg"
with open(image_path, 'rb') as image_file:
    image_data = image_file.read()
    
image_data = base64.b64encode(image_data).decode("utf-8")

response = chain.invoke({
    "struct_format": parser.get_format_instructions(),
    "image_data": image_data
})

data = json.loads(response.model_dump_json())

for k,v in data.items():
    print(f"{k}: {v}")

## Complete code is available in GitHub
plot_bounding_boxes(im=img,labels=data['Object'], bounding_boxes=data['Bounding_box'])

Vision models have a wide range of use cases across various industries and are increasingly being integrated into different platforms like Canva, Fireflies, Instagram, and YouTube.

Architecture of Large Vision-Language Models (LVLMs)

The primary purpose of developing vision models is to unify features from images, videos, and text. Researchers are exploring different architectures to pretrain Large Vision-Language Models (LVLMs).
Typically, encoders are employed to extract image features, while text data can be processed using an encoder, a decoder, or a combination of both. Modality projectors, sometimes called connectors, are dense neural networks used to align image features with text representations.

Below is the general overview of common network designs.

1. Two-Tower VLM

The figure below represents the simplest architecture where images and text are encoded separately and trained under a common objective. Here’s a breakdown of the components:

Two-Tower VLM
Two-Tower VLM
  • Image Encoder: On the left side, there is an encoder that processes image data. This encoder extracts meaningful features from the image for further processing.
  • Text Encoder: On the right side, a similar encoder that encodes text data. It transforms the textual data into a format suitable for the shared objective.
  • Objective: Representation of the image and text encoders feed into a shared objective. Here the goal is to align the information from both modalities (image and text).

This setup is common in models that aim to learn relationships between images and text. These models also work as the base for multiple downstream tasks like image captioning or visual question answering.

2. Two-Leg VLM

The architecture described below resembles the two-tower approach, but it incorporates a fusion layer (a dense neural network) to merge the features from images and text. Let’s go through each step in detail.

Two-Leg VLM
Two-Leg VLM
  • Image Encoder: This component processes input images. It extracts important features and representations from the image data.
  • Text Encoder: The right side component processes textual data. It transforms the text data into meaningful representations.
  • Fusion Layer: The key addition in this image is the fusion layer. After the image and text data are encoded separately, their representations are combined or fused in this layer. This is critical for learning relationships between the two modalities (images and text).
  • Objective: Ultimately, the fused data is utilized for a shared objective, which could be a downstream task such as classification, caption generation, or question answering.

In summary, the image describes a multimodal system where image and text data are encoded separately and then combined at the fusion layer to achieve a unified goal. The fusion layer is crucial for leveraging the information from both data types in a coordinated way.

3. VLM with Image Encoder – Text Encoder & Decoder

The next architecture we can think of is an encoder for images and splitting the encoder and decoder for textual data. We divided the text into two parts where one part will pass through the encoder, and
the remaining text data will feed into the decoder and learn further relations during cross-attention. This can be one use case of question-answering from images and their long description combined. Therefore, the image will pass through the encoder, the image description will go through the text encoder, and question-answers will feed into the decoder.

VLM with Image Encoder - Text Encoder & Decoder
VLM with Image Encoder – Text Encoder & Decoder

Here is an explanation of the different components:

  1. Conv Stage: This step processes images through a convolutional layer to extract features from the image data.
  2. Text Embedding: Text data (such as image descriptions) is embedded into a high-dimensional vector representation.
  3. Concatenate: Both the processed image features and the embedded text features are combined into a unified representation.
  4. Encoder: The concatenated features are passed through an encoder, which transforms the data into a higher-level representation.
  5. Projector: After encoding, the features are projected into a space where they can be more easily integrated with features from the decoder.
  6. Cross Attention: This block enables interaction between the features from the projector and the decoder. In this case, the system learns which parts of the image and text data are most relevant to each other.
  7. Concatenate Features: Instead of using cross-attention, we can stack features from the projector and decoder together.
  8. Decoder: The combined features are passed to a decoder, which processes the integrated information and generates output.
  9. Objective. The objective could be the same as given above.

Overall, this diagram represents a system where images and text are processed together. Their features are concatenated or cross-attended, and finally decoded to achieve a specific objective in a multimodal task.

4. VLM with Encoder-Decoder

Our final architecture talks about an approach where all the images will be passed to encoders whereas text data will go to the decoder. During combined representation learning, we can use either
cross-attention or simply concatenate the features from both modalities.

VLM with Image Encoder - Text Encoder - Decoder
VLM with Image Encoder – Text Encoder – Decoder

Following is a step-by-step explanation:

  • Image Encoder: It extracts visual features from the image, transforming it into a numerical representation that the model can understand.
  • Projector: The projector takes the output from the Image Encoder and projects it into a vector space compatible with the text data.
  • Cross Attention: This is where the core interaction between the image and text happens. It helps the model align the visual information with the relevant textual context.
  • Concatenate Features: At the place of using cross attention, we can merely stack the features of both modalities for better comprehensive context contextual learning.
  • Text Decoder: It takes the concatenated features as input and uses them to predict the next word in the sequence.

The model learns to “view” the images, “comprehend” the text, and then generate a coherent and informative output by aligning the visual and textual information.

Conclusion

Multimodal LLMs, or Vision-Language Models (VLMs) as discussed in this blog, are trained on image-text datasets to facilitate efficient communication across different data modalities. These models excel at recognizing pixels and addressing visual tasks such as object detection and semantic segmentation. However, it is important to highlight that achieving competitive performance with VLMs demands large datasets and significant computational resources. For instance, Qwen2-VL was trained on 1.4 trillion image and text tokens.

While VLMs can handle various visual tasks, they still show limitations in use cases such as reasoning, image interpretation, and extracting complex data.

I will conclude the first part here, hoping it has provided a clear overview of how vision models are generally trained. It is important to note that developing these models requires a strong understanding of matrix operations, model parallelism, flash attention, and hyperparameter tuning. In the next part, we will explore training our VLMs for a small use case.

References

I am Ram, a data scientist. I work as an Associate Director of Machine Learning at Cleareye.AI. Throughout my career, I have worked on various AI projects, ranging from traditional algorithms to cutting-edge technologies. I have extensive experience with LLMs and Graph Neural Networks. I am always eager to learn, and my next pursuit involves exploring Quantum computing.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details