The human mind naturally perceives language, vision, smell, and touch, enabling us to understand our surroundings. We are particularly inclined toward linguistic thought and visual memory. As GenAI models continue to grow, researchers are now working on extending their capabilities by incorporating multimodality. Large Language models (LLMs) only accept text as input and produce text as output, which means these models do not process or generate data from other modalities such as images, videos, or voice. LLMs have excelled in handling tasks such as question-answering, text summarization, translation, information retrieval, code generation, and reasoning. However, integrating other modalities with LLMs (Multimodal LLMs) enhances the potential of GenAI models. For instance, training a model by combining text and images solves problems such as visual Q&A, image segmentation, and object detection. Likewise, we can add videos in the same model for more advanced media-related analysis.
Generative AI is a subsection of machine learning models allowing for new content generation. We can generate new text after feeding input as text to the model known as text-to-text. However, after extending the capabilities of LLMs with other modalities, we can open the solution to a wide range of use cases such as text-to-image, text-to-video, text-to-speech, image-to-image, and image-to-video. We call such models Large multimodal models (Multimodal LLMs). Training such models happens on large datasets containing text and other modalities so that algorithms can learn the relationships among all the input types. Intuitively, these models are not limited to a single input or output type; they can be adapted to handle inputs from any modality and generate output accordingly. In this way, multimodal LLMs can be seen as providing the system with the ability to process and understand different types of sensory inputs.
This blog is split into two sections; in the first part, I will explore the applications of multimodal LLMs and various architectures, while in the second part, I will train a small vision model.
While combining different input types to create multimodal LLMs may appear straightforward, it becomes more complex when processing data from 1D, 2D, and 3D together. It is a multi-step problem that needs to be solved sequentially in a step-by-step manner, and the data must be carefully curated to enhance the problem-solving capabilities of such models.
For now, we will limit our discussion to text and images. Unlike text, images and videos come in varying sizes and resolutions, so a robust pre-processing technique is needed to standardize all inputs into a single framework. Furthermore, inputs like images, videos, prompts, and metadata should be prepared in a way that helps models build coherent thought processes and maintain logical consistency during inference. Models trained with text, image, and video data are called Large Vision-Language Models (LVLMs).
The following image is taken from a Qwen2-VL paper where researchers trained a vision model based on Qwen2 LLM that can solve multiple visual use cases.
The figure below demonstrates how a Multimodal Language Model (MMLM) processes different types of input data (image, text, audio, video) to achieve various objectives. The core part of the diagram, the MMLM, integrates all the different modalities (image, text, audio, video) to process them in combination.
Let’s proceed further and understand the different applications of vision models. The complete code used in this blog is stored in GitHub.
It is the task of describing the features of images in words. People are using this feature to generate descriptions of images and innovating a range of engaging captions and relevant hashtags for their social media posts to improve visibility.
image_path = "quantum.jpg"
with open(image_path, 'rb') as image_file:
image_data = image_file.read()
image_data = base64.b64encode(image_data).decode("utf-8")
prompt="""explain this image"""
message = HumanMessage(
content=[
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
},
],
)
response = llm.invoke([message])
print(response.content)
Information extraction is another application for vision models where we expect the model to retrieve features or data points from the images. For example, we can question the model to identify underlying objects’ colour, text, or feature. Contemporary models use function calling or JSON parsing techniques to extract structured data points from the images.
from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
import json
class Retrieval(BaseModel):
Description: str = Field(description="Describe the image")
Machine: str = Field(description="Explain what is the machine about")
Color: str = Field(description="What are the color used in the image")
People: str = Field(description="Count how many male and female are standing their")
parser = PydanticOutputParser(pydantic_object=Retrieval)
prompt = ChatPromptTemplate.from_messages([
("system", "Extract the requested details as per the given details.\n'{struct_format}'\n"),
("human", [
{
"type": "image_url",
"image_url": {"url": "data:image/jpeg;base64,{image_data}"},
},
]),
])
chain = prompt | llm | parser
image_path = "quantum.jpg"
with open(image_path, 'rb') as image_file:
image_data = image_file.read()
image_data = base64.b64encode(image_data).decode("utf-8")
response = chain.invoke({
"struct_format": parser.get_format_instructions(),
"image_data": image_data
})
data = json.loads(response.model_dump_json())
for k,v in data.items():
print(f"{k}: {v}")
It is a use case for a vision model to analyze the image and perform reasoning tasks. For example, the model can interpret the underlying information in images, diagrams, and graphical representations, create step-by-step analyses, and conclude.
It is one of the most important use cases in the area of Document AI where models convert and extract text data from images for downstream tasks.
image_path = "qubits.png"
with open(image_path, 'rb') as image_file:
image_data = image_file.read()
image_data = base64.b64encode(image_data).decode("utf-8")
prompt="""Extract all the text from the image"""
message = HumanMessage(
content=[
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{image_data}"},
},
],
)
response = llm.invoke([message])
print(response.content)
Vision models are capable of identifying objects in the images and classifying them into defined categories. Mainly in the case of object detection models can locate the objects and classify them whereas in the case of segmentation, vision models can divide the images into different regions based on surrounding pixel values.
from langchain.output_parsers import PydanticOutputParser
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from typing import List
import json
class Segmentation(BaseModel):
Object: List[str] = Field(description="Identify the object and give a name")
Bounding_box: List[List[int]] = Field(description="Extract the bounding boxes")
parser = PydanticOutputParser(pydantic_object=Segmentation)
prompt = ChatPromptTemplate.from_messages([
("system", "Extract all the image objects and their bounding boxes. You must always return valid JSON.\n'{struct_format}'\n"),
("human", [
{
"type": "image_url",
"image_url": {"url": "data:image/jpeg;base64,{image_data}"},
},
]),
])
chain = prompt | llm | parser
image_path = "quantum.jpg"
with open(image_path, 'rb') as image_file:
image_data = image_file.read()
image_data = base64.b64encode(image_data).decode("utf-8")
response = chain.invoke({
"struct_format": parser.get_format_instructions(),
"image_data": image_data
})
data = json.loads(response.model_dump_json())
for k,v in data.items():
print(f"{k}: {v}")
## Complete code is available in GitHub
plot_bounding_boxes(im=img,labels=data['Object'], bounding_boxes=data['Bounding_box'])
Vision models have a wide range of use cases across various industries and are increasingly being integrated into different platforms like Canva, Fireflies, Instagram, and YouTube.
The primary purpose of developing vision models is to unify features from images, videos, and text. Researchers are exploring different architectures to pretrain Large Vision-Language Models (LVLMs).
Typically, encoders are employed to extract image features, while text data can be processed using an encoder, a decoder, or a combination of both. Modality projectors, sometimes called connectors, are dense neural networks used to align image features with text representations.
Below is the general overview of common network designs.
The figure below represents the simplest architecture where images and text are encoded separately and trained under a common objective. Here’s a breakdown of the components:
This setup is common in models that aim to learn relationships between images and text. These models also work as the base for multiple downstream tasks like image captioning or visual question answering.
The architecture described below resembles the two-tower approach, but it incorporates a fusion layer (a dense neural network) to merge the features from images and text. Let’s go through each step in detail.
In summary, the image describes a multimodal system where image and text data are encoded separately and then combined at the fusion layer to achieve a unified goal. The fusion layer is crucial for leveraging the information from both data types in a coordinated way.
The next architecture we can think of is an encoder for images and splitting the encoder and decoder for textual data. We divided the text into two parts where one part will pass through the encoder, and
the remaining text data will feed into the decoder and learn further relations during cross-attention. This can be one use case of question-answering from images and their long description combined. Therefore, the image will pass through the encoder, the image description will go through the text encoder, and question-answers will feed into the decoder.
Here is an explanation of the different components:
Overall, this diagram represents a system where images and text are processed together. Their features are concatenated or cross-attended, and finally decoded to achieve a specific objective in a multimodal task.
Our final architecture talks about an approach where all the images will be passed to encoders whereas text data will go to the decoder. During combined representation learning, we can use either
cross-attention or simply concatenate the features from both modalities.
Following is a step-by-step explanation:
The model learns to “view” the images, “comprehend” the text, and then generate a coherent and informative output by aligning the visual and textual information.
Multimodal LLMs, or Vision-Language Models (VLMs) as discussed in this blog, are trained on image-text datasets to facilitate efficient communication across different data modalities. These models excel at recognizing pixels and addressing visual tasks such as object detection and semantic segmentation. However, it is important to highlight that achieving competitive performance with VLMs demands large datasets and significant computational resources. For instance, Qwen2-VL was trained on 1.4 trillion image and text tokens.
While VLMs can handle various visual tasks, they still show limitations in use cases such as reasoning, image interpretation, and extracting complex data.
I will conclude the first part here, hoping it has provided a clear overview of how vision models are generally trained. It is important to note that developing these models requires a strong understanding of matrix operations, model parallelism, flash attention, and hyperparameter tuning. In the next part, we will explore training our VLMs for a small use case.