This article explores Vision Language Models (VLMs) and their advantages over traditional computer vision-based models. It highlights the benefits of multimodal learning, their application in tasks such as image captioning and visual question answering, and the pre-training objectives and protocols of OpenAI’s SimVLM and CLIP.
This article was published as a part of the Data Science Blogathon.
Recent developments in multimodal learning draw inspiration from the efficacy of this approach to build models that can interpret and connect data using a variety of modalities, including text, image, video, audio, body motions, facial expressions, and physiological signals. This inherent nature of human learning acts as the reason behind the superior performance of joint VLMs. They outperform traditional computer vision-based methods, which involve only the vision modality.
Nowadays, VLMs have evolved to perform many challenging tasks with dramatically increasing efficiency. For example, image captioning, phrase grounding (performing object detection from an input image and expressing it in natural language phrase), text-guided image generation and manipulation, visual question-answering, detection of hate speech from social media content etc.
In the field of computer vision, visual concept classification and image or video captioning have emerged two important tasks. In this blog, we would like to discuss about how visual concept classification and their caption generation (prediction) based on joint vision language modalities are different from traditional computer vision-based models. Additionally, we would like to discuss about two different types of VLM-based models including their training procedure. This blog will detail joint vision-language models such as CLIP from OpenAI and SimVLM.
As opposed to conventional computer vision-based techniques that only consider visual characteristics, VLM-based classifications improve comprehension and analysis by fusing visual data with natural language.
Vision Language Models (VLMs) are a type of Multimodal Large Language Models (LLMs), which integrates LLMs with computer vision field so that they can both visualize images, videos and contextualize them with corresponding natural language descriptions, whereas the traditional visual concept classification methods primarily rely on analyzing visual features. Contextualization of a visual source means understanding the subject or context of it rather than mere identification of the objects visible in it.
Since, in contrast to the traditional methods, VLMs are capable to learn about images and videos from text also, in addition to the visual features, thus it is easier for VLMs to perform contextualization compared to the traditional models. Moreover, learning from natural language strengthens VLMs over conventional training methods.
The inherent capability of these models for zero-shot learning and few-shot learning allows them to potentially categorize images and videos into previously unseen or rarely seen classes, based on the understanding of their context. This stands in contrast to conventional models, which necessitate enough amount of training data for each category they are expected to identify. In other words, state-of-the-art visual concept classification methods are trained to predict a predefined set of object classes, each having numerous examples.
This characteristic restricts their applicability when test data contains previously unseen categories or when there are negligible examples of a category. Before VLMs, zero-data learning was mostly explored in the field of computer vision. Thus, a critical challenge lies for VLMs in crafting precise textual representations for class names.
In order to perform zero-shot and few-shot transfer learnings efficiently, VLM-based visual concept classification methods are trained on computer vision datasets of diverse domains (example: geo-localization, OCR, remote-sensing etc.) at a time, as well as unlimited amount of image and video descriptions in raw text, in contrast to traditional methods.
Since, the training process of this kind of methods incurs tremendous cost in terms of time and resources due to the aggregate supervision, it is a standard practice to use pre-trained models on new examples, although fine-tuning is required very often. Thus, in this blog, we will term the training process as pre-training from now onwards.
An image encoder, a text encoder, and a method to combine data from the two encoders are the three main components of a vision-language model. Because both the model architecture and the learning approach are taken into consideration when designing the loss functions, these essential components work closely together. The design of vision-language models has evolved significantly over time, despite the fact that this field of study is hardly new.
The current literature primarily uses transformer-architected image and text encoders to learn image and text representations either independently or together. Strategic pre-training objectives enable a range of downstream activities to be performed by these models during pre-training. In this section, we will discuss two types of pre-training methods: Contrastive Learning and PrefixLM. Both of these methods rely on fusing vision and language modalities, but they do so in different ways.
One popular pre-training objective for VLMs is contrastive learning, which has been shown to be a very successful pre-training goal for VLMs. Using big datasets of {image, caption} pairs, contrastive learning-based approaches learn a text encoder and an image encoder simultaneously with a contrastive loss, bridging the vision and language modalities. In contrastive learning, input words and images are mapped to the same feature space so that the distance between the embeddings of image-text pairs is maximized in the case of a match and minimized in the absence of one. Contrastive Language-Image Pre-training (CLIP) is an example of such a pre-trained model available for image classification.
CLIP is one of the state-of-the-art multimodal learning-based VLM model, highly capable of zero-data (or few-data) image classification introduced by OpenAI in the year 2021. Learning visual representations from natural language supervision is the main task of CLIP. And it is able to achieve competitive zero-shot (or few-shot) performance on a great variety of image classification datasets.
The training mechanism of CLIP requires image-text pairs where the ‘text’s are actually the captions of those images to be trained. All the text snippets are separated from the images and given as input to a text encoder model, which is trained to output the text features, also called text representations. The CLIP uses a Transformer as the text encoder.
Similarly, the images are passed through an image encoder model like ViT, which acts as a computer vision backbone. It is trained to get image features or representations. Both the text and image embeddings have same dimension, and are then projected to a latent space. More precisely, CLIP aims to maximize the cosine similarity between the image and word embeddings, creating a multimodal embedding space by simultaneously training an image and text encoder. This notebook contains the code to run the model.
Use the commands below to set up the environment for inference with CLIP.
conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git
The code snippet below demonstrates how to classify training images in the CIFAR100 dataset using CLIP, a model that was not exposed to CIFAR100 during pre-training. This example highlights CLIP’s capability for zero-shot learning by utilizing its pretrained multimodal embeddings for accurate classification. The code is available in the official github page of OpenAI-CLIP.
import os
import clip
import torch
from torchvision.datasets import CIFAR100
# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)
# Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)
# Prepare the inputs
image, class_id = cifar100[3637]
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device)
# Calculate features
with torch.no_grad():
image_features = model.encode_image(image_input)
text_features = model.encode_text(text_inputs)
# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)
# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")
Another approach to pre-train VLMs is using a PrefixLM objective, which also feature a multi-modal architecture consisting of an encoder and a decoder where both are transformers. In PrefixLM, the models accept parts of each image and the corresponding caption as prefix input, and predicts a plausible subsequent part of the caption. More precisely, the prefix text input acts as the prefix prompt for further prediction. Simple Visual Language Model (SimVLM) is such a model, which uses this pre-training objective.
Simple Visual Language Model was introduced in the year 2022. It is mainly applicable in the area of image captioning and visual question answering. SimVLM relies on the working principle of generative language models. They are highly capable to predict the next token of an input text given as the prefix. Instead of learning two distinct feature spaces – one for visual inputs and another for language inputs. This method aims to learn a single feature space from both types of inputs, in contrast to CLIP. Thus, we refer to the learned feature space as the unified multimodal feature space.
In the training mechanism of SimVLM, the model receives successive patches of images as inputs. SimVLM has an architecture, in which the decoder anticipates the next textual sequence after the encoder gets a concatenated image patch sequence and prefix text sequence as the prefix input. The SimVLM model undergoes pre-training on an aligned image-text dataset after initially training on a text dataset without image patches in the prefix. As mentioned earlier, SimVLM learns a unified multimodal representation. This enables it to perform zero-data and few-data cross-modality transfer learning with high efficiency. These models handle visual question answering and generate image-conditioned text and captions.
VLMs are more efficient than solely computer vision-based methods in case of visual concept classification, caption generation, visual question answering etc. There are various pre-training methods, each having individual objective. We have discussed two of them here, namely contrastive learning and prefixLM. CLIP and SimVLM are examples of them successively. Both of the pre-training methods perform based on fusing image and text embeddings. CLIP is highly capable of zero-shot and few-shot classification. SimVLM specializes in generative downstream tasks such as caption generation and visual question answering.
A. Tokenization is the process of splitting a text snippet into smaller units of text. For example, if a text snippet be ‘a boy is going to school’, then after applying tokenization on it, the tokens can be ‘a’, ‘boy’, ‘is’, ‘going’, ‘to’, and ‘school’.
A. Encoders aims to learn embeddings from the corresponding inputs. Inputs can be text, image etc. We use the learned embeddings for further downstream tasks like classification and prediction.
A. Decoders perform the desired downstream task taking the already learnt embeddings as inputs. The output of decoder will be the predicted probabilities for each class. In case of classification tasks; and text snippet for caption generation or VQA.
A. A transformer is a neural network-based architecture that serves as the foundational building block of LLM models.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.