What are Pre-training Methods of Vision Language Models?

Ankita Dutta Last Updated : 01 Jul, 2024

8 min read

Introduction

This article explores Vision Language Models (VLMs) and their advantages over traditional computer vision-based models. It highlights the benefits of multimodal learning, their application in tasks such as image captioning and visual question answering, and the pre-training objectives and protocols of OpenAI’s SimVLM and CLIP.

Learning Objectives

Understand how VLMs differ from solely computer vision-based models.
Learn about various VLM-based pre-training objectives.
Explore the training procedures of two state-of-the-art VLM models, SimVLM and CLIP, which rely on these pre-training goals.
Identify the individual application areas of these VLMs.

This article was published as a part of the Data Science Blogathon.

Introduction
Why Multimodal Learning?
Power of Vision Language Models
How do VLM-based Classifications Differ From Computer Vision-based Classifications?
Learning Process of VLMs
What is Contrastive Learning?
What is PrefixLM?
What is SimVLM?
Conclusion
Frequently Asked Questions

Why Multimodal Learning?

Recent developments in multimodal learning draw inspiration from the efficacy of this approach to build models that can interpret and connect data using a variety of modalities, including text, image, video, audio, body motions, facial expressions, and physiological signals. This inherent nature of human learning acts as the reason behind the superior performance of joint VLMs. They outperform traditional computer vision-based methods, which involve only the vision modality.

Power of Vision Language Models

Nowadays, VLMs have evolved to perform many challenging tasks with dramatically increasing efficiency. For example, image captioning, phrase grounding (performing object detection from an input image and expressing it in natural language phrase), text-guided image generation and manipulation, visual question-answering, detection of hate speech from social media content etc.

In the field of computer vision, visual concept classification and image or video captioning have emerged two important tasks. In this blog, we would like to discuss about how visual concept classification and their caption generation (prediction) based on joint vision language modalities are different from traditional computer vision-based models. Additionally, we would like to discuss about two different types of VLM-based models including their training procedure. This blog will detail joint vision-language models such as CLIP from OpenAI and SimVLM.

How do VLM-based Classifications Differ From Computer Vision-based Classifications?

As opposed to conventional computer vision-based techniques that only consider visual characteristics, VLM-based classifications improve comprehension and analysis by fusing visual data with natural language.

Contextualization

Vision Language Models (VLMs) are a type of Multimodal Large Language Models (LLMs), which integrates LLMs with computer vision field so that they can both visualize images, videos and contextualize them with corresponding natural language descriptions, whereas the traditional visual concept classification methods primarily rely on analyzing visual features. Contextualization of a visual source means understanding the subject or context of it rather than mere identification of the objects visible in it.

Since, in contrast to the traditional methods, VLMs are capable to learn about images and videos from text also, in addition to the visual features, thus it is easier for VLMs to perform contextualization compared to the traditional models. Moreover, learning from natural language strengthens VLMs over conventional training methods.

Transfer Learning

The inherent capability of these models for zero-shot learning and few-shot learning allows them to potentially categorize images and videos into previously unseen or rarely seen classes, based on the understanding of their context. This stands in contrast to conventional models, which necessitate enough amount of training data for each category they are expected to identify. In other words, state-of-the-art visual concept classification methods are trained to predict a predefined set of object classes, each having numerous examples.

This characteristic restricts their applicability when test data contains previously unseen categories or when there are negligible examples of a category. Before VLMs, zero-data learning was mostly explored in the field of computer vision. Thus, a critical challenge lies for VLMs in crafting precise textual representations for class names.

What are Pre-training Methods of Vision Language Models?

Diversity in Training Data

In order to perform zero-shot and few-shot transfer learnings efficiently, VLM-based visual concept classification methods are trained on computer vision datasets of diverse domains (example: geo-localization, OCR, remote-sensing etc.) at a time, as well as unlimited amount of image and video descriptions in raw text, in contrast to traditional methods.

Since, the training process of this kind of methods incurs tremendous cost in terms of time and resources due to the aggregate supervision, it is a standard practice to use pre-trained models on new examples, although fine-tuning is required very often. Thus, in this blog, we will term the training process as pre-training from now onwards.

Learning Process of VLMs

An image encoder, a text encoder, and a method to combine data from the two encoders are the three main components of a vision-language model. Because both the model architecture and the learning approach are taken into consideration when designing the loss functions, these essential components work closely together. The design of vision-language models has evolved significantly over time, despite the fact that this field of study is hardly new.

The current literature primarily uses transformer-architected image and text encoders to learn image and text representations either independently or together. Strategic pre-training objectives enable a range of downstream activities to be performed by these models during pre-training. In this section, we will discuss two types of pre-training methods: Contrastive Learning and PrefixLM. Both of these methods rely on fusing vision and language modalities, but they do so in different ways.

What is Contrastive Learning?

One popular pre-training objective for VLMs is contrastive learning, which has been shown to be a very successful pre-training goal for VLMs. Using big datasets of {image, caption} pairs, contrastive learning-based approaches learn a text encoder and an image encoder simultaneously with a contrastive loss, bridging the vision and language modalities. In contrastive learning, input words and images are mapped to the same feature space so that the distance between the embeddings of image-text pairs is maximized in the case of a match and minimized in the absence of one. Contrastive Language-Image Pre-training (CLIP) is an example of such a pre-trained model available for image classification.

Contrastive Language-Image Pre-training (CLIP)

CLIP is one of the state-of-the-art multimodal learning-based VLM model, highly capable of zero-data (or few-data) image classification introduced by OpenAI in the year 2021. Learning visual representations from natural language supervision is the main task of CLIP. And it is able to achieve competitive zero-shot (or few-shot) performance on a great variety of image classification datasets.

How Does CLIP Train?

The training mechanism of CLIP requires image-text pairs where the ‘text’s are actually the captions of those images to be trained. All the text snippets are separated from the images and given as input to a text encoder model, which is trained to output the text features, also called text representations. The CLIP uses a Transformer as the text encoder.

Similarly, the images are passed through an image encoder model like ViT, which acts as a computer vision backbone. It is trained to get image features or representations. Both the text and image embeddings have same dimension, and are then projected to a latent space. More precisely, CLIP aims to maximize the cosine similarity between the image and word embeddings, creating a multimodal embedding space by simultaneously training an image and text encoder. This notebook contains the code to run the model.

Use the commands below to set up the environment for inference with CLIP.

conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install ftfy regex tqdm
$ pip install git+https://github.com/openai/CLIP.git

The code snippet below demonstrates how to classify training images in the CIFAR100 dataset using CLIP, a model that was not exposed to CIFAR100 during pre-training. This example highlights CLIP’s capability for zero-shot learning by utilizing its pretrained multimodal embeddings for accurate classification. The code is available in the official github page of OpenAI-CLIP.

import os
import clip
import torch
from torchvision.datasets import CIFAR100

# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)

# Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)

# Prepare the inputs
image, class_id = cifar100[3637]
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device)

# Calculate features
with torch.no_grad():
    image_features = model.encode_image(image_input)
    text_features = model.encode_text(text_inputs)

# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)

# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
    print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")

What is PrefixLM?

Another approach to pre-train VLMs is using a PrefixLM objective, which also feature a multi-modal architecture consisting of an encoder and a decoder where both are transformers. In PrefixLM, the models accept parts of each image and the corresponding caption as prefix input, and predicts a plausible subsequent part of the caption. More precisely, the prefix text input acts as the prefix prompt for further prediction. Simple Visual Language Model (SimVLM) is such a model, which uses this pre-training objective.

What is SimVLM?

Simple Visual Language Model was introduced in the year 2022. It is mainly applicable in the area of image captioning and visual question answering. SimVLM relies on the working principle of generative language models. They are highly capable to predict the next token of an input text given as the prefix. Instead of learning two distinct feature spaces – one for visual inputs and another for language inputs. This method aims to learn a single feature space from both types of inputs, in contrast to CLIP. Thus, we refer to the learned feature space as the unified multimodal feature space.

How does SimVLM train?

In the training mechanism of SimVLM, the model receives successive patches of images as inputs. SimVLM has an architecture, in which the decoder anticipates the next textual sequence after the encoder gets a concatenated image patch sequence and prefix text sequence as the prefix input. The SimVLM model undergoes pre-training on an aligned image-text dataset after initially training on a text dataset without image patches in the prefix. As mentioned earlier, SimVLM learns a unified multimodal representation. This enables it to perform zero-data and few-data cross-modality transfer learning with high efficiency. These models handle visual question answering and generate image-conditioned text and captions.

Conclusion

VLMs are more efficient than solely computer vision-based methods in case of visual concept classification, caption generation, visual question answering etc. There are various pre-training methods, each having individual objective. We have discussed two of them here, namely contrastive learning and prefixLM. CLIP and SimVLM are examples of them successively. Both of the pre-training methods perform based on fusing image and text embeddings. CLIP is highly capable of zero-shot and few-shot classification. SimVLM specializes in generative downstream tasks such as caption generation and visual question answering.

Key Takeaways

In contrast to contrastive learning-based pre-training methods, prefixLM based methods aims to learns a unified multimodal representation.
Both contrastive learning and prefixLM are highly efficient to perform zero-shot and few-shot cross-modality transfer learning. Although their application areas are different.
Both contrastive learning and prefixLM adopt the concept of fusing vision and language modality, but in different way.
Both CLIP and SimVLM adopt transformer architectures as their backbones.

References

Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” International conference on machine learning. PMLR, 2021.
https://openai.com/index/clip/
https://github.com/openai/CLIP/tree/main
https://huggingface.co/docs/transformers/en/model_doc/clip
https://huggingface.co/blog/vision_language_pretraining
Wang, Zirui, et al. “Simvlm: Simple visual language model pretraining with weak supervision.” arXiv preprint arXiv:2108.10904 (2021).

Frequently Asked Questions

Q1. What is tokenization?

A. Tokenization is the process of splitting a text snippet into smaller units of text. For example, if a text snippet be ‘a boy is going to school’, then after applying tokenization on it, the tokens can be ‘a’, ‘boy’, ‘is’, ‘going’, ‘to’, and ‘school’.

Q2. What is Encoder?

A. Encoders aims to learn embeddings from the corresponding inputs. Inputs can be text, image etc. We use the learned embeddings for further downstream tasks like classification and prediction.

Q3. What is Decoder?

A. Decoders perform the desired downstream task taking the already learnt embeddings as inputs. The output of decoder will be the predicted probabilities for each class. In case of classification tasks; and text snippet for caption generation or VQA.

Q4. What is Transformer?

A. A transformer is a neural network-based architecture that serves as the foundational building block of LLM models.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Ankita Dutta

I am a machine learning enthusiast, working as a researcher (Project Linked Person) in Indian Statistical Institute (Kolkata, India). My research interest includes data science, machine learning, neural networks and obviously large language models.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

What are Pre-training Methods of Vision Language Models?

Introduction

Learning Objectives

Table of contents

Why Multimodal Learning?

Power of Vision Language Models

How do VLM-based Classifications Differ From Computer Vision-based Classifications?

Contextualization

Transfer Learning

Diversity in Training Data

Learning Process of VLMs

What is Contrastive Learning?

Contrastive Language-Image Pre-training (CLIP)

How Does CLIP Train?

What is PrefixLM?

What is SimVLM?

How does SimVLM train?

Conclusion

Key Takeaways

References

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at