CLIP VIT-L14: OpenAI’s Multimodal Marvel for Zero-Shot Image Classification

Maigari David Last Updated : 29 Nov, 2024
8 min read

Introduction

OpenAI’s development of CLIP (Contrastive Language Image Pre-training) has seen a lot of development in multimodal and natural language models. CLIP VIT L14 shows how you can represent image and text processing tasks. With different applications, this computer vision system can help represent text and images in a vector format. 

Another great attribute of this model is its capabilities in zero-shot image classification and identifying their similarities. Various other use cases include image clustering and image search. These attributes are important as they can be helpful in various multimodal machine-learning applications. 

Learning Outcomes

  • Understand the core architecture and functioning of OpenAI’s CLIP VIT-L14 model.
  • Learn how CLIP connects images and text using vector representations for multimodal tasks.
  • Explore the process of zero-shot image classification and image-text similarity matching.
  • Gain practical knowledge on running and fine-tuning the CLIP model for various applications.
  • Identify the key limitations and performance benchmarks of the CLIP VIT-L14 model.

This article was published as a part of the Data Science Blogathon.

What is OpenAI’s CLIP VIT L14?

This model is one of the developments initiated by OpenAI researchers to see what makes computer vision systems strong and efficient. CLIP VIT LARGE 14 was created to test the ‘ability of models to generalize to arbitrary image classification tasks in a zero-shot manner.’

This concept is evident as the foundation of development in CLIP models shows that. CLIP initiates a framework to connect images and text, which is why it is great for multimodal learning. This model is built on zero-shot transfer and natural language supervision. 

This framework allows us to see how OpenAI’s CLIP VIT L14 acquires its capabilities in image classification, checking image similarities, and connecting text with images, making it an efficient multimodal tool. 

Model Architecture of CLIP VIT L14

The structure that builds this model’s processing is one of the most effective in modern computer vision. This model’s implementation came with two variants: the ResNet image encoder and the vision encoder. 

This article will use the vision transformer architecture for the CLIP VIT-L14 model. The vision transformer has two endpoints and follows a simple and effective structure. This model uses a transformer architecture as the image encoder. On the other hand, CLIP VIT-L14 uses a masked self-attention transformer as the text encoder. This allows the encoder to perform image similarity tasks for image and text pairs using contrastive loss. So, you get a vector representation from running these images and text.

Model Architecture of CLIP VIT L14
Model Architecture of CLIP VIT L14

CLIP VIT-L14: Inputs and Outputs

The model has to get training with enough visual concepts into the model’s dataset for images. So, you have image inputs that go through the encoder and into a vector representation. This base also applies to text; the model takes text description which it will encode to a vector representation. 

Outputs for both cases are in vector representations, so you can see the similarities between image-text pairs and how they match. However, the pre-training is crucial as it helps predict which images were paired with which text in the datasets. That is because the datasets are in classes with captions such as “a photo of a dog,”  and then it can match this with the wide range of visual concepts it has in its dataset. 

Features of OpenAI’s CLIP

CLIP (Contrastive Language Image Pre-training) was developed on a framework that gives it various attributes to detect how effective computer vision can be; it can exhibit various features even without fine-tuned versions. Let’s highlight a few features that come with this model. 

CLIP’s Efficiency

Clip can learn from various kinds of data, including unfiltered and highly noisy ones. That is a good reason why this model can perform well with zero-shot transfer. Vision transformer architecture over ResNet is another crucial factor in this model’s computational efficiency. 

Flexibility with CLIP

Another feature that makes CLIP stand out is the various concepts available in its datasets directly from natural language. This makes it a level ahead of ImageNet and image-to-caption language. This results in high zero-shot performance datasets on different tasks, including image and object classification, OCR (images and videos), and geo-localization. 

Performance Benchmark of CLIP VIT-L14

Testing this model across various benchmarks has provided positive results, but the key factor is how it performs compared to other CLIP models. This model has the highest accuracy when dealing with requirements of image generalization of different classes. The accuracy with ImageNet for this benchmark is around 75% for CLIP VIT-L14, while other CLIP models like CLIP VIT-B32 and CLIP VIT-B16 have less than 70% accuracy. 

Running the Model

There are various ways to use this CLIP model; you can input an image to run a zero-shot classification and get the output in vector representation. You can also run inference on API with this model. 

Step1: Importing Necessary Libraries For Image Processing

We’ll begin by importing the essential libraries needed to process images and interact with the CLIP VIT-L14 model, ensuring we have the right tools for image manipulation and analysis.

from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel

This code snippet helps necessary libraries for image processing using ‘PIL,’ essential for opening, saving, and modifying the image. Also, the ‘request’ here is vital for managing the image data from the URL or image path before it goes to the processor. 

The CLIPProcessor pre-processes the input data (images and text) before feeding it into the CLIPModel, which performs the actual inference and generates predictions or embeddings from the input data.

Step2: Loading Pre-trained Data From CLIP Model

We will load the pre-trained CLIP ViT-L14 model, which has been fine-tuned for image and text embeddings, providing us with a robust foundation for accurate image analysis and segmentation tasks.

Using a pre-trained model is important as it streamlines the image processing procedure. This means we would only need to leverage datasets from the pre-trained model to bring in accurate image-to-text understanding. 

The CLIP processor also handles a key part of the processing: ensuring that the input is compatible with the model so that the image and text can be processed effectively.

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

Step3: Image Processing 

The image processing step begins by defining the URL point, and then the ‘requests’ download the image from the web. This code also opens the image before the processor processes the image and text. 

With this code in full, the model can handle image and text inputs for tasks like matching or classification. So, here we have the URL of the image alongside the text input, “a photo of a cat, “a photo of a dog.”

Step3: Image Processing 
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)


inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

Output

This classification’s function is to get the match or similarities between the text and image. The code below is expected to show the similarity scores of the preprocessed input (image and text). Then, each label gets the similarity score into probabilities as in the vector representation. 

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
Step3: Image Processing : Output

The text-image similarity score identifies and predicts which of the inputs (“a cat” or “a dog”) matches the image more. From the output, the score shows the vector representation of 18.9 and 11.7, respectively. This indicates that the first label (“a cat”) has a higher text-image similarity score compared to the second one (“a dog”)

Limitations of the CLIP Model

Despite its efficiency and accuracy with image classification and zero-shot performance, CLIP still has a few limitations. This model might face challenges with counting objects and tasks like fine-grained classification as it can be more complex categories and subcategories. 

Here is an example that highlights this limitation

inputs = processor(text=["a photo of a cat", "a photo of a dog", "a photo of a bulldog","a photo of a german shepherd", "a photo of a dalmatian", "a persian cat", "a siamese cat"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
Limitations of the CLIP Model

Fine-grained classification is supposed to categorize objects within a subcategory; in this case, different species of cats and dogs are in the input. With the output here, CLIP struggles to classify the different species of cats and dogs accurately.

Counting Images
This model was not built to count objects, so it can have some inaccuracies when making text-image similarity scores, as shown in the example below: 

Limitations of the CLIP Model: CLIP VIT-L14
url = "https://images.unsplash.com/photo-1517331156700-3c241d2b4d83?q=80&w=1468&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image = Image.open(requests.get(url, stream=True).raw)


inputs = processor(text=["a photo of one cat", "a photo of two cats", "a photo of three cats", "a photo of four cats", "a photo of five cats"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
output: CLIP VIT-L14

Here, the output gives a similarity score for two cats that is lower (16.9) than that of one cat (20.7), which may indicate that the probability of the image having two cats is lower than that of one cat. But the image has four cats, so the probability score is expected to increase relatively. 

Application of CLIP VIT-L14 Model

CLIP is already making its way into various industries with various applications. But the potential it has with further finetuning is also one to watch. Here are some functioning applications of CLIP you can find today; 

  • Finding images through search has become easier, and with the architecture of models like CLIP, this process can become more streamlined. 
  • This model has multimodal capabilities, with image and text matching. CLIP can help generate image captions and retrieve images from a large category using a simple text description. 
  • One of CLIP’s major features is its zero-shot classification ability. This attribute can be useful for creating photo organization and cataloging tools. 

Conclusion

OpenAI is showing, with its exploration of CLIP, that it can do much more with computer vision. The model uses a vision transformer architecture, which gives it computational efficiency. Its capabilities include zero-shot classification and its multimodal nature, which allow for a wide range of applications.  However, it is important to understand this model’s limitations and capabilities when exploring its pre-trained data. 

Resources

Key Takeaway

  • Multimodal Capabilities to connect images and text is a big factor in its good performance for tasks like zero-shot image classification, image clustering, and search. It represents both images and text as vector embeddings. 
  • This model can classify images with its unfiltered datasets. And this attribute is due to its vision transformer architecture. 
  • The model has some limitations, and these are especially visible for tasks that involve counting objects and fine-grained classification. 

Frequently Asked Questions

Q1. What is CLIP VIT-L14 used for?

A. This is used to connect images and text in computer vision models. It can perform tasks such as zero-shot image classification, image-text similarity matching, and multimodal machine learning applications like image search and clustering.

Q2. What are the limitations of the CLIP model? 

A. CLIP can struggle with fine-grained classification tasks, like counting objects or categorizing complex subgroups.

Q3. How does CLIP VIT-L14 process image-text data? 

A. The model encodes image and text inputs into vector representations, compares them to find similarities, and generates classification outputs.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Hey there! I'm David Maigari a dynamic professional with a passion for technical writing writing, Web Development, and the AI world. David is an also enthusiast of data science and AI innovations.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details