CLIP VIT-L14: OpenAI’s Multimodal Marvel for Zero-Shot Image Classification

Maigari David Last Updated : 29 Nov, 2024

8 min read

Introduction

OpenAI’s development of CLIP (Contrastive Language Image Pre-training) has seen a lot of development in multimodal and natural language models. CLIP VIT L14 shows how you can represent image and text processing tasks. With different applications, this computer vision system can help represent text and images in a vector format.

Another great attribute of this model is its capabilities in zero-shot image classification and identifying their similarities. Various other use cases include image clustering and image search. These attributes are important as they can be helpful in various multimodal machine-learning applications.

Learning Outcomes

Understand the core architecture and functioning of OpenAI’s CLIP VIT-L14 model.
Learn how CLIP connects images and text using vector representations for multimodal tasks.
Explore the process of zero-shot image classification and image-text similarity matching.
Gain practical knowledge on running and fine-tuning the CLIP model for various applications.
Identify the key limitations and performance benchmarks of the CLIP VIT-L14 model.

This article was published as a part of the Data Science Blogathon.

Introduction
What is OpenAI’s CLIP VIT L14?
Model Architecture of CLIP VIT L14
Features of OpenAI’s CLIP
Performance Benchmark of CLIP VIT-L14
Running the Model
Limitations of the CLIP Model
Application of CLIP VIT-L14 Model
Conclusion
Frequently Asked Questions

What is OpenAI’s CLIP VIT L14?

This model is one of the developments initiated by OpenAI researchers to see what makes computer vision systems strong and efficient. CLIP VIT LARGE 14 was created to test the ‘ability of models to generalize to arbitrary image classification tasks in a zero-shot manner.’

This concept is evident as the foundation of development in CLIP models shows that. CLIP initiates a framework to connect images and text, which is why it is great for multimodal learning. This model is built on zero-shot transfer and natural language supervision.

This framework allows us to see how OpenAI’s CLIP VIT L14 acquires its capabilities in image classification, checking image similarities, and connecting text with images, making it an efficient multimodal tool.

Model Architecture of CLIP VIT L14

The structure that builds this model’s processing is one of the most effective in modern computer vision. This model’s implementation came with two variants: the ResNet image encoder and the vision encoder.

This article will use the vision transformer architecture for the CLIP VIT-L14 model. The vision transformer has two endpoints and follows a simple and effective structure. This model uses a transformer architecture as the image encoder. On the other hand, CLIP VIT-L14 uses a masked self-attention transformer as the text encoder. This allows the encoder to perform image similarity tasks for image and text pairs using contrastive loss. So, you get a vector representation from running these images and text.

CLIP VIT-L14: Inputs and Outputs

The model has to get training with enough visual concepts into the model’s dataset for images. So, you have image inputs that go through the encoder and into a vector representation. This base also applies to text; the model takes text description which it will encode to a vector representation.

Outputs for both cases are in vector representations, so you can see the similarities between image-text pairs and how they match. However, the pre-training is crucial as it helps predict which images were paired with which text in the datasets. That is because the datasets are in classes with captions such as “a photo of a dog,” and then it can match this with the wide range of visual concepts it has in its dataset.

Features of OpenAI’s CLIP

CLIP (Contrastive Language Image Pre-training) was developed on a framework that gives it various attributes to detect how effective computer vision can be; it can exhibit various features even without fine-tuned versions. Let’s highlight a few features that come with this model.

CLIP’s Efficiency

Clip can learn from various kinds of data, including unfiltered and highly noisy ones. That is a good reason why this model can perform well with zero-shot transfer. Vision transformer architecture over ResNet is another crucial factor in this model’s computational efficiency.

Flexibility with CLIP

Another feature that makes CLIP stand out is the various concepts available in its datasets directly from natural language. This makes it a level ahead of ImageNet and image-to-caption language. This results in high zero-shot performance datasets on different tasks, including image and object classification, OCR (images and videos), and geo-localization.

Performance Benchmark of CLIP VIT-L14

Testing this model across various benchmarks has provided positive results, but the key factor is how it performs compared to other CLIP models. This model has the highest accuracy when dealing with requirements of image generalization of different classes. The accuracy with ImageNet for this benchmark is around 75% for CLIP VIT-L14, while other CLIP models like CLIP VIT-B32 and CLIP VIT-B16 have less than 70% accuracy.

Running the Model

There are various ways to use this CLIP model; you can input an image to run a zero-shot classification and get the output in vector representation. You can also run inference on API with this model.

Step1: Importing Necessary Libraries For Image Processing

We’ll begin by importing the essential libraries needed to process images and interact with the CLIP VIT-L14 model, ensuring we have the right tools for image manipulation and analysis.

from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel

This code snippet helps necessary libraries for image processing using ‘PIL,’ essential for opening, saving, and modifying the image. Also, the ‘request’ here is vital for managing the image data from the URL or image path before it goes to the processor.

The CLIPProcessor pre-processes the input data (images and text) before feeding it into the CLIPModel, which performs the actual inference and generates predictions or embeddings from the input data.

Step2: Loading Pre-trained Data From CLIP Model

We will load the pre-trained CLIP ViT-L14 model, which has been fine-tuned for image and text embeddings, providing us with a robust foundation for accurate image analysis and segmentation tasks.

Using a pre-trained model is important as it streamlines the image processing procedure. This means we would only need to leverage datasets from the pre-trained model to bring in accurate image-to-text understanding.

The CLIP processor also handles a key part of the processing: ensuring that the input is compatible with the model so that the image and text can be processed effectively.

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

Step3: Image Processing

The image processing step begins by defining the URL point, and then the ‘requests’ download the image from the web. This code also opens the image before the processor processes the image and text.

With this code in full, the model can handle image and text inputs for tasks like matching or classification. So, here we have the URL of the image alongside the text input, “a photo of a cat, “a photo of a dog.”

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)


inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

Output

This classification’s function is to get the match or similarities between the text and image. The code below is expected to show the similarity scores of the preprocessed input (image and text). Then, each label gets the similarity score into probabilities as in the vector representation.

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

The text-image similarity score identifies and predicts which of the inputs (“a cat” or “a dog”) matches the image more. From the output, the score shows the vector representation of 18.9 and 11.7, respectively. This indicates that the first label (“a cat”) has a higher text-image similarity score compared to the second one (“a dog”)

Limitations of the CLIP Model

Despite its efficiency and accuracy with image classification and zero-shot performance, CLIP still has a few limitations. This model might face challenges with counting objects and tasks like fine-grained classification as it can be more complex categories and subcategories.

Here is an example that highlights this limitation

inputs = processor(text=["a photo of a cat", "a photo of a dog", "a photo of a bulldog","a photo of a german shepherd", "a photo of a dalmatian", "a persian cat", "a siamese cat"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

Fine-grained classification is supposed to categorize objects within a subcategory; in this case, different species of cats and dogs are in the input. With the output here, CLIP struggles to classify the different species of cats and dogs accurately.

Counting Images
This model was not built to count objects, so it can have some inaccuracies when making text-image similarity scores, as shown in the example below:

Limitations of the CLIP Model: CLIP VIT-L14

url = "https://images.unsplash.com/photo-1517331156700-3c241d2b4d83?q=80&w=1468&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image = Image.open(requests.get(url, stream=True).raw)


inputs = processor(text=["a photo of one cat", "a photo of two cats", "a photo of three cats", "a photo of four cats", "a photo of five cats"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities

Here, the output gives a similarity score for two cats that is lower (16.9) than that of one cat (20.7), which may indicate that the probability of the image having two cats is lower than that of one cat. But the image has four cats, so the probability score is expected to increase relatively.

Application of CLIP VIT-L14 Model

CLIP is already making its way into various industries with various applications. But the potential it has with further finetuning is also one to watch. Here are some functioning applications of CLIP you can find today;

Finding images through search has become easier, and with the architecture of models like CLIP, this process can become more streamlined.
This model has multimodal capabilities, with image and text matching. CLIP can help generate image captions and retrieve images from a large category using a simple text description.
One of CLIP’s major features is its zero-shot classification ability. This attribute can be useful for creating photo organization and cataloging tools.

Conclusion

OpenAI is showing, with its exploration of CLIP, that it can do much more with computer vision. The model uses a vision transformer architecture, which gives it computational efficiency. Its capabilities include zero-shot classification and its multimodal nature, which allow for a wide range of applications. However, it is important to understand this model’s limitations and capabilities when exploring its pre-trained data.

Resources

HuggingFace: Click here to access the link.
Kaggle: Click here to access th e link .
OpenAI: Click here to access the link.
clip-ViT-L-14: Click here to access the link.
GitHub: Click here to access the link.

Key Takeaway

Multimodal Capabilities to connect images and text is a big factor in its good performance for tasks like zero-shot image classification, image clustering, and search. It represents both images and text as vector embeddings.
This model can classify images with its unfiltered datasets. And this attribute is due to its vision transformer architecture.
The model has some limitations, and these are especially visible for tasks that involve counting objects and fine-grained classification.

Frequently Asked Questions

Q1. What is CLIP VIT-L14 used for?

A. This is used to connect images and text in computer vision models. It can perform tasks such as zero-shot image classification, image-text similarity matching, and multimodal machine learning applications like image search and clustering.

Q2. What are the limitations of the CLIP model?

A. CLIP can struggle with fine-grained classification tasks, like counting objects or categorizing complex subgroups.

Q3. How does CLIP VIT-L14 process image-text data?

A. The model encodes image and text inputs into vector representations, compares them to find similarities, and generates classification outputs.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Maigari David

Hey there! I'm David Maigari a dynamic professional with a passion for technical writing writing, Web Development, and the AI world. David is an also enthusiast of data science and AI innovations.

Advanced Classification Computer Vision Generative AI Large Language Models Machine Learning

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Data analyst Learning Path

Tableau Learning Path

NLP Learning Path

Data Scientist Learning Path

Data Engineer Learning Path

MLOps Learning Path

AI Engineer Learning Path

Computer Vision Learning Path

Generative AI Learning Path

Generative AI Roadmap for Enterprises

LLMs Roadmap

Prompt Engineer Leaning Path

CLIP VIT-L14: OpenAI’s Multimodal Marvel for Zero-Shot Image Classification

Introduction

Learning Outcomes

Table of contents

What is OpenAI’s CLIP VIT L14?

Model Architecture of CLIP VIT L14

CLIP VIT-L14: Inputs and Outputs

Features of OpenAI’s CLIP

CLIP’s Efficiency

Flexibility with CLIP

Performance Benchmark of CLIP VIT-L14

Running the Model

Step1: Importing Necessary Libraries For Image Processing

Step2: Loading Pre-trained Data From CLIP Model

Step3: Image Processing

Output

Limitations of the CLIP Model

Application of CLIP VIT-L14 Model

Conclusion

Resources

Key Takeaway

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt