How to Use MetaCLIP?

Maigari David Last Updated : 27 Feb, 2025
7 min read

Artificial intelligence is always evolving, and Open AI’s CLIP has always been a standout technology due to its performance and architecture. This model also possesses multimodal capabilities that allow it to advance other models like OWL-ViT and DALL-E. Facebook researchers created MetaCLIP using the CLIP data curation concept. This model upgrades MetaAI and incorporates various features from zero-shot image classification. Compared to CLIP, it also makes the data curation process more transparent.  So, let’s discuss how this model works in zero-shot classification and other features, such as image similarity. We will also highlight the dynamics and performance of this MetaCLIP compared to CLIP. 

Learning Objectives

  • Understand MetaCLIP’s step forward from CLIP’s architecture. 
  • Learn about the performance benchmark of MetaCLIP. 
  • Get insight into the architecture of this model. 
  • Run inference on MetaCLIP for zero-shot image classification and checking image similarity. 
  • Highlighting the limitations and some real-life Applications of MetaCLIP. 

This article was published as a part of the Data Science Blogathon.

What is MetaCLIP?

MetaAI developed MetaCLIP, and this model carries a new approach to language image pre-training. It focuses on data curation to perform its tasks, and with over 400 million image-text pairs, this model delivered great research at a high accuracy. 

This model leverages data from metadata using CLIP’s concept as introduced in the paper “Demystifying CLIP Data.” MetaCLIP also comes with various applications due to its features. The model permits you to cluster an image based on features like shape, color, etc. You can also compare two images and match text with images. 

How to Use MetaCLIP: A Step Forward From CLIP

Understanding MetaCLIP: A Step Forward From CLIP

You need knowledge of the essentials that come with the development of CLIP to comprehend the peculiarity of MetaCLIP. Firstly, CLIP was a standout model as it introduced zero-shot classification to computer vision. But one fascinating fact about this model is that the foundation and structure comes from its delicate data curation process. 

However, CLIP’s data source is not accessible making the data curation process uncertain. This is where MetaCLIP stepped in to provide a more feasible approach using metadata but with the help of CLIP’s concept. However,, MetaCLIP improves and shares the data collection process. 

Performance Benchmark

MetaCLIP delivered superior numbers compared to CLIP’s data on various benchmarks with a more refined data source and process. For example, on a dataset of 400 million image-text pairs, this model attained a 70% accuracy on zero-shot classification. This number slightly edges CLIP to 68%. 

It even improves with an improved accuracy of 72% when scaled to 1 billion data points. MetaCLIP even goes as high as 80% on VT models of different sizes. 

Model Architecture

As mentioned earlier, this model’s(CLIP) foundation was built on its dataset rather than architecture. But with little information about CLIP’s data source, MetaCLIP decides to take its approach using Metadata curated datasets, which employs the following principles; 

  • The researcher used a brand-new dataset of over 400M text-image pairs from various online repositories.
  • MetaCLIP also ensures metadata text entries are linked and have a clear mapping to their corresponding textual content.
  • Metadata also established a formal algorithm for the data curation process to improve scalability and reduce space complexity. 
  • This model also used a specific matching technique to bridge the gap between unstructured text and structured metadata. 
  • Using a clear map and textual content, subjects were also added to every entry. This helps ensure a balanced data distribution and allows for pre-training efforts.
CLIP_architecture
Source: HuggingFace

How to Use MetaCLIP?

MetaCLIP performs well in various tasks, from zero-shot image classification to detecting image similarity. Although there are other ways to use this model, including image retrieval (prompts) and clustering, we will focus on zero-shot image classification. 

Step 1: Importing Necessary Libraries

This code imports the pipeline function from the transformers library, which is used to apply pre-trained models for various AI tasks. It also imports the Image module from the PIL (Pillow) library to handle image processing tasks like opening and modifying images. 

from transformers import pipeline
from PIL import Image

Step 2: Loading Image

Although you can load the image you want to use for the zero-shot classification using an image URL, using the image path is an effective alternative for images on your local device. So, the ‘Image.open’ function helps to open the image using the PIL library and prepares it for further processing.

image_path = "/content/Bald doctor.jpeg"
image = Image.open(image_path)
output:  MetaCLIP

Step 3: Initializing the Model

Now we will initialize the model.

 pipe = pipeline("zero-shot-image-classification", model="facebook/metaclip-b16-fullcc2.5b")

Step 4: Defining the Labels

The next step is to set candidate labels, which are possible categories into which the model can classify the image input. 

 # Define candidate labels
candidate_labels = ["doctor", "scientist", "businessman", "teacher", "artist"]

The model will process the image and give results based on a label score. So, you can get the score predicted from the following: a doctor, scientist, businessman, teacher, or artist. Since this model has a text-image matching feature, it can provide a percentage of which labels match the image. 

Step 5: Printing Output

 result = pipe(image, candidate_labels=candidate_labels)

The function here (pipe) will access the image and assign a probability score to each label. Thus, you will get a result for each label and its associated confidence score. 

Output:

 print(result)
output

The output shows that ‘doctor’ has the highest confidence score(0.99106) compared to the other labels.

Using MetaCLIP for Image-text Similarity  

Another amazing feature of this model is its ability to almost accurately define the image similarity of two images by giving them a confidence score. I’ll illustrate how you can compare two images with MetaCLIP: 

Importing Libraries

Although it is present in the environment, we can still import the necessary libraries, as in the zero-shot classification. 

 from transformers import pipeline
from PIL import Image

Initializing the MetaCLIP Model

Next, we initialize the MetaCLIP model tasked with the image similarity task. The image input is another important step of this process. Loading and processing the image you want to check for similarity can be done as shown below:

 pipe = pipeline("zero-shot-image-classification", model="facebook/metaclip-b16-fullcc2.5b")

Image Processing

We will need to load the two images we want to check for similarity. This requires opening the image from the specified path. The ‘image.open’ function uses the PIL library for this process. So, we have image1, which contains the first image with the file path”/content/Alphine Loop Apple Band.jpg.” 

Image2 stores the second image from the file path: “/content/Apple Watch Reset.jpg.”

image1 = Image.open("/content/Alphine Loop Apple Band.jpg")
image2 = Image.open("/content/Apple Watch Reset.jpg")

We enter the image path to load images to be processed. 

Here are the images below:

smartwatch
description1 = "An orange apple watch"  # Example for image1
description2 = "An apple watch with a black band"       # Example for image2

Since MetaCLIP is a text-image matching model, describe each image in text form. Then, you can check how Image1 and image2 match each other. 

Results

This code performs a cross-similarity check between two images using a zero-shot classification approach with the MetaCLIP pipeline. The first image (image1) is classified using the textual description of the second image (description2), and vice versa (image2 with description1). The resulting confidence scores (result1 and result2) indicate how well each image matches the description of the other.

result1 = pipe(image1, candidate_labels=[description2])
result2 = pipe(image2, candidate_labels=[description1])

Print results
Finally, you display the similarity score with the ‘Print.’ The first line displays how well image 1 matches the description of image 2 and vice versa. 

print("Similarity Score (Image1 → Image2):", result1)
print("Similarity Score (Image2 → Image1):", result2)

The images are considered similar if both scores are high and different if the scores are low. However, as shown in the result below, the similarity is evident. (Score = 1.0) 

output:  MetaCLIP

You can check out the link to see the code file on Colab

Applications of MetaCLIP

There are various ways to use this model across industries. This includes: 

  • MetaCLIP can help create AI system for image search. The image clustering ability of MetaCLIP means it can group images with certain attribute and identify them when matched with a text. 
  • Image captioning and generation is another application of CLIP models. MetaCLIP can use text prompt to generate images. 
  • This model is also useful for image combination as you can merge elements from different images using textual prompts. 

Limitations of MetaCLIP

Let us now explore limitations of MetaCLIP:

  • Potential Bias: MetaCLIP inherits biases from its online-sourced training data, leading to fairness concerns.
  • Data Dependency: The model’s performance relies heavily on the quality and diversity of metadata curation.
  • Computational Cost: Training and inference require significant resources, making it less accessible for smaller applications.
  • Limited Interpretability: The decision-making process in zero-shot classification lacks full explainability.
  • Ethical Concerns: Issues related to data privacy and responsible AI use remain a challenge.

Conclusion 

MetaCLIP represents a significant advancement over CLIP by improving data curation transparency and refining image-text pair matching. Its superior performance in zero-shot classification and image similarity tasks showcases its potential across various AI applications. With a well-structured metadata-based approach, MetaCLIP enhances scalability and accuracy while addressing some of CLIP’s data limitations. 

However, ethical concerns regarding data sourcing and potential biases remain a challenge. Despite this, MetaCLIP’s innovative architecture and high accuracy make it a powerful tool for advancing multimodal AI applications.

Key Takeaway

  • The data transparency that MetaCLIP brings to the table is a standout factor. It improves the data curation process using metadata-based dataset of over 400 million image-text pairs.
  • MetaCLIP’s performance in zero-shot image classification is another key takeaway from this model’s performance benchmark. 
  • MetaCLIP can be used for image search, clustering, captioning, and even text-to-image generation. 

Resources

Frequently Asked Questions

Q1. What makes MetaCLIP different from OpenAI’s CLIP?

A. MetaCLIP improves upon CLIP by providing a more structured and transparent metadata-based data curation process. This leads to better performance in zero-shot classification and image-text matching.

Q2. Can MetaCLIP be fine-tuned for specific tasks?

A. While MetaCLIP is primarily designed for zero-shot classification, it can be fine-tuned for specific applications, such as custom image retrieval and captioning.

Q3. What are the ethical concerns associated with MetaCLIP?

A. Since MetaCLIP sources its data from online repositories, it may inherit biases related to cultural and social perspectives, which could impact its fairness and accuracy.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Hey there! I'm David Maigari a dynamic professional with a passion for technical writing writing, Web Development, and the AI world. David is an also enthusiast of data science and AI innovations.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details