Artificial intelligence is always evolving, and Open AI’s CLIP has always been a standout technology due to its performance and architecture. This model also possesses multimodal capabilities that allow it to advance other models like OWL-ViT and DALL-E. Facebook researchers created MetaCLIP using the CLIP data curation concept. This model upgrades MetaAI and incorporates various features from zero-shot image classification. Compared to CLIP, it also makes the data curation process more transparent. So, let’s discuss how this model works in zero-shot classification and other features, such as image similarity. We will also highlight the dynamics and performance of this MetaCLIP compared to CLIP.
This article was published as a part of the Data Science Blogathon.
MetaAI developed MetaCLIP, and this model carries a new approach to language image pre-training. It focuses on data curation to perform its tasks, and with over 400 million image-text pairs, this model delivered great research at a high accuracy.
This model leverages data from metadata using CLIP’s concept as introduced in the paper “Demystifying CLIP Data.” MetaCLIP also comes with various applications due to its features. The model permits you to cluster an image based on features like shape, color, etc. You can also compare two images and match text with images.
You need knowledge of the essentials that come with the development of CLIP to comprehend the peculiarity of MetaCLIP. Firstly, CLIP was a standout model as it introduced zero-shot classification to computer vision. But one fascinating fact about this model is that the foundation and structure comes from its delicate data curation process.
However, CLIP’s data source is not accessible making the data curation process uncertain. This is where MetaCLIP stepped in to provide a more feasible approach using metadata but with the help of CLIP’s concept. However,, MetaCLIP improves and shares the data collection process.
MetaCLIP delivered superior numbers compared to CLIP’s data on various benchmarks with a more refined data source and process. For example, on a dataset of 400 million image-text pairs, this model attained a 70% accuracy on zero-shot classification. This number slightly edges CLIP to 68%.
It even improves with an improved accuracy of 72% when scaled to 1 billion data points. MetaCLIP even goes as high as 80% on VT models of different sizes.
As mentioned earlier, this model’s(CLIP) foundation was built on its dataset rather than architecture. But with little information about CLIP’s data source, MetaCLIP decides to take its approach using Metadata curated datasets, which employs the following principles;
MetaCLIP performs well in various tasks, from zero-shot image classification to detecting image similarity. Although there are other ways to use this model, including image retrieval (prompts) and clustering, we will focus on zero-shot image classification.
This code imports the pipeline function from the transformers library, which is used to apply pre-trained models for various AI tasks. It also imports the Image module from the PIL (Pillow) library to handle image processing tasks like opening and modifying images.
from transformers import pipeline
from PIL import Image
Although you can load the image you want to use for the zero-shot classification using an image URL, using the image path is an effective alternative for images on your local device. So, the ‘Image.open’ function helps to open the image using the PIL library and prepares it for further processing.
image_path = "/content/Bald doctor.jpeg"
image = Image.open(image_path)
Now we will initialize the model.
pipe = pipeline("zero-shot-image-classification", model="facebook/metaclip-b16-fullcc2.5b")
The next step is to set candidate labels, which are possible categories into which the model can classify the image input.
# Define candidate labels
candidate_labels = ["doctor", "scientist", "businessman", "teacher", "artist"]
The model will process the image and give results based on a label score. So, you can get the score predicted from the following: a doctor, scientist, businessman, teacher, or artist. Since this model has a text-image matching feature, it can provide a percentage of which labels match the image.
result = pipe(image, candidate_labels=candidate_labels)
The function here (pipe) will access the image and assign a probability score to each label. Thus, you will get a result for each label and its associated confidence score.
Output:
print(result)
The output shows that ‘doctor’ has the highest confidence score(0.99106) compared to the other labels.
Another amazing feature of this model is its ability to almost accurately define the image similarity of two images by giving them a confidence score. I’ll illustrate how you can compare two images with MetaCLIP:
Although it is present in the environment, we can still import the necessary libraries, as in the zero-shot classification.
from transformers import pipeline
from PIL import Image
Next, we initialize the MetaCLIP model tasked with the image similarity task. The image input is another important step of this process. Loading and processing the image you want to check for similarity can be done as shown below:
pipe = pipeline("zero-shot-image-classification", model="facebook/metaclip-b16-fullcc2.5b")
We will need to load the two images we want to check for similarity. This requires opening the image from the specified path. The ‘image.open’ function uses the PIL library for this process. So, we have image1, which contains the first image with the file path”/content/Alphine Loop Apple Band.jpg.”
Image2 stores the second image from the file path: “/content/Apple Watch Reset.jpg.”
image1 = Image.open("/content/Alphine Loop Apple Band.jpg")
image2 = Image.open("/content/Apple Watch Reset.jpg")
We enter the image path to load images to be processed.
Here are the images below:
description1 = "An orange apple watch" # Example for image1
description2 = "An apple watch with a black band" # Example for image2
Since MetaCLIP is a text-image matching model, describe each image in text form. Then, you can check how Image1 and image2 match each other.
This code performs a cross-similarity check between two images using a zero-shot classification approach with the MetaCLIP pipeline. The first image (image1) is classified using the textual description of the second image (description2), and vice versa (image2 with description1). The resulting confidence scores (result1 and result2) indicate how well each image matches the description of the other.
result1 = pipe(image1, candidate_labels=[description2])
result2 = pipe(image2, candidate_labels=[description1])
Print results
Finally, you display the similarity score with the ‘Print.’ The first line displays how well image 1 matches the description of image 2 and vice versa.
print("Similarity Score (Image1 → Image2):", result1)
print("Similarity Score (Image2 → Image1):", result2)
The images are considered similar if both scores are high and different if the scores are low. However, as shown in the result below, the similarity is evident. (Score = 1.0)
You can check out the link to see the code file on Colab.
There are various ways to use this model across industries. This includes:
Let us now explore limitations of MetaCLIP:
MetaCLIP represents a significant advancement over CLIP by improving data curation transparency and refining image-text pair matching. Its superior performance in zero-shot classification and image similarity tasks showcases its potential across various AI applications. With a well-structured metadata-based approach, MetaCLIP enhances scalability and accuracy while addressing some of CLIP’s data limitations.
However, ethical concerns regarding data sourcing and potential biases remain a challenge. Despite this, MetaCLIP’s innovative architecture and high accuracy make it a powerful tool for advancing multimodal AI applications.
A. MetaCLIP improves upon CLIP by providing a more structured and transparent metadata-based data curation process. This leads to better performance in zero-shot classification and image-text matching.
A. While MetaCLIP is primarily designed for zero-shot classification, it can be fine-tuned for specific applications, such as custom image retrieval and captioning.
A. Since MetaCLIP sources its data from online repositories, it may inherit biases related to cultural and social perspectives, which could impact its fairness and accuracy.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.