Image classification has found a huge application in real life by introducing better computer vision models and technology with more accurate output. There are many use cases for these models, but zero-shot classification and image pairs are some of the most popular applications of these models.
Google’s SigLIP image classification model is a big example, and it comes with a major performance benchmark that makes it special. It is an image embedding model that relies on a CLIP framework but even with a better loss function.
This model also works solely on image-text pairs, matching them and providing vector representation and probabilities. Siglip allows for image classification in smaller matches while accommodating further scaling. What makes the difference for Google’s siglip is the sigmoid loss that takes it a level above CLIP. That means the model is trained to work on image-text pairs individually and not wholly to see which matches the most.
This article was published as a part of the Data Science Blogathon.
This model uses a framework similar to CLIP (Contrastive Learning Image Pre-training) but with a little difference. Siglip is a multimodal model computer vision system that gives it an edge for better performance. It uses a vision transform encoder for images, which means the images are divided into patches before being linearly embedded into vectors.
On the other hand, Siglip uses a transformer encoder for text and converts the input text sequence into dense embeddings.
So, the model can take images as inputs and then perform zero-shot image classification. It can also use text as input, as it can be helpful for search queries and image retrieval. The output would be image-text similarity scores to give certain images through descriptions as certain tasks demand. Another possible output is the input image and text probabilities, otherwise known as zero-shot classification.
Another part of this model architecture is its language learning capabilities. As mentioned earlier, the Contrastive learning image pre-training framework is the model’s backbone. However, it also helps align the image and text representation.
Inference streamlines the process, and users can achieve great performance with the major tasks, namely zero-shot classification and image-text similarity scores.
A change in this model’s architecture comes with a few things. This Sigmoid loss opens the possibility of further scaling with the batch size. However, there is still more to be done with performance and efficiency compared to the standards of other similar CLIP models.
The latest research aims to shape-optimize this model, with the SoViT-400m being examined. It would be interesting to see how its performance compares to other CLIP-like models.
Here is how you run inference with your code through a few steps. The first part involves importing the necessary libraries. You can input the image using a link or upload a file from your device. Then, you call on your output using ‘logits,’ you can perform tasks that check the text-image similarity scores and probability. Here is how these start;
from transformers import pipeline
from PIL import Image
import requests
This code imports the necessary libraries to load and process images and perform tasks using pre-trained models obtained from HF. The PIL functions for loading and manipulating the image while the pipeline from the transformer library streamlines the inference process.
Together, these libraries can retrieve an image from the internet and process it using a machine-learning model for tasks like classification or detection.
This step initializes the zero-shot image classification task using the transformer library and starts the process by loading the pre-trained data.
# load pipe
image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-so400m-patch14-384")
This code loads the image uploaded from your local file using the PIL function. You can store the image and get the ‘image_path’ to identify it in your code. Then the ‘image.open’ function helps to read it.
# load image
image_path = '/pexels-karolina-grabowska-4498135.jpg'
image = Image.open(image_path)
Alternatively, you can use the image URL as shown in the code block below;
url = 'https://images.pexels.com/photos/4498135/pexels-photo-4498135.jpeg'
response = requests.get('https://images.pexels.com/photos/4498135/pexels-photo-4498135.jpeg', stream=True)
The model chooses the label with the highest score as the best match for the image, “a box.”
# inference
outputs = image_classifier(image, candidate_labels=["a box", "a plane", "a remote"])
outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
print(outputs)
Here is what the output representation looks like in the image below;
The box label shows a higher score of 0.877, while the other does not get any close.
Sigmoid is the difference maker in this model’s architecture. The original clip model uses the softmax function, making defining one class per image challenging. The sigmoid loss function removes this problem, as Google researchers found a way around it.
Here is a typical example below;
With CLIP, even when the image class is not present in the labels, the model still tries to give an output with a prediction that would be inaccurate. However, SigLIP takes away this problem with a better loss function. If you try the same tasks, provided the possible image description is not in the label, you will have all the output, giving better accuracy. You can check it out in the image below;
With an image of a box in the input, you get an output of 0.0001 for each label.
There are a few major uses of this model, but these are some of the most popular potential applications users can employ;
Google SigLIP offers a major improvement in image classification with the Sigmoid function. This model improves accuracy by focusing on individual image-text pair matches, allowing better performance in zero-shot classification tasks.
SigLIP’s ability to scale and provide higher precision makes it a powerful tool in applications like image search, captioning, and visual question answering. Its innovations position it as a standout in the realm of multimodal models.
A. SigLIP uses a Sigmoid loss function, which allows for individual image-text pair matching and leads to better classification accuracy than CLIP’s softmax approach.
A. SigLIP has applications for tasks such as image classification, image captioning, image retrieval through text descriptions, and visual question answering.
A. SigLIP classifies images by comparing them with provided text labels, even if the model hasn’t been trained on those specific labels, making it ideal for zero-shot classification.
A. The Sigmoid loss function helps avoid the limitations of the softmax function by independently evaluating each image-text pair. This results in more accurate predictions without forcing a single class output.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.