Google’s SigLIP: A Significant Momentum in CLIP’s Framework

Maigari David Last Updated : 02 Oct, 2024
6 min read

Introduction

Image classification has found a huge application in real life by introducing better computer vision models and technology with more accurate output. There are many use cases for these models, but zero-shot classification and image pairs are some of the most popular applications of these models. 

Google’s SigLIP image classification model is a big example, and it comes with a major performance benchmark that makes it special. It is an image embedding model that relies on a CLIP framework but even with a better loss function. 

This model also works solely on image-text pairs, matching them and providing vector representation and probabilities. Siglip allows for image classification in smaller matches while accommodating further scaling. What makes the difference for Google’s siglip is the sigmoid loss that takes it a level above CLIP. That means the model is trained to work on image-text pairs individually and not wholly to see which matches the most. 

Learning Objectives

  • Understanding SigLIP’s framework and model overview. 
  • Learning SigLIP’s state-of-the-art performance.
  •  Learn about the Sigmoid Loss Function
  • Gain Insight into some real-life applications of this model. 

This article was published as a part of the Data Science Blogathon.

Model Architecture of Google’s SigLip Model

This model uses a framework similar to CLIP (Contrastive Learning Image Pre-training) but with a little difference. Siglip is a multimodal model computer vision system that gives it an edge for better performance. It uses a vision transform encoder for images, which means the images are divided into patches before being linearly embedded into vectors. 

On the other hand, Siglip uses a transformer encoder for text and converts the input text sequence into dense embeddings. 

So, the model can take images as inputs and then perform zero-shot image classification. It can also use text as input, as it can be helpful for search queries and image retrieval. The output would be image-text similarity scores to give certain images through descriptions as certain tasks demand. Another possible output is the input image and text probabilities, otherwise known as zero-shot classification. 

Another part of this model architecture is its language learning capabilities. As mentioned earlier, the Contrastive learning image pre-training framework is the model’s backbone. However, it also helps align the image and text representation.

Model Architecture of Google’s SigLip Model

Inference streamlines the process, and users can achieve great performance with the major tasks, namely zero-shot classification and image-text similarity scores. 

What to Expect: Scaling and Performance Insights of SigLIP

A change in this model’s architecture comes with a few things. This Sigmoid loss opens the possibility of further scaling with the batch size. However, there is still more to be done with performance and efficiency compared to the standards of other similar CLIP models. 

The latest research aims to shape-optimize this model, with the SoViT-400m being examined. It would be interesting to see how its performance compares to other CLIP-like models. 

Running Inference with SigLIP: Step-by-Step Guide

Here is how you run inference with your code through a few steps. The first part involves importing the necessary libraries. You can input the image using a link or upload a file from your device. Then, you call on your output using ‘logits,’ you can perform tasks that check the text-image similarity scores and probability. Here is how these start; 

Importing Necessary Libraries

from transformers import pipeline
from PIL import Image
import requests

This code imports the necessary libraries to load and process images and perform tasks using pre-trained models obtained from HF. The PIL functions for loading and manipulating the image while the pipeline from the transformer library streamlines the inference process. 

Together, these libraries can retrieve an image from the internet and process it using a machine-learning model for tasks like classification or detection.

Loading the Pre-trained Model

This step initializes the zero-shot image classification task using the transformer library and starts the process by loading the pre-trained data.

# load pipe
image_classifier = pipeline(task="zero-shot-image-classification", model="google/siglip-so400m-patch14-384")

Preparing the Image

This code loads the image uploaded from your local file using the PIL function. You can store the image and get the ‘image_path’ to identify it in your code. Then the ‘image.open’ function helps to read it.

# load image
image_path = '/pexels-karolina-grabowska-4498135.jpg'
image = Image.open(image_path)

Alternatively, you can use the image URL as shown in the code block below; 

url = 'https://images.pexels.com/photos/4498135/pexels-photo-4498135.jpeg'
response = requests.get('https://images.pexels.com/photos/4498135/pexels-photo-4498135.jpeg', stream=True)
Running Inference with SigLIP: Step-by-Step Guide

Output

The model chooses the label with the highest score as the best match for the image, “a box.”

# inference
outputs = image_classifier(image, candidate_labels=["a box", "a plane", "a remote"])
outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
print(outputs)

Here is what the output representation looks like in the image below; 

Preparing the Image

The box label shows a higher score of 0.877, while the other does not get any close. 

Performance Benchmarks: SigLIP vs. Other Models

Sigmoid is the difference maker in this model’s architecture. The original clip model uses the softmax function, making defining one class per image challenging. The sigmoid loss function removes this problem, as Google researchers found a way around it. 

Here is a typical example below;

Performance Benchmarks: SigLIP vs. Other Models

With CLIP, even when the image class is not present in the labels, the model still tries to give an output with a prediction that would be inaccurate. However, SigLIP takes away this problem with a better loss function. If you try the same tasks, provided the possible image description is not in the label, you will have all the output, giving better accuracy. You can check it out in the image below; 

Performance Benchmarks: SigLIP vs. Other Models

With an image of a box in the input, you get an output of 0.0001 for each label. 

Application of SigLIP Model

There are a few major uses of this model, but these are some of the most popular potential applications users can employ; 

  • You can create a search engine for users to find images based on text descriptions. 
  • Image captioning is another valuable use of SigLIP as users can caption images and analyse them. 
  • Visual Question answering is also a brilliant use of this model. You can fine-tune the model to answer questions about the images and their content. 

Conclusion

Google SigLIP offers a major improvement in image classification with the Sigmoid function. This model improves accuracy by focusing on individual image-text pair matches, allowing better performance in zero-shot classification tasks. 

SigLIP’s ability to scale and provide higher precision makes it a powerful tool in applications like image search, captioning, and visual question answering. Its innovations position it as a standout in the realm of multimodal models.

Key Takeaway

  • Google’s SigLIP model improves other CLIP-like models by using a Sigmoid loss function, which enhances accuracy and performance in zero-shot image classification.
  • SigLIP excels in tasks involving image-text pair matching, enabling more precise image classification and offering capabilities like image captioning and visual question answering.
  • The model supports scalability for large batch sizes and is versatile across various use cases, such as image retrieval, classification, and search engines based on text descriptions.

Resources

Frequently Asked Questions

Q1. What is the key difference between SigLIP and CLIP models?

A. SigLIP uses a Sigmoid loss function, which allows for individual image-text pair matching and leads to better classification accuracy than CLIP’s softmax approach.

Q2. What are the main applications of Google’s SigLIP model?

A. SigLIP has applications for tasks such as image classification, image captioning, image retrieval through text descriptions, and visual question answering.

Q3. How does SigLIP handle zero-shot classification tasks?

A. SigLIP classifies images by comparing them with provided text labels, even if the model hasn’t been trained on those specific labels, making it ideal for zero-shot classification.

Q4. What makes the Sigmoid loss function beneficial for image classification?

A. The Sigmoid loss function helps avoid the limitations of the softmax function by independently evaluating each image-text pair. This results in more accurate predictions without forcing a single class output.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Hey there! I'm David Maigari a dynamic professional with a passion for technical writing writing, Web Development, and the AI world. David is an also enthusiast of data science and AI innovations.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details