Boosting image search capabilities has become a critical focus in the realm of digital asset management, e-commerce, and social media platforms. With the ever-increasing volume of visual content generated daily, the need for efficient and accurate image retrieval systems is more pressing than ever. Enter SigLIP 2 (Sigmoid Loss for Language-Image Pre-Training), a state-of-the-art multilingual vision-language encoder developed by Google DeepMind, which promises to revolutionize how we approach image similarity and search tasks. Its innovative architecture not only improves semantic understanding but also excels in zero-shot classification and image-text retrieval. By utilizing a unified training approach that incorporates self-supervised learning and diverse data curation, SigLIP 2 outperforms previous models in extracting meaningful visual representations.
This article was published as a part of the Data Science Blogathon.
CLIP, which stands for Contrastive Language-Image Pre-training, is a groundbreaking multimodal model developed by OpenAI in 2021. It bridges the gap between computer vision and natural language processing by learning a shared representation space for images and text. This innovative approach allows CLIP to understand and correlate both modalities simultaneously, enabling it to perform tasks like zero-shot image classification, image-text retrieval, and captioning.
Learn More: CLIP VIT-L14: OpenAI’s Multimodal Marvel for Zero-Shot Image Classification
The Key components of CLIP consists of a Text Encoder, an Image Encoder with a Contrastive Learning Mechanism. This mechanism aligns the representations of text and images by maximizing the similarity between matching pairs and minimizing it for non-matching pairs.
CLIP is trained on a large dataset of image-text pairs, typically involving hundreds of millions of examples. The model learns to predict the most relevant text snippet given an image and vice versa.
Also Read: Google’s SigLIP: A Significant Momentum in CLIP’s Framework
In CLIP, there is an encoder for image and another encoder for text which take the input images and texts to a latent representation. When we have the embeddings (the latent representations) from the encoders, a similarity score (or dot product) is calculated between each image and text pair. The similarity score gives us a measure of how similar the image and the text embeddings are. To train the models to tag the correct text for an image or vice versa, a loss function is utilized whose objective is to maximize the similarity score between the image and text pairs.
In CLIP, the softmax function is applied to the model’s outputs to obtain a probability distribution like below for every image text pair in a batch.
In CLIP, the normalization (as seen in the denominators) is independently performed two times: across images and across texts as shown below in the loss function below –
The first term in the above equation finds the best text match for a given query image while the second term finds the best image match for a given query text. “B” is the batch size.
SigLIP, developed by Google follows a similar framework as CLIP but overcomes CLIP’s above issues by using a sigmoid-based loss (in place of softmax based loss) that operates independently on each image-text pair. Following is the Sigmoid Loss Function used in SigLIP
CLIP | SigLIP | Inference |
Softmax Based Loss | Sigmoid Based Loss | SigLIP is neither asymmetric nor dependent on a global normalization factor. As a result, the loss for each pair—whether positive or negative—is independent of other pairs in the mini-batch |
Each GPU stores an NxN matrix to compute all pairwise similarities | No need to store NXN matrix as each positive/negative pair operates independently. | Reduces computational overhead due to memory-efficient loss calculation |
SigLIP 2 models outperform the previous SigLIP versions at all model scales in key areas such as zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). One standout feature is the dynamic resolution (naflex) version, which is especially useful for tasks sensitive to aspect ratio and resolution.
SigLIP 2 introduces a text decoder alongside the existing image and text vision encoders during training. For LocCa, a transformer decoder with cross-attention is added to the vision encoder to achieve two key goals:
To improve fine-grained local semantics in image representation, SigLIP 2 adds two additional objectives: Global-Local Loss and Masked Prediction Loss.
Since image models can be highly sensitive to changes in resolution and aspect ratio, SigLIP 2 introduces two approaches for handling this:
Now that we have covered some of the key differentiating features of SigLIP 2, let us build an image retrieval system using it in Python.
In the following hands on tutorial, we will build an image retrieval system when user searches based on a image query. We will compare the responses from SigLIP 2 against SigLIP as well. We will be using the T4 GPU (free tier) on Google Colab for implementing this.
!pip install datasets sentencepiece
!pip install faiss-cpu
#update latest version of transformers
!pip install git+https://github.com/huggingface/transformers
import torch
import faiss
from torchvision import transforms
from PIL import Image
from transformers import AutoProcessor, SiglipModel, AutoImageProcessor, AutoModel, AutoTokenizer
import numpy as np
import requests
device = torch.device('cuda' if torch.cuda.is_available() else "cpu")
model = SiglipModel.from_pretrained("google/siglip-base-patch16-384").to(device)
processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-384")
tokenizer = AutoTokenizer.from_pretrained("google/siglip-base-patch16-384")
def add_vector(embedding, index):
vector = embedding.detach().cpu().numpy()
vector = np.float32(vector)
faiss.normalize_L2(vector)
index.add(vector)
def embed_siglip(image):
with torch.no_grad():
inputs = processor(images=image, return_tensors="pt").to(device)
image_features = model.get_image_features(**inputs)
return image_features
add_vector: This function takes a tensor embedding, normalizes it, and adds it to a FAISS index for efficient similarity searching.
embed_siglip: This function takes an image, processes it, passes it through a model to obtain its embedding (feature representation), and returns these features.
API_TOKEN=""
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.huggingface.co/rows?dataset=ceyda/fashion-products-small&config=default&split=train"
def query():
response = requests.get(API_URL, headers=headers)
return response.json()
data = query()
We load an image dataset here and fetch it using the requests library for which we pre define the Hugging Face API token first. It is a dataset on Fashion products.
index = faiss.IndexFlatL2(768)
# read the image and add vector
for elem in data["rows"]:
url = elem["row"]["image"]["src"]
image = Image.open(requests.get(url, stream=True).raw)
#Generate Embedding of Image
clip_features = embed_siglip(image)
#Add vector to FAISS
add_vector(clip_features,index)
#Save the index
faiss.write_index(index,"./siglip_70k.index")
url = "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRsZ4PhHTilpQ5zsG51SPZVrgEhdSfQ7_cg1g&s"
image = Image.open(requests.get(url, stream=True).raw)
with torch.no_grad():
inputs = processor(images=image, return_tensors="pt").to(device)
input_features = model.get_image_features(**inputs)
input_features = input_features.detach().cpu().numpy()
input_features = np.float32(input_features)
faiss.normalize_L2(input_features)
distances, indices = index.search(input_features, 3)
Now that we’ve built the model, let’s test it out with a few prompts and see how it works.
Since this is a fashion dataset, we want to query on some fashion products and check if the model is able to fetch similar looking products from the database.
We will be first querying the model with this tan colored women’s bag.
Let us check the 3 most similar products fetched from the model based on this query now.
#DISPLAYING SIMILAR IMAGE
for elem in indices[0]:
url = data["rows"][elem]["row"]["image"]["src"]
image = Image.open(requests.get(url, stream=True).raw)
width = 300
ratio = (width / float(image.size[0]))
height = int((float(image.size[1]) * float(ratio)))
img = image.resize((width, height), Image.Resampling.LANCZOS)
display(img)
Output from SigLIP 2 Model
As seen from the output of the SigLIP 2 model, all the retrieved images of bags are close to our queried bag.
Let us now check the same with SigLIP model. We can simply load this model in Step 2 using the following code
import torch
import faiss
from torchvision import transforms
from PIL import Image
from transformers import AutoProcessor, SiglipModel, AutoImageProcessor, AutoModel, AutoTokenizer
import numpy as np
import requests
device = torch.device('cuda' if torch.cuda.is_available() else "cpu")
model = SiglipModel.from_pretrained("google/siglip-base-patch16-384").to(device)
processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-384")
tokenizer = AutoTokenizer.from_pretrained("google/siglip-base-patch16-384")
The other subsequent steps can be re-run as before.
Output from SigLIP Model
As seen from the output of the SigLIP model, two of the retrieved images of bags are similar to the retrieved images of bags from SigLIP 2 model. However, the third image retrieved from SigLIP model is not close to our query image as it is not close to the tan color.
Let us check for another query with this input image.
Output from SigLIP 2 model
As seen from the output of the SigLIP 2 model, all the retrieved images of the womens shoes are Canvas shoes and close to our queried shoe.
Output from SigLIP Model
As seen from the output of the SigLIP model, two of the retrieved images of shoes are similar to the retrieved images of shoes from SigLIP 2 model. However, the third image retrieved from SigLIP model is not exactly like our query image as it is not a Canvas shoe.
SigLIP 2 represents a significant step forward in the evolution of image-text retrieval and vision-language models. Its advanced features, such as dynamic resolution and improved fine-grained semantic understanding, make it a powerful tool for enhancing image search capabilities across various applications. By addressing key limitations of previous models, SigLIP 2 offers more accurate and efficient image retrieval, positioning it as a valuable asset in fields like e-commerce, digital asset management, and social media.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
A. SigLIP 2 is a state-of-the-art multilingual vision-language encoder developed by Google DeepMind. It improves image search by enhancing semantic understanding, enabling better image-text retrieval and zero-shot classification. Its unified training approach and sigmoid-based loss function offer superior performance compared to previous models.
A. SigLIP 2 introduces features like Location Aware Captioners (LocCa) Decoder for predicting bounding box coordinates and grounded captioning. It also improves fine-grained local semantics through self-distillation, Global-Local Loss, and Masked Prediction Loss, which make it more adept at handling detailed visual information.
A. SigLIP 2 models come in two main variants: FixRes and NaFlex. FixRes works with fixed resolution images, while NaFlex supports variable image aspect ratios and resolutions.
A. SigLIP 2 models outperform their predecessors in tasks like zero-shot classification, image-text retrieval, and localization tasks. They also offer better multilingual understanding and fairness due to a more diverse training dataset.