Boosting Image Search Capabilities Using SigLIP 2

Nibedita Dutta Last Updated : 10 Mar, 2025

10 min read

Boosting image search capabilities has become a critical focus in the realm of digital asset management, e-commerce, and social media platforms. With the ever-increasing volume of visual content generated daily, the need for efficient and accurate image retrieval systems is more pressing than ever. Enter SigLIP 2 (Sigmoid Loss for Language-Image Pre-Training), a state-of-the-art multilingual vision-language encoder developed by Google DeepMind, which promises to revolutionize how we approach image similarity and search tasks. Its innovative architecture not only improves semantic understanding but also excels in zero-shot classification and image-text retrieval. By utilizing a unified training approach that incorporates self-supervised learning and diverse data curation, SigLIP 2 outperforms previous models in extracting meaningful visual representations.

Learning Objectives

Understand the fundamentals of CLIP models and their role in image retrieval systems.
Identify the limitations of softmax-based loss functions in distinguishing nuanced image differences.
Explore how the SigLIP model overcomes these limitations by utilizing sigmoid loss functions.
Analyze the key advancements and differentiating features of SigLIP 2 over SigLIP.
Implement an image retrieval system based on a user’s image query.
Compare and evaluate the performance of SigLIP 2 against SigLIP in image retrieval tasks.

This article was published as a part of the Data Science Blogathon.

Contrastive Language-Image Pre-training (CLIP)
SigLIP with Sigmoid Loss Function
- Differences with Respect to CLIP
SigLIP 2 Over SigLIP
- Key Features of SigLIP 2
Building an Image Retrieval System Using SigLIP 2 and Comparison with SigLIP
Hands-on Retrieval Testing
- Testing on SigLIP 2 Model
- Testing on SigLIP Model
Conclusion
Frequently Asked Questions

Contrastive Language-Image Pre-training (CLIP)

CLIP, which stands for Contrastive Language-Image Pre-training, is a groundbreaking multimodal model developed by OpenAI in 2021. It bridges the gap between computer vision and natural language processing by learning a shared representation space for images and text. This innovative approach allows CLIP to understand and correlate both modalities simultaneously, enabling it to perform tasks like zero-shot image classification, image-text retrieval, and captioning.

Learn More: CLIP VIT-L14: OpenAI’s Multimodal Marvel for Zero-Shot Image Classification

Key Components of CLIP

The Key components of CLIP consists of a Text Encoder, an Image Encoder with a Contrastive Learning Mechanism. This mechanism aligns the representations of text and images by maximizing the similarity between matching pairs and minimizing it for non-matching pairs.

CLIP architecture — Source: https://openai.com/index/clip/

CLIP is trained on a large dataset of image-text pairs, typically involving hundreds of millions of examples. The model learns to predict the most relevant text snippet given an image and vice versa.

Also Read: Google’s SigLIP: A Significant Momentum in CLIP’s Framework

Softmax Function with Cross Entropy Loss

In CLIP, there is an encoder for image and another encoder for text which take the input images and texts to a latent representation. When we have the embeddings (the latent representations) from the encoders, a similarity score (or dot product) is calculated between each image and text pair. The similarity score gives us a measure of how similar the image and the text embeddings are. To train the models to tag the correct text for an image or vice versa, a loss function is utilized whose objective is to maximize the similarity score between the image and text pairs.

In CLIP, the softmax function is applied to the model’s outputs to obtain a probability distribution like below for every image text pair in a batch.

In CLIP, the normalization (as seen in the denominators) is independently performed two times: across images and across texts as shown below in the loss function below –

The first term in the above equation finds the best text match for a given query image while the second term finds the best image match for a given query text. “B” is the batch size.

Limitations of CLIP

Issues in dealing with very Similar Pairs. While CLIP leverages the Softmax function to calculate probabilities for image-text pairings, a potential issue arises when using it directly with cosine similarity, as the Softmax function might not effectively capture the relative distance between image and text embeddings, especially when dealing with very similar pairs, leading to less nuanced comparisons and potentially hindering performance in certain scenarios where fine-grained distinctions are important. Softmax tends to push the probabilities of “incorrect” pairings very close to zero, potentially causing the model to miss subtle differences between similar images and text descriptions.
Quadratic Memory Complexity. Additionally since in CLIP, the similarity of every positive-pair is normalized by all negative pairs, every GPU has to maintain an NxN matrix for all pairwise similarities introducing quadratic memory complexity.

SigLIP with Sigmoid Loss Function

SigLIP, developed by Google follows a similar framework as CLIP but overcomes CLIP’s above issues by using a sigmoid-based loss (in place of softmax based loss) that operates independently on each image-text pair. Following is the Sigmoid Loss Function used in SigLIP

Here, “N” is the batch size which is present in the denominator so that the loss remains normalized for all batch sizes.
“Σ(i=1 to N) Σ(j=1 to N)” is used to sum over the loss for all combinations of image (i) and text (j) pairs.
“z_ij” is for determining whether the image text pair is positive (1) or negative (-1).
“t” controls the steepness of the sigmoid curve.
“xi · yj” measures how similar the image embeddings and text embeddings are.

Differences with Respect to CLIP

CLIP	SigLIP	Inference
Softmax Based Loss	Sigmoid Based Loss	SigLIP is neither asymmetric nor dependent on a global normalization factor. As a result, the loss for each pair—whether positive or negative—is independent of other pairs in the mini-batch
Each GPU stores an NxN matrix to compute all pairwise similarities	No need to store NXN matrix as each positive/negative pair operates independently.	Reduces computational overhead due to memory-efficient loss calculation

SigLIP 2 Over SigLIP

SigLIP 2 models outperform the previous SigLIP versions at all model scales in key areas such as zero-shot classification, image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). One standout feature is the dynamic resolution (naflex) version, which is especially useful for tasks sensitive to aspect ratio and resolution.

Key Features of SigLIP 2

Training with Sigmoid & Location Aware Captioners (LocCa) Decoder

SigLIP 2 introduces a text decoder alongside the existing image and text vision encoders during training. For LocCa, a transformer decoder with cross-attention is added to the vision encoder to achieve two key goals:

Referring Expression (REF): Predicting bounding box coordinates for specific locations mentioned in textual descriptions.
Grounded Captioning (GCAP): Creating captions based on specific object locations within an image.

Improved Fine-Grained Local Semantics

To improve fine-grained local semantics in image representation, SigLIP 2 adds two additional objectives: Global-Local Loss and Masked Prediction Loss.

Self-Distillation: Unlike traditional knowledge distillation, which uses a large “teacher” model to train a smaller “student” model, self-distillation uses the same model for both roles. It helps transfer knowledge from deeper network layers to shallower ones or from earlier training stages to later ones.
Global-Local Loss: This loss encourages local-to-global consistency. The vision encoder (acting as the student) processes small image patches and learns to match the full-image representation created by a teacher network.
Masked Prediction Loss: This loss works by replacing 50% of the embedded image patches with mask tokens, prompting the student model to match the teacher’s features at the masked locations. This helps focus on individual per-patch features rather than the full image.

Better Adaptability to Different Resolutions

Since image models can be highly sensitive to changes in resolution and aspect ratio, SigLIP 2 introduces two approaches for handling this:

Fixed Resolution Variant: In this version, training resumes from a checkpoint where the model has already learned most patterns (95% of training completed). The positional embeddings are resized to match the target sequence length, and training continues with the new resolution.
Dynamic Resolution (NaFlex) Variant: The NaFlex variant builds on concepts from FlexiViT and NaViT to enable a single model to handle multiple sequence lengths and maintain the native aspect ratio of images. This reduces aspect ratio distortion and is particularly useful for tasks like OCR and document image processing.

Now that we have covered some of the key differentiating features of SigLIP 2, let us build an image retrieval system using it in Python.

Building an Image Retrieval System Using SigLIP 2 and Comparison with SigLIP

In the following hands on tutorial, we will build an image retrieval system when user searches based on a image query. We will compare the responses from SigLIP 2 against SigLIP as well. We will be using the T4 GPU (free tier) on Google Colab for implementing this.

Step 1. Installation of Necessary Libraries

!pip install datasets sentencepiece
!pip install faiss-cpu
#update latest version of transformers
!pip install git+https://github.com/huggingface/transformers

Step 2. Loading SigLIP Models

import torch
import faiss
from torchvision import transforms

from PIL import Image
from transformers import AutoProcessor, SiglipModel, AutoImageProcessor, AutoModel, AutoTokenizer

import numpy as np
import requests

device = torch.device('cuda' if torch.cuda.is_available() else "cpu")

model = SiglipModel.from_pretrained("google/siglip-base-patch16-384").to(device)
processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-384")
tokenizer = AutoTokenizer.from_pretrained("google/siglip-base-patch16-384")

Step 3. Functions For Processing Input Images, Generating Embeddings & Saving it in FAISS

def add_vector(embedding, index):
    vector = embedding.detach().cpu().numpy()
    vector = np.float32(vector)
    faiss.normalize_L2(vector)
    index.add(vector)

def embed_siglip(image):
    with torch.no_grad():
        inputs = processor(images=image, return_tensors="pt").to(device)
        image_features = model.get_image_features(**inputs)
        return image_features

add_vector: This function takes a tensor embedding, normalizes it, and adds it to a FAISS index for efficient similarity searching.

embed_siglip: This function takes an image, processes it, passes it through a model to obtain its embedding (feature representation), and returns these features.

Step 4. Loading Image Dataset

API_TOKEN=""
headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://datasets-server.huggingface.co/rows?dataset=ceyda/fashion-products-small&config=default&split=train"

def query():
    response = requests.get(API_URL, headers=headers)
    return response.json()
data = query()

We load an image dataset here and fetch it using the requests library for which we pre define the Hugging Face API token first. It is a dataset on Fashion products.

Step 5. Storing the Embeddings in FAISS Vector Database

index = faiss.IndexFlatL2(768)

# read the image and add vector
for elem in data["rows"]:
  url = elem["row"]["image"]["src"]
  image = Image.open(requests.get(url, stream=True).raw)
  #Generate Embedding of Image
  clip_features = embed_siglip(image)
  #Add vector to FAISS
  add_vector(clip_features,index)

#Save the index 
faiss.write_index(index,"./siglip_70k.index")

Step 6. Querying the Model

url = "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRsZ4PhHTilpQ5zsG51SPZVrgEhdSfQ7_cg1g&s"
image = Image.open(requests.get(url, stream=True).raw)

with torch.no_grad():
  inputs = processor(images=image, return_tensors="pt").to(device)
  input_features = model.get_image_features(**inputs)

input_features = input_features.detach().cpu().numpy()
input_features = np.float32(input_features)
faiss.normalize_L2(input_features)
distances, indices = index.search(input_features, 3)

Now that we’ve built the model, let’s test it out with a few prompts and see how it works.

Hands-on Retrieval Testing

Since this is a fashion dataset, we want to query on some fashion products and check if the model is able to fetch similar looking products from the database.

We will be first querying the model with this tan colored women’s bag.

Let us check the 3 most similar products fetched from the model based on this query now.

Testing on SigLIP 2 Model

#DISPLAYING SIMILAR IMAGE
for elem in indices[0]:
  url = data["rows"][elem]["row"]["image"]["src"]
  image = Image.open(requests.get(url, stream=True).raw)
  width = 300
  ratio = (width / float(image.size[0]))
  height = int((float(image.size[1]) * float(ratio)))
  img = image.resize((width, height), Image.Resampling.LANCZOS)
  display(img)

Output from SigLIP 2 Model

As seen from the output of the SigLIP 2 model, all the retrieved images of bags are close to our queried bag.

Testing on SigLIP Model

Let us now check the same with SigLIP model. We can simply load this model in Step 2 using the following code

import torch
import faiss
from torchvision import transforms

from PIL import Image
from transformers import AutoProcessor, SiglipModel, AutoImageProcessor, AutoModel, AutoTokenizer

import numpy as np
import requests

device = torch.device('cuda' if torch.cuda.is_available() else "cpu")

model = SiglipModel.from_pretrained("google/siglip-base-patch16-384").to(device)
processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-384")
tokenizer = AutoTokenizer.from_pretrained("google/siglip-base-patch16-384")

The other subsequent steps can be re-run as before.

Output from SigLIP Model

As seen from the output of the SigLIP model, two of the retrieved images of bags are similar to the retrieved images of bags from SigLIP 2 model. However, the third image retrieved from SigLIP model is not close to our query image as it is not close to the tan color.

Let us check for another query with this input image.

Output from SigLIP 2 model

As seen from the output of the SigLIP 2 model, all the retrieved images of the womens shoes are Canvas shoes and close to our queried shoe.

Output from SigLIP Model

As seen from the output of the SigLIP model, two of the retrieved images of shoes are similar to the retrieved images of shoes from SigLIP 2 model. However, the third image retrieved from SigLIP model is not exactly like our query image as it is not a Canvas shoe.

Conclusion

SigLIP 2 represents a significant step forward in the evolution of image-text retrieval and vision-language models. Its advanced features, such as dynamic resolution and improved fine-grained semantic understanding, make it a powerful tool for enhancing image search capabilities across various applications. By addressing key limitations of previous models, SigLIP 2 offers more accurate and efficient image retrieval, positioning it as a valuable asset in fields like e-commerce, digital asset management, and social media.

Key Takeaways

SigLIP 2, developed by Google DeepMind, improves upon its predecessor by utilizing a unified training approach and sigmoid-based loss, offering more accurate and efficient image-text retrieval and zero-shot classification.
Unlike CLIP, which uses a Softmax function that can struggle with nuanced image-text comparisons, SigLIP 2 employs a more effective sigmoid loss function that works independently on each image-text pair, enhancing performance.
SigLIP 2 introduces the NaFlex variant, allowing the model to handle varying image resolutions and aspect ratios effectively, making it ideal for tasks such as OCR and document processing.
Through the use of self-distillation and enhanced training techniques like Global-Local Loss and Masked Prediction Loss, SigLIP 2 offers better semantic understanding, making it more adept at capturing detailed visual features.
SigLIP 2 features a Location Aware Captioners (LocCa) Decoder, enabling tasks like grounded captioning and predicting bounding box coordinates, further enhancing its capabilities for accurate image search and retrieval.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What is SigLIP 2, and how does it improve image search capabilities?

A. SigLIP 2 is a state-of-the-art multilingual vision-language encoder developed by Google DeepMind. It improves image search by enhancing semantic understanding, enabling better image-text retrieval and zero-shot classification. Its unified training approach and sigmoid-based loss function offer superior performance compared to previous models.

Q2. What are the main features of SigLIP 2 that make it stand out?

A. SigLIP 2 introduces features like Location Aware Captioners (LocCa) Decoder for predicting bounding box coordinates and grounded captioning. It also improves fine-grained local semantics through self-distillation, Global-Local Loss, and Masked Prediction Loss, which make it more adept at handling detailed visual information.

Q3. What variants does SigLIP 2 come in?

A. SigLIP 2 models come in two main variants: FixRes and NaFlex. FixRes works with fixed resolution images, while NaFlex supports variable image aspect ratios and resolutions.

Q4. What are the key improvements in SigLIP 2 over SigLIP?

A. SigLIP 2 models outperform their predecessors in tasks like zero-shot classification, image-text retrieval, and localization tasks. They also offer better multilingual understanding and fairness due to a more diverse training dataset.

Nibedita Dutta

Nibedita completed her master’s in Chemical Engineering from IIT Kharagpur in 2014 and is currently working as a Senior Data Scientist. In her current capacity, she works on building intelligent ML-based solutions to improve business processes.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Boosting Image Search Capabilities Using SigLIP 2

Learning Objectives

Table of Contents

Contrastive Language-Image Pre-training (CLIP)

Key Components of CLIP

Softmax Function with Cross Entropy Loss

Limitations of CLIP

SigLIP with Sigmoid Loss Function

Differences with Respect to CLIP

SigLIP 2 Over SigLIP

Key Features of SigLIP 2

Training with Sigmoid & Location Aware Captioners (LocCa) Decoder

Improved Fine-Grained Local Semantics

Better Adaptability to Different Resolutions

Building an Image Retrieval System Using SigLIP 2 and Comparison with SigLIP

Step 1. Installation of Necessary Libraries

Step 2. Loading SigLIP Models

Step 3. Functions For Processing Input Images, Generating Embeddings & Saving it in FAISS

Step 4. Loading Image Dataset

Step 5. Storing the Embeddings in FAISS Vector Database

Step 6. Querying the Model

Hands-on Retrieval Testing

Testing on SigLIP 2 Model

Testing on SigLIP Model

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B