Guide to Zero-Shot Image Classification

Shikha Sen 28 Jun, 2024
5 min read

Introduction

The article explores zero-shot learning, a machine learning technique that classifies unseen examples, focusing on zero-shot image classification. It discusses the mechanics of zero-shot image classification, implementation methods, benefits and challenges, practical applications, and future directions.

Overview

  • Understand the significance of zero-shot learning in machine learning.
  • Examine zero-shot classification and its uses in many fields.
  • Study zero-shot image classification in detail, including its workings and application.
  • Examine the benefits and difficulties associated with zero-shot picture classification.
  • Analyse the practical uses and potential future directions of this technology.

What is Zero-Shot Learning?

A machine learning technique known as “zero-shot learning” (ZSL) allows a model to identify or classify examples of a class that were not present during training. The goal of this method is to close the gap between the enormous number of classes that are present in the real world and the small number of classes that may be used to train a model.

Key aspects of zero-shot learning

  • Leverages semantic knowledge about classes.
  • makes use of metadata or additional information.
  • Enables generalization to unknown classes.

Zero Shot Classification

One particular application of zero-shot learning is zero-shot classification, which focuses on classifying instances—including ones that are absent from the training set—into classes.

How it functions?

  • The model learns to map input features to a semantic space during training.
  • This semantic space is also mapped to class descriptions or attributes.
  • The model makes predictions during inference by comparing the representation of the input with class descriptions.

.Zero-shot classification examples include:

  • Text classification: Categorizing documents into new topics.
  • Audio classification: Recognizing unfamiliar sounds or genres of music.
  • Identifying novel object kinds in pictures or videos is known as object recognition.

Zero-Shot Image Classification

This classification is a specific type of zero-shot classification applied to visual data. It allows models to classify images into categories they haven’t explicitly seen during training.

Key differences from traditional image classification:

  •  Traditional: Requires labeled examples for each class.
  •  Zero-shot: Can classify into new classes without specific training examples.

How Zero-Shot Image Classification Works?

  • Multimodal Learning: Large datasets with both textual descriptions and images are commonly used to train zero-shot classification models. This enables the model to understand how visual characteristics and language ideas relate to one another.
  • Aligned Representations: Using a common embedding space, the model generates aligned representations of textual and visual data. This alignment allows the model to understand the correspondence between image content and textual descriptions.
  • Inference Process: The model compares the candidate text labels’ embeddings with the input image’s embedding during classification. The categorization result is determined by selecting the label with the highest similarity score.

Implementing Zero-Shot Classification of Image

First, we need to install dependencies : 

!pip install -q "transformers[torch]" pillow

There are two main approaches to implementing zero-shot image classification:

Using a Prebuilt Pipeline

from transformers import pipeline
from PIL import Image
import requests
# Set up the pipeline
checkpoint = "openai/clipvitlargepatch14"
detector = pipeline(model=checkpoint, task="zeroshotimageclassification")

url = "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTuC7EJxlBGYl8-wwrJbUTHricImikrH2ylFQ&s"
image = Image.open(requests.get(url, stream=True).raw)
image
zeroshot
# Perform classification
predictions = detector(image, candidate_labels=["fox", "bear", "seagull", "owl"])
predictions
Output
# Find the dictionary with the highest score
best_result = max(predictions, key=lambda x: x['score'])


# Print the label and score of the best result
print(f"Label with the best score: {best_result['label']}, Score: {best_result['score']}")

Output :

Output

Manual Implementation

from transformers import AutoProcessor, AutoModelForZeroShotImageClassification
import torch
from PIL import Image
import requests

# Load model and processor
checkpoint = "openai/clipvitlargepatch14"
model = AutoModelForZeroShotImageClassification.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)
# Load an image 
url = "https://unsplash.com/photos/xBRQfR2bqNI/download?ixid=MnwxMjA3fDB8MXxhbGx8fHx8fHx8fHwxNjc4Mzg4ODEx&force=true&w=640" 
image = Image.open(requests.get(url, stream=True).raw)
 Image
Zero-Shot Image Classification
# Prepare inputs
candidate_labels = ["tree", "car", "bike", "cat"]
inputs = processor(images=image, text=candidate_labels, return_tensors="pt", padding=True)

# Perform inference
with torch.no_grad():
    outputs = model(**inputs)

logits = outputs.logits_per_image[0]
probs = logits.softmax(dim=1).numpy()

# Process results
result = [
    {"score": float(score), "label": label}
    for score, label in sorted(zip(probs, candidate_labels), key=lambda x: x[0])
]
print(result)
Zero-Shot Image Classification
# Find the dictionary with the highest score
best_result = max(result, key=lambda x: x['score'])


# Print the label and score of the best result
print(f"Label with the best score: {best_result['label']}, Score: {best_result['score']}")
Zero-Shot Image Classification

Zero-Shot Image Classification Benefits

  • Flexibility: Able to classify photos into new groups without any retraining.
  • Scalability: The capacity to quickly adjust to new use cases and domains.
  • Reduced dependence on data: No need for sizable labelled datasets for each new category.
  • Natural language interface: Enables users to utilise freeform text to define categories6.

Challenges and Restrictions

  • Accuracy: May not always correspond with specialised models’ performance.
  • Ambiguity: May find it difficult to distinguish minute differences between related groups.
  • Bias: May inherit biases present in the training data or language models.
  • Computational resources: Because models are complicated, they frequently need for more powerful technology.

Applications

  • Content moderation: Adjusting to novel forms of objectionable content
  • E-commerce: Adaptable product search and classification
  • Medical imaging: Recognizing uncommon ailments or adjusting to new diagnostic criteria

 Future Directions

  • Improved model architectures
  • Multimodal fusion
  • Fewshot learning integration
  • Explainable AI for zero-shot models
  • Enhanced domain adaptation capabilities

Also Read: Build Your First Image Classification Model in Just 10 Minutes!

Conclusion

A major development in computer vision and machine learning is zero-shot image classification, which is based on the more general idea of zero-shot learning. By enabling models to classify images into previously unseen categories, this technology offers unprecedented flexibility and adaptability. Future research should yield even more potent and flexible systems that can easily adjust to novel visual notions, possibly upending a wide range of sectors and applications.

Frequently Asked Questions

Q1. What is the main difference between traditional image classification and zero-shot image classification?

A. Traditional image classification requires labeled examples for each class it can recognize, while this can categorize images into classes it hasn’t explicitly seen during training.

Q2. How does zero-shot image classification work?

A. It uses multi-modal models trained on large datasets of images and text descriptions. These models learn to create aligned representations of visual and textual information, allowing them to match new images with textual descriptions of categories.

Q3. What are the main advantages of zero-shot image classification?

A. The key advantages include flexibility to classify into new categories without retraining, scalability to new domains, reduced dependency on labeled data, and the ability to use natural language for specifying categories.

Q4. Are there any limitations to zero-shot image classification?

A. Yes, some limitations include potentially lower accuracy compared to specialized models, difficulty with subtle distinctions between similar categories, potentially inherited biases, and higher computational requirements.

Q5. What are some real-world applications of zero-shot image classification?

A. Applications include content moderation, e-commerce product categorization, medical imaging for rare conditions, wildlife monitoring, and object recognition in robotics.

Shikha Sen 28 Jun, 2024

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear