Zero-shot Object Detection Using Grounding DINO Base

Maigari David Last Updated : 17 Oct, 2024
7 min read

Detecting objects in an image requires some accuracy, especially when the image does not only take a box-like shape for easy detection. However, numerous models have provided solutions with state-of-the-art performance in object detection. 

Zero-shot object detection with the Grounding DINO base is another efficient model that allows you to scan through out-of-box images. This model extends to closed-set object detection with a text encoder while enabling open-set object detection. 

This model can be handy when performing a task requiring text queries to identify the object. A significant feature of this model is that it does not need label data to show the image output. We will discuss all you need to know about the Grounding DINO base model and how it works.

Learning Objective

  • Learn about how zero-shot object detection is done with the Grounding DINO Base. 
  • Get insight into the working principle and operation of this model. 
  • Study the use cases of the Grounding DINO model.
  • Run inference on this model. 
  • Explore real-life applications of the Grounding DINO base. 

This article was published as a part of the Data Science Blogathon.

Use Cases of Zero-shot Object Detection

The core attribute of this model is the ability to identify objects in an image using a text prompt. This concept can help users in various ways; models with zero-shot object detection can help search images on smartphones and other devices. You can use it to look for specific places, cities, animals, and other objects. 

Zero-shot classification models can also help count a specific object within a group of objects appearing in a single image. Another fascinating use case is object tracking in videos. 

How Grounding DINO Base Works?

The Grounding DINO base does not have labeled data, so it works with a text prompt and tries to find the probability score after matching the image to the text. What this model starts with during the process is to identify the object mentioned in the text. Then, it generates an ‘object proposal’ using colors, shapes, and other features to identify the objects in the image.

Grounding DINO Base
Source: Medium

So, for each text prompt you add as input to the model, Grounding DINO processes the image and identifies an object through a score. Each object has a label with a probability score that indicates the object in the text input has been detected in the image. A good example is shown in the image below; 

Zero-shot Object Detection Using Grounding DINO Base
Source: Medium

Model Architecture of Grounding DINO Base

The DINO (DETR with Improved DeNoising anchOr boxes) base is integrated with GLIP pre-training as the mechanism’s foundation. This model’s architecture combines two systems for object detection and end-point optimization, bridging the gap between language and vision in the model. 

Grounding DINO’s architecture bridges the gap between language and vision using a two-stream approach. Image features are extracted by a visual backbone like Swin Transformer, and text features by a model like BERT. These features are then transformed into a unified representation space through a feature enhancer that includes multiple layers of self-attention mechanisms.

GroundingDINOBase
Source: Medium

Practically, the first layer of this model starts with the text and image input. Since it uses two streams, it can represent the image and text. This input is fed into the feature enhancers in the next stage of the process. 

The feature enhancers have multiple layers and can be used for text and images. The deformable text attention enhances the image features, while the regular self-attention works for the text feature enhancers. 

FeatureEnhancerLayer
Source: Medium

The next layer, language-guided query selection, makes a few major contributions. It can leverage the input text for object detection by selecting relevant features from the images and text. The decoder can locate the object’s position in the image; this language-guided query selection helps the decoder do this and assign labels through text description.

In the cross-modality stage, this layer integrates text and image modality features in the model. It does this through a series of attention layers and feed-forward networks. The relationship between the visual and textual information is gotten here making it possible to assign the proper labels. 

DecoderLayer
Source: Medium

So, with these steps, you have the final results, with the model giving results including bounding box prediction, class-specific confidence filtering, and label assignment. 

Running the Grounding DINO Model

Although you can run this model by using a pipeline as a helper, the autokenizer method can be effective in running this model. 

Importing Necessary Libraries

import requests

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection

This code imports the libraries for zero-shot object detection. It includes a request for the image loading of the processor and the model. So, you can perform object detection with this operation even without specific training. 

Preparing the Environment 

The next step is to define the model and identify that the pre-trained data in the Grounding DINO base is used for the task. It also defines the device and hardware system suitable for running this model, as shown in the next line of code;

 model_id = "IDEA-Research/grounding-dino-base"
device = "cuda" if torch.cuda.is_available() else "cpu"

Initiating the Model using the processor 

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)

This code performs two main things: initializing the pre-trained processor and assigning which device and hardware are comparable for effective object detection execution. 

Processing the Image

image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
# Check for cats and remote controls
# VERY important: text queries need to be lowercased + end with a dot
text = "a cat. a remote control."

This code downloads and processes the image from the URL. First, it stores the image and then opens the URL using the ‘image.open’ function. This operation loads the image’s raw data. Furthermore, the code shows the text prompt. So, the model is looking for a cat’ and ‘a remote.’ It is also important to note that the text query should be in lowercase for accuracy processing. 

Preparing the Input 

Here, you covert the image and text into an understandable format for the model using the PyTorch tensors. This code also involves the function that runs the inference while saving computational cost. Finally, the zero-shot object detection model generates predictions based on the text and image.

inputs = processor(images=image, text=text, return_tensors="pt").to(device)
with torch.no_grad():
   outputs = model(**inputs)

Result and Output

results = processor.post_process_grounded_object_detection(
   outputs,
   inputs.input_ids,
   box_threshold=0.4,
   text_threshold=0.3,
   target_sizes=[image.size[::-1]]
)

This is where the model refines the raw model data and converts it into output that humans can read. It also handles the image formats, sizing, and dimensions while sorting out the prediction from the text prompt. 

results

Image of the input:

Image of the input: Grounding DINO Base

The output result of the zero-shot image object detection. It proves the presence of a cat and a remote in the image. 

Output : Grounding DINO Base

Real-Life Applications of Grounding DINO

There are many ways to apply this model in real-life applications and industries. These include; 

  • Models like Grounding DINO Base can be effective in robotic assistants as they can identify any object if they have larger datasets of images available. 
  • Self-driving cars are another valuable use of this technology. Autonomous cars can use this model to detect cars, traffic lights, and other objects. 
  • This model can also be used as an image analysis tool to identify the objects, people, and other things in an image. 

Conclusion

The Grounding DINO base model provides an innovative approach to zero-shot object detection by effectively combining image and text inputs for accurate identification. Its ability to detect objects without requiring labeled data makes it versatile for various applications, from image search and object tracking to more complex scenarios like autonomous driving. 

This model ensures precise detection and localization based on text prompts by taking advantage of advanced features such as deformable self-attention and cross-modality decoders. Grounding DINO showcases the potential of language-guided object detection and opens new possibilities for real-life applications in AI-driven tasks.

Key Takeaways

  • The model architecture employs a system that helps integrate language and vision integration. 
  • Applications in robotics, autonomous vehicles, and image analysis suggest that this model has promising potential, and we could see more of its utilization in the future. 
  • Grounding DINO base performs object detection with label trained in the model’s dataset, which means it gets results from text prompts and output in probability scores. This concept makes it adaptable to various applications. 

Resources

Frequently Asked Questions

Q1. What is zero-shot object detection with Grounding DINO Base?

A. Zero-shot object detection with Grounding DINO Base allows the model to detect objects in images using text prompts without requiring pre-labeled data. It uses a combination of language and visual features to identify and locate objects in real time.

Q2. How does the Grounding DINO Base work?

A. The model processes the input text query and identifies objects in the image by generating an “object proposal” based on color, shape, and other visual features. The text with the highest probability score is considered the detected object.

Q3. What are the applications of Grounding DINO Base?

A. The model has numerous real-world applications, such as image search, object tracking in videos, robotic assistants, and self-driving cars. It can detect objects without prior knowledge, making it versatile across various industries.

Q4. Can Grounding DINO Base work for real-time object detection? 

A. Grounding DINO Base can be utilized for real-time applications, such as autonomous driving or robotic vision, due to its ability to detect objects using text prompts in dynamic environments without needing labeled datasets.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Hey there! I'm David Maigari a dynamic professional with a passion for technical writing writing, Web Development, and the AI world. David is an also enthusiast of data science and AI innovations.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details