As 2023 is coming to an end, the exciting news for the computer vision community is that Google has recently made strides in the world of zero-shot object detection with the release of OWLv2. This cutting-edge model is now available in 🤗 Transformers and represents one of the most robust zero-shot object detection systems to date. It builds upon the foundation laid by OWL-ViT v1, which was introduced last year.
In this article, we will introduce this model’s behavior and architecture and see a practical approach to how to run inference. Let us get started.
This article was published as a part of the Data Science Blogathon.
OWLv2’s impressive capabilities can be attributed to its novel self-training approach. The model was trained on a web-scale dataset comprising over 1 billion examples. To achieve this, the authors harnessed the power of OWL-ViT v1, using it to generate pseudo labels, which in turn were used to train OWLv2.
Additionally, the model underwent fine-tuning on detection data, resulting in performance improvements over its predecessor, OWL-ViT v1. The self-training opens up web-scale training for open-world localization, mirroring the trends seen in object classification and language modeling.
While the architecture of OWLv2 is similar to OWL-ViT, there’s a notable addition to its object detection head. It now includes an objectness classifier that predicts the likelihood that a predicted box contains an object. The objectness score gives insights and can be used to rank or filter predictions independently of text queries.
Zero-shot learning is a new terminology that has become popular since the trend of GenAI. It is commonly seen in Large Language Model(LLM) fine-tuning. It involves finetuning base models using some data so that, a model extends to new categories. Zero-shot object detection is a game-changer in the field of computer vision. It’s all about empowering models to detect objects in images without the need for manually annotated bounding boxes. This not only speeds up the process but removes manual annotation, making it more exciting for humans and less boring.
OWLv2 follows a similar approach to OWL-ViT but features an updated image processor, Owlv2ImageProcessor. Additionally, the model relies on CLIPTokenizer to encode text. The Owlv2Processor is a handy tool that combines Owlv2ImageProcessor and CLIPTokenizer, simplifying the process of encoding text. Here’s an example of how to perform object detection using Owlv2Processor and Owlv2ForObjectDetection.
Find the entire code here: https://github.com/inuwamobarak/OWLv2
In this step, we start by installing the 🤗 Transformers library from GitHub.
# Install the 🤗 Transformers library from GitHub.
!pip install -q git+https://github.com/huggingface/transformers.git
Here, we load an OWLv2 checkpoint from the hub. Note that checkpoint options are available, and in this example, we load an ensemble checkpoint.
# Load an OWLv2 checkpoint from the hub.
from transformers import Owlv2Processor, Owlv2ForObjectDetection
# Load the processor and model.
processor = Owlv2Processor.from_pretrained(“google/owlv2-base-patch16-ensemble”)
model = Owlv2ForObjectDetection.from_pretrained(“google/owlv2-base-patch16-ensemble”)
# Load an OWLv2 checkpoint from the hub.
from transformers import Owlv2Processor, Owlv2ForObjectDetection
# Load the processor and model.
processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble")
In this step, we load an image on which we want to detect objects.
# Load an image that you want to analyze.
from huggingface_hub import hf_hub_download
from PIL import Image
# Replace the file paths accordingly.
filepath = hf_hub_download(repo_id="adirik/OWL-ViT", repo_type="space", filename="assets/astronaut.png")
image = Image.open(filepath)
OWLv2 is capable of detecting objects given text queries. In this step, we prepare the image and text queries for the model using the processor.
# Define the text queries that you want the model to detect.
texts = [['face', 'bag', 'shoe', 'hair']]
# Prepare the image and text for the model using the processor.
inputs = processor(text=texts, images=image, return_tensors="pt")
# Print the shapes of input tensors.
for key, val in inputs.items():
print(f"{key}: {val.shape}")
In this step, we forward the inputs through the model. We use torch.no_grad() to reduce memory usage since we don’t need gradients at inference time.
# Import the torch library.
import torch
# Perform a forward pass through the model.
with torch.no_grad():
outputs = model(**inputs)
In this final step, we convert the model’s outputs to COCO API format and visualize the results by drawing bounding boxes and labels on the image.
# Convert model outputs to COCO API format.
target_sizes = torch.Tensor([image.size[::-1]])
results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.2)
# Retrieve predictions for the first image.
i = 0
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
# Draw bounding boxes and labels on the image.
from PIL import ImageDraw
draw = ImageDraw.Draw(image)
for box, score, label in zip(boxes, scores, labels):
box = [round(i, 2) for i in box.tolist()]
x1, y1, x2, y2 = tuple(box)
draw.rectangle(xy=((x1, y1), (x2, y2)), outline="red")
draw.text(xy=(x1, y1), text=text[label])
# Display the image with bounding boxes and labels.
image
We perform the image-guided one-shot object detection using OWLv2. This means we detect objects in a new image based on an example query image.
Code: https://github.com/inuwamobarak/OWLv2
# Import necessary libraries
# %matplotlib inline # Uncomment this line for compatibility if using Jupyter Notebook.
import cv2
from PIL import Image
import requests
import torch
from matplotlib import rcParams
import matplotlib.pyplot as plt
# Set the figure size
rcParams['figure.figsize'] = 11, 8
# Load the input image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
target_sizes = torch.Tensor([image.size[::-1])
# Load the query image
query_url = "http://images.cocodataset.org/val2017/000000058111.jpg"
query_image = Image.open(requests.get(query_url, stream=True).raw)
# Display the input image and query image side by side.
fig, ax = plt.subplots(1, 2)
ax[0].imshow(image)
ax[1].imshow(query_image)
After loading the two images, we preprocess the input and print the shape.
# Define the device to use for processing.
device = "cuda" if torch.cuda.is_available() else "cpu"
# Process input and query images using the preprocessor.
inputs = processor(images=image, query_images=query_image, return_tensors="pt").to(device)
# Print the input names and shapes.
for key, val in inputs.items():
print(f"{key}: {val.shape}")
Below, we perform image-guided object detection. We print the shapes of the model’s outputs, including vision model outputs.
# Perform image-guided object detection using the model.
with torch.no_grad():
outputs = model.image_guided_detection(**inputs)
# Print the shapes of the model's outputs.
for k, val in outputs.items():
if k not in {"text_model_output", "vision_model_output"}:
print(f"{k}: shape of {val.shape}")
print("\nVision model outputs")
for k, val in outputs.vision_model_output.items():
print(f"{k}: shape of {val.shape}")
Finally, we visualize the results by drawing bounding boxes on the image. The code handles the conversion of the image to RGB format and post-processes the detection results.
# Visualize the results
import numpy as np
# Convert the image to RGB format.
img = cv2.cvtColor(np.array(image), cv2.COLOR_BGR2RGB)
outputs.logits = outputs.logits.cpu()
outputs.target_pred_boxes = outputs.target_pred_boxes.cpu()
# Post-process the detection results.
results = processor.post_process_image_guided_detection(outputs=outputs, threshold=0.9, nms_threshold=0.3, target_sizes=target_sizes)
boxes, scores = results[0]["boxes"], results[0]["scores"]
# Draw bounding boxes on the image.
for box, score in zip(boxes, scores):
box = [int(i) for i in box.tolist()]
img = cv2.rectangle(img, box[:2], box[2:], (255, 0, 0), 5)
if box[3] + 25 > 768:
y = box[3] - 10
else:
y = box[3] + 25
# Display the image with predicted bounding boxes.
plt.imshow(img[:, :, ::-1])
Open-vocabulary object detection has benefited from pre-trained vision-language models. However, it’s often hindered by the limited availability of detection training data. To address this, the authors turned to self-training and existing detectors to generate pseudo-box annotations on image-text pairs. Scaling self-training presents its own set of challenges, including the choice of label space, pseudo-annotation filtering, and training efficiency.
OWLv2 and the OWL-ST self-training recipe have been developed to overcome these challenges. As a result, OWLv2 now surpasses the performance of earlier state-of-the-art open-vocabulary detectors, even at similar training scales of around 10 million examples.
OWLv2’s performance is indeed impressive. With an L/14 architecture, OWL-ST improves the Average Precision (AP) on LVIS rare classes. Even when the model has not seen human box annotations for these rare classes, it achieves this improvement, with AP rising from 31.2% to 44.6%.
OWL-ST’s capability to scale to over 1 billion examples signifies achievement in web-scale training for open-world localization, similar to what we’ve witnessed in object classification and language modeling.
OWLv2 and the innovative OWL-ST self-training recipe represent a leap forward in zero-shot object detection. These advancements promise to reshape the landscape of computer vision by making it easier and more efficient to detect objects in images without the need for manually annotated bounding boxes. We encourage you to explore OWLv2 and its applications in your projects. The possibilities are exciting, and we can’t wait to see how the computer vision community leverages this technology for groundbreaking solutions.
A1: Zero-shot object detection is a way for models to detect objects in images without the need for manually annotated bounding boxes. It’s important because it streamlines the object detection process and makes it less labor-intensive.
A2: Self-training involves using an existing detector to generate pseudo-box annotations on image-text pairs. OWLv2 leverages this self-training approach to improve performance and scalability.
A3: The objectness classifier in OWLv2’s object detection head predicts the likelihood that a predicted box contains an object. Use this information to rank or filter predictions independently of text queries.
A4: Use OWLv2 with processors like Owlv2ImageProcessor, CLIPTokenizer, and Owlv2Processor to perform text-conditioned object detection. Practical examples are available in the article.
A5: Self-training addresses challenges like the choice of label space, pseudo-annotation filtering, and training scaled open-vocabulary object detection.
A6: OWLv2’s capabilities have the potential to benefit applications in computer vision, including object detection, image understanding, and more. Researchers and developers can leverage this technology for innovative solutions.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.