Owl ViT is a computer vision model that has become very popular and has found applications across various industries. This model takes in an image and a text query as input. After the image processing, the output comes with a confidence score and the object’s location (from the text query) in the image.
This model’s vision transformer architecture allows it to understand the relationship between text and images, which justifies the image and text encoder it uses during image processing. Owl ViT uses CLIP so the similarities of image-text can be accurate with contrastive loss.
This article was published as a part of the Data Science Blogathon.
Zero-shot object detection is a computer vision system that helps a model identify objects of different classes without previous knowledge. This model can take images as input and receive a list of candidates to choose from, which is more likely to be the object in the image. This model’s capability also ensures that it sees the bounding boxes that identify the object’s position in the image.
Models like Owl ViT would need a lot of pre-trained data to perform these tasks. So, the number of images of cars, cats, dogs, bikes, etc., would be used during the training process. But with the help of zero-shot object detection, you can break down this method using text-image similarities, allowing you to bring text descriptions. In contrast, the model uses its language understanding to perform the task. This concept is the base of this model’s architecture, which brings us to the next section.
Owl ViT is an open-source model that uses CLIP-based image classification. It can detect objects of any class and match images to text descriptions using computer vision technology.
This model’s foundation is its vision transformer architecture. This architecture takes images in sequences of patches, which are processed by a transformer encoder.
The transformer encoder handles the model’s language understanding to process the input text query. This is further processed by the vision transformer encoder, which works with the image in patches. The model can find the relationship between text descriptions and images with this structure.
Vision transformer architecture has become popular for many computer vision tasks. With the Owl ViT model, zero-shot object detection is the game changer. The model can easily classify objects in images even with words it has not seen before, streamlining the pre-training process and identifying images.
So, to put this theory into practice, we need to meet some requirements before running the model. We will use the hugging face transformer library, which gives us access to open-source transformer models and toolkits. There are a few steps to running this model, starting by importing the needed libraries.
Firstly, we must import three essential libraries to run this model: the request, PIL.image, and torch. Each of these libraries is necessary for the object detection tasks. Here is the brief breakdown;
The ‘request’ library is essential for making HTTPS requests and accessing API. This library can interact with web servers, allowing you to download web content, such as images, using links. On the other hand, the PIL library allows you to open, download, and modify images in different file formats. Torch is a deep learning framework that allows different tensor operations, such as model training, GPU support, and matching learning tasks.
import requests
from PIL import Image
import torch
Providing preprocessed data for the Owl ViT is another part of running this model.
from transformers import OwlViTProcessor, OwlViTForObjectDetection
This code ensures the model can handle input formats, resize images, and work with input such as text descriptions. Hence, you have pre-processed data and the fine-tuned tasks it performs.
For the case, we use Owl for object detection, so we define the processor and expected input the model would handle.
processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")
image_path = "/content/five cats.jpg"
image = Image.open(image_path)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")
outputs = model(**inputs)
An Owl ViT processor has to be compatible with the input you want to use. So, using ‘processor(text=texts, images=image, return_tensors=”pt”)’ does not only allow you to process image and text descriptions. This line also indicates that the preprocessed data should be returned as PyTorch tensors.
Here, we fetch the image_path using a file from our computer. This is an alternative to using a URL and calling PIL to load the image for the object detection task.
There are some common image processing parameters common with the OWL-ViT model, and we will briefly look at a few of them here;
The texts show the list of candidates for the classes: “a photo of a cat” and a “photo of a dog.” Finally, you have the model preprocessing the text and image descriptions to make them suitable as input for the model. The output will contain information about the detected object in the image, which, in this case, will be a confidence score. It can also use bounding boxing to identify the location of the image.
# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process_object_detection(outputs=outputs, threshold=0.1, target_sizes=target_sizes)
This code prepares the image to fit the prediction from the bounding box and also ensures that the format is compatible with the data set that carries the image. The result is a structured output of detected objects, each with its bounding box and class label, suitable for evaluation or further application use.
Here is a breakdown simple breakdown;
target_sizes = torch.Tensor: This code defines the target image sizes in (height, width) format. It reverses the original image’s (width, height) dimensions and stores them as a PyTorch tensor.
Additionally, the code uses the processor’s ‘post_process_object_detection’ method to convert the model’s raw output into bounding boxes and class labels.
i = 0 # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
Here, you want to obtain the detection result by analyzing the text query, scores, and labels for the detected object in the image. Full resources for this are available in this notebook.
Finally, we get a summary of the results after completing the object detection task. We can run this with the code shown below;
# Print detected objects and rescaled box coordinates
for box, score, label in zip(boxes, scores, labels):
box = [round(i, 2) for i in box.tolist()]
print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")
Many tasks involve computer vision and object detection these days. Owl ViT can come in handy for each of the following applications;
Computer vision models are traditionally versatile, and Owl ViT is no different. Due to the model’s zero-shot capabilities, you can use it without extensive pre-training. This model’s strength is based on leveraging CLIP and vision transformer architecture for image-text matching, so exploring it becomes streamlined.
A. Zero-shot object detection allows Owl ViT to identify objects just by matching textual descriptions of the images, even if it has not been trained on that specific class. This concept enables the model to detect new objects based on text prompts alone.
A. Owl ViT leverages a vision transformer architecture with CLIP, which matches images to text descriptions using contrastive learning. This phenomenon allows it to recognize objects based on text queries without prior knowledge of specific object classes.
A. Owl ViT can find useful applications in image search, robotics technology, and for users with impaired vision. That means people with this challenge can benefit from this model as it can describe objects based on text input.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.