Image depth estimation is about figuring out how far away objects in an image are. It’s an important problem in computer vision because it helps with things like creating 3D models, augmented reality, and self-driving cars. In the past, people used techniques like stereo vision or special sensors to estimate depth. But now, there’s a new method called Depth Prediction Transformers (DPTs) that uses deep learning.
DPTs are a type of model that can learn to estimate depth by looking at images. In this article, we’ll learn more about how DPTs work using hands-on coding, why they’re useful, and what we can do with them in different applications.
This article was published as a part of the Data Science Blogathon.
Depth Prediction Transformers (DPTs) are a unique kind of deep learning model that is specifically designed to estimate the depth of objects in images. They make use of a special type of architecture called transformers, which were initially developed for processing language data. However, DPTs adapt and apply this architecture to handle visual data. One of the key strengths of DPTs is their ability to capture intricate relationships between various parts of an image and model dependencies that span across long distances. This enables DPTs to accurately predict the depth or distance of objects in an image.
Depth Prediction Transformers (DPTs) combine vision transformers with an encoder-decoder framework to estimate depth in images. The encoder component captures and encodes features using self-attention mechanisms, enhancing the understanding of relationships between different parts of the image. This improves feature resolution and allows for the capture of fine-grained details. The decoder component reconstructs dense depth predictions by mapping the encoded features back to the original image space, utilizing techniques like upsampling and convolutional layers. The architecture of DPTs enables the model to consider the global context of the scene and model dependencies between different image regions, resulting in accurate depth predictions.
In summary, DPTs leverage vision transformers and an encoder-decoder framework to estimate depth in images. The encoder captures features and encodes them using self-attention mechanisms, while the decoder reconstructs dense depth predictions. This architecture enables DPTs to capture fine-grained details, consider global context, and generate accurate depth predictions.
We will see a practical implementation of DPT using a Huggin Face pipeline. Find the entire code here.
We start by installing the transformers package from the GitHub repository by using the following command:
!pip install -q git+https://github.com/huggingface/transformers.git # Install the transformers package from the Hugging Face GitHub repository
Execute !pip install command in Jupyter Notebook or JupyterLab cell to install packages directly within the notebook environment.
The provided code defines a depth estimation model using the DPT architecture from the Hugging Face Transformers library.
from transformers import DPTFeatureExtractor, DPTForDepthEstimation
# Create a DPT feature extractor
feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-large")
# Create a DPT depth estimation model
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")
The code imports the necessary classes from the Transformers library i.e. DPTFeatureExtractor and DPTForDepthEstimation. Then, we created an instance of the DPT feature extractor by calling DPTFeatureExtractor.from_pretrained() and loading the pre-trained weights from the “Intel/dpt-large” model. In a similar manner, they create an instance of the DPT depth estimation model by using DPTForDepthEstimation.from_pretrained() and load the pre-trained weights from the same “Intel/dpt-large” model.
Now we go on to provide a means of loading and preparing an image for further processing.
from PIL import Image
import requests
# Specify the URL of the image to download
url = 'https://img.freepik.com/free-photo/full-length-shot-pretty-healthy-young-lady-walking-morning-park-with-dog_171337-18880.jpg?w=360&t=st=1689213531~exp=1689214131~hmac=67dea8e3a9c9f847575bb27e690c36c3fec45b056e90a04b68a00d5b4ba8990e'
# Download and open the image using PIL
image = Image.open(requests.get(url, stream=True).raw)
We imported the necessary modules (Image from PIL and requests) to handle image processing and HTTP requests, respectively. It specifies the URL of the image to download and then uses requests.get() to retrieve the image data. Image.open() is used to open the downloaded image data as a PIL Image object.
import torch
# Use torch.no_grad() to disable gradient computation
with torch.no_grad():
# Pass the pixel values through the model
outputs = model(pixel_values)
# Access the predicted depth values from the outputs
predicted_depth = outputs.predicted_depth
The above code performs the forward pass of the model to obtain predicted depth values for the input image. We use torch.no_grad() as a context manager to disable gradient computation, which helps to reduce memory usage during inference. They pass the pixel values tensor, pixel_values, through the model using model(pixel_values), and store the resulting outputs in the outputs variable. Next, they access the predicted depth values from outputs.predicted_depth and assign them to the predicted_depth variable.
We now perform interpolation of the predicted depth values to the original image size and convert the output into an image.
import numpy as np
# Interpolate the predicted depth values to the original size
prediction = torch.nn.functional.interpolate(
predicted_depth.unsqueeze(1),
size=image.size[::-1],
mode="bicubic",
align_corners=False,
).squeeze()
# Convert the interpolated depth values to a numpy array
output = prediction.cpu().numpy()
# Scale and format the depth values for visualization
formatted = (output * 255 / np.max(output)).astype('uint8')
# Create an image from the formatted depth values
depth = Image.fromarray(formatted)
depth
We use torch.nn.functional.interpolate() to interpolate the predicted depth values to the original size of the input image. The interpolated depth values are then converted to a numpy array using .cpu().numpy(). Next, the depth values are scaled and formatted to the range [0, 255] for visualization purposes. Finally, an image is created from the formatted depth values using Image.fromarray().
After executing this code, the `depth` variable will contain the depth image, which we display as the image depth.
Depth Prediction Transformers offer several benefits and advantages over traditional methods for image depth estimation. Here are some key points to understand about Depth Prediction Transformers (DPTs):
Image Depth Estimation using Depth Prediction Transformers has many useful applications in different fields. Here are a few examples:
Image Depth Estimation using Depth Prediction Transformers provides a strong and precise method to estimate depth from 2D images. By using the transformer architecture and an encoder-decoder framework, DPTs can effectively capture intricate details, understand connections between different parts of the image, and generate accurate depth predictions. This technology has the potential for applying in various areas such as autonomous navigation, augmented reality, 3D reconstruction, and robotics, offering exciting possibilities for advancements in these fields. As computer vision progresses, Depth Prediction Transformers will continue to play a crucial role in achieving accurate and dependable depth estimation, leading to improvements and breakthroughs in numerous applications.
A. Depth Prediction Transformers (DPTs) use advanced techniques to estimate the distance or depth of objects in images. Design them to be very accurate in predicting depth by analyzing the details and relationships between different parts of the image.
A. DPTs use a different approach compared to older methods. The special kind of architecture called transformers, which was originally used for language processing, is used by them. This allows DPTs to understand the image better and make more precise depth predictions.
A. They are particularly helpful in self-driving cars to navigate safely, in augmented reality to make virtual objects look realistic in the real world, in creating 3D models from regular images, and in robotics for tasks like picking up objects and avoiding obstacles.
A. Combine DPTs with other computer vision methods like recognizing objects or dividing an image into parts. This helps improve the overall understanding of the scene and makes the depth estimation more accurate.
A. DPTs are a significant step forward in improving depth estimation in computer vision. They can capture fine details, understand relationships between objects, and make precise predictions. This helps in better understanding scenes, recognizing objects more accurately, and perceiving depth more effectively.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.