Image Depth Estimation Using Depth Prediction Transformers (DPTs)

Mobarak Inuwa Last Updated : 25 Jul, 2023

7 min read

Introduction

Image depth estimation is about figuring out how far away objects in an image are. It’s an important problem in computer vision because it helps with things like creating 3D models, augmented reality, and self-driving cars. In the past, people used techniques like stereo vision or special sensors to estimate depth. But now, there’s a new method called Depth Prediction Transformers (DPTs) that uses deep learning.

DPTs are a type of model that can learn to estimate depth by looking at images. In this article, we’ll learn more about how DPTs work using hands-on coding, why they’re useful, and what we can do with them in different applications.

Learning Objectives

The concept of Dense Prediction Transformers (DPTs) and their role in image depth estimation.
Explore the architecture of DPTs, including the combination of vision transformers and encoder-decoder frameworks.
Implement a DPT task using the Hugging Face transformer library.
Recognize the potential applications of DPTs in various domains.

This article was published as a part of the Data Science Blogathon.

Introduction
Understanding Depth Prediction Transformers
The Architecture of Depth Prediction Transformers
DPT Implementation Using Hugging Face Transformer
Benefits and Advantages
Potential Applications
Conclusion
Frequently Asked Questions

Understanding Depth Prediction Transformers

Depth Prediction Transformers (DPTs) are a unique kind of deep learning model that is specifically designed to estimate the depth of objects in images. They make use of a special type of architecture called transformers, which were initially developed for processing language data. However, DPTs adapt and apply this architecture to handle visual data. One of the key strengths of DPTs is their ability to capture intricate relationships between various parts of an image and model dependencies that span across long distances. This enables DPTs to accurately predict the depth or distance of objects in an image.

The Architecture of Depth Prediction Transformers

Depth Prediction Transformers (DPTs) combine vision transformers with an encoder-decoder framework to estimate depth in images. The encoder component captures and encodes features using self-attention mechanisms, enhancing the understanding of relationships between different parts of the image. This improves feature resolution and allows for the capture of fine-grained details. The decoder component reconstructs dense depth predictions by mapping the encoded features back to the original image space, utilizing techniques like upsampling and convolutional layers. The architecture of DPTs enables the model to consider the global context of the scene and model dependencies between different image regions, resulting in accurate depth predictions.

In summary, DPTs leverage vision transformers and an encoder-decoder framework to estimate depth in images. The encoder captures features and encodes them using self-attention mechanisms, while the decoder reconstructs dense depth predictions. This architecture enables DPTs to capture fine-grained details, consider global context, and generate accurate depth predictions.

DPT Implementation Using Hugging Face Transformer

We will see a practical implementation of DPT using a Huggin Face pipeline. Find the entire code here.

Step 1: Installing Dependencies

We start by installing the transformers package from the GitHub repository by using the following command:

!pip install -q git+https://github.com/huggingface/transformers.git  # Install the transformers package from the Hugging Face GitHub repository

Execute !pip install command in Jupyter Notebook or JupyterLab cell to install packages directly within the notebook environment.

Step 2: Depth Estimation Model Definition

The provided code defines a depth estimation model using the DPT architecture from the Hugging Face Transformers library.

from transformers import DPTFeatureExtractor, DPTForDepthEstimation

# Create a DPT feature extractor
feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-large")

# Create a DPT depth estimation model
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")

The code imports the necessary classes from the Transformers library i.e. DPTFeatureExtractor and DPTForDepthEstimation. Then, we created an instance of the DPT feature extractor by calling DPTFeatureExtractor.from_pretrained() and loading the pre-trained weights from the “Intel/dpt-large” model. In a similar manner, they create an instance of the DPT depth estimation model by using DPTForDepthEstimation.from_pretrained() and load the pre-trained weights from the same “Intel/dpt-large” model.

Step 3: Image Loading

Now we go on to provide a means of loading and preparing an image for further processing.

from PIL import Image
import requests

# Specify the URL of the image to download
url = 'https://img.freepik.com/free-photo/full-length-shot-pretty-healthy-young-lady-walking-morning-park-with-dog_171337-18880.jpg?w=360&t=st=1689213531~exp=1689214131~hmac=67dea8e3a9c9f847575bb27e690c36c3fec45b056e90a04b68a00d5b4ba8990e'

# Download and open the image using PIL
image = Image.open(requests.get(url, stream=True).raw)

Image loading | Depth Prediction Transformers

We imported the necessary modules (Image from PIL and requests) to handle image processing and HTTP requests, respectively. It specifies the URL of the image to download and then uses requests.get() to retrieve the image data. Image.open() is used to open the downloaded image data as a PIL Image object.

Step 4: Forward Pass

import torch

# Use torch.no_grad() to disable gradient computation
with torch.no_grad():
    # Pass the pixel values through the model
    outputs = model(pixel_values)
    # Access the predicted depth values from the outputs
    predicted_depth = outputs.predicted_depth

The above code performs the forward pass of the model to obtain predicted depth values for the input image. We use torch.no_grad() as a context manager to disable gradient computation, which helps to reduce memory usage during inference. They pass the pixel values tensor, pixel_values, through the model using model(pixel_values), and store the resulting outputs in the outputs variable. Next, they access the predicted depth values from outputs.predicted_depth and assign them to the predicted_depth variable.

Step 5: Interpolation and Visualization

We now perform interpolation of the predicted depth values to the original image size and convert the output into an image.

import numpy as np

# Interpolate the predicted depth values to the original size
prediction = torch.nn.functional.interpolate(
    predicted_depth.unsqueeze(1),
    size=image.size[::-1],
    mode="bicubic",
    align_corners=False,
).squeeze()

# Convert the interpolated depth values to a numpy array
output = prediction.cpu().numpy()

# Scale and format the depth values for visualization
formatted = (output * 255 / np.max(output)).astype('uint8')

# Create an image from the formatted depth values
depth = Image.fromarray(formatted)
depth

We use torch.nn.functional.interpolate() to interpolate the predicted depth values to the original size of the input image. The interpolated depth values are then converted to a numpy array using .cpu().numpy(). Next, the depth values are scaled and formatted to the range [0, 255] for visualization purposes. Finally, an image is created from the formatted depth values using Image.fromarray().

After executing this code, the `depth` variable will contain the depth image, which we display as the image depth.

Benefits and Advantages

Depth Prediction Transformers offer several benefits and advantages over traditional methods for image depth estimation. Here are some key points to understand about Depth Prediction Transformers (DPTs):

Better Attention to Details: DPTs use a special part called the encoder to capture very small details and make the predictions more accurate.
Understanding the Big Picture: DPTs are good at figuring out how different parts of an image are connected. This helps them understand the whole scene and estimate depth accurately.
Diverse areas of Application: Use DPTs in lots of different things like making 3D models, adding things to the real world in augmented reality, and helping robots understand their surroundings.

Ease of Integration: Combine DPTs with other tools in computer vision like picking out objects or dividing an image into different parts. This makes the depth estimation even better and more precise.

Potential Applications

Image Depth Estimation using Depth Prediction Transformers has many useful applications in different fields. Here are a few examples:

Autonomous Navigation: Depth estimation is important for self-driving cars to understand their surroundings and navigate safely on the road.
Augmented Reality: Depth estimation helps in overlaying virtual objects onto the real world in augmented reality apps, making them look realistic and interact with the environment correctly.
3D Reconstruction: Depth estimation is essential for creating 3D models of objects or scenes from regular 2D images, allowing us to visualize them in a three-dimensional space.
Robotics: Depth estimation is valuable for robots to perform tasks like picking up objects, avoiding obstacles, and understanding the layout of their environment.

Conclusion

Image Depth Estimation using Depth Prediction Transformers provides a strong and precise method to estimate depth from 2D images. By using the transformer architecture and an encoder-decoder framework, DPTs can effectively capture intricate details, understand connections between different parts of the image, and generate accurate depth predictions. This technology has the potential for applying in various areas such as autonomous navigation, augmented reality, 3D reconstruction, and robotics, offering exciting possibilities for advancements in these fields. As computer vision progresses, Depth Prediction Transformers will continue to play a crucial role in achieving accurate and dependable depth estimation, leading to improvements and breakthroughs in numerous applications.

Key Takeaways

Image Depth Estimation using Depth Prediction Transformers (DPTs) is a powerful and accurate approach to predicting depth from 2D images.
DPTs leverage the transformer architecture and the encoder-decoder framework to capture fine-grained details, model long-range dependencies, and generate precise depth predictions.
DPTs have potential applications in autonomous navigation, augmented reality, 3D reconstruction, and robotics, opening up new possibilities in various domains.
As computer vision advances, Depth Prediction Transformers will continue to play a significant role in achieving precise and reliable depth estimation, contributing to advancements in numerous applications.

Frequently Asked Questions

Q1. What are Depth Prediction Transformers (DPTs)?

A. Depth Prediction Transformers (DPTs) use advanced techniques to estimate the distance or depth of objects in images. Design them to be very accurate in predicting depth by analyzing the details and relationships between different parts of the image.

Q2. How are DPTs different from traditional methods?

A. DPTs use a different approach compared to older methods. The special kind of architecture called transformers, which was originally used for language processing, is used by them. This allows DPTs to understand the image better and make more precise depth predictions.

Q3. What can DPTs be used for?

A. They are particularly helpful in self-driving cars to navigate safely, in augmented reality to make virtual objects look realistic in the real world, in creating 3D models from regular images, and in robotics for tasks like picking up objects and avoiding obstacles.

Q4. Can DPTs work together with other computer vision techniques?

A. Combine DPTs with other computer vision methods like recognizing objects or dividing an image into parts. This helps improve the overall understanding of the scene and makes the depth estimation more accurate.

Q5. How do DPTs contribute to advancements in computer vision?

A. DPTs are a significant step forward in improving depth estimation in computer vision. They can capture fine details, understand relationships between objects, and make precise predictions. This helps in better understanding scenes, recognizing objects more accurately, and perceiving depth more effectively.

Reference Links

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Mobarak Inuwa

I am an AI Engineer with a deep passion for research, and solving complex problems. I provide AI solutions leveraging Large Language Models (LLMs), GenAI, Transformer Models, and Stable Diffusion.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Computer Vision

Getting Started with Image Data

Introduction to CNN and Implementation

Introduction to CNN and implementation

Introduction to Transfer Learning

CNN Visualization

Overview of Pretrained Models

Inception

ResNets

DenseNets

CSRNet

Introduction to Object Detection

Region Based Convolutional Neural Network

Single Stage Networks

Transformed Based Object Detection Models

Face Detection

Object Tracking

Pose Estimation

Introduction to Image Segmentation

Understanding Deep Learning Architectures for Image Segmentation

Video Classification

Introduction to Image Generation

Experiments with Generative Adversarial Networks

Zero and Few Shot Learning

Model Deployment

Image Depth Estimation Using Depth Prediction Transformers (DPTs)

Introduction

Learning Objectives

Table of contents

Understanding Depth Prediction Transformers

The Architecture of Depth Prediction Transformers

DPT Implementation Using Hugging Face Transformer

Step 1: Installing Dependencies

Step 2: Depth Estimation Model Definition

Step 3: Image Loading

Step 4: Forward Pass

Step 5: Interpolation and Visualization

Benefits and Advantages

Potential Applications

Conclusion

Key Takeaways

Frequently Asked Questions

Reference Links

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth