Image-to-Image Generation Using depth2img Pre-Trained Models

Mobarak Inuwa Last Updated : 13 Jun, 2023

10 min read

Introduction

Hugging face has provided different means of carryout image-to-image generation using pre-trained models and other available libraries. This article will generate new images from an input image using UNet2DConditionModel models. The implementation will be based on PyTorch and then the hugging face depth2img. Stable diffusion models make it possible to generate new images by modifying them and generating new creative images.

Learning Objectives

A practical look at Stable Diffusion
Building a customized pipeline
Using Huggin Face transformers and diffusers

This article was published as a part of the Data Science Blogathon.

Introduction
What are Stable Diffusion Models?
The Architecture of Stable Diffusion
Image-to-Image Generation
Approach
Importing Libraries
Creating Diffusion Pipeline
Loading Functions for Preprocessing Text
Using Dense Prediction Transformers (DPT)
Creating a Variable Instance of Pipeline
Loading Images
Generating New Images
Memory Complexity
OutOfMemoryError: CUDA Out of Memory
Conclusion
Frequently Asked Questions

What are Stable Diffusion Models?

Stable Diffusion models are models that perform as latent diffusers by learning the latent makeup of images on how they behave through the latent space. They can be seen as a member of the family of deep generative neural networks (dgNN) which are particularly high-performance generative neural network (GNN) layers backend.

A beginner may ask about the implication of the word “Stable” in stable diffusion. It is stable because a guide is provided on the results using the original images and parameters. This works across the latent space accordingly. There is an unstable diffusion where the generation space is unpredictable and completely random.

The Architecture of Stable Diffusion

The working concept of stable diffusion is a bit complex since it goes beyond regular image processing or deep learning. The attempt in this article is to demystify this concept using simple explanations.

The model utilized a probabilistic model known as the latent diffusion model (LDM). In the process of training these models, the priority is to apply noise to the process such that the statistical probability distribution function is equal to the normal distribution of the training outcome. This means that, although the model is being noised, it is still trying to maintain a threshold. This is referred Gaussian noise.

The image below illustrates the diffusion Process in a Latent Space with noise using the U-Net model.

This is a sequence of denoising autoencoders (DAE). DAEs help to adjust the reconstruction criterion for the pixel space of the images. It adds noise to the standard autoencoder. Stable Diffusion consists of some essential parts such as the variational autoencoder (VAE) which is an artificial neural network (ANN) for probabilistic graphical models. It also has the U-Net block which is a convolutional neural network (CNN) for image segmentation. Then, finally, it has a text encoder part performed by a trained CLIP ViT-L/14 text encoder which does the transformation of text prompts into an embedding space.

The VAE compresses the image values into a dimensional latent space without losing details yet managing memory.

Image-to-Image Generation

The idea is to transform an existing image into a modified one with a diffuser with an image generator pipeline. This approach involves giving an image to the pipeline in order to generate another image. This is one among the many use case of stable diffusion. Other instances are text to image where no image is provided initially except a piece of text containing the description of the expected image. Another instance is the text-to-video, which generates new videos from text and even text-to-speech, etc. In this article, we present a hands-on approach to building a customized pipeline for image-to-image generation.

Approach

Let us see the process we will follow to build our customized pipeline.

Import libraries
Create a class for the Diffusion pipeline
Create an instance with the pipeline
Load functions for preprocessing our text (prompts)
Understand the Image depth with Dense Prediction Transformers (DPT)
Create a variable of the pipeline
Load image to be used
Generate new images

Project GitHub repo

Importing Libraries

Installing libraries

#  Installing libraries
%pip install --quiet --upgrade diffusers transformers scipy ftfy

#  Installing libraries
%pip install --quiet --upgrade accelerate

Let us look at the use of the above libraries:

Diffusers: They are latent models trained to generate photo-realistic images with text. This is the backbone of our everything.

Transformers: They are models used for their ability to understand features in a context like text and interpret them to produce token embeddings. The choice of transformers enhances how the context will be understood.

Scipy: SciPy is a traditional Python library for scientific computation that uses the NumPy backend. It provides our pipeline with optimization and signal processing of the data.

Ftfy: It provides tools for processing the texts.

Accelerate: It allows PyTorch code to be cross-platform and to run in different configurations and environments with no stress.

import numpy as np
from tqdm import tqdm
from PIL import Image

# PyTorch backend
import torch
from torch import autocast

# Transformers containing tools
from transformers import CLIPTextModel, CLIPTokenizer
from transformers import DPTForDepthEstimation, DPTFeatureExtractor

# Accessing our pipeline
from diffusers import AutoencoderKL, UNet2DConditionModel
from diffusers.schedulers.scheduling_pndm import PNDMScheduler

Creating Diffusion Pipeline

Although hugging face provides a simple predefined pipeline for diffusion but in this section, we will be building one. Hence, this will provide a flexible approach. Specifying what we want and how we want it. We create a number of methods in the class which we will initialize later. To follow hands-on, please use the notebook available on GitHub.

# Creating a class for the Diffusion pipeline

class DiffusionPipeline:

    def __init__(self, 
                 vae, 
                 tokenizer, 
                 text_encoder, 
                 unet, 
                 scheduler):
        
        self.vae = vae
        self.tokenizer = tokenizer
        self.text_encoder = text_encoder
        self.unet = unet
        self.scheduler = scheduler
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    
    def get_text_embeds(self, text):
    
        # Tokenizing the text
        text_input = self.tokenizer(text, 
                                    padding='max_length', 
                                    max_length=tokenizer.model_max_length, 
                                    truncation=True, 
                                    return_tensors='pt')
                                    
        # Embeding the tokenize text
        with torch.no_grad():
            text_embeds = self.text_encoder(text_input.input_ids.to(self.device))[0] # Get embeddings
        return text_embeds


    def get_prompt_embeds(self, prompt):
        if isinstance(prompt, str):
            prompt = [prompt]
            
        # get conditional embeddings
        cond_embeds = self.get_text_embeds(prompt)
        
        # get unconditional embeddings
        uncond_embeds = self.get_text_embeds([''] * len(prompt))
        
        # concatenate the above 2 embeds
        prompt_embeds = torch.cat([uncond_embeds, cond_embeds])
        return prompt_embeds



    def decode_img_latents(self, img_latents):
        img_latents = 1 / self.vae.config.scaling_factor * img_latents
        with torch.no_grad():
            img = self.vae.decode(img_latents).sample
        
        img = (img / 2 + 0.5).clamp(0, 1)
        img = img.cpu().permute(0, 2, 3, 1).float().numpy()
        return img



    def transform_img(self, img):
    
        # scale images to [0, 255] and convert to int
        img = (img * 255).round().astype('uint8')
        
        # convert to PIL Image objects
        img = [Image.fromarray(i) for i in img]
        return img


    def encode_img_latents(self, img, latent_timestep):
        if not isinstance(img, list):
            img = [img]
        
        img = np.stack([np.array(i) for i in img], axis=0)
        
        # scale images to [-1, 1]
        img = 2 * ((img / 255.0) - 0.5)
        img = torch.from_numpy(img).float().permute(0, 3, 1, 2)
        img = img.to(self.device)

        # encode images
        img_latents_dist = self.vae.encode(img)
        img_latents = img_latents_dist.latent_dist.sample()
        
        # scale images
        img_latents = self.vae.config.scaling_factor * img_latents
        
        # add noise to the latents
        noise = torch.randn(img_latents.shape).to(self.device)
        img_latents = self.scheduler.add_noise(img_latents, noise, latent_timestep)

        return img_latents

We initialize the root class as a parameter in a new class. This new Depth2ImgPipeline class will be assigned to a variable that will be called to act on our target image.

class Depth2ImgPipeline(DiffusionPipeline):
    def __init__(self, 
                 vae, 
                 tokenizer, 
                 text_encoder, 
                 unet, 
                 scheduler, 
                 depth_feature_extractor, 
                 depth_estimator):
        
        super().__init__(vae, tokenizer, text_encoder, unet, scheduler)

        self.depth_feature_extractor = depth_feature_extractor
        self.depth_estimator = depth_estimator


    def get_depth_mask(self, img):
        if not isinstance(img, list):
            img = [img]

        width, height = img[0].size
        
        # pre-process the input image and get its pixel values
        pixel_values = self.depth_feature_extractor(img, return_tensors="pt").pixel_values

        # use autocast for automatic mixed precision (AMP) inference
        with autocast('cuda'):
            depth_mask = self.depth_estimator(pixel_values).predicted_depth
        
        # get the depth mask
        depth_mask = torch.nn.functional.interpolate(depth_mask.unsqueeze(1),
                                                     size=(height//8, width//8),
                                                     mode='bicubic',
                                                     align_corners=False)
        
        # scale the mask to [-1, 1]
        depth_min = torch.amin(depth_mask, dim=[1, 2, 3], keepdim=True)
        depth_max = torch.amax(depth_mask, dim=[1, 2, 3], keepdim=True)
        depth_mask = 2.0 * (depth_mask - depth_min) / (depth_max - depth_min) - 1.0
        depth_mask = depth_mask.to(self.device)

        # replicate the mask for classifier free guidance 
        depth_mask = torch.cat([depth_mask] * 2)
        return depth_mask



    # Denoising the image over the latent space
    def denoise_latents(self, 
                        img,
                        prompt_embeds,
                        depth_mask,
                        strength,
                        num_inference_steps=20,
                        guidance_scale=7.5,
                        height=512, width=512):
        
        # clip the value of strength to ensure strength lies in [0, 1]
        strength = max(min(strength, 1), 0)

        # compute timesteps
        self.scheduler.set_timesteps(num_inference_steps)

        init_timestep = int(num_inference_steps * strength)
        t_start = num_inference_steps - init_timestep
        
        timesteps = self.scheduler.timesteps[t_start: ]
        num_inference_steps = num_inference_steps - t_start

        latent_timestep = timesteps[:1].repeat(1)

        latents = self.encode_img_latents(img, latent_timestep)

        # use autocast for automatic mixed precision (AMP) inference
        with autocast('cuda'):
            for i, t in tqdm(enumerate(timesteps)):
                latent_model_input = torch.cat([latents] * 2)
                latent_model_input = torch.cat([latent_model_input, depth_mask], dim=1)
                
                # predict noise residuals
                with torch.no_grad():
                    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=prompt_embeds)['sample']

                # separate predictions for unconditional and conditional outputs
                noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
                
                # perform guidance
                noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

                # remove the noise from the current sample i.e. go from x_t to x_{t-1}
                latents = self.scheduler.step(noise_pred, t, latents)['prev_sample']

        return latents


    def __call__(self, 
                 prompt, 
                 img, 
                 strength=0.8,
                 num_inference_steps=50,
                 guidance_scale=7.5,
                 height=512, width=512):


        prompt_embeds = self.get_prompt_embeds(prompt)

        depth_mask = self.get_depth_mask(img)

        latents = self.denoise_latents(img,
                                       prompt_embeds,
                                       depth_mask,
                                       strength,
                                       num_inference_steps,
                                       guidance_scale,
                                       height, width)

        img = self.decode_img_latents(latents)

        img = self.transform_img(img)
        
        return img

Specifying a processor to speed up tasks with GPU.

# Setting a GPU device
device = 'cuda'

Loading Functions for Preprocessing Text

Since the prompt is text, we need a way of converting this text to be readable by the pipeline. This is achieved by autoencoders and tokenization.

# Loading autoencoder for reconstructing the image
vae = AutoencoderKL.from_pretrained('stabilityai/stable-diffusion-2-depth', subfolder='vae').to(device)

# Load the tokenizer and the text encoder
tokenizer = CLIPTokenizer.from_pretrained('stabilityai/stable-diffusion-2-depth', subfolder='tokenizer')
text_encoder = CLIPTextModel.from_pretrained('stabilityai/stable-diffusion-2-depth', subfolder='text_encoder').to(device)

# Load UNet model
unet = UNet2DConditionModel.from_pretrained('stabilityai/stable-diffusion-2-depth', subfolder='unet').to(device)

# Load scheduler to adjust the learning rate
scheduler = PNDMScheduler(beta_start=0.00085, 
                          beta_end=0.012, 
                          beta_schedule='scaled_linear', 
                          num_train_timesteps=1000)

Using Dense Prediction Transformers (DPT)

DPT was first released by intel labs around 2021. It is a dense vision transformer with an architecture that leverages vision transformers instead of traditional convolutional networks. It is trained with over 150 classes and works as a vision transformer for dense prediction tasks. This converts the image into tokens using a kind of segmentation.

You can find the official paper on DPT released by Intel here

# Load DPT Depth Estimator for measuring the distance of each pixel
depth_estimator = DPTForDepthEstimation.from_pretrained('stabilityai/stable-diffusion-2-depth', subfolder='depth_estimator')

# Load DPT Feature Extractor for dense prediction
depth_feature_extractor = DPTFeatureExtractor.from_pretrained('stabilityai/stable-diffusion-2-depth', subfolder='feature_extractor')

Creating a Variable Instance of Pipeline

We can now initialize our pipeline. This approach makes our code brief when calling our pipeline on the input image.

# Initializing pipeline
depth2img = Depth2ImgPipeline(vae, 
                              tokenizer, 
                              text_encoder, 
                              unet, 
                              scheduler,
                              depth_feature_extractor,
                              depth_estimator)

Loading Images

We need a function to ensure and load images.

import urllib.parse as parse
import os
import requests

# Determine if a string is a URL
def check_url(string):
    try:
        result = parse.urlparse(string)
        return all([result.scheme, result.netloc, result.path])
    except:
        return False

We can now build a function to load the images.

# Load an image
def load_image(image_path):
    if check_url(image_path):
        return Image.open(requests.get(image_path, stream=True).raw)
    elif os.path.exists(image_path):
        return Image.open(image_path)

Generating New Images

We are now set to use our model. To use our pipeline in generating new images, we load the image URL. Let us view a sample image.

# Getting an image URL
url = "https://img.freepik.com/free-vector/two-red-roses-white_1308-35268.jpg?size=626&ext=jpg&uid=R21895281&ga=GA1.1.821631087.1678296847&semt=ais"
img = load_image(url)
img

Prompt = “two hibiscus flowers”

# Assigning Pipeline to prompt
depth2img("two hibiscus flowers", img)[0]

Let us try another sample.

# Getting an image URL
url = "https://img.freepik.com/free-vector/cute-pink-bicycle-isolated_1284-43044.jpg?t=st=1684396069~exp=1684396669~hmac=fb265438f0680c00b7c156182201f5c15b602bd1733a5b051a2d9c77ff83a4fd"
img = load_image(url)
img

Prompt = “bicycle”

# Assigning Pipeline to prompt
depth2img("bicycle", img)[0]

You can experiment with different images and text. With more detailed prompts, you can generate new creative images.

Memory Complexity

It is going to be vital to consider memory before we wrap off. It is possible to encounter memory challenges when trying out this customized pipeline on a limited environment like CPU or low RAM or GPU. If you are trying this on Free Google Colab then it is likely to face a “CUDA Out of Memory” problem.

OutOfMemoryError: CUDA Out of Memory

To fix this common issue with stable diffusion, there are a few quick escape routes. The first is to reduce the image resolutions and spare processing time or get more memory by buying it. The default trained dimension is around 500 pixels. You can half this size to less than 25 which further spares some memory.

height=256, width=256): # Reduced image dimension

Conclusion

We have looked at building an image-to-image generation pipeline using depth2img pre-trained models. This is an alternative powered by Hugging Face instead of the prebuilt pipeline with less customization. Building a pipeline on the pre-trained models make things more adjustable. Utilizing the power of generative AI and stable diffusion could bring about a lot of changes and benefits to everyday business processes.

Key Takeaways

Hugging face has provided means of carryout image-to-image generation using pre-trained models and other available libraries.
The prompt is text so convert it to be readable by the pipeline. You can achieve this by autoencoders and tokenization.
DPT is a dense vision transformer with an architecture that leverages vision transformers instead of traditional convolutional networks.
Generation of images is convenient with Hugging Face

Reference Links

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What is an image-to-image generation?

A. Image-to-image generation refers to the process of generating new images from existing ones while preserving certain attributes or transforming them into a different representation. It involves mapping input images to corresponding output images using various techniques like generative adversarial networks (GANs) or conditional variational autoencoders (CVAEs).

Q2. What is the process of image generation?

A. The process of image generation typically involves training a machine learning model on a dataset of images. This model learns the underlying patterns and structures of the images and can generate new images by sampling from the learned distribution. The process may involve preprocessing, feature extraction, model training, and post-processing steps.

Q3. What is depth-to-image in Stable Diffusion?

A. In Stable Diffusion, depth-to-image refers to generating realistic images from depth maps or information. It involves converting depth representations, which typically indicate the distance of objects from the camera, into visually plausible images with realistic details, textures, and colors. Stable Diffusion is a method that leverages diffusion models for this purpose, enabling high-quality depth-to-image synthesis.

Q4. What is image-to-image generation using Stable Diffusion?

A. Image-to-image generation using Stable Diffusion refers to the process of generating new images from existing ones by leveraging the Stable Diffusion method. This approach involves converting input images into a latent space representation, applying diffusion processes to manipulate the latent variables, and then mapping them back to the image space to generate visually coherent and diverse output images with desired attributes or transformations.

Mobarak Inuwa

I am an AI Engineer with a deep passion for research, and solving complex problems. I provide AI solutions leveraging Large Language Models (LLMs), GenAI, Transformer Models, and Stable Diffusion.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Image-to-Image Generation Using depth2img Pre-Trained Models

Introduction

Learning Objectives

Table of contents

What are Stable Diffusion Models?

The Architecture of Stable Diffusion

Image-to-Image Generation

Approach

Importing Libraries

Creating Diffusion Pipeline

Loading Functions for Preprocessing Text

Using Dense Prediction Transformers (DPT)

Creating a Variable Instance of Pipeline

Loading Images

Generating New Images

Memory Complexity

OutOfMemoryError: CUDA Out of Memory

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang