Hugging face has provided different means of carryout image-to-image generation using pre-trained models and other available libraries. This article will generate new images from an input image using UNet2DConditionModel models. The implementation will be based on PyTorch and then the hugging face depth2img. Stable diffusion models make it possible to generate new images by modifying them and generating new creative images.
This article was published as a part of the Data Science Blogathon.
Stable Diffusion models are models that perform as latent diffusers by learning the latent makeup of images on how they behave through the latent space. They can be seen as a member of the family of deep generative neural networks (dgNN) which are particularly high-performance generative neural network (GNN) layers backend.
A beginner may ask about the implication of the word “Stable” in stable diffusion. It is stable because a guide is provided on the results using the original images and parameters. This works across the latent space accordingly. There is an unstable diffusion where the generation space is unpredictable and completely random.
The working concept of stable diffusion is a bit complex since it goes beyond regular image processing or deep learning. The attempt in this article is to demystify this concept using simple explanations.
The model utilized a probabilistic model known as the latent diffusion model (LDM). In the process of training these models, the priority is to apply noise to the process such that the statistical probability distribution function is equal to the normal distribution of the training outcome. This means that, although the model is being noised, it is still trying to maintain a threshold. This is referred Gaussian noise.
The image below illustrates the diffusion Process in a Latent Space with noise using the U-Net model.
This is a sequence of denoising autoencoders (DAE). DAEs help to adjust the reconstruction criterion for the pixel space of the images. It adds noise to the standard autoencoder. Stable Diffusion consists of some essential parts such as the variational autoencoder (VAE) which is an artificial neural network (ANN) for probabilistic graphical models. It also has the U-Net block which is a convolutional neural network (CNN) for image segmentation. Then, finally, it has a text encoder part performed by a trained CLIP ViT-L/14 text encoder which does the transformation of text prompts into an embedding space.
The VAE compresses the image values into a dimensional latent space without losing details yet managing memory.
The idea is to transform an existing image into a modified one with a diffuser with an image generator pipeline. This approach involves giving an image to the pipeline in order to generate another image. This is one among the many use case of stable diffusion. Other instances are text to image where no image is provided initially except a piece of text containing the description of the expected image. Another instance is the text-to-video, which generates new videos from text and even text-to-speech, etc. In this article, we present a hands-on approach to building a customized pipeline for image-to-image generation.
Let us see the process we will follow to build our customized pipeline.
Installing libraries
# Installing libraries
%pip install --quiet --upgrade diffusers transformers scipy ftfy
# Installing libraries
%pip install --quiet --upgrade accelerate
Let us look at the use of the above libraries:
Diffusers: They are latent models trained to generate photo-realistic images with text. This is the backbone of our everything.
Transformers: They are models used for their ability to understand features in a context like text and interpret them to produce token embeddings. The choice of transformers enhances how the context will be understood.
Scipy: SciPy is a traditional Python library for scientific computation that uses the NumPy backend. It provides our pipeline with optimization and signal processing of the data.
Ftfy: It provides tools for processing the texts.
Accelerate: It allows PyTorch code to be cross-platform and to run in different configurations and environments with no stress.
import numpy as np
from tqdm import tqdm
from PIL import Image
# PyTorch backend
import torch
from torch import autocast
# Transformers containing tools
from transformers import CLIPTextModel, CLIPTokenizer
from transformers import DPTForDepthEstimation, DPTFeatureExtractor
# Accessing our pipeline
from diffusers import AutoencoderKL, UNet2DConditionModel
from diffusers.schedulers.scheduling_pndm import PNDMScheduler
Although hugging face provides a simple predefined pipeline for diffusion but in this section, we will be building one. Hence, this will provide a flexible approach. Specifying what we want and how we want it. We create a number of methods in the class which we will initialize later. To follow hands-on, please use the notebook available on GitHub.
# Creating a class for the Diffusion pipeline
class DiffusionPipeline:
def __init__(self,
vae,
tokenizer,
text_encoder,
unet,
scheduler):
self.vae = vae
self.tokenizer = tokenizer
self.text_encoder = text_encoder
self.unet = unet
self.scheduler = scheduler
self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
def get_text_embeds(self, text):
# Tokenizing the text
text_input = self.tokenizer(text,
padding='max_length',
max_length=tokenizer.model_max_length,
truncation=True,
return_tensors='pt')
# Embeding the tokenize text
with torch.no_grad():
text_embeds = self.text_encoder(text_input.input_ids.to(self.device))[0] # Get embeddings
return text_embeds
def get_prompt_embeds(self, prompt):
if isinstance(prompt, str):
prompt = [prompt]
# get conditional embeddings
cond_embeds = self.get_text_embeds(prompt)
# get unconditional embeddings
uncond_embeds = self.get_text_embeds([''] * len(prompt))
# concatenate the above 2 embeds
prompt_embeds = torch.cat([uncond_embeds, cond_embeds])
return prompt_embeds
def decode_img_latents(self, img_latents):
img_latents = 1 / self.vae.config.scaling_factor * img_latents
with torch.no_grad():
img = self.vae.decode(img_latents).sample
img = (img / 2 + 0.5).clamp(0, 1)
img = img.cpu().permute(0, 2, 3, 1).float().numpy()
return img
def transform_img(self, img):
# scale images to [0, 255] and convert to int
img = (img * 255).round().astype('uint8')
# convert to PIL Image objects
img = [Image.fromarray(i) for i in img]
return img
def encode_img_latents(self, img, latent_timestep):
if not isinstance(img, list):
img = [img]
img = np.stack([np.array(i) for i in img], axis=0)
# scale images to [-1, 1]
img = 2 * ((img / 255.0) - 0.5)
img = torch.from_numpy(img).float().permute(0, 3, 1, 2)
img = img.to(self.device)
# encode images
img_latents_dist = self.vae.encode(img)
img_latents = img_latents_dist.latent_dist.sample()
# scale images
img_latents = self.vae.config.scaling_factor * img_latents
# add noise to the latents
noise = torch.randn(img_latents.shape).to(self.device)
img_latents = self.scheduler.add_noise(img_latents, noise, latent_timestep)
return img_latents
We initialize the root class as a parameter in a new class. This new Depth2ImgPipeline class will be assigned to a variable that will be called to act on our target image.
class Depth2ImgPipeline(DiffusionPipeline):
def __init__(self,
vae,
tokenizer,
text_encoder,
unet,
scheduler,
depth_feature_extractor,
depth_estimator):
super().__init__(vae, tokenizer, text_encoder, unet, scheduler)
self.depth_feature_extractor = depth_feature_extractor
self.depth_estimator = depth_estimator
def get_depth_mask(self, img):
if not isinstance(img, list):
img = [img]
width, height = img[0].size
# pre-process the input image and get its pixel values
pixel_values = self.depth_feature_extractor(img, return_tensors="pt").pixel_values
# use autocast for automatic mixed precision (AMP) inference
with autocast('cuda'):
depth_mask = self.depth_estimator(pixel_values).predicted_depth
# get the depth mask
depth_mask = torch.nn.functional.interpolate(depth_mask.unsqueeze(1),
size=(height//8, width//8),
mode='bicubic',
align_corners=False)
# scale the mask to [-1, 1]
depth_min = torch.amin(depth_mask, dim=[1, 2, 3], keepdim=True)
depth_max = torch.amax(depth_mask, dim=[1, 2, 3], keepdim=True)
depth_mask = 2.0 * (depth_mask - depth_min) / (depth_max - depth_min) - 1.0
depth_mask = depth_mask.to(self.device)
# replicate the mask for classifier free guidance
depth_mask = torch.cat([depth_mask] * 2)
return depth_mask
# Denoising the image over the latent space
def denoise_latents(self,
img,
prompt_embeds,
depth_mask,
strength,
num_inference_steps=20,
guidance_scale=7.5,
height=512, width=512):
# clip the value of strength to ensure strength lies in [0, 1]
strength = max(min(strength, 1), 0)
# compute timesteps
self.scheduler.set_timesteps(num_inference_steps)
init_timestep = int(num_inference_steps * strength)
t_start = num_inference_steps - init_timestep
timesteps = self.scheduler.timesteps[t_start: ]
num_inference_steps = num_inference_steps - t_start
latent_timestep = timesteps[:1].repeat(1)
latents = self.encode_img_latents(img, latent_timestep)
# use autocast for automatic mixed precision (AMP) inference
with autocast('cuda'):
for i, t in tqdm(enumerate(timesteps)):
latent_model_input = torch.cat([latents] * 2)
latent_model_input = torch.cat([latent_model_input, depth_mask], dim=1)
# predict noise residuals
with torch.no_grad():
noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=prompt_embeds)['sample']
# separate predictions for unconditional and conditional outputs
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
# perform guidance
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
# remove the noise from the current sample i.e. go from x_t to x_{t-1}
latents = self.scheduler.step(noise_pred, t, latents)['prev_sample']
return latents
def __call__(self,
prompt,
img,
strength=0.8,
num_inference_steps=50,
guidance_scale=7.5,
height=512, width=512):
prompt_embeds = self.get_prompt_embeds(prompt)
depth_mask = self.get_depth_mask(img)
latents = self.denoise_latents(img,
prompt_embeds,
depth_mask,
strength,
num_inference_steps,
guidance_scale,
height, width)
img = self.decode_img_latents(latents)
img = self.transform_img(img)
return img
Specifying a processor to speed up tasks with GPU.
# Setting a GPU device
device = 'cuda'
Since the prompt is text, we need a way of converting this text to be readable by the pipeline. This is achieved by autoencoders and tokenization.
# Loading autoencoder for reconstructing the image
vae = AutoencoderKL.from_pretrained('stabilityai/stable-diffusion-2-depth', subfolder='vae').to(device)
# Load the tokenizer and the text encoder
tokenizer = CLIPTokenizer.from_pretrained('stabilityai/stable-diffusion-2-depth', subfolder='tokenizer')
text_encoder = CLIPTextModel.from_pretrained('stabilityai/stable-diffusion-2-depth', subfolder='text_encoder').to(device)
# Load UNet model
unet = UNet2DConditionModel.from_pretrained('stabilityai/stable-diffusion-2-depth', subfolder='unet').to(device)
# Load scheduler to adjust the learning rate
scheduler = PNDMScheduler(beta_start=0.00085,
beta_end=0.012,
beta_schedule='scaled_linear',
num_train_timesteps=1000)
DPT was first released by intel labs around 2021. It is a dense vision transformer with an architecture that leverages vision transformers instead of traditional convolutional networks. It is trained with over 150 classes and works as a vision transformer for dense prediction tasks. This converts the image into tokens using a kind of segmentation.
You can find the official paper on DPT released by Intel here
# Load DPT Depth Estimator for measuring the distance of each pixel
depth_estimator = DPTForDepthEstimation.from_pretrained('stabilityai/stable-diffusion-2-depth', subfolder='depth_estimator')
# Load DPT Feature Extractor for dense prediction
depth_feature_extractor = DPTFeatureExtractor.from_pretrained('stabilityai/stable-diffusion-2-depth', subfolder='feature_extractor')
We can now initialize our pipeline. This approach makes our code brief when calling our pipeline on the input image.
# Initializing pipeline
depth2img = Depth2ImgPipeline(vae,
tokenizer,
text_encoder,
unet,
scheduler,
depth_feature_extractor,
depth_estimator)
We need a function to ensure and load images.
import urllib.parse as parse
import os
import requests
# Determine if a string is a URL
def check_url(string):
try:
result = parse.urlparse(string)
return all([result.scheme, result.netloc, result.path])
except:
return False
We can now build a function to load the images.
# Load an image
def load_image(image_path):
if check_url(image_path):
return Image.open(requests.get(image_path, stream=True).raw)
elif os.path.exists(image_path):
return Image.open(image_path)
We are now set to use our model. To use our pipeline in generating new images, we load the image URL. Let us view a sample image.
# Getting an image URL
url = "https://img.freepik.com/free-vector/two-red-roses-white_1308-35268.jpg?size=626&ext=jpg&uid=R21895281&ga=GA1.1.821631087.1678296847&semt=ais"
img = load_image(url)
img
Prompt = “two hibiscus flowers”
# Assigning Pipeline to prompt
depth2img("two hibiscus flowers", img)[0]
Let us try another sample.
# Getting an image URL
url = "https://img.freepik.com/free-vector/cute-pink-bicycle-isolated_1284-43044.jpg?t=st=1684396069~exp=1684396669~hmac=fb265438f0680c00b7c156182201f5c15b602bd1733a5b051a2d9c77ff83a4fd"
img = load_image(url)
img
Prompt = “bicycle”
# Assigning Pipeline to prompt
depth2img("bicycle", img)[0]
You can experiment with different images and text. With more detailed prompts, you can generate new creative images.
It is going to be vital to consider memory before we wrap off. It is possible to encounter memory challenges when trying out this customized pipeline on a limited environment like CPU or low RAM or GPU. If you are trying this on Free Google Colab then it is likely to face a “CUDA Out of Memory” problem.
To fix this common issue with stable diffusion, there are a few quick escape routes. The first is to reduce the image resolutions and spare processing time or get more memory by buying it. The default trained dimension is around 500 pixels. You can half this size to less than 25 which further spares some memory.
height=256, width=256): # Reduced image dimension
We have looked at building an image-to-image generation pipeline using depth2img pre-trained models. This is an alternative powered by Hugging Face instead of the prebuilt pipeline with less customization. Building a pipeline on the pre-trained models make things more adjustable. Utilizing the power of generative AI and stable diffusion could bring about a lot of changes and benefits to everyday business processes.
Key Takeaways
Reference Links
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
A. Image-to-image generation refers to the process of generating new images from existing ones while preserving certain attributes or transforming them into a different representation. It involves mapping input images to corresponding output images using various techniques like generative adversarial networks (GANs) or conditional variational autoencoders (CVAEs).
A. The process of image generation typically involves training a machine learning model on a dataset of images. This model learns the underlying patterns and structures of the images and can generate new images by sampling from the learned distribution. The process may involve preprocessing, feature extraction, model training, and post-processing steps.
A. In Stable Diffusion, depth-to-image refers to generating realistic images from depth maps or information. It involves converting depth representations, which typically indicate the distance of objects from the camera, into visually plausible images with realistic details, textures, and colors. Stable Diffusion is a method that leverages diffusion models for this purpose, enabling high-quality depth-to-image synthesis.
A. Image-to-image generation using Stable Diffusion refers to the process of generating new images from existing ones by leveraging the Stable Diffusion method. This approach involves converting input images into a latent space representation, applying diffusion processes to manipulate the latent variables, and then mapping them back to the image space to generate visually coherent and diverse output images with desired attributes or transformations.