With the recent advancement in AI, the capabilities of Generative AI are being explored, and generating images from text is one such capability. Many models include Stable Diffusion, Imagen, Dall-E 3, Midjourney, Dreambooth, DreamFusion, and many more. In this article, we shall review the concept of the diffusion model utilized in Stable Diffusion along with its fine-tuning using LoRA.
Learning Objectives
This article was published as a part of the Data Science Blogathon.
The diffusion model is a class of deep learning models capable of generating new data similar to what they have seen during the training. Stable diffusion is one such model which has the following capabilities:
In summary, the Stable Diffusion model is a versatile deep-learning model with capabilities ranging from creative content generation to image manipulation and restoration. Its adaptability to diverse tasks makes it a valuable tool in various fields, including computer vision, graphics, and creative arts.
Let’s start with the components involved in the Stable Diffusion model:
The task of the text encoder is to transform the input prompt into an embedding space that the U-Net can comprehend. Typically implemented as a simple transformer-based encoder, it maps a sequence of input tokens to a set of latent text embeddings.
Influenced by Imagen, the Stable Diffusion methodology takes a unique stance by refraining from training the text-encoder during its training phase. Instead, it utilizes the pre-existing and pretrained text encoder from CLIP, specifically the CLIPTextModel. CLIP, functioning as a multi-modal vision and language model, serves multiple purposes, including image-text similarity and zero-shot image classification. This model incorporates a ViT-like transformer for visual features and a causal language model for text features. The text and visual features are subsequently projected into a latent space with identical dimensions.
The U-Net architecture consists of an encoder and a decoder, each comprising ResNet blocks. In this design, the encoder compresses an image representation into a lower resolution. At the same time, the decoder reconstructs the lower-resolution representation back to the original higher-resolution image, aiming for reduced noise. Specifically, the U-Net output predicts the noise residual, facilitating the computation of the denoised image representation.
To mitigate the loss of crucial information during downsampling, short-cut connections are typically introduced. These connections link the encoder’s downsampling ResNets to the decoder’s upsampling ResNets. Furthermore, the stable diffusion U-Net can condition its output on text embeddings by incorporating cross-attention layers. Both the encoder and decoder sections of the U-Net integrate these cross-attention layers, usually positioning them between ResNet blocks.
The VAE model has two parts: an encoder and a decoder. The encoder converts the image into a low-dimensional latent representation, which will serve as the input to the U-Net model. The decoder transforms the latent representation back into an image. During latent diffusion training, the encoder utilizes the photos to obtain their latent representations for the forward diffusion process, gradually adding more noise at each step. In inference, the denoised latent vectors produced by the reverse diffusion process are transformed back into images using the VAE decoder. As we will see during inference, we only need the VAE decoder.
This section will look at the Diffusers pipeline to write our inference pipeline.
Import all the pretrained models using the diffuser library
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae")
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")
# 3. The UNet model for generating the latents.
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4",
subfolder="unet")
In this step, we will define a K-LMS scheduler instead of a pre-defined one. Schedulers are algorithms that generate latent representations from the noisy latent representations produced by the U-Net model.
from diffusers import LMSDiscreteScheduler
scheduler = LMSDiscreteScheduler.from_pretrained("CompVis/stable-diffusion-v1-4",
subfolder="scheduler")
Let’s define a few parameters to be used for generating images:
prompt = [“ an astronaut riding a horse"]
height = 512 # default height of Stable Diffusion
width = 512 # default width of Stable Diffusion
num_inference_steps = 100 # Number of denoising steps
guidance_scale = 7.5 # Scale for classifier-free guidance
generator = torch.manual_seed(32) # Seed generator to create the inital latent noise
batch_size = 1
Get the text embeddings for the prompt, which will be used for the U-Net model.
text_input = tokenizer(prompt, padding="max_length",
max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")
with torch.no_grad():
text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]
We will obtain unconditional text embeddings to guide without relying on a classifier. These embeddings precisely correspond to the padding token (representing empty text). These unconditional text embeddings must maintain the same shape as the conditional text embeddings, aligning with batch size and sequence length parameters.
max_length = text_input.input_ids.shape[-1]
uncond_input = tokenizer(
[""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt"
)
with torch.no_grad():
uncond_embeddings = text_encoder(uncond_input.input_ids.to(torch_device))[0]
To achieve classifier-free guidance, it is necessary to perform two forward passes. The first pass involves the conditioned input using text embeddings, while the second one utilizes unconditional embeddings (uncond_embeddings). A more efficient approach in practical implementation involves concatenating both sets of embeddings into a single batch. This streamlines the process and eliminates the need to conduct two forward passes.
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])
Generate initial latent noise:
latents = torch.randn(
(batch_size, unet.in_channels, height // 8, width // 8),
generator=generator,
)
latents = latents.to(torch_device)
The initialization of the scheduler involves specifying the chosen num_inference_steps. During this initialization, the scheduler computes the sigmas and determines the exact time step values to use throughout the denoising process.
scheduler.set_timesteps(num_inference_steps)
latents = latents * scheduler.init_noise_sigma
Let’s write denoising loop: from tqdm.auto import tqdm
from torch import autocast
for t in tqdm(scheduler.timesteps):
# expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.
latent_model_input = torch.cat([latents] * 2)
latent_model_input = scheduler.scale_model_input(latent_model_input, t)
# predict the noise residual
with torch.no_grad():
noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample
# perform guidance
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
# compute the previous noisy sample x_t -> x_t-1
latents = scheduler.step(noise_pred, t, latents).prev_sample
Let’s use the VAE to decode the generated latent into the image.
# scale and decode the image latents with vae
latents = 1 / 0.18215 * latents
with torch.no_grad():
image = vae.decode(latents).sample
Let’s convert the image to PIL to display or save it.
image = (image / 2 + 0.5).clamp(0, 1)
image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
images = (image * 255).round().astype("uint8")
pil_images = [Image.fromarray(image) for image in images]
pil_images[0]
The below image is generated using the above code:
In the above article, we explored the components involved in image generation by Stable Diffusion and its capabilities. Following are the key takeaways:
Gain hands-on experience with our Stable Diffusion Image Generation course. Master key components and build your own diffusion pipeline with ease
Unlike other models like Imagen, which operates in the pixel space, it operates in latent space.
It converts the text input into text embeddings, which can be used as input for U-Net.
Latent diffusion presents a notable enhancement in efficiency by diminishing both memory and compute complexities. Implementing the diffusion process across a lower-dimensional latent space achieves this improvement instead of utilizing the actual pixel space.
A latent seed generates random latent image representations of size 64×64.
They are denoising algorithms that remove noise from the latent image produced by the U-Net model.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.