Everything You Need To Know About Stable Diffusion

Sarvagya Agrawal Last Updated : 04 Dec, 2024
7 min read

Introduction

With the recent advancement in AI, the capabilities of Generative AI are being explored, and generating images from text is one such capability. Many models include Stable Diffusion, Imagen, Dall-E 3, Midjourney, Dreambooth, DreamFusion, and many more. In this article, we shall review the concept of the diffusion model utilized in Stable Diffusion along with its fine-tuning using LoRA.

Everything You Need To Know About Stable Diffusion

Learning Objectives

  • To understand the basic concept behind Stable Diffusion.
  • Components involved in the image generation.
  • Get hands-on experience in generating images with stable diffusion.

This article was published as a part of the Data Science Blogathon.

Introduction to Stable Diffusion

The diffusion model is a class of deep learning models capable of generating new data similar to what they have seen during the training. Stable diffusion is one such model which has the following capabilities:

Text-to-Image Generation

  • In this aspect, the Stable Diffusion model excels at translating textual descriptions into visually coherent images. It leverages the learned patterns from its training data to create images that align with the provided text prompts.
  •  Applications of this capability include content creation, where users can describe a scene or concept in text, and the model generates an image based on that description.
  • Additionally, developers can leverage the Stable Diffusion API to integrate text-to-image generation into their applications, enabling programmatic creation of images from textual prompts.

Image-to-Image Generation

  • This compelling functionality allows users to input an image and provide a textual prompt to guide the modification process. The model then combines the visual information from the image with the contextual cues from the text to produce a modified version of the input image.
  • Use cases for this feature range from creative design to image enhancement, where users can specify desired changes or adjustments through both text and visual input.

Inpainting

  • Inpainting is a specialized form of an image-to-image generation where the model focuses on restoring or completing specific regions of an image that may be missing or corrupted. Introducing noise to these areas is an essential technique employed by the Stable Diffusion model.
  • This capability finds applications in image restoration, where the model can reconstruct damaged or incomplete images based on the provided information.

Depth-to-Image

  • The depth-to-image functionality involves the transformation of depth information into a visual representation. Depth information typically describes the distance of objects in a scene, and the model can convert this data into a corresponding image.
  • Applications of this feature include computer vision tasks such as 3D reconstruction and scene understanding, where depth information is crucial for interpreting the spatial layout of a scene.

In summary, the Stable Diffusion model is a versatile deep-learning model with capabilities ranging from creative content generation to image manipulation and restoration. Its adaptability to diverse tasks makes it a valuable tool in various fields, including computer vision, graphics, and creative arts.

Understanding the Working of Stable Diffusion

Let’s start with the components involved in the Stable Diffusion model:

Understanding the Working of Stable Diffusion

Text Tokenizer

The task of the text encoder is to transform the input prompt into an embedding space that the U-Net can comprehend. Typically implemented as a simple transformer-based encoder, it maps a sequence of input tokens to a set of latent text embeddings.

Influenced by Imagen, the Stable Diffusion methodology takes a unique stance by refraining from training the text-encoder during its training phase. Instead, it utilizes the pre-existing and pretrained text encoder from CLIP, specifically the CLIPTextModel. CLIP, functioning as a multi-modal vision and language model, serves multiple purposes, including image-text similarity and zero-shot image classification. This model incorporates a ViT-like transformer for visual features and a causal language model for text features. The text and visual features are subsequently projected into a latent space with identical dimensions.

U-Net Model as Noise Predictor

The U-Net architecture consists of an encoder and a decoder, each comprising ResNet blocks. In this design, the encoder compresses an image representation into a lower resolution. At the same time, the decoder reconstructs the lower-resolution representation back to the original higher-resolution image, aiming for reduced noise. Specifically, the U-Net output predicts the noise residual, facilitating the computation of the denoised image representation.

To mitigate the loss of crucial information during downsampling, short-cut connections are typically introduced. These connections link the encoder’s downsampling ResNets to the decoder’s upsampling ResNets. Furthermore, the stable diffusion U-Net can condition its output on text embeddings by incorporating cross-attention layers. Both the encoder and decoder sections of the U-Net integrate these cross-attention layers, usually positioning them between ResNet blocks.

Autoencoder (VAE)

The VAE model has two parts: an encoder and a decoder. The encoder converts the image into a low-dimensional latent representation, which will serve as the input to the U-Net model. The decoder transforms the latent representation back into an image. During latent diffusion training, the encoder utilizes the photos to obtain their latent representations for the forward diffusion process, gradually adding more noise at each step. In inference, the denoised latent vectors produced by the reverse diffusion process are transformed back into images using the VAE decoder. As we will see during inference, we only need the VAE decoder.

Steps to Generate Images with Stable Diffusion

This section will look at the Diffusers pipeline to write our inference pipeline.

Step 1.

Import all the pretrained models using the diffuser library

from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler


vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae")


tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")


# 3. The UNet model for generating the latents.
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", 
subfolder="unet")

Step 2.

In this step, we will define a K-LMS scheduler instead of a pre-defined one. Schedulers are algorithms that generate latent representations from the noisy latent representations produced by the U-Net model.

from diffusers import LMSDiscreteScheduler

scheduler = LMSDiscreteScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", 
subfolder="scheduler")

Step 3.

Let’s define a few parameters to be used for generating images:

prompt = [“ an astronaut riding a horse"]


height = 512                        # default height of Stable Diffusion
width = 512                         # default width of Stable Diffusion


num_inference_steps = 100            # Number of denoising steps


guidance_scale = 7.5                # Scale for classifier-free guidance


generator = torch.manual_seed(32)   # Seed generator to create the inital latent noise


batch_size = 1

Step 4.

Get the text embeddings for the prompt, which will be used for the U-Net model.

text_input = tokenizer(prompt, padding="max_length", 
  max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")


with torch.no_grad():
  text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]

Step 5.

We will obtain unconditional text embeddings to guide without relying on a classifier. These embeddings precisely correspond to the padding token (representing empty text). These unconditional text embeddings must maintain the same shape as the conditional text embeddings, aligning with batch size and sequence length parameters.

max_length = text_input.input_ids.shape[-1]

uncond_input = tokenizer(

    [""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt"

)

with torch.no_grad():

  uncond_embeddings = text_encoder(uncond_input.input_ids.to(torch_device))[0]  

Step 6.

To achieve classifier-free guidance, it is necessary to perform two forward passes. The first pass involves the conditioned input using text embeddings, while the second one utilizes unconditional embeddings (uncond_embeddings). A more efficient approach in practical implementation involves concatenating both sets of embeddings into a single batch. This streamlines the process and eliminates the need to conduct two forward passes.

text_embeddings = torch.cat([uncond_embeddings, text_embeddings])

Step 7.

Generate initial latent noise:

latents = torch.randn(

  (batch_size, unet.in_channels, height // 8, width // 8),

  generator=generator,

)

latents = latents.to(torch_device)

Step 8.

The initialization of the scheduler involves specifying the chosen num_inference_steps. During this initialization, the scheduler computes the sigmas and determines the exact time step values to use throughout the denoising process.

scheduler.set_timesteps(num_inference_steps)

latents = latents * scheduler.init_noise_sigma

Step 9.

Let’s write denoising loop: from tqdm.auto import tqdm

from torch import autocast

for t in tqdm(scheduler.timesteps):

  # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.

  latent_model_input = torch.cat([latents] * 2)

  latent_model_input = scheduler.scale_model_input(latent_model_input, t)

  # predict the noise residual

  with torch.no_grad():

    noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample

  # perform guidance

  noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)

  noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

  # compute the previous noisy sample x_t -> x_t-1

  latents = scheduler.step(noise_pred, t, latents).prev_sample

Step 10.

Let’s use the VAE to decode the generated latent into the image.

# scale and decode the image latents with vae

latents = 1 / 0.18215 * latents

with torch.no_grad():

  image = vae.decode(latents).sample

Step 11.

Let’s convert the image to PIL to display or save it.

image = (image / 2 + 0.5).clamp(0, 1)

image = image.detach().cpu().permute(0, 2, 3, 1).numpy()

images = (image * 255).round().astype("uint8")

pil_images = [Image.fromarray(image) for image in images]

pil_images[0]

The below image is generated using the above code:

Steps to Generate Images with Stable Diffusion

Conclusion

In the above article, we explored the components involved in image generation by Stable Diffusion and its capabilities. Following are the key takeaways:

  • Comprehensive insight into the capabilities of diffusion models.
  • Overview of the critical components integral to Stable Diffusion.
  • Practical, hands-on experience in constructing a personalized diffusion pipeline.

Gain hands-on experience with our Stable Diffusion Image Generation course. Master key components and build your own diffusion pipeline with ease

Frequently Asked Questions

Q1. Why Stable Diffusion is faster than other models like Imagen?

Unlike other models like Imagen, which operates in the pixel space, it operates in latent space.

Q2. What is the role of the text encoder in the Stable Diffusion?

It converts the text input into text embeddings, which can be used as input for U-Net.

Q3. What is latent diffusion?

Latent diffusion presents a notable enhancement in efficiency by diminishing both memory and compute complexities. Implementing the diffusion process across a lower-dimensional latent space achieves this improvement instead of utilizing the actual pixel space.

Q4. What is a latent seed?

A latent seed generates random latent image representations of size  64×64.

Q5. What are schedulers?

They are denoising algorithms that remove noise from the latent image produced by the U-Net model.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Hi, I'm Sarvagya Agrawal, Software Engineer, with a strong passion for utilizing technology to drive positive change in society. I believe that technology is not just a skill, but an art form that can be leveraged to transform the world.
My primary focus lies in machine learning and web development, with strong programming skills in Python. I have worked on innovative projects, including developing an AI model to calculate cardiovascular risk factors from OCTA scans using computer vision algorithms and creating an AI-based web application for calculating financial risk based on an individual's spending trends.

Responses From Readers

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details