Stability AI has been at the forefront of developing Open Source Diffusion Models like the Stable Diffusion and Stable Diffusion XL, which have brought a revolution to the field of text-to-image generation. Now, the world of text-to-image generation just got a major upgrade with the arrival of Stable Diffusion XL Turbo aka SDXL Turbo for short. This revolutionary model from Stability AI promises lightning-fast image creation, pushing the boundaries to the next level. Stability AI has brought in a new concept with the introduction of this model. In this article, we will go through the process of setting up this model.
This article was published as a part of the Data Science Blogathon.
Stable Diffusion is a powerful text-to-image generation model that uses diffusion processes to add noise at every step to an image while preserving its internal form. By starting with pure noise and eventually eliminating it based on a Text Prompt, the model “learns” to create images that match the textual description. Diffusion models like Stable Diffusion XL have a good number of plus points over older-generation methods, including high-quality image outputs, detailed control, and diverse artistic styles.
But the problem comes during the generation process. The time it takes to generate these high-quality images through Stable Diffusion / Stable Diffusion XL is pretty high and is the issue at hand. The diffusion model needs to take many iterations ranging from 20 to 60 to produce good-quality images. Hence a lot of research has been put up to reduce the generation speed and thus the Stability AI came up with SDXL Turbo.
But the problem comes during the generation process. The time it takes to generate these high-quality images through Stable Diffusion / Stable Diffusion XL is pretty high and is the only drawback. The diffusion model needs to take many iterations ranging from 20 to 60 to produce good-quality images. Hence a lot of research has been put up to reduce the generation speed and thus the Stability AI came up with SDXL Turbo.
SDXL Turbo is a distilled version of Stable Diffusion XL, built using a novel method called Adversarial Diffusion Distillation (ADD). This method, what it does is, it “tunes” the model for faster inference, drastically reducing the image generation time. The Adversarial Diffusion Unlike traditional Stable Diffusion, which requires tens or hundreds of steps to produce a high-quality image, SDXL Turbo can get similar results in just five steps. It can even produce good-quality images only in a single iteration. This translates to real-time image generation, opening up a world of creative possibilities.
The Adversarial Diffusion Distillation involves three networks, an ADD-student, a Discriminator, and a DM-Teacher (Diffusion Model Teacher). Firstly, a real image is converted into a noisy image. The ADD-Student then takes in this noisy image and tries to generate a good-quality image in just 4 steps through the diffusion process, i.e. denoising it. The discriminator then tries to distinguish between the real image and the image produced by the student to check if it is fake or real.
In this process, the student tries to optimize two losses. One is the adversarial loss, that is it tries to fool the discriminator by generating good images that look like the original image. The other is the distillation loss, where the student tries to achieve results comparable to that of the DM-Teacher. Here the knowledge is being distilled from the DM-Teacher to the ADD-student, where the student keeps the denoised weight of the teacher as its prediction target to decrease the distillation loss. This way, the student can generate good-quality images in just a few steps.
In this section, we will look into how to get started with Stable Stable through huggingface. To get started, first, we download the necessary libraries.
!pip install diffusers transformers accelerate
# Import the necessary libraries
from diffusers import AutoPipelineForText2Image
import torch
# Load the pre-trained text-to-image diffusion model
pipe = AutoPipelineForText2Image.from_pretrained(
"stabilityai/sdxl-turbo", # Specify the model name from Hugging Face Hub
torch_dtype=torch.float16, # Set half-precision floating-point type for efficiency
variant="fp16" # Optimize model for half-precision calculations
)
# Move the model to the GPU for faster processing
pipe.to("cuda")
Now we have successfully downloaded the model and have uploaded it into the GPU. Next, we will try giving a Prompt and observing the image generated.
# Define the prompt text for image generation
prompt = "A cinematic shot of a kitten walking down a lush green forest on \
a broad daylight"
# Generate the image using the model
image = pipe(
prompt=prompt, # Pass the prompt to the model
num_inference_steps=1, # Set the number of diffusion steps
guidance_scale=0.0 # Disable guided diffusion for this generation
).images[0] # Access the generated image
image
The image generated took only a single inference step. And the time taken to generate is less than a second. This takes the existing SD and SDXL models to the next level, which usually takes many seconds and sometimes even minutes to generate images. And even the generated image quality is good
Sometimes, the image generated can be distorted or may not be of good quality. Think of the generated image containing a human with 3 eyes, 3 legs, or more than 5 fingers. This is not the proper image that we want to generate Hence for these, we give Negative Prompts. Here is an example of an unusual image-generated
Also by default, the images generated are of size 512×512. We can generate an image of higher resolution by giving the width and height of the pipeline. Now let’s try adding in Negative Prompts and even changing the resolution of the generated image
# Define prompts for image generation
prompt = "A cinematic close-up shot of astronauts walking stepping \
down the spacecraft on Mars."
negative_prompt = "blurry image, distorted image, people, triple hands"
# Generate the image using the model, incorporating desired features
image = pipe(
prompt=prompt, # Provide the main prompt to guide image creation
negative_prompt=negative_prompt, # Specify elements to avoid in the image
num_inference_steps=4, # Set the number of diffusion steps
guidance_scale=0.0, # Disable guided diffusion for this generation
width=1024, # Set image width to 1024 pixels
height=1024 # Set image height to 1024 pixels
).images[0] # Access the generated image
image
The image generated for the above Prompt can be seen below
Compared to the 1st pic, here we don’t see distortions or an unusual number of body parts. The image perfectly follows the text we have provided to the SDXL Turbo
The potential applications and use cases of the SDXL Turbo are
Gone are the days of painstaking changes in design software. With SDXL Turbo, crafting visuals can be done in real time. Imagine sketching out a concept for a science fiction film, and instantly conjuring shimmering alien cities or spaceships bursting with neon, guided by your every descriptive whim. We can leverage it to create vibrant posters by injecting it with a few phrases.
We can say goodbye to the frustration of slow, clunky prototyping tools. SDXL Turbo lets our ideas materialize at the speed of imagination. It helps in brainstorming. Writing down a few lines about its features, and watching the SDXL creating breathtaking and interactive mockups with realistic user interfaces.
It will be helpful in physical product designs. Simply describe the shape, materials, and functionality, and we can witness a virtual prototype materialize before our eyes, ready for immediate tweaks and changes. With SDXL Turbo, iteration cycles become lightning-fast, reducing our design time and propelling our projects from concept to reality in record time.
Captivate our audience like never before with presentations that go beyond simple static slides. SDXL Turbo transforms our stories into living images, based on our Prompts and audience interaction. Think of a situation where we are telling a story and watching the scenes change with every word spoken by us. With SDXL Turbo, presentations become immersive journeys, leaving our listeners spellbound.
SDXL Turbo marks a thrilling evolution in the realm of text-to-image Generation, thus paving the way for Artists and Creators to materialize their visions with unprecedented speed. While it is still not close to the intricate detail of slower diffusion models, its real-time capabilities unlock a myriad of possibilities for rapid prototyping, collaborative exploration, and captivating live performances. In this article, we have taken a practical look at how to get started with SDXL Turbo
A. Diffusion models are text-to-image models that slowly add noise to images based on Text Prompts, preserving the form and producing high-quality outputs.
A. SDXL Turbo addresses slow image generation in diffusion models by using Adversarial Diffusion Distillation for real-time results.
A. SDXL Turbo is created through Adversarial Diffusion Distillation, thus allowing to use low number of steps like five steps for good results to traditional models.
A. Yes, SDXL Turbo gives us the option to edit the Image Size and lets us provide Negative Prompts to avoid distortions or unwanted features in generated images.
A. SDXL Turbo is readily available in Hugging Face. We can work with the existing diffusers library from HuggingFace and work with it to download the SDXL Turbo model.
A. SDXL Turbo may have limitations in handling highly complex image details compared to slower models. Even the quality of the image generated will be a bit less compared to the actual SDXL models.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.