Stable diffusion is a powerful (generative model) tool to create high-quality images from noise. Stable diffusion consists of two steps: a forward diffusion process and a reverse diffusion process. In the forward diffusion process, noise is progressively added to an image, effectively degrading its quality. This step is crucial for training the model, as it helps the model learn how images can transition from clarity to noise. We have covered the details of the forward diffusion process in our previous article.
In reverse diffusion, noise is progressively removed to generate a high-quality image. This article will focus on this process, exploring its mechanisms and mathematical foundations.
The reverse diffusion process aims to convert pure noise into a clean image by iteratively removing noise. Training a diffusion model is to learn the reverse diffusion process so that it can reconstruct an image from pure noise. If you guys are familiar with GANs, we’re trying to train our generator network, but the only difference is that the diffusion network does an easier job because it doesn’t have to do all the work in one step. Instead, it uses multiple steps to remove noise at a time, which is more efficient and easy to train, as figured out by the authors of this paper.
Many people think that a neural network (called a diffusion model for even more confusion) removes noise from an input image or predicts the noise to be removed from an input. Both are incorrect. What the diffusion model does is predict the entire noise to be removed at a particular timestep. This means that if we have timestep t=600, then our Diffusion model tries to predict the entire noise on which removal we should get to t=0, not t=599.
This is the equation that we took from the paper Denoising Diffusion Probabilistic Models.
It basically says that 𝑝𝜃(𝑥0:𝑇) is a chain of Gaussian transitions starting at 𝑝(𝑥𝑇) and iterating T times using the equation for one diffusion process step 𝑝𝜃(𝑥𝑡−1∣𝑥𝑡).
Now it’s time to explain how the single step works and how to get something to implement.
𝑁(𝑥𝑡−1,𝜇𝜃(𝑥𝑡,𝑡),∑𝜃(𝑥𝑡,𝑡)) has 2 parts:
To know more about the mathematical foundations of the reverse diffusion process refer to this article.
The generation of images using the reverse diffusion process relies highly on how well the model can predict the noise included in the forward diffusion process. This noise prediction capability is developed through a rigorous training process.
The main objective of training the model using reverse diffusion is to predict the noise at each diffusion process step. By minimizing the error between predicted and actual noise, the model learns to denoise the image effectively.
The training data consists of pairs of noisy images and the corresponding noise added at each step during the forward diffusion process. This data is generated by applying the forward diffusion process to a set of clean images, progressively adding noise over multiple steps.
A critical component of the training process is the loss function. The loss function quantifies the difference between predicted and actual noise. One commonly used loss function is the Mean Squared Error (MSE). The model is trained to minimize this MSE loss, thereby improving its ability to predict the noise accurately.
Convolutional neural networks (CNNs) are the most common type of neural network utilized in the reverse diffusion process for noise prediction. CNNs can record spatial hierarchies in images, making them ideal for image processing applications. Multiple convolutional layers, pooling layers, and activation functions may be used in the architecture to extract and learn complicated characteristics from noisy pictures. There are two common backbone architecture choices for diffusion models: U-Net and Transformer.
The model’s performance is assessed after training using a different validation dataset that wasn’t utilized for training. On this validation set, the model’s accuracy in predicting noise is an indication of its generalization ability. Metrics like mean squared error (MSE), root mean square error (RMSE), mean absolute error (MAE), and R-squared (coefficient of determination) are often used.
Stable diffusion models rely on both the forward and reverse diffusion processes. These processes work together to gradually reduce noise in an image, ultimately producing high-quality results. This iterative refining mechanism is rooted in strong mathematical foundations, making stable diffusion an effective tool in the generative model field. As research in this area progresses, we can anticipate even more advanced applications and developments in this intriguing field.
Ans. In stable diffusion, the reverse diffusion process starts with a noisy image and gradually reduces the noise to produce a high-quality image. It is the opposite of the forward diffusion process, which gradually adds noise to an image.
Ans. The image that starts the process is noisy. A neural network estimates the amount of noise at each step, which is then deducted from the image. This iterative process of noise prediction and subtraction is carried out until a high-quality image is achieved.
Ans. The neural network’s role is to accurately predict the noise at each step of the reverse diffusion process. This prediction is crucial for effectively removing noise and reconstructing the original image.
Ans. The model is trained using pairs of noisy images, and the corresponding noise is added during the forward diffusion process. The training objective is to minimize the error between predicted and actual noise using a loss function like Mean Squared Error (MSE).