This project explores diffusion models and sampling from pretrained mdoels like DeepFloyd IF by Stability AI. We explore denoising capabilities of such models as well as how to sample finer quality images, image translations, visual anagrams, and hybrid images. In later parts, we also explore the design and conditioning of such denoising neural nets based on methods from the paper "Denoising Diffusion Probabilistic Models" (DDPM) by Ho et al..
To begin our foray into diffusion models, we first examine the capabilities of denoising and sampling from pretrained diffusion models.
In diffusion model training, a clean image x_0
is iteratively perturbed, obtaining progressively more noisy versions of the image x_t
until timestep t = T. The model then tries to reverse this process by predicting the noise in the image at different timesteps and denoising the image. To generate our noisy test input, we take a clean image of the Campanile and apply the forward process x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * noise
where noise
~ N(0, 1). alpha_bar_t
corresponds to the amount of noise that should be added based on the timestep.
Classical denoising methods generally entail Gaussian blur filtering. We can observe improvements in the image at low noise levels, but at higher noise levels, the filter fails to recover any features in the image.
Next, we try single-step denoising by sampling from a pretrained UNet. The UNet used was trained with text conditioning, so we have a corresponding text prompt embedding, "a high quality photo", which guides the model's denoising process. To denoise our image, we pass in our noisy input, and get a noise estimate. Reversing our forward process, we get x_0 = (x_t - noise_est * sqrt(1 - alpha_bar_t)) / sqrt(alpha_bar_t)
. This leaves us with a much cleaner denoised version of the image. However, it is still not perfect, and at higher noise levels, we can also observe the structure changing.
Since denoising UNets are trained to denoise iteratively, we implement an iterative denoising process. This is very similar to single-step denoising, but entails finding a linear interpolation between the estimated noise and the image at each time step. The denoising equation used was derived from the DDPM paper. Here, we finally a much better quality image is recovered, although the structure has completely changed.
Instead of starting with a noised input image, we can also start with an image of pure noise. Applying iterative denoising can allow us to generate interesting images from scratch.
Among our sampled images, although interesting, we can observe lower qualities. Using Classifier-Free Diffusion Guidance (CFG), it is possible to increase the quality of our images. We compute both a conditional and unconditional noise estimate and generate a new noise estimate noise_est = noise_uncond + gamma * (noise_cond - noise_uncond)
where gamma > 1. Although the cause for why this is the result is still up to vigorous debate, personally, it seems it uses the unconditional noise estimate as a baseline, and then accentuates the noise features which are observed in conditional noise estimate, allowing for sharper features in the recovered image (similar to how we add gain to high frequency features in project 2 for image sharpening).
One application of our new sampling implementation is editing existing images. By adding noise to our images, the diffusion model is allowed to "hallucinate" new things and force the image back onto a natural manifold, as described by the SDEdit algorithm. We can observe that as the starting t increases, our image gets closer and closer to the original Campanile with a few perturbations here and there.
Similarly, we can use the model to also edit web and hand-drawn images. However, due to my poor drawing skills, most of the generated images even at later stages are not very representative of what I had intended it to be.
Another application of this process is inpainting, following the RePaint paper. Here, we generate a mask, and have the model essentially fill in the the mask with the surrounding image forced to remain the same. This allows us to be more creative with our images, while generally adhering to the context of the image.
One last application that we explored was text-conditional image-to-image translation. By changing the text prompt embedding, our noise is conditioned to more specific prompts, and transforms our images towards our text prompts. Here the Campanile and jellyfish are conditioned using "a rocket ship", while the cat is conditioned using "a photo of a dog".
A fun exploration, suggested by Geng, et al., we can use diffusion models for is to create optical illusions, which reveal different images depending on the viewing orientation. This entails denoising the image x_t
at each time step, but with 2 text prompts, one with the image right side up, and one with the image flipped upside down. Then we make a composite noise estimate by averaging the 2, and proceed iteratively like before. This results in pretty fun images, however, sometimes the text prompts can be incompatible and sample rather nonsensical results.
Another fun exploration, also suggested by Geng, et al., we can use our model for is creating hybrid images. This allows us to generate images that look like one text prompt from close up, and another text prompt from far away. This is achieved by creating a compositive noise estimate of the noise estimates passed through highpass and lowpass filters. However, similarly, sometimes the text prompts can be incompatible and rarely sample something satisfactory.
Now that we've explored how to use a denoising UNet, we aim to create and train our own diffusion model on the MNIST dataset.
To begin this process, we first create a dataset and dataloader in which we pair a noised image with its clean version. This way we can compute the mean squared error (MSE) loss between the predicted denoised image and the clean image. We use a simple equation for our noising process: z = x + sigma * noise
where noise
~ N(0, I) and sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]
.
Next, the UNet is constructed follow this architecture.
As specified, the UNet is trained on a Noisy MNIST dataset with sigma = 0.5
using MSE loss. For the model's hyperparameters, batch size = 256, hidden dimensions = 128, epochs = 5, and an Adam optimizer is used with a learning rate of 1e-4.
Once training is complete, we can visualize the model's ability to denoise. After the first epoch, we can observe the model still struggles to remove the noise completely from the input image. However, after the fifth epoch, we can observe a significant improvement in our output image quality.
To further evaluate our learned model, we test on data points that are out of the training distribution. We use the entire range of sigmas and pass in the various noised images into our model. As observed, the model performs relatively well up until sigma = 0.8
, after which, the model seems to struggle to recover the original image, and instead "hallucinates" a little.
Instead of a single-step denoiser, we now aim to train a UNet model that iteratively denoises images, similar to the one we sampled from before. This requires a slight change in our loss function - now we take the MSE loss between the predicted noise and actual noise at each time step, given an image and time t. We also change our model architecture to include 2 fully connected blocks that help incorporate the timestep into our prediction output.
This time, the normal MNIST dataset was used, and when running the DDPM forward algorithm, t was uniformly sampled from a range of timesteps, which was then used to compute the noisy image based on the previous iterative forward noising process. Within this process, a custom DDPM schedule was computed using betas
in [0.0001, 0.02], and alphas
and alpha_bars
calculated accordingly. For the model's hyperparameters, batch size = 128, hidden dimensions = 64, epochs = 20, and an Adam optimizer is used with an initial learning rate of 1e-3. An exponential learning rate decay scheduler with a gamma of 0.1 ** (1.0 / num_epochs)
is also used.
Here are also some visualizations of the model's predictions during the training process.
After training is complete, we use the DDPM sampling algorithm to iteratively denoise an image of pure noise to recover our numbers. At epoch 5, it's still pretty clear that the model hasn't learned the numbers' structures, and denoises to more abstract figures. Once epoch 20 finishes, the model does seem to generate more sensical outputs that are clean of noise and other visual artifacts.
To improve upon our time-conditioned UNet, we further add 2 more fully connected blocks to incorporate the class (aka the number) that the image corresponds to. This helps guide the denoiser towards a specified noise pattern, similar to the text embeddings from the earlier sections. We also include a 10% dropout such that the UNet is still able to learn noise estimates without class conditioning. In both the time-conditioned and class-conditioned UNets, we can observe the loss bouncing around asymptotically in later iterations, but this is inevitable given that the noise is randomly distributed. However, the class-conditioned UNet is able to achieve lower losses in later iterations, likely thanks to the help of class conditioning.
Here are also some visualizations of the model's predictions during the training process.
After the training is complete, it's time to test our sampling results again. However, this time, the sampling algorithm is changed to use CFG and produce an unconditional and conditional noise estimate before generating a final composite noise estimate. In this way, we can observe that even in epoch 5, our model is already generating close to perfect images of each number, which is further refined after epoch 20.