CS 280A Proj5 - Leo Huang

Basic Sampling Loops using DeepFloyd

To begin our foray into diffusion models, we first examine the capabilities of denoising and sampling from pretrained diffusion models.

Forward Noising Process

In diffusion model training, a clean image x_0 is iteratively perturbed, obtaining progressively more noisy versions of the image x_t until timestep t = T. The model then tries to reverse this process by predicting the noise in the image at different timesteps and denoising the image. To generate our noisy test input, we take a clean image of the Campanile and apply the forward process x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * noise where noise ~ N(0, 1). alpha_bar_t corresponds to the amount of noise that should be added based on the timestep.

t = 250

t = 500

t = 750

Classical Denoising

Classical denoising methods generally entail Gaussian blur filtering. We can observe improvements in the image at low noise levels, but at higher noise levels, the filter fails to recover any features in the image.

t = 250

t = 500

t = 750

One-Step Denoising

Next, we try single-step denoising by sampling from a pretrained UNet. The UNet used was trained with text conditioning, so we have a corresponding text prompt embedding, "a high quality photo", which guides the model's denoising process. To denoise our image, we pass in our noisy input, and get a noise estimate. Reversing our forward process, we get x_0 = (x_t - noise_est * sqrt(1 - alpha_bar_t)) / sqrt(alpha_bar_t). This leaves us with a much cleaner denoised version of the image. However, it is still not perfect, and at higher noise levels, we can also observe the structure changing.

noisy

t = 250

t = 500

t = 750

denoised

t = 250

t = 500

t = 750

Iterative Denoising

Since denoising UNets are trained to denoise iteratively, we implement an iterative denoising process. This is very similar to single-step denoising, but entails finding a linear interpolation between the estimated noise and the image at each time step. The denoising equation used was derived from the DDPM paper. Here, we finally a much better quality image is recovered, although the structure has completely changed.

t = 90

t = 240

t = 390

t = 540

t = 690

original

iterative denoise

one-step denoise

blurred denoise

Diffusion Model Sampling

Naive Sampling

Instead of starting with a noised input image, we can also start with an image of pure noise. Applying iterative denoising can allow us to generate interesting images from scratch.

sample 1

sample 2

sample 3

sample 4

sample 5

Classifier-Free Guidance (CFG)

Among our sampled images, although interesting, we can observe lower qualities. Using Classifier-Free Diffusion Guidance (CFG), it is possible to increase the quality of our images. We compute both a conditional and unconditional noise estimate and generate a new noise estimate noise_est = noise_uncond + gamma * (noise_cond - noise_uncond) where gamma > 1. Although the cause for why this is the result is still up to vigorous debate, personally, it seems it uses the unconditional noise estimate as a baseline, and then accentuates the noise features which are observed in conditional noise estimate, allowing for sharper features in the recovered image (similar to how we add gain to high frequency features in project 2 for image sharpening).

sample 1

sample 2

sample 3

sample 4

sample 5

Image-to-Image Translation

One application of our new sampling implementation is editing existing images. By adding noise to our images, the diffusion model is allowed to "hallucinate" new things and force the image back onto a natural manifold, as described by the SDEdit algorithm. We can observe that as the starting t increases, our image gets closer and closer to the original Campanile with a few perturbations here and there.

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

original

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

original

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

original

Similarly, we can use the model to also edit web and hand-drawn images. However, due to my poor drawing skills, most of the generated images even at later stages are not very representative of what I had intended it to be.

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

web image

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

frog

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

computer

Another application of this process is inpainting, following the RePaint paper. Here, we generate a mask, and have the model essentially fill in the the mask with the surrounding image forced to remain the same. This allows us to be more creative with our images, while generally adhering to the context of the image.

Campanile

inpainted

cat

inpainted

jellyfish

inpainted

One last application that we explored was text-conditional image-to-image translation. By changing the text prompt embedding, our noise is conditioned to more specific prompts, and transforms our images towards our text prompts. Here the Campanile and jellyfish are conditioned using "a rocket ship", while the cat is conditioned using "a photo of a dog".

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Campanile

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

cat

i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

jellyfish

Visual Anagrams

A fun exploration, suggested by Geng, et al., we can use diffusion models for is to create optical illusions, which reveal different images depending on the viewing orientation. This entails denoising the image x_t at each time step, but with 2 text prompts, one with the image right side up, and one with the image flipped upside down. Then we make a composite noise estimate by averaging the 2, and proceed iteratively like before. This results in pretty fun images, however, sometimes the text prompts can be incompatible and sample rather nonsensical results.

old man

campfire

snowy village

old man

Hybrid Images

Another fun exploration, also suggested by Geng, et al., we can use our model for is creating hybrid images. This allows us to generate images that look like one text prompt from close up, and another text prompt from far away. This is achieved by creating a compositive noise estimate of the noise estimates passed through highpass and lowpass filters. However, similarly, sometimes the text prompts can be incompatible and rarely sample something satisfactory.

skull + waterfall

campfire + snowy village

waterfall + rocket

skull + old man

Training a Single-Step Denoising UNet

Now that we've explored how to use a denoising UNet, we aim to create and train our own diffusion model on the MNIST dataset.

Creating a Noisy Dataset

To begin this process, we first create a dataset and dataloader in which we pair a noised image with its clean version. This way we can compute the mean squared error (MSE) loss between the predicted denoised image and the clean image. We use a simple equation for our noising process: z = x + sigma * noise where noise ~ N(0, I) and sigma = [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0].

sigma = 0.0

sigma = 0.2

sigma = 0.4

sigma = 0.5

sigma = 0.6

sigma = 0.8

sigma = 1.0

Training

Next, the UNet is constructed follow this architecture.

unconditioned UNet architecture

unconditioned UNet training loss curve

As specified, the UNet is trained on a Noisy MNIST dataset with sigma = 0.5 using MSE loss. For the model's hyperparameters, batch size = 256, hidden dimensions = 128, epochs = 5, and an Adam optimizer is used with a learning rate of 1e-4.

Once training is complete, we can visualize the model's ability to denoise. After the first epoch, we can observe the model still struggles to remove the noise completely from the input image. However, after the fifth epoch, we can observe a significant improvement in our output image quality.

image

input

epoch 1

output

image

input

epoch 5

output

Out-of-Distribution Testing

To further evaluate our learned model, we test on data points that are out of the training distribution. We use the entire range of sigmas and pass in the various noised images into our model. As observed, the model performs relatively well up until sigma = 0.8, after which, the model seems to struggle to recover the original image, and instead "hallucinates" a little.

noisy input

sigma = 0

sigma = 2

sigma = 4

sigma = 5

sigma = 6

sigma = 8

sigma = 10

model output

sigma = 0

sigma = 2

sigma = 4

sigma = 5

sigma = 6

sigma = 8

sigma = 10

Training a Time Conditioned UNet

Training

Instead of a single-step denoiser, we now aim to train a UNet model that iteratively denoises images, similar to the one we sampled from before. This requires a slight change in our loss function - now we take the MSE loss between the predicted noise and actual noise at each time step, given an image and time t. We also change our model architecture to include 2 fully connected blocks that help incorporate the timestep into our prediction output.

time-conditioned UNet architecture

time-conditioned UNet training loss curve

This time, the normal MNIST dataset was used, and when running the DDPM forward algorithm, t was uniformly sampled from a range of timesteps, which was then used to compute the noisy image based on the previous iterative forward noising process. Within this process, a custom DDPM schedule was computed using betas in [0.0001, 0.02], and alphas and alpha_bars calculated accordingly. For the model's hyperparameters, batch size = 128, hidden dimensions = 64, epochs = 20, and an Adam optimizer is used with an initial learning rate of 1e-3. An exponential learning rate decay scheduler with a gamma of 0.1 ** (1.0 / num_epochs) is also used.

Here are also some visualizations of the model's predictions during the training process.

x_0 at epoch 10

x_t at epoch 10

predicted noise at epoch 10

predicted denoise at epoch 10

Sampling

After training is complete, we use the DDPM sampling algorithm to iteratively denoise an image of pure noise to recover our numbers. At epoch 5, it's still pretty clear that the model hasn't learned the numbers' structures, and denoises to more abstract figures. Once epoch 20 finishes, the model does seem to generate more sensical outputs that are clean of noise and other visual artifacts.

sampling UNet at epoch 5

sampling process at epoch 5

sampling UNet at epoch 20

sampling process at epoch 20

Training a Class Conditioned UNet

Training

To improve upon our time-conditioned UNet, we further add 2 more fully connected blocks to incorporate the class (aka the number) that the image corresponds to. This helps guide the denoiser towards a specified noise pattern, similar to the text embeddings from the earlier sections. We also include a 10% dropout such that the UNet is still able to learn noise estimates without class conditioning. In both the time-conditioned and class-conditioned UNets, we can observe the loss bouncing around asymptotically in later iterations, but this is inevitable given that the noise is randomly distributed. However, the class-conditioned UNet is able to achieve lower losses in later iterations, likely thanks to the help of class conditioning.

class-conditioned UNet training loss curve

Here are also some visualizations of the model's predictions during the training process.

x_0 at epoch 10

x_t at epoch 10

predicted noise at epoch 10

predicted denoise at epoch 10

Sampling

After the training is complete, it's time to test our sampling results again. However, this time, the sampling algorithm is changed to use CFG and produce an unconditional and conditional noise estimate before generating a final composite noise estimate. In this way, we can observe that even in epoch 5, our model is already generating close to perfect images of each number, which is further refined after epoch 20.

sampling UNet at epoch 5

sampling process at epoch 5

sampling UNet at epoch 20

sampling process at epoch 20

CS 280A Project 5

Fun with Diffusion Models

Leo Huang

Overview

Basic Sampling Loops using DeepFloyd

Forward Noising Process

Classical Denoising

One-Step Denoising

Iterative Denoising

Diffusion Model Sampling

Naive Sampling

Classifier-Free Guidance (CFG)

Image-to-Image Translation

Visual Anagrams

Hybrid Images

Training a Single-Step Denoising UNet

Creating a Noisy Dataset

Training

Out-of-Distribution Testing

Training a Time Conditioned UNet

Training

Sampling

Training a Class Conditioned UNet

Training

Sampling