CS 280A Project 5: Diffusion Models

by Alonso Martinez

Intro

We begin our journey into neural generative models with Diffusion models. In this exciting project, we leverage a pretrained model from DeepFloyd to generate images from inputs such as: text prompts, images, or pure RGB noise. Leveraging the pre-trained weights, we manually implement iterative diffusion denoising, Classifier-Free Guidance (CFG), and enhance the capabilities of the model via image-to-image translation, hole filling (inpainting), and explore how to achieve results similar to that of Visual Anagrams (Geng et al., 2024)

Setup

Utilizing DeepFloyd's Stage 1 function allows us to take the text prompts inputs and generate image outputs. Subsequently Stage 2 of the model we can implement super-resolution of our generated images. (YOUR_SEED = 400)

An oil painting of a snowy mountain village

A man wearing a hat

A rocket dfd

We notice that the model is able to generate images that are quite close to the text prompt. However, these images were cherry-picked as the model is not perfect and can generate images that are not quite what we expect. We also notice a propotional initial correlation between the quality of the image and num_inference_steps.

Part 1: Sampling Loops

1.1 - Implementing the Forward Process

First, we begin by creating our own forward pass that can add noise to an image. Given we are using a pre-trained model from DeepFloyd, we need also need to adhere to two constraints: 1) The non-linear noise schedule they used to train the model (using a look up table found within alphas_cumprod), and 2) The number of discrete steps the model was trained with: 1000 steps (t values of 0-999)

More formally you aren't actually adding noise to the image, but rather blending between a noisy image and a predicted clean image. Unlike pervious projects were we blended the images like this: (A_img * selector) + ((1-selector) * B_img), in this case we square root the selector making the relationship not linear: (A_img * torch.sqrt(selector)) + ((1 - torch.sqrt(selector)) * B_img). The deviation of A and B are scaled down by the selector, but then added together. 'scaling' here is not referring to image size, but individual pixel values. Not yet quite sure why at the moment this non-linear relationship is better.

selector 't' (blue) vs sqrt(t) (red)

Of note is that the formula ∈ ~ N(0,1) does not mean that the noise generated is of range 0-1, but rather that the noise is of 1 standard deviation (generally values of -4 to 4). Here are the results from the forward function at the corresponding values:

original

t = 250

t = 500

t = 750

1.2 Classical Denoising

Before overcomplicating things, one should always evaluate the simple solution: "Can we remove the noise with convolved Gaussian filters instead of Machine Learning models?" The short answer is "not well enough".

We manually tune the kernel size and sigma trying to strike the balance between getting rid of the noise while not losing too much detail in the original image. Here are the results:

t = 250

t = 500

t = 750

kernel_size = 9
sigma = 2

kernel_size = 11
sigma = 5

kernel_size = 13
sigma = 5

Now let's apply what we learned in project 2 by applying a sharpening filter to bring back detail by subtracting one gaussian level vs another and adding in the sharpness to the original image:

Unfortunately the infromation has been lost and cannot be recovered without some digital creativity. We will explore this in the next section.

1.3 Diffusion One-Step Denoising

To solve this problem we can utilize the unreasonable effectiveness of diffusion models and large scale datasets and compute to hallucinate the data lost by projecting the known pixels back into the manifold of images found in the dataset. In this section we do single shot prediction of what the noise is in the image and compute how to retrieve an estimation of the original clean image:

t = 250

t = 500

t = 750

Original image

Noised image

Predicted noise

Predicted original

The only downside to the single shot denoisingthis approach is that as shown in the images above, The greater the noise in the image, the more the model struggles to predict the noise and causes blurrying in the output. We will tackle this in the next section where we iteratively reduce the noise in multiple diffusion steps.

1.4 Iterative Diffusion Denoising

Instead of the single shot approach of predicting noise, we break this up into itertive denoising of the image. My intuition here is that this architecture enduces the model to think in terms of coarse to fine detail.

In the section below, we compare the iterative diffusion, single shot diffusion and the simple blur approach. We see that the diffusion approachs are generally much better than the blur approach and further that the iterative diffusion approach restored additional details in the image; For example the definition in the foliage in the background of the image. Given that the information is actually lost, the retoration process of the iterative diffusion HAS to hallucinate the detail, leading to some creative interpretations of the original image.

Original

Iterative diffusion

One shot diffusion

Blur denoise

Noisy Campanile
t=690

Noisy Campanile
t=540

Noisy Campanile
t=390

Noisy Campanile
t=140

Noisy Campanile
t=90

Incremental diffusion denoising

+ Stage 2 Super resolution

1.5 Diffusion Model Sampling

Instead of sampling from an existing image, we can hallucinate images from scratch by sampling from pure noise. My initial observation is that the the lower the stride → the richer the image is. Often a high stride (less iterative steps total) result in solid black or purple colored results. All images are enhanced used using super resolution.

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

stride = 1

stride = 30

1.6 Classifier-Free Guidance (CFG)

We enhance the controlability of our generative model via the technique of Classifier-Free Guidance (CFG), which in a way gives us a control knob to adjust the influence of the text prompt on the generated image.

Given that our model has the capability to "see" which part of noise looks like a given text prompt, we can compute the delta between a user given text prompt and an empty text prompt, scale that delta by the cfg-scale, then add it to the unconditional noise.

The truly exciting concept ⭐ is that you can extrapolate outside of the manifold of images in the original dataset by choosing a CFG values higher than 1! Here you can see that CFG = 7 is literally giving you other-worldy results: (I guess this alternate world is very purple) 😊

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

CFG = 2

CFG = 7

1.7 Image-to-image Translation

Now that we know how to project images and noise into the manifold of known images, let's explore the manifold ⭐ a bit more. A simple and philosophically intersting question we can ask is: What are some things that the model thinks is visually similar to my input image?.

More technically, we project our input image into the manifold and explore the area around it via two variables: 1) Controlling the initial amount of noise in the input image via i_start and 2) The number of incremental diffusion steps (which allow it to walk further away from the original noise).

The images below how the model hallucinates images that are visually similar to the input image. In practice we see that the lower the i_start and stride, the furhter away we are to a visual similary to the input image. What is non the less beautiful to see what visual associations the model has with the input image.

i_start 1

i_start 3

i_start 4

i_start 7

i_start 10

i_start 20

original

stride = 30

stride = 15

stride = 7

stride = 1

Additional input images

Let's see more results with some additional input images. (The stride was kept at 30 for all images)

i_start 1

i_start 3

i_start 4

i_start 7

i_start 10

i_start 20

original

stride = 30

1.7.1 Editing Hand-Drawn and Web Images

Web Images

Another fantastic application we can explore is how to interpolate between two concepts such as "realistic" to "cartoony". We achieve this by using the text prompt "a high quality photo" (which implies a certain degree if realism) and then using a strong CFG value.

i_start 1

i_start 3

i_start 4

i_start 7

i_start 10

i_start 20

original

stride = 30

We see an incredible result! In the first row, between the i_start values of 4 and 10 we see that the model arrived at a very similar generalization of the 'realism / cartoony' axis as seen in the book by Scott McCloud, "Understanding Comics".

Incremental diffusion denoising

Hand drawn

Some additional inputs from hand drawn images. You can see that for the first result it struggles to project this imaginary creature into the manifold of known images. The second image is a bit more successful.

i_start 1

i_start 3

i_start 4

i_start 7

i_start 10

i_start 20

original

stride = 30

Stroke based editing

Using an input image as control, you can make some super rough visual edits to the image and hallucinate variations. In the first example we add a moon to the image and in the second example we add a river.

Original

Added stroke drawn moon

i_start 1

i_start 3

i_start 4

i_start 7

i_start 11n

i_start 20

Original

Added stroke drawn
river and bushes

i_start 1

i_start 3

i_start 4

i_start 7

i_start 11n

i_start 20

1.7.2 Inpainting

An emergent capability of this architecture is "hole filling" masks in images, known as inpainting. This is achieved by replacing the masked part of the pixels with the original image at each step of the diffusion. This forces the system to use the context of the original photo and inegrate it into the hallucinated parts.

Original

Mask

Applied Mask

Inpainted

Inpainted hiRes

Applying to hybrid images

I was naturally curious how the inpaniting woudl work on the texture hybrid images from project 2. I think most of the images are an improvement! 😊

Hybrid Original

Mask

Inpainted

Inpainted hiRes

1.7.3 Text-Conditional Image-to-image Translation

The other critical degree of controlability for this generative video language model is the 'languge' itself via the text promps. We can see that we can interploate between to locations in the manifold, one given by an input image and the other given by a text prompt. One easy way to understand what is happening in the images below is to think that when the amount of noise on the original image is high, it can get closer to the location of the text prompt, and the less noise the more it has to stick to what it already has (since it can't dramatically alter the image.)

Interpolation between the text prompt: "a pencil" and an image of a the Capmanile at UC Berkeley.

i_start 1

i_start 3

i_start 4

i_start 7

i_start 10

i_start 20

Original

Interpolation between the text prompt: "a dog" and an image of rocks.

i_start 1

i_start 3

i_start 4

i_start 7

i_start 10

i_start 20

Original

Interpolation between the text prompt: "a dog" and an image of cat.

i_start 1

i_start 3

i_start 4

i_start 7

i_start 10

i_start 20

Original

1.8 Visual Anagrams

Similar to Mexican artist Octavio Ocampo's remarkable surrealist paintings from the 1970s, which reveal different images depending on the viewer's perspective, we will use our diffusion generative model to create dual-meaning illustrations. To achieve this, we will implement techniques from the Visual Anagrams paper (Geng et al., 2024).

"General’s Family"

"Visions of Quixote"

Rotating Anagrams

In order to achieve this illusion of dual meaning, during each iteration of the diffusion process, we follow the following steps:

Make two copies of the current noisy image (orig_noise_A, orig_noise_B)
Nudge orig_noise_A towards "text prompt A"
Rotate orig_noise_B by 180 degrees
Average orig_noise_A and orig_noise_B

This surprisingly simple technique yield some incredibly creative and striking!

"an oil painting of an old man" → "an oil painting of people around a campfire"

"an oil painting of an old man" → "an oil painting of a snowy mountain village"

1.9 Hybrid Images

Low / High frequency Anagrams

The first text prompt descriptions correspond with the large image, and second prompt corresponds with the small image.

"lithograph of a forest"
"lithograph of a penguin"

"lithograph of a waterfall"
"lithograph of a panda"

"lithograph of a waterfall"
"lithograph of a skull"

CS 280A Project 5B: Diffusion Models from Scratch!

Unconditioned unet

Now that we've gotten our feet wet playing with pre-trained models, let's try to implement a diffusion model from scratch! We'll start with a simple unconditioned unet model and then we will move to conditioning with time, class and re-implement Classifier-Free guidance.

Noising function

We being by 'diffusing' our original image with a noise function: Original + (normal noise * sigma):

Noising function

Training

We train a unet designed to predict the original image given a noisy image at simga = 0.5:

Noising function

In the figures below you can see that the prediction capabilities improve with additional epochs:

1 EPOCH

original noised to t=0.5 predicted

5 EPOCH

original noised to t=0.5 predicted

However, given that this model was purely trained using simga = 0.5, it does not generalize well to other sigmas:

Out-of-Distribution Testing

Sigma: [0, 0.2, 0.4, 0.5, 0.6, 0.8, 1] from left to right and predictions below )

Time conditioned unet

To boost performance across more sigma values, we train a time conditioned unet by noising the inputs to different t levels, and telling the model t level each image in the batch is at. So that this conditioning information is not accidentally lost in the encoding part of the architecture, we inject the conditioning data on the decoding (after the unflatten).

Training

This time our models' goal isn't to directly predict the original image, but rather to predict the noise added to the image. Additionally like in Part A, we sample the model iteratively as to improve the details in the generations.

Epoch 1

Epoch 5

Epoch 10

Epoch 15

Epoch 20

Epoch 40

Class and Time conditioned CFG unet

Training

We continue to improve our model by adding class conditioning to the model. This is done by adding a one-hot enconding to the architecture at the same point where we injected the time conditioning. In order to incourage the model to work even without the class conditioning, we add a dropout layer to the class conditioning with a probability of 0.1. Here we also evaluate the test loss to make sure we are not overfitting.

Train Loss

Test Loss

Epoch 1

Epoch 5

Epoch 10

Epoch 15

Epoch 20

Bells and Whistles (WIP section)

Part 5A

Used super resolution
Animated diffusion steps for gifs
Animated gifs to better communicate results
For Part A. I did more examples for outputs like 1.7, 1.8, 1.9

Part 5B

Added test loss to evaluate overfitting
Animated diffusion steps for gifs

Final thoughts

While this was by far the most challenging project in the course, it was equivallently the most rewarding. With other projects, you could predict what the output would be, but with this one, the results were always so fun to look at. You truly begin to appreciate the power of neural networks and the potential of generative models. I particularly enjoyed feeding the model noise and seeing it hallucinate images. This makes me think of "Do Androids Dream of Electric Sheep?" by Philip K.

This was a fantastic forcing function to learn about pytorch, neural networks, and the diffusion models!