Image Diffusion Models: From U-Net to DiT

By Gopi Krishna Tummala

Diffusion Models Series

Part 1: Foundations Part 2: Image Diffusion Part 3: Sampling & Guidance Part 4: Video Fundamentals Part 5: Pre-Training & Post-Training Part 6: Diffusion for Action Part 7: Modern Models & Motion

📖 You are reading Part 2: Image Diffusion Models — From U-Net to DiT

Image Diffusion Models: From U-Net to DiT

Diffusion models revolutionized image generation by learning to reverse a noise process. The journey from U-Net-based architectures to Transformer-based models (DiT) represents a fundamental shift in how we approach generative modeling.

The U-Net Era: Convolutional Foundations

Early diffusion models like DDPM and Stable Diffusion used U-Net architectures. But why is the U-Net architecture ideal for denoising?

Why U-Net for Denoising?

Denoising requires solving a multiscale problem:

Global Structure Recognition (Encoder): The model must understand the high-level content — “Is this a cat or a car?” This requires compression through downsampling layers to capture semantic structure.
Fine Detail Reconstruction (Decoder + Skip Connections): The model must remove noise pixel-by-pixel while preserving sharp edges and textures. This requires high-resolution detail that would be lost during compression.

The U-Net Solution:

Encoder (Downsampling Path): Compresses the image through convolutional layers, capturing global structure and context. Think of it as “zooming out” to see the big picture.
Decoder (Upsampling Path): Reconstructs the image at full resolution, using the learned global structure to guide denoising.
Skip Connections: These are the critical innovation — they carry fine-grained details directly from encoder to decoder, bypassing the compression bottleneck. Like a “highway” that preserves pixel-level information.

Visual Analogy for Skip Connections:

Imagine restoring a damaged painting:

The encoder is like stepping back to see the overall composition (global structure)
The decoder is like zooming in to fix individual brushstrokes (local details)
Skip connections are like having a reference photo at full resolution — you can always check the original fine details without losing them through compression

Without skip connections, the network would lose high-frequency details during compression. Without downsampling, it couldn’t capture the semantic content needed to guide denoising.

The forward diffusion process:

q(x_t | x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t}x_{t-1}, \beta_t I)

The reverse denoising process:

p_\theta(x_{t-1}|x_t) = \mathcal{N}\big(\mu_\theta(x_t,t), \Sigma_t\big)

Where $\mu_\theta$ is learned by a U-Net that predicts the noise to remove using the loss function:

\mathcal{L} = \mathbb{E}_{x_0, \epsilon, t} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]

The DiT Revolution: Scalable Transformers

Diffusion Transformers (DiT) (Peebles & Xie, 2023) replaced U-Nets with Vision Transformers, enabling better scaling:

Patch Embedding: Image is split into patches (e.g., 16×16 pixels)
Transformer Blocks: Self-attention processes all patches globally
Conditional Generation: Cross-attention with text embeddings

The key advantage: global receptive field from the start, rather than building it through stacked convolutions.

Patches as Tokens: Treating Images Like Language

The Paradigm Shift: Older models (U-Nets) looked at images as a grid of pixels to be convolved. Newer models (Sora, Veo, DiT) treat images like language.

The Analogy:

U-Nets: Like reading a book word-by-word, but only seeing nearby words. You struggle to connect a character mentioned on page 1 with their action on page 100.
DiT: Like having the entire book laid out as a single sentence. You can see all “words” (patches) at once and understand how they relate across the entire image.

Why This Matters:

U-Nets struggle to “see” things that are far apart in an image. For example:

A hand on the left side of an image matching a foot on the right (same person, same pose)
A shadow on the ground matching the object casting it
Text in one corner matching a logo in another corner

The Solution (Patches as Tokens):

We chop the image into little squares called patches (like puzzle pieces). We treat these patches exactly like words in a sentence (tokens). This allows the model to use Transformers (the brain of ChatGPT).

The Benefit: Suddenly, the model understands “context” across the entire image/video at once. This leads to:

Better consistency in physics (objects don’t randomly change)
Object permanence (things stay the same across frames)
Global understanding (the model “sees” the whole scene at once)

This is why modern models (Sora, Veo, Stable Diffusion 3) use DiT architectures — they can maintain consistency across large spatial and temporal scales.

\text{Attention}(Q,K,V) = \text{softmax}\Big( \frac{QK^\top}{\sqrt{d}} \Big)V

Why Transformers Scale Better: Resolution and Computation

Transformers excel at high-resolution images ( $1024 \times 1024$ and up) where U-Nets struggle:

The Scaling Problem:

U-Nets: Computation scales with image size through stacked convolutions. For $1024 \times 1024$ images, you need many layers to build a global receptive field, making training and inference expensive.
Transformers: Computation scales more predictably. Self-attention gives every patch access to every other patch in a single layer, regardless of image size.

Resolution Advantage:

At $512 \times 512$ : U-Nets work well, but Transformers are competitive
At $1024 \times 1024$ : Transformers become significantly more efficient
At $2048 \times 2048$ : Transformers are the clear choice — U-Nets become computationally prohibitive

This is why modern high-resolution image generation (SDXL, Imagen) uses Transformer-based architectures. The global receptive field isn’t just a nice-to-have — it’s essential for scaling to production-quality resolutions.

Image Diffusion Architecture Evolution

Architecture	Year	Key Innovation
DDPM	2020	U-Net denoiser, simple noise schedule
Stable Diffusion	2022	Latent diffusion (VAE encoder/decoder)
DiT	2023	Pure Transformer, no convolutions
SDXL	2023	Larger U-Net, better text conditioning

Latent Diffusion: The Efficiency Breakthrough

Stable Diffusion (Rombach et al., 2022) introduced latent diffusion:

Images are encoded to a lower-dimensional latent space (e.g., 512×512 → 64×64)
Diffusion happens in latent space
Decoder reconstructs high-resolution images

This reduces compute by ~16× while maintaining quality.

Latent Diffusion as a System Design Pattern

Interview Question: “How do you make Stable Diffusion fast on consumer hardware?”

Answer: Latent Diffusion (LDM).

This is a critical System Design pattern for production GenAI systems:

Problem: Pixel-space diffusion on $512 \times 512$ images requires massive compute — impractical for consumer GPUs.
Solution: Compress images to latent space ( $64 \times 64$ ) using a pre-trained VAE encoder, run diffusion in this compressed space, then decode back to pixels.
Tradeoff: Slight quality loss from compression, but massive speedup (16×) makes it production-viable.
Production Impact: This is why Stable Diffusion runs on consumer GPUs while pixel-space models require data center infrastructure.

Key Insight: The VAE encoder/decoder learns a “visual grammar” — it compresses images into a space that preserves semantic information while discarding pixel-level redundancy. Diffusion in this compressed space is both faster and often produces better results because the model focuses on structure rather than noise.

The VAE Bottleneck: The “JPEG” Compression Artifacts of AI

The Problem: Latent diffusion works by compressing the image first using a VAE (Variational Autoencoder). But the compressor is imperfect — it’s like a “zipping” tool that loses data when compressed too aggressively.

The Analogy: Think of the VAE as a JPEG compressor. If you compress a photo too much, you get:

Blocky artifacts
Blurred details
Lost fine textures

The same happens with VAE compression in diffusion models.

The Symptom: Have you noticed AI-generated images where:

Text looks garbled or unreadable?
Faces look waxy or smoothed out?
Fine details (like hair strands, fabric textures) are missing?

That’s often not the diffusion model’s fault — it’s the compressor’s fault. The VAE literally “blurred” the details before the diffusion model even started working.

The Fix: Modern models address this in several ways:

Larger, Better VAEs: Models like SDXL and Flux use larger VAE encoders that preserve more detail
Skip Compression for Critical Details: Some models use a hybrid approach — compress most of the image, but keep text and fine details in pixel space
Better Training: Training VAEs specifically to preserve important details (like text, faces, fine textures)

Production Impact: This is why you see different quality levels across models. A model with a poor VAE will struggle with text and fine details, even if the diffusion model itself is excellent. The VAE is a critical bottleneck that determines the upper bound on image quality.

Interview Insight: When asked “Why do AI images sometimes look blurry or have garbled text?”, the answer is often the VAE bottleneck, not the diffusion model itself.

Image Generation Pipeline

Text Encoding: CLIP or T5 encodes text prompt
Noise Sampling: Start with random noise in latent space
Iterative Denoising: DiT/U-Net removes noise step-by-step
VAE Decoding: Convert latent back to pixel space

The result: high-quality images from text prompts, with fine-grained control through guidance and conditioning.

References

DDPM (Foundational Paper)

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS. arXiv

DiT Architecture

Peebles, W., & Xie, S. (2023). Scalable Diffusion Models with Transformers. ICCV. arXiv
OpenDiT / PixArt-α: Open-source implementations on GitHub demonstrating DiT scalability

Latent Diffusion (LDM)

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. CVPR. arXiv

Image Diffusion Models: From U-Net to DiT

Table of Contents

Image Diffusion Models: From U-Net to DiT

The U-Net Era: Convolutional Foundations

Why U-Net for Denoising?

The DiT Revolution: Scalable Transformers

Patches as Tokens: Treating Images Like Language

Why Transformers Scale Better: Resolution and Computation

Image Diffusion Architecture Evolution

Latent Diffusion: The Efficiency Breakthrough

Latent Diffusion as a System Design Pattern

The VAE Bottleneck: The “JPEG” Compression Artifacts of AI

Image Generation Pipeline

References

Further Reading