By Gopi Krishna Tummala
Table of Contents
Video Diffusion Models: The Temporal Challenge
Video generation extends image diffusion to the temporal dimension, introducing new challenges and architectural innovations.
Why Video is Harder Than Images
Imagine you ask an artist to draw a bird.
One frame? Easy.
Now tell the artist:
“Draw the same bird, but flying — for 5 seconds — at 24 frames per second.”
Suddenly the problem explodes:
- The wings must flap naturally.
- The body must follow a smooth trajectory.
- The shadows must move correctly.
- The bird cannot randomly change color or species between frames.
What the artist feels here is exactly what video models feel:
Temporal constraints are like hidden physical laws.
Break them once and the illusion collapses.
The Mathematical Challenge
For images, your model learns a distribution:
For video, it must learn:
Where each frame must satisfy:
and
This is a high-dimensional Markov process, except the “transition dynamics” (the physics of how things move) are not given — the model must learn them.
2024–2025 Research Consensus
Recent papers (CVPR, NeurIPS, ICLR) hammer this point:
- “Video Diffusion Models are Amortized Physical Simulators” (NeurIPS 2024 spotlight)
- “TempoFlow: Learning Coherent Motion Priors for Video Synthesis” (CVPR 2025)
- “DynamiCrafter 2: Learning Temporal Scene Geometry and Non-Rigid Motion” (NeurIPS 2024)
They all converge on one central idea:
To generate believable video, the model must implicitly learn physics.
Even if no one tells it Newton’s laws.
The model discovers that objects have momentum, that light sources cast consistent shadows, that water flows downhill — not through explicit programming, but through the statistical structure of billions of video frames.
The DiT Revolution: Transformers Replace U-Nets
Why U-Net Fails at Video Scaling
U-Nets use convolutions that operate locally:
Great for images.
Disastrous for long-range temporal structure.
To model a 10-second clip (240 frames), the receptive field needs to explode.
Transformers solve this by making the receptive field global:
This is the key: every patch attends to every other patch, including across time.
DiT → V-DiT → AsymmDiT
Open-source and industry models in 2024–2025 evolved like this:
| Year | Architecture | Major Contribution |
|---|---|---|
| 2023 | DiT | Replace U-Net with pure ViT for diffusion denoiser |
| 2024 | V-DiT / Video DiT | Extend DiT into temporal dimension |
| 2025 | AsymmDiT / Dual-path Attention | Separate spatial vs. temporal attention → faster + higher coherence |
AsymmDiT (e.g., Mochi 1 (Genmo, Apache 2.0), Pyramidal Video LDMs from CVPR 2025) uses:
-
Spatial Attention:
-
Temporal Attention:
And mixes them with a learned gate:
Why this works:
- Spatial attention learns image content
- Temporal attention learns object permanence, motion physics
- A learned α lets the model gradually shift focus depending on frame structure
This is one of the most powerful “physics proxies” in modern video generation.
The Architecture in Practice
In a typical V-DiT block:
- Patch Embedding: Video is split into 3D patches (height × width × time)
- Spatial Self-Attention: Patches within the same frame attend to each other
- Temporal Self-Attention: Patches across frames attend to each other
- Cross-Attention: Text prompts condition the generation
- Feed-Forward: Standard MLP layers
The key insight: separating spatial and temporal attention allows the model to learn different types of structure independently, then combine them.
Diffusion for Video: Intuition → Math
Diffusion models learn to reverse a noise process, transforming random noise into structured video content.
Forward process (adding noise):
Reverse process (denoising):
For video, the latent includes time:
with = number of frames.
The Temporal Consistency Trick
The trick:
Noise is added independently to each frame, but the denoiser must jointly reconstruct all frames with temporal consistency.
This forces the model to learn temporal structure because that’s the only way to solve the puzzle.
If the model tries to denoise each frame independently, it will produce flickering, inconsistent motion. The only way to generate smooth video is to learn the temporal dependencies.
Modern Video Diffusion Scale
Modern video diffusion datasets (Wan 2.2, HunyuanVideo, Open-Sora 2 (open-source), VeGa) use up to:
- 1024×1024 resolution
- 8–24 fps
- 2–14 seconds per clip
This is orders of magnitude larger than early video diffusion.
For a 10-second clip at 24fps and 1024×1024 resolution:
Training on billions of such clips requires:
- Efficient latent compression (VAE encoders)
- Temporal downsampling strategies
- Hierarchical generation (generate keyframes, then interpolate)
References
V-DiT / Temporal Attention
- Latte / Video DiT: Early works adapting DiT for video with temporal attention mechanisms
- Stable Video Diffusion (SVD): Demonstrates inflating pre-trained 2D models with temporal layers
AsymmDiT (Asymmetric DiT)
- Mochi 1 (Genmo): Open-source model (Apache 2.0) using Asymmetric Diffusion Transformer to separate spatial vs. temporal attention. GitHub
Further Reading
- Part 3: Sampling & Guidance
- Part 5: Pre-Training & Post-Training
This is Part 4 of the Diffusion Models Series. Part 3 covered sampling and guidance. Part 5 will explore pre-training and post-training pipelines.