By Gopi Krishna Tummala
Table of Contents
Pre-Training: Learning the Grammar of the World
Pre-training is like teaching a child the grammar of the world — the model sees millions of images and video clips, and learns the “language” of how pixels, patches, and motion behave.
It’s not just about “make pretty video”; it’s about learning the distribution of real-world spatio-temporal phenomena so that the model can later be directed via prompts.
Key Components of Pre-Training
Large-Scale Data:
- Videos and images of variable resolutions, durations, and aspect ratios
- As seen in Sora’s technical report: unified representation for images & video with variable duration/ratio
- The model must learn to handle diverse content types
Latent Compression:
- Videos are compressed into a smaller latent space (both spatially & temporally) before diffusion
- Sora’s approach: “spacetime patches” as tokens — video is broken into patches that span both space and time
- This reduces computational requirements while preserving essential information
Transformer Backbone:
- The denoising network is a transformer over spatio-temporal tokens
- Diffusion loss operates on patch-based latent space
- The model learns to reverse the noise process in this compressed representation
Mathematical Foundation:
The pre-training objective is to learn the reverse diffusion process:
Where:
- is the latent representation of video
- is the noise to predict
- is the diffusion timestep
- is conditioning (e.g., text prompts)
Training Data: The Billion-Frame Problem
What the Leading 2025 Video Models Use
| Model | Year | Frames | Notes |
|---|---|---|---|
| HunyuanVideo | 2024 | ~1B | Strongest open-source text-to-video (2024) |
| Wan 2.2 | 2025 | ~12B | Uses aesthetic + cinematic scoring |
| Open-Sora 2 (open-source) | 2025 | ~4B | Fully open pipeline, detailed technical report |
| Pika 1.5 (commercial) | 2024 | undisclosed | High-quality proprietary dataset |
Data Quality Requirements
New datasets lean heavily on:
- Scene description consistency: Captions accurately describe what’s happening
- Temporal captions: “at 1s, camera pans left…” — describing actions over time
- Action-rich clips: Sports, wildlife, driving — clips with clear motion
- Cinematic metadata: Shot types, lenses, lighting — professional filmmaking knowledge
Recaptioning: The Data Engine (The Missing Link)
The Problem: Raw data from the internet is noisy and poorly labeled. You might find:
- Images labeled “IMG_001.jpg” or “holiday 2012”
- Videos with generic descriptions like “nice view” or “cat video”
- Alt text that’s completely wrong or missing
If you train on bad labels, you get a model that ignores prompts. The model never learns the connection between words like “vibrant,” “calm,” or “silhouette” and their visual meanings.
The Analogy: Imagine trying to learn what a “sunset” looks like, but the teacher only shows you photos labeled “holiday 2012” or “nice view.” You’d never learn the connection between the word “sunset” and the orange sky.
The Fix (Recaptioning): Before training the image/video generator, researchers use a different AI (a Vision-Language Model like GPT-4V or CLIP) that is already smart to look at every training image/video and write a detailed, accurate description.
Example:
- Original caption: “holiday 2012”
- Recaptioned: “A vibrant orange sunset over a calm ocean with silhouette palm trees, warm golden light reflecting on the water, tropical beach scene”
The Result: The image/video generator is now trained on these “perfect” synthetic captions. It learns exactly what “vibrant,” “calm,” and “silhouette” mean visually.
Key Takeaway: Better captions > More data. This is the secret sauce behind DALL-E 3 and Sora’s ability to follow complex instructions.
How It Works:
- Pre-trained Vision-Language Model: Use a model like GPT-4V that can understand images/videos and generate detailed descriptions
- Batch Recaptioning: Process all training data through the VLM to generate high-quality captions
- Training on Synthetic Captions: Train the diffusion model on these recaptioned pairs instead of original noisy captions
Production Impact: This is why modern models (DALL-E 3, Sora, Veo) can follow complex, multi-part prompts. They were trained on captions that actually describe what’s in the image/video, not generic filenames or poor alt text.
Interview Insight: When asked “How do you make a model follow prompts better?”, the answer is often “better training data” — specifically, recaptioning with high-quality vision-language models. This is more important than model architecture improvements.
Framewise Aesthetic Reward Models (FARM, 2025)
A new reward function for RLHF on video aesthetic quality:
This rewards:
- Frame-level quality: Each frame is visually appealing
- Temporal coherence: Frames flow smoothly together
The challenge: balancing aesthetic quality with temporal consistency. A model that generates beautiful individual frames but flickers between them is useless.
Data Curation Pipeline
Modern video datasets go through:
- Web scraping: Billions of video-text pairs from the internet
- Quality filtering: Remove low-resolution, corrupted, or irrelevant videos
- Caption generation: Use vision-language models to generate detailed captions
- Aesthetic scoring: Rank videos by visual quality
- Temporal annotation: Label actions, camera movements, scene changes
- Deduplication: Remove near-duplicate clips
The result: a curated dataset where each clip is:
- High quality
- Well-described
- Temporally rich
- Aesthetically pleasing
Pre-Training Challenges and Tradeoffs
Scale Requirements:
- How much compute & data is needed? Modern models train on billions of frames
- Training costs can reach millions of dollars in compute
- Requires massive distributed training infrastructure
Representational Capacity:
- Balancing spatial detail vs temporal coherence
- Higher resolution means more parameters and compute
- Longer videos require more memory and temporal modeling capacity
Data Diversity:
- Ensuring the model sees enough variation in movement, scene types, camera angles
- Avoiding bias toward common patterns (e.g., certain camera movements, scene compositions)
- Handling edge cases: rare motions, unusual perspectives, complex interactions
Efficiency vs Quality:
- Latent compression reduces compute but may lose fine details
- Temporal downsampling speeds training but limits motion fidelity
- Hierarchical generation (keyframes + interpolation) trades quality for speed
Post-Training: Alignment and Human Preferences
After pre-training, the raw diffusion model is powerful but unaligned. It might generate motion that’s physically weird, or video that’s misaligned with user intent. Post-training is how we teach the model to behave.
Think of it like giving the child not just grammar but style guides — what we actually want them to write, ethically and aesthetically.
The Core Distinction:
- Pre-training teaches what the world looks like — the statistical distribution of video data
- Post-training teaches what humans want to see — alignment with preferences, safety, and controllability
Post-Training Methods
1. Supervised Fine-Tuning (SFT):
- Training on prompt–video (or video + caption) pairs to align with desired outputs
- Improves prompt following and style consistency
- Typically uses a smaller, high-quality curated dataset
2. Preference-Based Alignment (RLHF / DPO):
- Humans rank generated videos; the model is trained to prefer higher-ranked ones
- RLHF: Requires training a separate reward model, then using reinforcement learning
- DPO: Directly optimizes preferences without a reward model (more stable, dominant in 2025)
3. Safety & Moderation Layers:
- Content filters to prevent harmful or inappropriate content
- Watermarking and detection systems for synthetic content
- Content provenance metadata for tracking generated media
Mathematical Foundation: Direct Preference Optimization (DPO)
Recent research has successfully applied DPO to video generation (HuViDPO, Flow-DPO). The preference-ranking loss:
Where:
- is a preferred video (rated higher by humans)
- is a less preferred video
- is the model’s probability of generating that video
DPO has become the dominant alignment technique in 2025 because it:
- Doesn’t require training a separate reward model
- Directly optimizes for human preferences
- Is more stable than RLHF
Cinematic Reward Models
Some models (OpenAI Sora successors, proprietary) also use “Cinematic Reward Models” which grade:
- Shot composition: Rule of thirds, leading lines, framing
- Color grading: Consistent color palette, mood
- Motion smoothness: No jitter, natural camera movement
- Camera trajectory realism: Camera moves like a real camera operator would
This is why modern models suddenly produce near-Hollywood-level videos.
The model learns not just to generate video, but to generate cinematic video — video that looks like it was shot by a professional filmmaker.
The Alignment Process
- Collect preferences: Show humans pairs of videos, ask which is better
- Train reward model: Learn to predict human preferences
- Optimize policy: Use RLHF or DPO to align model with preferences
- Iterate: Repeat with new data, refine preferences
The result: models that generate videos humans actually want to watch.
Post-Training Challenges
Compute Constraints:
- Very large video models are hard to fine-tune because of compute + data requirements
- Full model fine-tuning may be impractical; adapter-based methods are common
Physical Realism:
- Post-training for physical realism (motion, causality) is still under research
- Models may generate physically implausible motion even after alignment
- Artifact detection remains an active area (e.g., Simple Visual Artifact Detection in Sora-generated Videos)
Balancing Tradeoffs:
- Alignment may reduce creative diversity
- Safety filters may be overly conservative
- Quality vs. controllability: more control may reduce output quality
References
Training Data & Datasets
- Open-Sora 2.0: Comprehensive open-source project detailing hierarchical data pyramid, multi-stage training, and data filtering. GitHub
- WebVid-2M / Kinetics / UCF-101: Public datasets for video action recognition and T2V benchmarking
- Wan 2.2: Research on aesthetic and cinematic scoring in data curation pipelines
Post-Training & Alignment
- HuViDPO / Flow-DPO: First successful applications of Direct Preference Optimization (DPO) to Text-to-Video generation. arXiv
- Improving Video Generation with Human Feedback: Systematic pipeline using human feedback with multi-dimensional video reward models. arXiv
- HuggingFace TRL Library: Open-source implementations of DPO, PPO (RLHF), and alignment methods for Transformer-based models. Documentation
Further Reading
- Part 4: Video Diffusion Fundamentals
- Part 6: Diffusion for Action
This is Part 5 of the Diffusion Models Series. Part 4 covered video diffusion fundamentals. Part 6 will explore diffusion for robotics and action planning.