By Gopi Krishna Tummala
Table of Contents
Modern Models: Sora, Veo 3, and Open-Sora
These are not just research toys — they are world simulators that demonstrate the state-of-the-art in video generation.
Sora: World Simulators
OpenAI’s Sora represents a breakthrough in unified video generation:
Key Innovations:
- Spacetime Patches: Diffusion transformer operates on patches that span both space and time, enabling variable duration and aspect ratios
- Unified Representation: Same model handles images and video, with variable resolutions and durations
- Recaptioning Technique: Uses DALL·E 3’s recaptioning to improve caption quality in training data
Architecture:
- Transformer-based denoiser on spacetime patches
- Large-scale pre-training on video + image data
- Post-alignment for prompt-following and safety
Capabilities:
- Generate videos up to 60 seconds
- Variable aspect ratios and resolutions
- Strong temporal coherence and object permanence
Limitations:
- Physics & causality limitations (OpenAI acknowledges)
- Artifact issues (addressed in research like Sugiyama & Kataoka, 2025)
- Watermarking and copyright concerns
- Access constraints (not fully public)
On-Device Sora:
- Research variant optimized for mobile/low compute
- Techniques: Linear Proportional Leap (LPL) to reduce denoising steps
- Temporal Dimension Token Merging (TDTM) for efficiency
- Does not require retraining — works by optimizing inference
Veo 3: Audio and Motion Control
Google/DeepMind’s Veo 3 focuses on high-fidelity video with integrated audio:
Key Innovations:
- Integrated Audio Generation: Includes lip-sync, environmental sound, and dialogue
- Motion Control: Fine-grained control over camera movement, scene dynamics, and object motion
- Cinematic Quality: High-fidelity output with professional filmmaking aesthetics
Architecture:
- Large-scale pre-training (Google-scale compute and data)
- Post-training for alignment, realism, and audio-visual synchronization
- Motion control through conditioning mechanisms
Capabilities:
- High-resolution video generation
- Synchronized audio generation
- Controllable camera and object motion
- Professional-grade cinematic output
Limitations:
- Black box: not fully open research
- Access and cost constraints for users
- Balancing audio, visuals, and motion control is computationally intensive
Open-Sora: Open-Source Alternative
Open-Sora (PKU-Alignment) provides a fully open-source alternative:
Key Innovations:
- Spatial-Temporal Diffusion Transformer (STDiT): Decouples spatial and temporal attention for efficiency
- 3D Autoencoder: Compact video representation in latent space
- Open Weights + Code: Fully reproducible pipeline
Architecture:
- STDiT backbone with separate spatial/temporal attention
- 3D VAE for latent compression
- Multi-stage training pipeline
Capabilities:
- 720p video generation
- ~15 second clip generation
- Open-source and reproducible
- Detailed technical reports
Limitations:
- Lower resolution/quality vs. Sora/Veo
- Generation cost for long/high-res video
- Less advanced audio (depending on version)
- Fewer resources for post-training compared to big organizations
Model Comparison
| Model | Key Strength | Pre-Training Scale | Post-Training | Open Source | Best For |
|---|---|---|---|---|---|
| Sora | Unified representation, long videos | Very large | Extensive | No | Research, high-quality generation |
| Veo 3 | Audio sync, motion control | Very large | Extensive | No | Cinematic content, audio-visual |
| Open-Sora | Reproducibility, open access | Large (open) | Limited | Yes | Research, education, development |
| Mochi 1 | AsymmDiT architecture | Large | Limited | Yes (Apache 2.0) | Open-source video generation |
| HunyuanVideo | Large-scale open model | ~1B frames | Limited | Yes | Open-source baseline |
Motion Modeling: Geometry, Optical Flow, and Diffusion Fields
Recent research (2024–2025) shows a shift:
Models now explicitly learn motion fields, not just pixels.
Good video models either learn motion explicitly (flow/fields) or implicitly (attention across time). When motion is modeled explicitly, the model gets a physics anchor to hang frames on.
FlowVid 2.0 (CVPR 2024)
Uses motion priors learned from optical flow to stabilize animations (FlowVid 2.0, CVPR 2024).
The model learns to predict optical flow:
Where is the predicted flow field and is the ground-truth optical flow.
This ensures that:
- Objects move smoothly
- Motion is physically plausible
- Temporal consistency is maintained
DynamiCrafter 2 (NeurIPS 2024)
Learns scene geometry as a latent NeRF-like volume (DynamiCrafter 2, NeurIPS 2024):
Where is a 3D volume representation and is a ray through the scene.
This gives the model:
- 3D understanding: Objects have depth and structure
- View consistency: The same object looks correct from different angles
- Motion in 3D space: Objects move through 3D, not just 2D pixels
Diffusion Video Fields (CVPR 2025)
Represents video as a 4D continuous function:
Where:
- are spatial coordinates
- is time
- is the noise level (diffusion timestep)
This gives better:
- Identity preservation: Objects maintain their appearance across frames
- Motion stability: Smooth, continuous motion
- Controllability: Easy to manipulate camera movement, object motion
The key insight: representing video as a continuous function allows the model to interpolate smoothly between frames, rather than generating discrete frames independently.
Putting It All Together
The Complete Pipeline
A modern video generation model (2025) works like this:
-
Pre-training (billions of frames):
- Learn the structure of video data
- Learn temporal dependencies
- Learn to denoise video latents
-
Architecture (DiT/V-DiT/AsymmDiT):
- Spatial attention for image content
- Temporal attention for motion
- Cross-attention for text conditioning
-
Motion Learning (explicit motion fields):
- Optical flow for smooth motion
- 3D geometry for view consistency
- Continuous fields for interpolation
-
Post-training (human preferences):
- DPO for preference alignment
- Cinematic reward models for aesthetics
- Instruction following for controllability
The Physics Connection
The remarkable thing: none of this explicitly programs physics.
The model learns:
- Objects have momentum (from watching things move)
- Light casts shadows (from watching lighting)
- Water flows downhill (from watching water)
- Camera movement is smooth (from watching camera work)
Not through equations, but through statistical patterns in billions of frames.
The model becomes an amortized physical simulator — it doesn’t solve physics equations, but it has learned to generate videos that satisfy physical laws because those laws are encoded in the training data.
Why This Matters
This approach to video generation has implications beyond entertainment:
- Robotics: Models that understand motion can plan robot trajectories
- Scientific simulation: Generate plausible simulations of physical processes
- Education: Visualize complex phenomena (fluid dynamics, particle physics)
- Creative tools: Enable new forms of artistic expression
The future: models that don’t just generate video, but understand the physics underlying motion.
Conclusion
Generative video is one of the most exciting frontiers in AI.
It requires:
- Massive scale: Billions of frames, trillions of parameters
- Novel architectures: DiT, temporal attention, motion fields
- Sophisticated training: Pre-training, post-training, alignment
- Implicit physics: Learning physical laws from data
The result: models that can generate videos that are:
- Visually stunning: High resolution, cinematic quality
- Temporally coherent: Smooth motion, consistent objects
- Physically plausible: Motion that makes sense
- Controllable: Follow text prompts, user instructions
We’re teaching machines the physics of time — not through equations, but through the statistical structure of motion itself.
References
Modern Models
- Sora (OpenAI): Video generation models as world simulators. OpenAI Research
- On-device Sora: Training-free diffusion-based text-to-video for mobile devices. arXiv
- Veo 3: Google/DeepMind’s high-fidelity video generation with audio. Veo 3
- Open-Sora: Democratizing efficient video production for all. arXiv GitHub
- Simple Visual Artifact Detection in Sora-generated Videos: Research on detecting artifacts in Sora outputs. arXiv
- Mora: Enabling generalist video generation via multi-agent framework. arXiv
Motion Learning & Optical Flow
- FlowVid 2.0 (CVPR 2024): Explicit optical flow priors for temporal coherence. arXiv
- DynamiCrafter 2 (NeurIPS 2024): Learning temporal scene geometry and non-rigid motion. arXiv
Further Reading
- Part 6: Diffusion for Action
- Diffusion Models Series Part 1: From Molecules to Machines
- Vision-Language Models: Vision-Language Models Explained
- Physics-Aware Video: Physics-Aware Video Diffusion Models
This is Part 7 of the Diffusion Models Series, concluding our exploration of image and video diffusion models. The series covers foundations, architectures, training pipelines, robotics applications, and state-of-the-art models.