Modern Video Models & Motion: Sora, Veo 3, Open-Sora, and Motion Modeling

By Gopi Krishna Tummala

Diffusion Models Series

Part 1: Foundations Part 2: Image Diffusion Part 3: Sampling & Guidance Part 4: Video Fundamentals Part 5: Pre-Training & Post-Training Part 6: Diffusion for Action Part 7: Modern Models & Motion

📖 You are reading Part 7: Modern Models & Motion — State-of-the-art video generation

Modern Models: Sora, Veo 3, and Open-Sora

These are not just research toys — they are world simulators that demonstrate the state-of-the-art in video generation.

Sora: World Simulators

OpenAI’s Sora represents a breakthrough in unified video generation:

Key Innovations:

Spacetime Patches: Diffusion transformer operates on patches that span both space and time, enabling variable duration and aspect ratios
Unified Representation: Same model handles images and video, with variable resolutions and durations
Recaptioning Technique: Uses DALL·E 3’s recaptioning to improve caption quality in training data

Architecture:

Transformer-based denoiser on spacetime patches
Large-scale pre-training on video + image data
Post-alignment for prompt-following and safety

Capabilities:

Generate videos up to 60 seconds
Variable aspect ratios and resolutions
Strong temporal coherence and object permanence

Limitations:

Physics & causality limitations (OpenAI acknowledges)
Artifact issues (addressed in research like Sugiyama & Kataoka, 2025)
Watermarking and copyright concerns
Access constraints (not fully public)

On-Device Sora:

Research variant optimized for mobile/low compute
Techniques: Linear Proportional Leap (LPL) to reduce denoising steps
Temporal Dimension Token Merging (TDTM) for efficiency
Does not require retraining — works by optimizing inference

Veo 3: Audio and Motion Control

Google/DeepMind’s Veo 3 focuses on high-fidelity video with integrated audio:

Key Innovations:

Integrated Audio Generation: Includes lip-sync, environmental sound, and dialogue
Motion Control: Fine-grained control over camera movement, scene dynamics, and object motion
Cinematic Quality: High-fidelity output with professional filmmaking aesthetics

Architecture:

Large-scale pre-training (Google-scale compute and data)
Post-training for alignment, realism, and audio-visual synchronization
Motion control through conditioning mechanisms

Capabilities:

High-resolution video generation
Synchronized audio generation
Controllable camera and object motion
Professional-grade cinematic output

Limitations:

Black box: not fully open research
Access and cost constraints for users
Balancing audio, visuals, and motion control is computationally intensive

Open-Sora: Open-Source Alternative

Open-Sora (PKU-Alignment) provides a fully open-source alternative:

Key Innovations:

Spatial-Temporal Diffusion Transformer (STDiT): Decouples spatial and temporal attention for efficiency
3D Autoencoder: Compact video representation in latent space
Open Weights + Code: Fully reproducible pipeline

Architecture:

STDiT backbone with separate spatial/temporal attention
3D VAE for latent compression
Multi-stage training pipeline

Capabilities:

720p video generation
~15 second clip generation
Open-source and reproducible
Detailed technical reports

Limitations:

Lower resolution/quality vs. Sora/Veo
Generation cost for long/high-res video
Less advanced audio (depending on version)
Fewer resources for post-training compared to big organizations

Model Comparison

Model	Key Strength	Pre-Training Scale	Post-Training	Open Source	Best For
Sora	Unified representation, long videos	Very large	Extensive	No	Research, high-quality generation
Veo 3	Audio sync, motion control	Very large	Extensive	No	Cinematic content, audio-visual
Open-Sora	Reproducibility, open access	Large (open)	Limited	Yes	Research, education, development
Mochi 1	AsymmDiT architecture	Large	Limited	Yes (Apache 2.0)	Open-source video generation
HunyuanVideo	Large-scale open model	~1B frames	Limited	Yes	Open-source baseline

Motion Modeling: Geometry, Optical Flow, and Diffusion Fields

Recent research (2024–2025) shows a shift:

Models now explicitly learn motion fields, not just pixels.

Good video models either learn motion explicitly (flow/fields) or implicitly (attention across time). When motion is modeled explicitly, the model gets a physics anchor to hang frames on.

FlowVid 2.0 (CVPR 2024)

Uses motion priors learned from optical flow to stabilize animations (FlowVid 2.0, CVPR 2024).

The model learns to predict optical flow:

\mathcal{L}_{\text{flow}} = \big\Vert f_\theta(x_t) - \hat{f}_{\text{optical}}(x_t)\big\Vert_1

Where $f_\theta$ is the predicted flow field and $\hat{f}_{\text{optical}}$ is the ground-truth optical flow.

This ensures that:

Objects move smoothly
Motion is physically plausible
Temporal consistency is maintained

DynamiCrafter 2 (NeurIPS 2024)

Learns scene geometry as a latent NeRF-like volume (DynamiCrafter 2, NeurIPS 2024):

x(t) = V_\theta(r(t))

Where $V_\theta$ is a 3D volume representation and $r(t)$ is a ray through the scene.

This gives the model:

3D understanding: Objects have depth and structure
View consistency: The same object looks correct from different angles
Motion in 3D space: Objects move through 3D, not just 2D pixels

Diffusion Video Fields (CVPR 2025)

Represents video as a 4D continuous function:

v = F_\theta(x, y, t, \sigma)

Where:

$(x, y)$ are spatial coordinates
$t$ is time
$\sigma$ is the noise level (diffusion timestep)

This gives better:

Identity preservation: Objects maintain their appearance across frames
Motion stability: Smooth, continuous motion
Controllability: Easy to manipulate camera movement, object motion

The key insight: representing video as a continuous function allows the model to interpolate smoothly between frames, rather than generating discrete frames independently.

Putting It All Together

The Complete Pipeline

A modern video generation model (2025) works like this:

Pre-training (billions of frames):
- Learn the structure of video data
- Learn temporal dependencies
- Learn to denoise video latents
Architecture (DiT/V-DiT/AsymmDiT):
- Spatial attention for image content
- Temporal attention for motion
- Cross-attention for text conditioning
Motion Learning (explicit motion fields):
- Optical flow for smooth motion
- 3D geometry for view consistency
- Continuous fields for interpolation
Post-training (human preferences):
- DPO for preference alignment
- Cinematic reward models for aesthetics
- Instruction following for controllability

The Physics Connection

The remarkable thing: none of this explicitly programs physics.

The model learns:

Objects have momentum (from watching things move)
Light casts shadows (from watching lighting)
Water flows downhill (from watching water)
Camera movement is smooth (from watching camera work)

Not through equations, but through statistical patterns in billions of frames.

The model becomes an amortized physical simulator — it doesn’t solve physics equations, but it has learned to generate videos that satisfy physical laws because those laws are encoded in the training data.

Why This Matters

This approach to video generation has implications beyond entertainment:

Robotics: Models that understand motion can plan robot trajectories
Scientific simulation: Generate plausible simulations of physical processes
Education: Visualize complex phenomena (fluid dynamics, particle physics)
Creative tools: Enable new forms of artistic expression

The future: models that don’t just generate video, but understand the physics underlying motion.

Conclusion

Generative video is one of the most exciting frontiers in AI.

It requires:

Massive scale: Billions of frames, trillions of parameters
Novel architectures: DiT, temporal attention, motion fields
Sophisticated training: Pre-training, post-training, alignment
Implicit physics: Learning physical laws from data

The result: models that can generate videos that are:

Visually stunning: High resolution, cinematic quality
Temporally coherent: Smooth motion, consistent objects
Physically plausible: Motion that makes sense
Controllable: Follow text prompts, user instructions

We’re teaching machines the physics of time — not through equations, but through the statistical structure of motion itself.

References

Modern Models

Sora (OpenAI): Video generation models as world simulators. OpenAI Research
On-device Sora: Training-free diffusion-based text-to-video for mobile devices. arXiv
Veo 3: Google/DeepMind’s high-fidelity video generation with audio. Veo 3
Open-Sora: Democratizing efficient video production for all. arXiv GitHub
Simple Visual Artifact Detection in Sora-generated Videos: Research on detecting artifacts in Sora outputs. arXiv
Mora: Enabling generalist video generation via multi-agent framework. arXiv

Motion Learning & Optical Flow

FlowVid 2.0 (CVPR 2024): Explicit optical flow priors for temporal coherence. arXiv
DynamiCrafter 2 (NeurIPS 2024): Learning temporal scene geometry and non-rigid motion. arXiv

Modern Video Models & Motion: Sora, Veo 3, Open-Sora, and Motion Modeling

Table of Contents

Modern Models: Sora, Veo 3, and Open-Sora

Sora: World Simulators

Veo 3: Audio and Motion Control

Open-Sora: Open-Source Alternative

Model Comparison

Motion Modeling: Geometry, Optical Flow, and Diffusion Fields

FlowVid 2.0 (CVPR 2024)

DynamiCrafter 2 (NeurIPS 2024)

Diffusion Video Fields (CVPR 2025)

Putting It All Together

The Complete Pipeline

The Physics Connection

Why This Matters

Conclusion

References

Further Reading