By Gopi Krishna Tummala
Table of Contents
From Pixels to Actions: The Bridge
So far, we’ve seen diffusion models generate images and videos — visual content. But what if we want to generate actions — sequences of motor commands for a robot or autonomous vehicle?
This is where diffusion models connect to robotics, autonomous systems, and embodied AI.
The Key Insight:
Instead of generating pixels , we generate action trajectories:
Where each is an action (e.g., joint angles, velocities, steering commands).
The same diffusion process that learns to reverse noise into images can learn to reverse noise into plausible action sequences.
Diffusion Policy: Predicting Action Sequences
Diffusion Policy (Chi et al., 2023) applies diffusion models to robot control:
The Problem
Traditional robot policies output a single action given the current state . But many tasks require multi-step reasoning — the robot must plan a sequence of actions.
Example: A robot arm picking up a cup needs to:
- Move toward the cup
- Open gripper
- Position around cup
- Close gripper
- Lift
A single-action policy struggles with this; it needs to see the full sequence.
The Solution: Diffusion over Action Sequences
Diffusion Policy generates action sequences (trajectories) instead of single actions:
Where is the horizon (e.g., 16 steps into the future).
The Diffusion Process:
- Forward: Add noise to a demonstration trajectory until it becomes random actions
- Reverse: Learn to denoise random actions back into a valid trajectory
- Conditioning: Condition on the current observation (camera image, sensor data)
Mathematical Formulation:
The model learns:
Where:
- is the action trajectory
- is the current observation (image, state)
Key Advantage: Diffusion naturally handles multi-modal action distributions. If there are multiple valid ways to complete a task, diffusion can generate diverse trajectories, unlike deterministic policies.
Why Diffusion Works for Actions
1. Multi-Modality:
- There are often multiple valid action sequences for a task
- Diffusion models excel at capturing multi-modal distributions
- Unlike deterministic policies, they can explore diverse solutions
2. Smoothness:
- Actions should change smoothly over time (no sudden jerky movements)
- Diffusion’s iterative denoising naturally produces smooth trajectories
- The noise schedule enforces temporal smoothness
3. Constraint Satisfaction:
- Robot actions must satisfy physical constraints (joint limits, collision avoidance)
- Diffusion can be guided to satisfy constraints through conditioning
World Models: Generating Future States
World Models use diffusion to predict future states of the environment, not just actions.
The Concept
Instead of generating pixels or actions, generate future observations:
Where:
- is the current observation (camera image, LiDAR, etc.)
- is the action taken
- are predicted future observations
Application: Training Planning Agents
Use Case: Train a planning agent for autonomous driving:
- Collect data: Record driving videos with actions (steering, acceleration)
- Train world model: Use a small video diffusion model to predict the next 5 seconds of driving
- Train planner: Use the world model to simulate future scenarios, train a planner that avoids collisions
Why This Works:
- The world model learns the dynamics of the environment
- The planner can “imagine” consequences of actions without real-world trial-and-error
- This is model-based reinforcement learning with learned dynamics
Mathematical Formulation
The world model learns:
This is essentially a conditional video diffusion model where:
- The condition is the current observation and actions
- The output is future observations
Key Insight: Video diffusion models implicitly learn physics. By training on real driving data, the model learns how scenes evolve — cars move, pedestrians cross, lights change. This learned physics can be used for planning.
Diffusion for Planning and Prediction
Trajectory Prediction
In autonomous vehicles, predicting other agents’ trajectories is critical:
Problem: Given the current scene, predict where other vehicles/pedestrians will be in the next 5 seconds.
Solution: Use diffusion to generate multiple plausible trajectories:
Where is the future trajectory of another agent.
Why Diffusion:
- Multiple plausible futures (agent could turn left, right, or go straight)
- Diffusion captures this multi-modality naturally
- Can generate diverse, realistic trajectories
Motion Planning
For the ego vehicle, diffusion can generate candidate trajectories:
- Generate diverse trajectories using diffusion
- Score each trajectory (safety, comfort, goal progress)
- Select the best trajectory
- Execute the first action, replan
Advantage over traditional planners:
- Naturally handles multi-modal scenarios
- Learns from data (doesn’t require hand-coded rules)
- Can adapt to complex, real-world situations
Robotics Applications
Manipulation Tasks
Diffusion Policy has shown strong performance on:
- Pick and place: Grasping objects and moving them
- Assembly: Putting parts together
- Kitchen tasks: Opening drawers, using tools
Why it works:
- Manipulation requires multi-step sequences
- Diffusion naturally handles the sequential nature
- Can learn from diverse demonstration data
Mobile Robotics
Navigation and path planning:
- Generate diverse paths to a goal
- Avoid obstacles while maintaining smooth motion
- Handle uncertainty in the environment
Human-Robot Interaction
Predicting human intent:
- Use diffusion to predict where a human will move
- Plan robot actions that avoid collisions
- Generate natural, human-like robot motions
Autonomous Vehicles and L4/L5 Systems
Behavior Prediction
Critical for L4/L5 autonomy: Predicting other agents’ behavior.
Diffusion-based prediction:
- Generate multiple plausible futures for each agent
- Each trajectory has an associated probability
- Planner uses these predictions to make safe decisions
Mathematical Formulation:
For each agent :
Where context includes:
- Agent type (car, pedestrian, cyclist)
- Road structure
- Traffic rules
- Historical behavior
Scene Prediction
Predict future scenes (not just agent trajectories):
- Use video diffusion to predict the next 5-10 seconds of the scene
- Includes all agents, road structure, lighting changes
- Planner can “simulate” consequences of actions
Connection to World Models:
- The scene prediction model is a world model for driving
- It learns the dynamics of traffic scenes
- Can be used for planning and safety validation
Closed-Loop Planning
The Complete Pipeline:
- Perception: Process sensor data (cameras, LiDAR) → current state
- Prediction: Use diffusion to predict other agents’ trajectories
- Planning: Generate candidate ego trajectories using diffusion
- Selection: Score and select best trajectory
- Control: Execute first action, repeat
Why Diffusion Fits:
- Handles multi-modal scenarios (multiple valid plans)
- Learns from data (doesn’t require hand-coded rules)
- Naturally produces smooth, realistic trajectories
The Connection to Reinforcement Learning
Diffusion models and RL are complementary:
Traditional RL:
- Learns policies through trial-and-error
- Requires many interactions with the environment
- Can be sample-inefficient
Diffusion + RL:
- Diffusion Policy: Learns from demonstrations (imitation learning)
- World Models: Learns environment dynamics, enables model-based RL
- Planning: Uses diffusion to generate candidate actions
Hybrid Approach:
- Pre-train diffusion policy on demonstrations
- Fine-tune with RL for specific tasks
- Use world models for efficient exploration
This combines the data efficiency of imitation learning with the adaptability of RL.
Summary: Diffusion Beyond Visual Generation
Diffusion models aren’t just for images and video — they’re a powerful framework for:
- Action Generation: Diffusion Policy for robot control
- State Prediction: World models for planning
- Trajectory Prediction: Multi-modal future prediction
- Planning: Generating diverse, realistic action sequences
The Common Thread:
All these applications use the same core idea: learn to reverse a noise process to generate structured sequences — whether those sequences are pixels, actions, or states.
Interview Relevance:
- Robotics: How do you plan multi-step actions? → Diffusion Policy
- Autonomous Vehicles: How do you predict other agents? → Diffusion-based trajectory prediction
- System Design: How do you handle multi-modal scenarios? → Diffusion captures diversity naturally
References
Diffusion Policy
- Chi, C., et al. (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. RSS. arXiv
World Models
- Ha, D., & Schmidhuber, J. (2018). World Models. NeurIPS. arXiv
- Recent work applying video diffusion to world models for robotics
Trajectory Prediction
- Recent research on diffusion-based trajectory prediction for autonomous vehicles (2024-2025)
Further Reading
- Part 5: Pre-Training & Post-Training
- Part 7: Modern Models & Motion
- Behavior Prediction: Behavior Prediction for Closed-Loop Driving
This is Part 6 of the Diffusion Models Series. Part 5 covered pre-training and post-training. Part 7 will explore modern models like Sora, Veo, and Open-Sora.