By Gopi Krishna Tummala
The Story: The 2-Second Horizon
You are driving through downtown San Francisco. A pedestrian steps off the curb but hesitates. A cyclist swerves to avoid a pothole. A car in the opposite lane inches forward, signaling an unprotected left turn.
A human driver processes this scene, predicts the future states of these agents, and decides to slow down—all in under 200 milliseconds.
In the autonomous vehicle (AV) stack, this is the Prediction Module. It is the bridge between Perception (what do I see?) and Planning (what should I do?). It is also the hardest problem in AV because it requires modeling the most unpredictable variable in physics: Human intent.
This post tells the story of how we went from simple physics equations to Generative AI, tracking the evolution through the lens of academic research and the prestigious Waymo Open Dataset Challenges.
Act I: The Age of Innocence (Physics & Kalman Filters)
In the early days (DARPA Grand Challenges, late 2000s), prediction was handled by classical mechanics. The assumption was simple: Objects in motion stay in motion.
If we knew a car’s position () and velocity (), we could predict its future position using basic kinematics:
The Tool: The Kalman Filter
Because sensors are noisy, we couldn’t trust a single measurement. We used Kalman Filters to estimate the “true” state.
-
Predict: Use physics to guess where the car will be.
-
Update: Measure where the car actually is.
-
Correct: Merge the two based on uncertainty ( and covariance matrices).
Why it failed:
Physics works for rocks, not people. A pedestrian standing at a curb has zero velocity (). Physics predicts they will stay there forever. But a human driver knows they might step out. Physics cannot model intent.
Act II: The Engineer’s Playground (Feature Engineering & XGBoost)
Before the deep learning revolution took over, prediction was the domain of the Feature Engineer. This was the golden era of ADAS (Advanced Driver Assistance Systems) and early highway autopilots.
The prevailing philosophy was that highway driving is a Finite State Machine. A car isn’t just “moving”; it is executing one of a discrete set of maneuvers:
-
Lane Keeping (LK)
-
Lane Change Left (LCL)
-
Lane Change Right (LCR)
The Approach: Hand-Crafted Features + Classifiers
In this era, we didn’t feed raw pixels into a black box. Instead, engineers spent months hand-crafting specific mathematical features that correlated with human intent. The logic was transparent, debuggable, and relied on tabular data.
The “God Features” of 2015:
-
TTC (Time-to-Collision): How many seconds until I hit the car in front?
-
TLC (Time-to-Lane-Crossing): If the current steering angle is held, in how many seconds will the tire cross the paint?
-
Lateral Jerk: Is the driver twitchy?
-
Yaw Rate relative to Lane: Is the car angling toward the gap?
The Model: The Rise of Tree-Based Methods
Once these features were extracted, the problem became a classic classification task. The research landscape was diverse, exploring various architectures: Support Vector Machines (SVMs) were used for their strong margin maximization, Dynamic Bayesian Networks (DBNs) for their ability to model temporal dependencies, and Hidden Markov Models (HMMs) for state transitions.
However, Tree-Based Methods—specifically Gradient Boosted Decision Trees (like XGBoost)—became widely adopted in practical applications.
Why did trees win on the highway?
-
Tabular Dominance: Unlike images, our inputs were “tabular” (spreadsheets of features). Tree ensembles are historically state-of-the-art for this data type.
-
Non-Linear Interactions: A tree can easily learn complex rules like: “IF
Distance_Front< 20m ANDVelocity> 60mph, THENLane_Change_Leftprobability is high.” -
Robustness: They handled heterogeneous data (mixing “seconds” for TTC with “meters” for distance) without needing the heavy normalization required by neural networks.
Where is the vector of hand-tuned features and represents the sum of the prediction scores from decision trees.
The Workflow:
-
Feature Extraction: Calculate (distance to line) and (lateral velocity).
-
Classification: The XGBoost model outputs: “95% probability of Lane Change Left.”
-
Regression: Snap a pre-calculated polynomial (e.g., a Quintic Spline) to the center of the left lane.
The Seminal Research
This approach wasn’t just a hack; it was grounded in rigorous research:
-
Houenou et al. (IV 2013): “Vehicle trajectory prediction based on motion model and maneuver recognition.” This paper formalized the idea of selecting a specific motion model (Constant Acceleration for turning, Constant Velocity for straight) based on a classifier’s output.
-
Lefèvre et al. (IV 2014): “A survey on motion prediction and risk assessment for intelligent vehicles.” Popularized the use of Dynamic Bayesian Networks (DBNs) to infer hidden states (intent) from observed variables (steering, speed).
-
Schlechtriemen et al. (ITSC 2014): Used Gaussian Mixture Models (GMMs) combined with feature-based classifiers to predict lane changes on the Autobahn.
The Verdict
-
The Good: It was interpretable. If the car predicted a lane change falsely, you could look at the decision tree and see: “Ah,
Lateral_Velocity > 0.5triggered the branch.” -
The Bad: It was brittle. It required infinite “If-This-Then-That” rules. What happens in a construction zone where lanes don’t exist? Or in a parking lot where “Lane Keeping” is a meaningless concept?
As we moved from structured highways to chaotic urban centers, the “Hand-Tuned Feature” approach hit a wall. You cannot write a feature for “yielding to a pedestrian who is looking at their phone.”
Act III: The Deep Learning Explosion (Rasterization)
As AVs moved from structured highways to chaotic cities (around 2016–2019), the “3-choice” logic broke down. You cannot classify a parking lot maneuver as just “Lane Change Left.” The industry realized we needed to predict free-form, non-linear motion.
To solve this, engineers borrowed the most powerful hammer available at the time: Computer Vision.
The Architecture: Spatial Backbones & Temporal Heads
The standard architecture during this era followed a two-stage approach: understanding the Space (the map and agents) and then modeling the Time (the movement).
1. The Spatial Backbone (The Eyes)
First, the scene is rasterized into a Bird’s Eye View (BEV) image. This massive 3D tensor is fed into a Convolutional Neural Network (CNN)—typically a ResNet or U-Net.
-
Role: Feature Extraction. The CNN doesn’t predict the future; it compresses the high-dimensional image into a dense “feature vector” or “feature map” that encodes the road geometry and agent positions.
-
Why U-Net? While ResNet was powerful, it downsampled the image, losing fine details. U-Net became the gold standard because its skip connections preserved the spatial resolution, allowing the model to understand exactly where a curb ended.
2. The Temporal Head (The Brain)
Once we have the features, how do we predict the path? This evolved in two distinct phases.
-
Phase 1: The Simple Regressor (CNN + MLP)
Early models took the flattened feature vector from the CNN and fed it into a simple Multi-Layer Perceptron (MLP).
-
Output: A fixed set of numbers representing coordinates .
-
Limitation: MLPs are static. They struggled to capture the temporal continuity of motion. They often predicted “jumpy” trajectories that violated physics.
-
-
Phase 2: The Sequence Modeler (CNN + RNN)
The industry quickly realized that driving is a time-series problem. The MLP was replaced by Recurrent Neural Networks (RNNs), specifically LSTMs or GRUs.
-
Mechanism: The RNN takes the CNN features as the initial state and “unrolls” the future trajectory step-by-step.
-
Benefit: The RNN has “memory.” It ensures that the position at is kinematically consistent with the position at .
-
Landmark Papers of this Era
This architecture—CNN backbone + RNN head—defined the research of Waymo and Uber ATG between 2018 and 2019.
-
Fast and Furious (Luo et al., CVPR 2018): Uber ATG.
A pioneering paper that performed Detection, Tracking, and Forecasting in a single network. It used a 3D CNN backbone to process voxelized LiDAR data, proving that end-to-end learning was faster and more robust than independent modules.
-
ChauffeurNet (Bansal et al., RSS 2019): Waymo.
The definitive implementation of the CNN+RNN pattern. Waymo fed a sequence of rasterized history images into a CNN backbone, and an RNN head generated the ego-vehicle’s future controls. They introduced “perturbation training”—intentionally shaking the simulated car to teach the RNN how to recover from mistakes.
-
IntentNet (Casas et al., CoRL 2018): Uber ATG.
They extended the architecture to predict high-level intent (Left Turn vs. Keep Lane) jointly with the trajectory. By adding a classification head alongside the regression head, the model learned that “turning left” implies a specific curve shape.
The Trade-offs: Why we moved on
Despite these successes, the raster era hit a ceiling:
-
The “Sparse” Problem: A road image is 90% empty asphalt. A CNN wastes millions of computations processing black pixels (zeros) just to find the one car in the corner.
-
Ego-Centric vs. Agent-Centric: This was the fatal flaw.
-
Ego-Centric: The image is centered on your car. A vehicle 100m away is just a tiny blob of pixels at the edge. The CNN features for distant cars are weak.
-
Agent-Centric: To fix this, you have to re-center the image and re-run the CNN for every single car on the road. This explodes computational cost (running ResNet 50 times per frame).
-
This inefficiency led directly to the Vector Revolution (Act IV), where we stopped drawing pixels and started processing graphs.
Act IV: The Vector Revolution (Waymo Open Dataset 2020)
In 2019/2020, Waymo released the Waymo Open Dataset (WOD), highlighting a massive shift from pixels to Vectors.
The Breakthrough: VectorNet
Instead of drawing a lane as pixels, we represent it as a mathematical vector (a line segment with direction).
-
Lanes: Polylines (connected vectors).
-
Agents: Trajectory history vectors.
This ushered in the era of Graph Neural Networks (GNNs). The model could now “reason” about the road topology: “The car is in a left-turn only lane, therefore it must turn left.”
Act V: The Multi-Modal Reality (Multipath & Uncertainty)
One of the biggest lessons from the Waymo challenges was that the future is not deterministic.
If a car approaches a yellow light, two futures are possible:
-
It accelerates to beat the light.
-
It brakes to stop.
An average of these two (driving at half speed into the intersection) is the worst possible prediction.
The Solution: Gaussian Mixture Models (GMMs)
Modern models (like Waymo’s Multipath++) don’t output one line. They output a distribution of possible futures.
Where:
-
: The probability of a specific behavior (e.g., 30% chance of turning left).
-
: The trajectory path.
-
: The uncertainty (how wide is the lane?).
This allows the planner to say: “There is a 10% chance this car cuts me off. Is it safe to proceed?”
Act VI: The “Social” Era (Interaction Transformers)
Until 2021, most models predicted agents independently (Open-Loop). We predicted Car A, then predicted Car B.
The Problem: Traffic is a negotiation. If I merge into a lane, the car behind me slows down. If I wait, they speed up.
The Scene Transformer (Waymo 2021/2022)
Inspired by Large Language Models (LLMs) like GPT, researchers began using Transformers for traffic.
-
Attention Mechanisms: Just as “bank” means something different in “river bank” vs “bank account,” a car’s behavior depends on its neighbors.
-
Joint Prediction: The model predicts the joint state of the scene. “Car A yields IF Car B goes.”
The metric for success shifted in competitions from pure displacement error (ADE/FDE) to Interaction metrics—did we correctly predict who would yield to whom?
Act VII: The Final Frontier (Closed-Loop & Generative AI)
This brings us to today, and the Waymo Sim Agents Challenge (2023-2024).
We realized that minimizing error on a static dataset isn’t enough. A model can have low error but still be useless if it can’t react to the AV’s decisions. We needed to move from Open-Loop (predicting pre-recorded data) to Closed-Loop (simulation).
The “Sim Agents” Concept
In this new paradigm, we don’t just predict a path; we create a digital twin of a human driver (a Sim Agent) inside a simulation.
-
The Test: If the autonomous vehicle honks or swerves, does the predicted Sim Agent react realistically?
-
The Tech: Generative AI & Diffusion Models.
Similar to how Stable Diffusion generates images from noise, Motion Diffusion Models generate realistic driving trajectories from random noise, conditioned on the map context. This captures the “multimodal” nature of driving perfectly—every time you run the model, you get a slightly different, plausible human behavior.
Summary: The Stack Evolution
| Era | Technology | Philosophy | Key Challenge |
|---|---|---|---|
| 2010s | Kalman Filters | Physics-based | Modeling non-linear turns |
| 2013-2016 | XGBoost / DBNs | Feature Engineering (Highway) | Hand-crafted features for maneuver classification |
| 2018 | CNNs / Raster | Pattern Matching | Computational Efficiency |
| 2020 | VectorNet / GNNs | Structured Learning | Understanding Maps/Lanes |
| 2022 | Transformers | Interaction | ”Who yields to whom?“ |
| 2024+ | Diffusion / Sim Agents | Generative / Closed-Loop | Realistic Reactivity |
The Future: From Trajectory to Intent
As we look at the winning entries of the latest CVPR and NeurIPS workshops, the trend is clear: Prediction is merging with Planning.
We are no longer just asking “Where will they be?” ( coordinates). We are asking “What do they want?” (Intent). By using V2X (Vehicle-to-Everything) communication and Generative AI, we are moving toward a world where cars don’t just guess the future—they negotiate it.
Graduate Assignment: The Highway Classifier
Task:
Design a logic gate for a highway predictor.
-
Inputs: (distance to lane line), (lateral velocity), Blinkers (On/Off).
-
Scenario: A car is 0.5m from the left lane line, moving left at 0.5 m/s, with left blinker ON.
-
Bayesian Update:
-
Prior probability of Lane Change: (Rare event).
-
Likelihood given blinker: .
-
Calculate the posterior probability.
-
-
Analysis: Why does this fail if the lane lines are faded?
Further Reading:
- VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation (CVPR 2020)
- Multipath++: Efficient Information Fusion and Trajectory Aggregation (ICRA 2022)
- Waymo Open Dataset: Sim Agents Challenge