Physical AI Glossary
Video Prediction
Video prediction generates future video frames from past observations and optional action inputs, serving as a learned world model for robot planning. Unlike classical physics simulators requiring explicit geometry and dynamics, video prediction models learn visual dynamics directly from data—predicting pixel-level consequences of actions in unstructured environments where analytic models fail.
Quick facts
- Term
- Video Prediction
- Domain
- Robotics and physical AI
- Last reviewed
- 2025-06-08
What Video Prediction Solves in Physical AI
Video prediction addresses the core challenge of model-based robot learning: predicting sensory consequences of actions without hand-engineered physics. Classical model-based reinforcement learning relies on simulators like RoboSuite or ManiSkill that require precise object meshes, friction coefficients, and contact dynamics—parameters unavailable in real-world kitchens, warehouses, or outdoor environments.
Learned video prediction models replace analytic simulators with neural networks trained on visual trajectories. Given frames t₀…tₙ and future actions aₙ₊₁…aₙ₊ₖ, the model outputs predicted frames tₙ₊₁…tₙ₊ₖ. RT-1 and RT-2 use implicit world models (value functions learned end-to-end), but explicit video prediction enables interpretable planning: generate candidate action sequences, predict their visual outcomes, select the trajectory maximizing a reward signal.
World Models (Ha & Schmidhuber 2018) demonstrated this loop in simulated car racing: a variational autoencoder compresses frames to latent codes, an LSTM predicts future latents conditioned on actions, and a controller optimizes actions in latent space[1]. Scaling this approach to high-resolution robot manipulation requires datasets pairing video with precise action labels—teleoperation trajectories where human operators' joystick commands are logged frame-by-frame.
Architecture Families: Autoregressive, Diffusion, and Latent Dynamics
Video prediction architectures fall into three families. Autoregressive models like Video Transformers generate one frame at a time, feeding predictions back as input for the next step. This approach accumulates error over long horizons but enables flexible conditioning on actions or language at each timestep.
Diffusion models like NVIDIA Cosmos generate entire future sequences in parallel by iteratively denoising Gaussian noise. Imagen Video (2022) and Sora (2024) demonstrated photorealistic generation at scale, but action conditioning remains an open challenge—most diffusion video models train on internet video without robot telemetry[2].
Latent dynamics models compress frames to low-dimensional representations, predict latent trajectories with recurrent or transformer modules, then decode back to pixels. RoboNet (Dasari et al. 2020) trained a 7-robot latent dynamics model on 113,000 trajectories, achieving cross-embodiment transfer by learning shared visual dynamics[3]. DROID extended this to 350+ buildings and 564,000 trajectories, but latent models sacrifice pixel-level fidelity for computational efficiency.
All three families require synchronized video-action pairs. LeRobot standardizes this via episode dictionaries mapping frame indices to 6-DOF end-effector poses, gripper states, and joint angles—the minimum metadata for action-conditioned prediction.
Training Data Requirements: Temporal Density and Action Precision
Video prediction models demand higher temporal resolution than classification datasets. EPIC-KITCHENS-100 captures 90,000 action segments at 60 fps, but segments are trimmed around discrete verbs ("open drawer", "pour water")—insufficient for continuous dynamics modeling[4]. Robot datasets like BridgeData V2 log full episodes at 5-15 Hz with millisecond-synchronized actions, enabling frame-to-frame prediction.
Action precision matters more than camera resolution. A 224×224 RGB stream with 1 kHz force-torque telemetry trains better dynamics models than 1080p video with 10 Hz pose estimates—contact-rich tasks like insertion or wiping depend on sub-centimeter position accuracy and Newton-level force feedback. UMI demonstrates this with a custom gripper logging 6-axis wrench at 100 Hz, paired with wrist-mounted 640×480 video.
Truelabel's marketplace indexes 12,000+ collectors contributing teleoperation data across 47 embodiments. Buyers filter by action dimensionality (3-DOF Cartesian vs. 7-DOF joint space), sensor modality (RGB-D, tactile, proprioception), and task domain (kitchen, warehouse, outdoor). Every dataset includes provenance metadata—collector identity, hardware specs, calibration logs—enabling reproducible world model training.
Action Conditioning: From Open-Loop Prediction to Closed-Loop Planning
Open-loop video prediction generates futures without action input, useful for anomaly detection or video compression but insufficient for robot control. Closed-loop prediction conditions each frame on the action taken at that timestep, enabling counterfactual reasoning: "If I rotate the gripper 15° clockwise, will the cup tip over?"
Convolutional Dynamic Neural Advection (Finn et al. 2016) introduced spatial transformers that warp previous frames according to predicted flow fields, conditioned on robot joint velocities. This architecture dominated early work but struggles with occlusions and appearance changes—a cup moving behind a box cannot be reconstructed from flow alone.
Modern approaches like Genie (Google DeepMind 2024) learn latent action representations from video alone, inferring plausible action spaces without telemetry. Genie trained on 200,000 hours of internet platformer gameplay, discovering jump/crouch/move actions purely from pixel dynamics[5]. Applying this to robotics requires egocentric video datasets like Ego4D (3,670 hours, 74 scenarios) where human hand motions proxy for robot actions.
For manipulation, explicit action labels remain superior. Open X-Embodiment aggregates 1M+ trajectories with standardized action spaces (end-effector deltas, gripper binary), enabling cross-dataset pretraining. Video prediction models trained on this corpus generalize to new robots by fine-tuning on 50-500 target-embodiment episodes—a 10× data efficiency gain over training from scratch[6].
Evaluation Metrics: Beyond Pixel MSE
Mean squared error (MSE) between predicted and ground-truth pixels dominated early video prediction benchmarks, but MSE penalizes perceptually irrelevant shifts (a cup predicted 2 pixels left of its true position incurs high MSE despite correct semantics). Perceptual metrics like LPIPS (Learned Perceptual Image Patch Similarity) compare deep features from pretrained ImageNet models, correlating better with human judgments.
Frechet Video Distance (FVD) extends Frechet Inception Distance to video by computing statistics over spatiotemporal I3D features. Lower FVD indicates generated videos match the distribution of real videos, but FVD does not measure action-conditioned accuracy—a model generating plausible but incorrect futures (the robot grasps a different object than commanded) can score well.
Task-specific metrics matter more for robot planning. CALVIN evaluates video prediction models by their ability to enable downstream policy learning: train a world model on 10,000 play trajectories, use it to generate synthetic rollouts for 34 manipulation tasks, measure success rate of policies trained purely on imagined data[7]. THE COLOSSEUM extends this to 20 real-world tasks with 6 distractor objects, testing whether predicted videos preserve task-relevant object states (drawer open/closed, liquid poured/not poured).
Truelabel's data marketplace tags datasets with evaluation-ready test splits—held-out environments, novel object instances, unseen lighting—enabling apples-to-apples model comparison. Buyers specify target metrics (FVD <150, grasp success >80%) and receive datasets meeting those thresholds.
Sim-to-Real Transfer: When Synthetic Video Prediction Fails
Synthetic data from simulators like RoboSuite or AI2-THOR offers infinite trajectories with perfect action labels, but video prediction models trained purely on simulation fail in real deployment due to the reality gap—differences in lighting, texture, physics fidelity, and sensor noise.
Domain randomization (Tobin et al. 2017) mitigates this by varying simulator parameters (camera position, object colors, friction) during training, forcing models to learn dynamics invariant to visual appearance[8]. Sim-to-real transfer surveys report 60-80% real-world success rates for policies trained on randomized simulation, but video prediction lags behind: generating photorealistic pixels from simulator states remains harder than learning control policies.
Hybrid approaches combine small real datasets with large synthetic corpora. RoboNet mixed 15,000 real trajectories (7 robots, 113,000 total) with 50,000 simulated MuJoCo rollouts, pretraining a shared latent dynamics model then fine-tuning per-robot decoders on real data[3]. This reduced real data requirements by 5×, but the latent bottleneck discarded high-frequency details (contact transients, deformable object dynamics) critical for dexterous manipulation.
The emerging consensus: real teleoperation data is irreplaceable for video prediction. DROID collected 76,000 real trajectories across 564 buildings, demonstrating that scale in diverse real environments beats synthetic augmentation. Truelabel's 12,000-collector network accelerates this by crowdsourcing teleoperation in buyers' target environments—warehouses, kitchens, construction sites—capturing the long-tail visual distributions simulators cannot replicate.
Egocentric Video: The Highest-Intent Training Signal
Egocentric video—captured from head-mounted or wrist-mounted cameras—provides the visual perspective robots experience during manipulation. Third-person video (static overhead cameras) includes irrelevant background clutter and occludes the gripper-object interaction zone, forcing models to learn viewpoint transformations before predicting dynamics.
EPIC-KITCHENS-100 pioneered large-scale egocentric action recognition with 100 hours of head-mounted GoPro footage across 45 kitchens, but its verb-noun annotations ("take cup", "pour water") lack the continuous action labels needed for video prediction[4]. Ego4D expanded to 3,670 hours and 74 scenarios (cooking, repair, social interaction), adding gaze tracking and audio, but still omits robot telemetry.
Robot-specific egocentric datasets pair wrist cameras with end-effector poses. BridgeData V2 uses a WidowX arm with a wrist-mounted RealSense capturing 640×480 RGB-D at 15 Hz, synchronized to 7-DOF joint angles at 20 Hz. ALOHA adds bimanual teleoperation with dual wrist cameras, enabling prediction of coordinated two-arm tasks (folding towels, opening jars).
Truelabel's teleoperation datasets default to egocentric capture—collectors wear GoPros or mount cameras on robot end-effectors, logging 6-DOF poses via OptiTrack or Vicon motion capture. Buyers specify camera intrinsics (focal length, distortion coefficients) and receive calibration matrices enabling direct projection of predicted 3D trajectories into pixel space.
Multimodal Conditioning: Language, Goals, and Tactile Feedback
Pure visual prediction ignores task intent—a model predicting "the robot moves forward" cannot distinguish between reaching for a cup versus avoiding an obstacle. Language conditioning grounds predictions in natural language instructions: given "pick up the red mug" and initial frames, predict frames showing the gripper approaching the red object.
RT-2 conditions a vision-language-action model on text prompts, but generates actions directly rather than predicted video. SayCan (Google 2022) uses language to score action sequences, but relies on a separate learned value function rather than explicit video rollouts. Combining language with video prediction remains an open problem—most video diffusion models (Sora, Imagen Video) condition on text but lack action inputs.
Goal-conditioned prediction provides a target end state (an image of the desired configuration) and predicts the trajectory connecting initial to goal frames. CALVIN trains policies via hindsight relabeling: any reached state becomes a valid goal, enabling self-supervised learning from play data[7]. Video prediction models can generate goal-reaching trajectories by optimizing actions to minimize pixel distance between predicted final frame and goal image.
Tactile conditioning adds contact-rich dynamics invisible to vision. DROID includes GelSight tactile sensors on 12% of trajectories, logging 640×480 tactile images at 30 Hz alongside RGB video. Predicting tactile frames jointly with visual frames improves insertion and wiping tasks by 15-20% over vision-only models, but tactile data remains scarce—only 8,000 of DROID's 76,000 trajectories include touch.
Temporal Horizons: Short-Term Dynamics vs. Long-Horizon Planning
Video prediction accuracy degrades exponentially with horizon length. Models predicting 5-10 frames (0.3-1 second at 10 Hz) achieve near-perfect pixel reconstruction, but 50-frame predictions (5 seconds) accumulate compounding errors—small mistakes in predicted object positions propagate, causing drift and blurring.
Short-horizon prediction suffices for reactive control: RT-1 re-plans every 3 seconds using a 1-second lookahead, correcting errors before they compound. Long-horizon tasks (making a sandwich, cleaning a room) require hierarchical planning: predict high-level subgoal sequences ("open fridge", "grasp lettuce", "place on counter") at 1 Hz, then predict low-level trajectories (gripper waypoints) at 10 Hz within each subgoal.
LongBench evaluates manipulation policies on 50-step real-world tasks (average 120 seconds), finding that models trained on short teleoperation clips (<30 seconds) fail to chain subgoals—they successfully grasp objects but place them in wrong locations or forget intermediate steps. Training on full task episodes (2-5 minutes) improves long-horizon success from 12% to 68%, but such episodes are expensive: a 3-minute teleoperation run at 10 Hz yields 1,800 frames, requiring 15 GB storage for RGB-D plus actions.
Truelabel's marketplace offers both short clips (10-30 seconds, $0.50-2 per trajectory) for reactive policy training and full task episodes (1-5 minutes, $5-20 per trajectory) for hierarchical planning. Buyers specify horizon requirements and receive datasets with episode-length distributions matching their target tasks.
Uncertainty Quantification: Stochastic Prediction and Ensemble Methods
The future is inherently multimodal—a person reaching toward a table might grasp the cup, the plate, or the napkin. Deterministic video prediction models collapse this distribution to a single blurry average (the hand appears semi-transparent, hovering between all three objects). Stochastic models sample from a learned distribution over futures, generating diverse plausible trajectories.
Variational autoencoders (VAEs) model uncertainty via latent codes: sample z ~ N(0,1), concatenate with observed frames and actions, decode to predicted frames. Stochastic Video Generation (Denton & Fergus 2018) used this approach for Atari game prediction, but VAE-generated videos suffer from posterior collapse—the decoder ignores z and produces deterministic outputs.
Diffusion models naturally handle multimodality by sampling from learned score functions. NVIDIA Cosmos generates 100+ diverse continuations for the same initial frames, enabling risk-aware planning: reject trajectories where the robot collides with obstacles in >10% of samples. However, action-conditioned diffusion for robotics remains nascent—most work focuses on unconditional internet video.
Ensemble methods train multiple deterministic models with different random seeds, using prediction disagreement as an uncertainty signal. High ensemble variance indicates regions where training data is sparse (novel object configurations, rare contact modes), triggering active data collection. Truelabel's request system automates this loop: buyers upload ensemble predictions, the platform identifies high-disagreement states, collectors receive targeted requests for trajectories covering those states.
Computational Cost: Training and Inference Budgets
Video prediction models are parameter-hungry. NVIDIA Cosmos scales to 12 billion parameters, requiring 256 H100 GPUs for pretraining on 20 million video clips—a $2M compute budget at 2025 cloud rates. Smaller models like LeRobot's Diffusion Policy (50M parameters) train on a single A100 in 12 hours using 10,000 trajectories, but achieve 60-70% success rates versus 85-90% for billion-parameter models.
Inference cost matters for real-time control. Autoregressive transformers generate frames sequentially, taking 50-200 ms per frame on an RTX 4090—too slow for 10 Hz control loops. Latent dynamics models compress frames to 64-256 dimensional codes, enabling 5-10 ms prediction latency, but sacrifice pixel-level detail. Diffusion models require 50-1000 denoising steps, taking 1-10 seconds per video—acceptable for offline planning but prohibitive for reactive control.
Model distillation compresses large teacher models into fast student models. Train a 12B-parameter diffusion model on 10M trajectories, then distill into a 50M-parameter latent dynamics model using the teacher's predictions as targets. The student runs 100× faster while retaining 90-95% of the teacher's accuracy. Open X-Embodiment provides pretrained teacher models (RT-2-X, Octo) enabling distillation with 1,000-10,000 target-domain trajectories rather than millions.
Truelabel's data marketplace tags datasets with training cost estimates—GPU-hours required to reach 80% validation accuracy on standard benchmarks—helping buyers budget compute alongside data acquisition.
Commercial Applications: Warehouse Automation and Humanoid Pretraining
Video prediction enables model-based planning in domains where simulation is infeasible. Warehouse picking involves deformable packaging (cardboard boxes, plastic bags), contact-rich grasping (items wedged between shelves), and variable lighting (skylights, moving forklifts)—conditions simulators cannot replicate. Scale AI's Physical AI platform trains video prediction models on 50,000+ warehouse teleoperation trajectories, enabling robots to predict grasp outcomes ("will this suction cup lift the bag?") before attempting picks.
Humanoid pretraining requires predicting full-body dynamics across locomotion, manipulation, and social interaction. Figure AI's partnership with Brookfield collects 10,000 hours of humanoid teleoperation in warehouses, construction sites, and retail stores, training world models that predict 30-second futures conditioned on 34-DOF whole-body actions. These models enable offline policy optimization—generate 1M imagined trajectories, train a policy on successful rollouts, deploy to hardware.
Autonomous vehicles use video prediction for pedestrian trajectory forecasting and occlusion reasoning. Waymo Open Dataset includes 1,000 hours of urban driving with LiDAR, camera, and radar, but lacks the fine-grained action labels (steering angle, throttle, brake at 100 Hz) needed for action-conditioned prediction. Truelabel's automotive collectors log CAN bus telemetry synchronized to camera frames, enabling prediction of vehicle responses to control inputs.
The common thread: video prediction replaces hand-engineered simulators in domains where physics is too complex (deformable objects), environments too diverse (1,000+ warehouse layouts), or tasks too contact-rich (humanoid whole-body control) for analytic models.
Open Challenges: Occlusion, Contact, and Generalization
Occlusion handling remains unsolved. When a robot arm moves behind a table, video prediction models must infer the occluded gripper's position from context (last visible pose, table geometry, action commands). Current models either hallucinate incorrect positions or generate blurry averages. Multiview prediction—training on synchronized cameras from 3-5 viewpoints—improves occlusion robustness by 20-30%, but requires 3-5× more data and compute.
Contact dynamics (friction, slip, deformation) are underrepresented in training data. DROID includes 76,000 trajectories but only 8% involve contact-rich tasks (wiping, insertion, cutting)—most are pick-and-place with rigid objects. Models trained on this distribution fail on deformable manipulation (folding laundry, kneading dough) or high-friction tasks (turning stiff knobs, opening stuck drawers). Truelabel's request system incentivizes contact-rich collection by paying 2-3× premiums for trajectories with force-torque telemetry exceeding 10 N or 1 Nm.
Cross-embodiment generalization tests whether a model trained on WidowX arms transfers to Franka Panda or UR5 robots. Open X-Embodiment demonstrates 60-75% zero-shot transfer by learning embodiment-agnostic visual dynamics, but performance degrades on tasks requiring precise force control (insertion tolerances <1 mm) where kinematic differences matter. Fine-tuning on 500-2,000 target-embodiment trajectories recovers 90-95% of train-from-scratch performance.
The path forward: larger datasets spanning more embodiments, tasks, and environments. Truelabel's 12,000-collector network contributes 500-1,000 new trajectories daily across 47 robot types, 200+ task categories, and 1,500+ physical locations—the scale needed to train generalizable world models.
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- World Models
Ha & Schmidhuber 2018 World Models paper demonstrating latent dynamics prediction for car racing control
worldmodels.github.io ↩ - NVIDIA Cosmos World Foundation Models
NVIDIA Cosmos 12B-parameter world foundation model for video prediction
NVIDIA Developer ↩ - RoboNet: Large-Scale Multi-Robot Learning
RoboNet paper demonstrating 7-robot latent dynamics model trained on 113,000 trajectories
arXiv ↩ - Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
EPIC-KITCHENS-100 paper describing 90,000 action segments at 60 fps across 45 kitchens
arXiv ↩ - General Agents Need World Models
Genie paper demonstrating latent action discovery from 200,000 hours of video
arXiv ↩ - Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment paper aggregating 1M+ trajectories with standardized action spaces
arXiv ↩ - CALVIN paper
CALVIN paper evaluating video prediction via downstream policy learning on 34 tasks
arXiv ↩ - Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization paper by Tobin et al. 2017 for sim-to-real transfer
arXiv ↩
More glossary terms
FAQ
What is the difference between video prediction and video generation?
Video prediction generates future frames conditioned on past observations and actions, maintaining causal consistency with the input sequence. Video generation (e.g., Sora, Imagen Video) creates videos from text prompts without temporal grounding, optimizing for visual quality rather than physical plausibility. Robot planning requires prediction—the model must forecast consequences of specific actions, not generate arbitrary plausible videos.
How much training data does a video prediction model need?
Baseline models achieve 70-80% accuracy on simple pick-and-place with 5,000-10,000 trajectories (50-100 hours of teleoperation). State-of-the-art models like those trained on Open X-Embodiment (1M+ trajectories) or DROID (76,000 trajectories) reach 85-90% on diverse manipulation tasks. Contact-rich tasks (insertion, wiping) require 2-3× more data due to underrepresentation in existing datasets. Truelabel's marketplace enables buyers to start with 1,000-trajectory pilots, scaling to 10,000-100,000 as model performance plateaus.
Can video prediction models trained on simulation transfer to real robots?
Sim-to-real transfer for video prediction lags behind policy transfer. Domain randomization improves real-world success from 20-30% (no randomization) to 60-80%, but generating photorealistic pixels from simulator states remains harder than learning control policies. Hybrid approaches mixing 10-20% real data with synthetic data achieve 75-85% real-world accuracy, but pure simulation training fails due to lighting, texture, and physics gaps. Real teleoperation data is currently irreplaceable for production video prediction models.
What camera resolution and frame rate are needed for robot video prediction?
Most models train on 224×224 or 256×256 RGB at 10-15 Hz—sufficient for tabletop manipulation where objects move <10 cm/s. High-speed tasks (catching, hitting) require 30-60 Hz. Resolution matters less than action precision: 224×224 video with 1 kHz force-torque telemetry outperforms 1080p video with 10 Hz pose estimates for contact-rich tasks. Depth channels (RGB-D) improve occlusion reasoning by 10-15% but double data storage costs. Truelabel's datasets default to 640×480 RGB-D at 15 Hz, balancing quality and cost.
How do video prediction models handle uncertainty and multimodal futures?
Deterministic models collapse multimodal futures into blurry averages. Stochastic models use VAEs (sample latent codes) or diffusion (iterative denoising) to generate diverse trajectories. Ensemble methods train multiple models with different seeds, using prediction disagreement as an uncertainty signal. For robot planning, ensembles are most practical—generate 10-50 rollouts, reject trajectories with collisions or failures in >10% of samples, execute the safest remaining plan. NVIDIA Cosmos and other diffusion models enable 100+ diverse samples but require 1-10 seconds per video, too slow for real-time control.
Find datasets covering video prediction
Truelabel surfaces vetted datasets and capture partners working with video prediction. Send the modality, scale, and rights you need and we route you to the closest match.
Browse Physical AI Datasets