Physical AI Glossary
World Model
A world model is a neural network that learns to predict future environment states given current observations and proposed actions, enabling agents to plan by simulating action sequences internally before physical execution. Training world models requires diverse real-world video capturing causal structure—robotics teams use teleoperation datasets like DROID (76,000 trajectories across 564 skills) and BridgeData V2 (60,000 demonstrations) to teach models how objects respond to manipulation.
Quick facts
- Term
- World Model
- Domain
- Robotics and physical AI
- Last reviewed
- 2025-05-15
What Is a World Model in Physical AI?
A world model is a learned internal representation that predicts how an environment will change in response to an agent's actions without executing those actions in the real world. In robotics and embodied AI, world models take current sensor observations—camera frames, proprioceptive joint angles, tactile readings—and a proposed action sequence as input, then output predictions of future observations. This predictive capability enables model-based planning: the agent simulates multiple candidate action sequences internally, evaluates predicted outcomes, and selects the best trajectory before committing to physical execution.
World models operate at different levels of abstraction. Low-level world models predict raw sensory observations—actual pixel values of future camera frames—which requires large generative models and substantial compute. Google's RT-1 Robotics Transformer demonstrated that vision-language-action models can implicitly learn world dynamics from 130,000 demonstrations, though explicit world model architectures remain an active research frontier. High-level world models predict abstract state representations in a learned latent space, compressing environment dynamics into compact vectors where prediction is faster and more tractable. Hafner's Dreamer family of algorithms exemplifies this approach, learning latent dynamics models from pixel observations in continuous control tasks.
Hybrid approaches predict at intermediate levels—feature maps, semantic segmentation masks, or object-centric representations—capturing task-relevant structure without pixel-level generation costs. NVIDIA's Cosmos World Foundation Models train on millions of hours of video to learn generalizable physical priors, then fine-tune on domain-specific robotics data. The choice of abstraction level trades off between prediction fidelity, computational cost, and generalization: low-level models capture fine-grained texture and lighting but struggle with long-horizon prediction; high-level models generalize better but may discard task-critical details.
Historical Evolution: From Dyna to Dreamer to Foundation Models
The concept of world models traces back to Sutton's Dyna architecture (1991), which integrated model-based planning with model-free reinforcement learning by using a learned environment model to generate synthetic experience. Early implementations struggled with model error accumulation—small prediction mistakes compound over multi-step rollouts, leading to catastrophic planning failures. For two decades, model-free methods dominated robotics due to their robustness to model mismatch, despite requiring orders of magnitude more real-world interaction data.
The modern resurgence began with Ha and Schmidhuber's World Models (2018), which trained a variational autoencoder to compress visual observations into a low-dimensional latent space, then learned a recurrent neural network to predict latent dynamics. This architecture achieved strong performance on car racing and VizDoom tasks with compact models trainable on a single GPU. The key insight: learning dynamics in latent space rather than pixel space dramatically reduces model complexity and improves sample efficiency.
Hafner's PlaNet (2019) and subsequent Dreamer algorithms (2020-2023) extended this approach to continuous control, demonstrating that latent world models could match or exceed model-free methods on Atari and MuJoCo benchmarks while using 10-50× less environment interaction[1]. Recent foundation model efforts—NVIDIA Cosmos, Google's Genie, OpenAI's Sora—train on internet-scale video to learn generalizable physical priors, then adapt to robotics via fine-tuning. NVIDIA's GR00T N1 technical report describes training on 1.2 million robot trajectories plus 100,000 hours of human video to build a unified world model for humanoid control.
Training Data Requirements: Real-World Video and Teleoperation Datasets
Training world models requires diverse real-world video that captures the causal structure of physical interactions—how objects move, deform, occlude, and respond to forces. Teleoperation datasets have emerged as the highest-intent content category because they pair human demonstrations with precise action labels, enabling supervised learning of action-conditioned dynamics. The DROID dataset contains 76,000 trajectories across 564 skills collected via teleoperation on 18 robot platforms, providing rich coverage of manipulation primitives[2]. BridgeData V2 contributes 60,000 demonstrations of long-horizon tasks in kitchen and tabletop environments, with multi-view camera angles and precise gripper state annotations[3].
Egocentric video datasets like EPIC-KITCHENS-100 (100 hours, 90,000 action segments) and Ego4D (3,600 hours across 74 locations) capture human manipulation from head-mounted cameras, providing naturalistic priors for object affordances and task structure. However, egocentric video lacks precise action labels—researchers must infer hand poses and contact forces from pixels, introducing noise. Open X-Embodiment aggregates 1 million trajectories from 22 robot embodiments, standardizing heterogeneous data sources into a common format for cross-embodiment transfer[4].
Simulation data remains critical for scaling: domain randomization techniques generate infinite synthetic trajectories with perfect ground-truth labels, though sim-to-real transfer requires careful calibration. RLBench provides 100 simulated manipulation tasks with procedurally generated variations, enabling large-scale pretraining before real-world fine-tuning. The optimal training mix balances real-world diversity (capturing distribution of physical phenomena) with simulation scale (enabling high-throughput model iteration).
Model Architectures: Latent Dynamics, Video Prediction, and Diffusion
World model architectures fall into three families. Latent dynamics models compress observations into a low-dimensional latent space, then learn a recurrent or transformer-based transition function in that space. Dreamer uses a recurrent state-space model with stochastic latent variables, enabling long-horizon planning via latent imagination. RT-2 extends this by grounding latent dynamics in vision-language representations, transferring web-scale semantic knowledge to robotic control[5].
Video prediction models generate future frames directly in pixel space. Ha and Schmidhuber's original World Models used a mixture-density network to predict next-frame distributions. Modern approaches employ diffusion models or autoregressive transformers: NVIDIA Cosmos trains 12-billion-parameter video diffusion models on 20 million hours of driving and robotics footage, achieving photorealistic 10-second predictions at 30 FPS[6]. Video prediction enables zero-shot sim-to-real transfer—the model learns physical priors from internet video, then applies them to novel robot embodiments without task-specific fine-tuning.
Hybrid architectures predict intermediate representations. OpenVLA predicts semantic feature maps rather than raw pixels, reducing computational cost while preserving task-relevant structure. DeepMind's RoboCat learns a shared world model across six robot embodiments by predicting object-centric representations—bounding boxes, segmentation masks, 6-DOF poses—that abstract away embodiment-specific details[7]. The choice of representation determines the model's generalization-efficiency tradeoff: pixel-level models capture fine details but require massive datasets; semantic models generalize from less data but may miss critical low-level cues.
Planning with World Models: Model Predictive Control and Tree Search
World models enable three planning paradigms. Model Predictive Control (MPC) rolls out the world model for a fixed horizon (typically 5-20 steps), optimizes an action sequence via gradient descent or sampling-based methods, executes the first action, then replans from the new state. PlaNet uses cross-entropy method (CEM) to sample 1,000 candidate action sequences, evaluate their predicted returns, and select the top performers. MPC provides robustness to model error—frequent replanning corrects for accumulated prediction drift—but requires fast inference (sub-100ms per replan for real-time control).
Tree search methods like Monte Carlo Tree Search (MCTS) build an explicit search tree over possible action sequences, using the world model to simulate outcomes at each node. MuZero combines learned world models with MCTS to achieve superhuman performance on Atari and board games, demonstrating that learned simulators can replace hand-coded game rules. In robotics, tree search enables long-horizon planning—CALVIN uses a learned world model to plan 10-step manipulation sequences, achieving 88% success on multi-stage tasks[8].
Latent imagination trains a policy entirely in the world model's latent space, never executing actions during training. Dreamer imagines thousands of trajectories per environment step, learning a value function and policy from imagined experience. This approach achieves 10-50× sample efficiency gains over model-free methods on continuous control benchmarks[1]. RT-1 extends latent imagination to vision-language-action models, enabling robots to plan by imagining the consequences of natural language instructions before execution.
Evaluation Metrics: Prediction Accuracy, Planning Performance, and Generalization
World model quality is measured across three dimensions. Prediction accuracy quantifies how well the model forecasts future observations. Standard metrics include mean squared error (MSE) for pixel predictions, structural similarity index (SSIM) for perceptual quality, and Fréchet Video Distance (FVD) for distribution matching. NVIDIA Cosmos reports FVD scores below 50 on held-out driving scenarios, indicating near-photorealistic 10-second predictions[6]. However, prediction accuracy does not guarantee planning performance—models can generate plausible-looking futures that violate task constraints.
Planning performance measures task success when using the world model for control. Dreamer achieves 90% of model-free performance on Atari while using 20× less environment interaction, demonstrating that latent world models enable sample-efficient learning[1]. CALVIN evaluates long-horizon manipulation: a world model trained on 10,000 demonstrations achieves 88% success on 5-step task chains, versus 34% for model-free baselines[8]. The gap between prediction accuracy and planning performance reveals model limitations—accurate short-term predictions may still accumulate error over multi-step rollouts.
Generalization tests whether world models transfer to novel objects, scenes, or embodiments. Open X-Embodiment trains a unified world model on 22 robot platforms, then evaluates zero-shot transfer to a 23rd unseen embodiment, achieving 67% of in-distribution performance[4]. RoboCat demonstrates cross-embodiment transfer by learning a shared object-centric world model, enabling a 6-DOF arm to leverage demonstrations from a parallel-jaw gripper[7]. Generalization failures often stem from distribution shift—world models overfit to training data statistics (lighting, backgrounds, object textures) rather than learning causal structure.
Sim-to-Real Transfer: Domain Randomization and Reality Gap
Simulation provides infinite training data with perfect ground-truth labels, but world models trained purely in simulation fail when deployed on real robots due to the reality gap—discrepancies in physics, rendering, and sensor noise. Domain randomization addresses this by training on a distribution of simulated environments with randomized textures, lighting, object properties, and dynamics parameters. The model learns to ignore simulation-specific artifacts and focus on invariant causal structure, improving real-world transfer[9].
Crossing the Reality Gap: A Survey on Sim-to-Real Transferability identifies three transfer strategies. System identification measures real-world physics parameters (friction, mass, damping) and configures the simulator to match, minimizing domain gap. Adversarial training learns a world model robust to worst-case perturbations, ensuring the model does not exploit simulation-specific shortcuts. Hybrid approaches pretrain on simulation, then fine-tune on small amounts of real data—RT-1 uses 100,000 simulated trajectories plus 30,000 real demonstrations to achieve 97% success on real-world pick-and-place tasks[10].
RLBench provides 100 manipulation tasks with procedural variation, enabling large-scale sim-to-real experiments. RoboSuite offers photorealistic rendering and accurate contact dynamics, reducing the reality gap for vision-based policies. Despite progress, sim-to-real transfer remains brittle—world models often fail on out-of-distribution real-world scenarios (novel object geometries, unexpected occlusions, lighting changes) that were underrepresented in simulation training distributions.
Foundation Models for World Modeling: Scaling Laws and Pretraining
Recent efforts train world models on internet-scale video to learn generalizable physical priors before task-specific fine-tuning. NVIDIA Cosmos trains 12-billion-parameter video diffusion models on 20 million hours of driving, robotics, and human activity footage, demonstrating that scale improves zero-shot transfer to novel environments[6]. NVIDIA GR00T N1 extends this to humanoid control, training on 1.2 million robot trajectories plus 100,000 hours of human video to build a unified world model for bipedal locomotion and manipulation.
Google's Genie trains on 200,000 hours of internet video games, learning a latent action space and world dynamics from pixels alone—no action labels required. The model generates playable game environments from a single image prompt, demonstrating that world models can learn action-conditioned dynamics via self-supervision. OpenAI's Sora extends video generation to 60-second clips at 1080p resolution, though its application to robotics planning remains unexplored.
Scaling laws for world models remain poorly understood. RT-2 shows that increasing model size from 5B to 55B parameters improves generalization to novel objects and instructions, but the relationship between pretraining data volume, model capacity, and downstream task performance lacks the predictability of language model scaling laws[5]. Open X-Embodiment finds that cross-embodiment transfer improves logarithmically with the number of source robots—adding the 10th embodiment provides less benefit than adding the 2nd[4]. Optimal pretraining strategies remain an open research question: should world models train on diverse internet video or domain-specific robotics data?
Data Formats and Tooling: RLDS, LeRobot, and MCAP
Robotics datasets use specialized formats to store multi-modal sensor streams and action labels. RLDS (Reinforcement Learning Datasets) defines a standardized schema for episodic data—sequences of observations, actions, rewards, and metadata—stored in TensorFlow's TFRecord format. Open X-Embodiment uses RLDS to unify 22 heterogeneous datasets, enabling cross-dataset training without format conversion[4].
Hugging Face LeRobot provides a PyTorch-native alternative, storing trajectories in Parquet files with HDF5 blobs for images and point clouds. LeRobot includes 25 pretrained policies (ACT, Diffusion Policy, VQ-BeT) and 15 datasets (ALOHA, PushT, UMI), enabling researchers to train world models without building data pipelines from scratch[11]. MCAP is a container format for multi-modal time-series data, widely used in autonomous vehicles—Waymo Open Dataset distributes 1,000 hours of driving data in MCAP, pairing LiDAR point clouds with camera images and vehicle telemetry.
Annotation tooling for world model training data remains immature. Scale AI's Physical AI data engine offers 3D bounding box annotation and point cloud segmentation for autonomous vehicles, but lacks manipulation-specific primitives (grasp annotations, contact labels, force profiles). Truelabel's physical AI marketplace aggregates teleoperation datasets with verified provenance, enabling buyers to filter by robot embodiment, task category, and data volume. As world model architectures standardize, demand for high-quality action-labeled video will drive investment in specialized annotation infrastructure.
Common Failure Modes: Model Error Accumulation and Distributional Shift
World models fail in predictable ways. Model error accumulation occurs when small prediction mistakes compound over multi-step rollouts—a 1% per-step error rate yields 10% error after 10 steps, 40% after 50 steps. PlaNet mitigates this via frequent replanning (every 5-10 steps), correcting for drift by re-observing the true state. MuZero uses value equivalence—the model need not predict observations accurately as long as predicted values match true values—enabling longer planning horizons despite imperfect dynamics.
Distributional shift causes world models to fail on out-of-distribution states. A model trained on tabletop manipulation may predict nonsensical dynamics when the robot encounters a cluttered shelf or transparent object. Open X-Embodiment addresses this by training on diverse embodiments and environments, but zero-shot transfer to truly novel scenarios (e.g., underwater manipulation, microgravity) remains unreliable[4]. Active learning—identifying high-uncertainty states and collecting targeted demonstrations—can patch distribution gaps, but requires human-in-the-loop data collection.
Causal confusion occurs when world models learn spurious correlations rather than causal structure. A model trained on kitchen videos may learn that "opening the fridge causes the light to turn on," failing to generalize to fridges without interior lights. Google's SayCan addresses this by grounding world models in affordance functions—learned predicates that capture preconditions and effects of actions—rather than raw pixel predictions. Causal discovery methods remain an active research area, with limited deployment in production robotics systems.
Commercial Applications: Autonomous Vehicles, Warehouse Robotics, and Humanoids
World models are entering production in three domains. Autonomous vehicles use world models for trajectory prediction—forecasting how pedestrians, cyclists, and other vehicles will move over the next 5-10 seconds. Waymo's motion prediction system trains on 1,000 hours of annotated driving data, achieving 85% accuracy on 8-second forecasts in urban environments. Tesla's Autopilot uses video prediction models to anticipate occlusions and plan lane changes, though technical details remain proprietary.
Warehouse robotics deploy world models for bin-picking and palletizing. Scale AI's partnership with Universal Robots trains world models on 50,000 pick-and-place demonstrations, enabling robots to predict grasp success before execution and replan when objects shift unexpectedly[12]. CloudFactory's industrial robotics solutions use world models to simulate pallet configurations, optimizing stacking sequences for stability and space efficiency.
Humanoid robotics represent the frontier. NVIDIA GR00T N1 trains a unified world model on 1.2 million trajectories across manipulation, locomotion, and whole-body tasks, enabling a humanoid to predict the consequences of complex action sequences (e.g., "walk to the table, pick up the cup, hand it to the person"). Figure AI's partnership with Brookfield aims to collect 10 million hours of humanoid teleoperation data by 2026, providing the scale needed to train generalizable world models for real-world deployment. Commercial viability hinges on reducing data requirements—current systems need 10,000-100,000 demonstrations per task, limiting economic feasibility.
Research Frontiers: Causal World Models and Compositional Generalization
Three research directions dominate. Causal world models aim to learn the underlying causal graph of the environment—which variables directly influence which others—rather than merely predicting correlations. General Agents Need World Models argues that causal structure is necessary for compositional generalization: a robot that understands "pushing causes sliding" can combine this knowledge with "sliding causes falling" to predict that "pushing near an edge causes falling," without observing that specific scenario during training.
Compositional generalization tests whether world models can recombine learned primitives to solve novel tasks. CALVIN evaluates this by training on single-step tasks ("pick up block," "open drawer") then testing on multi-step chains ("pick up block, place in drawer, close drawer"). World models that learn compositional structure achieve 88% success on 5-step chains, versus 34% for non-compositional baselines[8]. ManipArena extends this to 100-step reasoning-oriented tasks, finding that current world models fail beyond 10-step horizons due to compounding prediction errors.
Multi-modal world models integrate vision, language, and haptics. RT-2 grounds world models in vision-language representations, enabling robots to plan from natural language instructions ("hand me the apple") by predicting the visual consequences of candidate action sequences[5]. Tactile world models—predicting contact forces and object deformation from touch—remain underexplored, though DexYCB provides 1,000 hours of tactile manipulation data for training. Unified multi-modal world models that predict across vision, language, audio, and haptics represent the long-term vision for general-purpose physical AI.
Procurement Considerations: Dataset Licensing and Provenance
Buying world model training data requires diligence on three fronts. Licensing determines commercial use rights. RoboNet's dataset license permits research use but prohibits commercial deployment without separate agreement. EPIC-KITCHENS-100 annotations use a non-commercial Creative Commons license, blocking use in production systems. CC-BY-4.0 permits commercial use with attribution, but does not address model training rights—legal ambiguity persists.
Provenance verifies data origin and consent. Truelabel's data provenance framework tracks collector identity, recording conditions, and consent artifacts for every trajectory, enabling buyers to audit compliance with GDPR Article 7 and AI Act transparency requirements. GDPR Article 7 mandates explicit consent for data collection, but many academic datasets lack documented consent, creating liability risk for commercial deployers.
Quality metrics for world model data remain unstandardized. Buyers should verify: (1) action label precision (±1mm for position, ±0.1N for force), (2) temporal synchronization across sensors (<10ms skew), (3) scene diversity (≥100 object instances, ≥10 lighting conditions), (4) failure case coverage (≥10% of trajectories include recoveries from errors). Truelabel's marketplace surfaces these metrics in dataset cards, enabling apples-to-apples comparison across vendors. As world model architectures mature, standardized benchmarks (analogous to ImageNet for vision) will emerge, clarifying procurement requirements.
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- World Models
Dreamer achieves 10-50× sample efficiency gains over model-free methods
worldmodels.github.io ↩ - DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID provides 76,000 trajectories across 564 manipulation skills
arXiv ↩ - BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 contains 60,000 demonstrations with multi-view camera angles
arXiv ↩ - Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment contains 1M trajectories from 22 embodiments with cross-platform transfer metrics
arXiv ↩ - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 shows model size scaling from 5B to 55B improves generalization
arXiv ↩ - NVIDIA Cosmos World Foundation Models
NVIDIA Cosmos trains 12B-parameter models on 20M hours achieving FVD below 50
NVIDIA Developer ↩ - RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
RoboCat enables cross-embodiment transfer via object-centric world modeling
arXiv ↩ - CALVIN paper
CALVIN achieves 88% success on 5-step task chains with world model planning
arXiv ↩ - Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization improves real-world transfer by training on diverse simulated environments
arXiv ↩ - RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 achieves 97% success on real-world pick-and-place with hybrid sim-real training
arXiv ↩ - LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch
LeRobot includes 25 pretrained policies and 15 datasets for world model training
arXiv ↩ - scale.com scale ai universal robots physical ai
Scale AI Universal Robots partnership trains on 50,000 pick-and-place demonstrations
scale.com ↩
More glossary terms
FAQ
What is the difference between a world model and a simulator?
A simulator is a hand-coded program that implements known physics equations (rigid body dynamics, collision detection, rendering) to predict environment behavior. A world model is a learned neural network that approximates environment dynamics from data, without explicit physics equations. Simulators provide perfect accuracy within their modeled domain but fail on phenomena outside their scope (soft-body deformation, fluid dynamics, complex contact). World models generalize to novel scenarios by learning patterns from data but accumulate prediction error over long horizons. Hybrid approaches use simulators for coarse prediction, then apply learned world models to correct for unmodeled effects.
How much training data does a world model need?
Data requirements vary by task complexity and model architecture. Latent dynamics models like Dreamer achieve strong performance on continuous control benchmarks with 100,000-500,000 environment steps (10-50 hours of interaction). Video prediction models require 10-100× more data—NVIDIA Cosmos trains on 20 million hours of video to learn generalizable physical priors. Robotics manipulation tasks need 10,000-100,000 demonstrations per task for reliable performance, though foundation models pretrained on diverse data can reduce task-specific requirements to 1,000-5,000 demonstrations. Cross-embodiment transfer via Open X-Embodiment reduces per-robot data needs by sharing knowledge across platforms.
Can world models trained in simulation transfer to real robots?
Sim-to-real transfer succeeds when the world model learns invariant causal structure rather than simulation-specific artifacts. Domain randomization—training on diverse simulated environments with randomized textures, lighting, and dynamics—improves transfer by forcing the model to ignore spurious correlations. Hybrid approaches pretrain on simulation then fine-tune on 1,000-10,000 real-world demonstrations, achieving 80-95% of fully real-trained performance. However, zero-shot sim-to-real transfer (no real data) remains unreliable for contact-rich manipulation—world models overfit to simulation physics and fail on real-world friction, compliance, and sensor noise.
What are the computational costs of training a world model?
Training costs scale with model size and data volume. Latent dynamics models like Dreamer train on a single NVIDIA A100 GPU in 1-3 days for continuous control tasks. Video prediction models require 8-64 GPUs for 1-4 weeks—NVIDIA Cosmos uses 256 A100 GPUs for 2 weeks to train on 20 million hours of video. Foundation models like RT-2 (55B parameters) require 512-1024 GPUs for 2-6 weeks. Inference costs are lower but still significant: real-time planning (30 Hz) with a video prediction model requires 1-4 GPUs, limiting deployment to high-value applications. Latent dynamics models achieve real-time inference on CPU, enabling broader deployment.
How do world models handle partial observability and occlusions?
World models address partial observability by maintaining a belief state—a probability distribution over possible environment states consistent with observations. Recurrent architectures like Dreamer use LSTM or GRU cells to integrate observation history, inferring occluded object positions from motion cues. Transformer-based world models attend over long observation sequences, reasoning about object permanence ("the cup is behind the box"). However, world models struggle with long-duration occlusions—if an object is hidden for 50+ timesteps, prediction uncertainty grows and planning performance degrades. Active perception strategies—moving the camera to reduce occlusion—can mitigate this, but require explicit planning over information-gathering actions.
What evaluation benchmarks exist for world models in robotics?
CALVIN evaluates long-horizon compositional manipulation, testing whether world models can chain 5-10 primitive skills to solve novel tasks. RLBench provides 100 simulated manipulation tasks with procedural variation, enabling large-scale sim-to-real experiments. ManipArena tests reasoning-oriented manipulation over 100-step horizons, exposing world model failure modes on complex tasks. Open X-Embodiment benchmarks cross-embodiment transfer, measuring zero-shot performance on unseen robot platforms. Real-world benchmarks remain sparse—most evaluations use simulation due to the cost and safety risks of large-scale real-robot experiments. Standardized real-world benchmarks analogous to ImageNet for vision are an active area of community effort.
Find datasets covering world model
Truelabel surfaces vetted datasets and capture partners working with world model. Send the modality, scale, and rights you need and we route you to the closest match.
Browse Physical AI Datasets