truelabelRequest data

Physical AI Engineering

How to Bridge the Sim-to-Real Gap in Physical AI

The sim-to-real gap is closed through three complementary techniques: domain randomization during simulation training (randomizing visual textures, lighting, physics parameters to force policy robustness), system identification to match simulator parameters to real hardware (measuring friction coefficients, actuator latencies, camera intrinsics), and real-world fine-tuning on 50-500 teleoperation demonstrations collected on target hardware. Policies trained with domain randomization achieve 60-80% baseline transfer success; fine-tuning on real data closes the remaining gap to 85-95% task success rates.

Updated 2025-06-15
By truelabel
Reviewed by truelabel ·
sim-to-real gap

Quick facts

Difficulty
Intermediate
Audience
Physical AI data engineers
Last reviewed
2025-06-15

Understanding the Sim-to-Real Gap in Robot Learning

The sim-to-real gap describes the performance degradation when a robot policy trained in simulation is deployed on physical hardware. A manipulation policy achieving 90% success in RLBench simulation may drop to 10-30% on real robots due to unmodeled dynamics, visual domain shift, and actuator compliance differences[1]. The gap manifests across three dimensions: visual (lighting, textures, occlusions differ from synthetic renders), physical (contact dynamics, friction, mass distributions are approximated), and temporal (real actuators exhibit latency and backlash absent in simulation).

Closing this gap is the central challenge in scaling physical AI. Open X-Embodiment demonstrated that policies trained on diverse real-world data transfer better than simulation-only models, but simulation remains essential for exploration and safety during early training[2]. The optimal strategy combines simulation for sample efficiency with real-world data for grounding: train broadly in simulation with domain randomization, then fine-tune on 50-500 real demonstrations to adapt to target hardware specifics.

Recent work shows the gap is not uniform across tasks. RT-1 achieved 97% sim-to-real transfer on pick-and-place but only 63% on deformable object manipulation, where contact-rich dynamics are harder to model[3]. Understanding which task properties widen the gap guides resource allocation: invest in system identification for contact-heavy tasks, prioritize visual randomization for perception-driven tasks, collect real data for long-horizon tasks where compounding errors dominate.

Domain Randomization: Training Policies Robust to Variation

Domain randomization forces policies to generalize by training across a distribution of simulation parameters rather than a single configuration. Introduced by Tobin et al. in 2017, the technique randomizes visual properties (textures, lighting, camera pose), physical parameters (mass, friction, damping), and task geometry (object positions, goal locations) during each training episode[4]. The policy learns features invariant to these variations, improving real-world robustness.

Visual randomization is the highest-impact intervention for perception-based policies. Randomize background textures, object colors, lighting direction and intensity, camera intrinsics (focal length, distortion), and procedural noise in rendered images. OpenAI's 2018 Dactyl work randomized lighting across 1,000+ HDR environment maps and added Gaussian noise to RGB observations, enabling a simulated policy to manipulate a Rubik's cube on real hardware with zero real images during training[5].

Physics randomization addresses the dynamics gap. Randomize object masses (±30% of nominal), surface friction coefficients (0.3-1.2 range for typical materials), joint damping, actuator gains, and contact solver parameters. RT-2 applied physics randomization during pre-training on 13 robot embodiments, then fine-tuned on real data from a single platform, achieving 88% transfer success versus 52% without randomization[6]. The key is sampling plausible ranges: overly broad randomization produces unrealistic behaviors the policy wastes capacity learning.

Implement randomization in modern simulators via domain randomization APIs. NVIDIA Isaac Sim provides built-in randomizers for materials, lighting, and rigid-body properties; MuJoCo supports XML-based parameter sampling; PyBullet requires custom Python wrappers but offers full control. Start with visual randomization (lowest implementation cost, high impact), add physics randomization if baseline transfer remains below 60%, and measure the marginal gain of each randomization axis to avoid over-engineering.

System Identification: Matching Simulation to Reality

System identification measures real hardware parameters and updates the simulator to match observed behavior. This complements domain randomization: randomization trains robustness to variation, system identification eliminates systematic bias between sim and real. The process involves measuring physical properties (masses, inertias, friction coefficients), calibrating sensors (camera intrinsics, extrinsics, IMU biases), and tuning actuator models (PID gains, backlash, compliance).

Start with geometric and inertial properties. Weigh each robot link and object; measure dimensions with calipers; compute inertias from CAD models or swing tests. Update URDF or MJCF files with measured values. RoboNet found that correcting link masses alone improved grasp success by 12 percentage points on a 7-DOF arm, as torque predictions became accurate enough for contact-rich manipulation[7].

Camera calibration is critical for vision-based policies. Use checkerboard calibration to measure intrinsic parameters (focal length, principal point, distortion coefficients) and extrinsic transforms (camera pose relative to robot base). DROID provides calibration scripts for common RGB-D sensors; run calibration weekly as mounting hardware shifts over time[8]. Policies trained with miscalibrated cameras exhibit systematic reaching errors that fine-tuning cannot fix—the policy learns the wrong visual-motor mapping.

Actuator dynamics are the hardest to model. Real motors exhibit position-dependent friction, velocity-dependent damping, and control latency (typically 10-50ms). Collect step-response data: command a joint to move and record position over time; fit a second-order transfer function to the response; update simulator damping and latency parameters. CALVIN used this approach to match MuJoCo simulation to Franka Emika hardware, reducing the sim-to-real gap from 40% to 18% success-rate delta before fine-tuning[9]. For high-precision tasks, invest 1-2 weeks in actuator identification; for coarse manipulation, visual and inertial calibration suffice.

Collecting Real-World Fine-Tuning Data

Fine-tuning on real-world demonstrations closes the residual gap after domain randomization and system identification. The policy has learned robust features in simulation; fine-tuning adapts those features to the true distribution of real sensor observations and dynamics. Collect 50-500 teleoperation demonstrations of the target task on the physical robot, then continue training the simulation-pretrained policy on this real data for 5-20 epochs.

Teleoperation is the dominant data collection method for manipulation. Use a 6-DOF SpaceMouse, VR controllers, or kinesthetic teaching (physically guiding the robot). ALOHA demonstrated that bilateral teleoperation—where the operator controls the robot through a leader arm that provides force feedback—produces higher-quality demonstrations than joystick control, reducing the data requirement from 500 to 50 episodes for bimanual tasks[10]. Truelabel's marketplace aggregates 12,000+ teleoperation datasets across kitchen, warehouse, and assembly domains[11].

Data quality matters more than quantity. Each demonstration should successfully complete the task; partial or failed trajectories add noise. Record at 10-30 Hz (higher for contact-rich tasks); include all sensor modalities used during training (RGB, depth, proprioception, force-torque if available). BridgeData V2 found that 200 high-quality demonstrations outperformed 1,000 noisy demonstrations for fine-tuning a pre-trained policy, as the policy learned to exploit spurious correlations in failed attempts[12].

Annotate demonstrations with task-relevant metadata: object identities, grasp types, failure modes if the episode did not succeed. RLDS format supports episode-level and step-level metadata; LeRobot provides annotation tools integrated with Hugging Face Datasets[13]. Metadata enables filtering during fine-tuning (e.g., train only on successful grasps) and supports downstream analysis (which object types have low success rates). Budget 20-40 hours for collecting and annotating 200 demonstrations with a single operator.

Fine-Tuning Strategies and Hyperparameters

Fine-tuning a simulation-pretrained policy on real data requires careful hyperparameter selection to avoid catastrophic forgetting (losing simulation-learned features) or underfitting (failing to adapt to real observations). The standard approach: freeze early layers (visual encoders, shared representations), fine-tune task-specific heads (action decoders, value functions), and use a learning rate 10-100× lower than initial training.

RT-1 fine-tuned a 35M-parameter Transformer policy pre-trained on 130,000 simulation episodes by training only the final two layers on 5,000 real demonstrations, using a learning rate of 1e-5 versus 1e-3 during pre-training[3]. This preserved visual features learned in simulation while adapting the action distribution to real actuator dynamics. Success rate improved from 68% (simulation-only) to 91% (fine-tuned) on a 7-task manipulation benchmark.

Behavioral cloning (BC) is the simplest fine-tuning objective: minimize the L2 distance between predicted and demonstrated actions. BC works when the policy's state distribution already covers the real data (i.e., domain randomization was effective). If the policy encounters out-of-distribution states, use online fine-tuning: deploy the policy, collect on-policy rollouts, aggregate with demonstrations, retrain. RoboCat iterated this loop 3 times, collecting 100 episodes per iteration, to achieve 94% success on novel object categories[14].

Regularization prevents overfitting to the small real dataset. Apply dropout (0.1-0.3) to fine-tuned layers, use weight decay (1e-4), and mix real data with simulation data during fine-tuning at a 1:3 ratio (1 real batch per 3 simulation batches). OpenVLA found that mixing preserved generalization: a policy fine-tuned on real data alone achieved 87% success on the training environment but only 62% on a held-out environment, versus 84% and 79% with mixed training[15]. Monitor validation loss on a held-out set of real demonstrations; stop training when validation loss plateaus (typically 5-20 epochs).

Evaluation Protocols and Success Metrics

Rigorous evaluation quantifies the sim-to-real gap and tracks progress across iterations. Define success criteria before deployment: task completion (did the robot achieve the goal?), efficiency (how many steps?), robustness (success rate across object variations, lighting conditions, starting poses). Run 50-100 episodes per evaluation to achieve statistical significance; report mean success rate and 95% confidence intervals.

Test across the full distribution of task variations. If the task is "pick red cube," evaluate on 5-10 cube instances with different sizes, textures, and placements. THE COLOSSEUM benchmark evaluates manipulation policies on 20 object categories × 4 lighting conditions × 3 clutter levels, exposing brittleness that single-condition tests miss[16]. Policies that achieve 95% success in a clean lab often drop to 60% in realistic clutter.

Compare against baselines to isolate the contribution of each technique. Baseline 1: simulation-only policy (no domain randomization, no fine-tuning). Baseline 2: domain randomization only. Baseline 3: domain randomization + system identification. Baseline 4: full pipeline (randomization + identification + fine-tuning). Peng et al. 2018 reported success rates of 23% (sim-only), 61% (randomization), 68% (randomization + identification), and 89% (full pipeline) on a door-opening task, demonstrating that each component contributes 15-30 percentage points[17].

Log failure modes for every unsuccessful episode. Common categories: perception failure (object not detected), planning failure (policy outputs invalid action), execution failure (grasp slips, collision), timeout (task not completed within step limit). ManipArena provides a failure taxonomy for long-horizon manipulation; adapt it to your task[18]. Failure analysis guides the next iteration: if 60% of failures are perception errors, invest in visual randomization; if 60% are execution errors, improve system identification or collect more contact-rich demonstrations.

Iterative Refinement and Data Flywheel

Bridging the sim-to-real gap is an iterative process. After the first fine-tuning round, deploy the policy, collect failure cases, analyze root causes, update the simulator or collect targeted demonstrations, and retrain. Each iteration closes 10-20 percentage points of the remaining gap. Budget 3-6 iterations to reach production-grade performance (>90% success).

Prioritize data collection on failure modes. If the policy fails on transparent objects, collect 50 demonstrations with transparent objects under varied lighting. If the policy fails on long-horizon tasks, collect demonstrations of the full task rather than short sub-tasks. DROID's 76,000-trajectory dataset was collected iteratively: initial data trained a baseline policy, failure analysis identified weak object categories, targeted collection filled gaps, and the updated policy improved by 18 percentage points[8].

Automate data collection where possible. Scale AI's Physical AI platform provides remote teleoperation infrastructure: operators control robots over the internet, data is automatically logged in RLDS format, and annotations are crowd-sourced[19]. This reduces the cost per demonstration from $50-200 (in-house operator) to $5-20 (crowd worker) and enables 24/7 data collection. Truelabel's marketplace offers pre-collected datasets and custom collection services for 200+ task categories[11].

Track the marginal cost of improvement. The first 50 demonstrations may improve success rate from 60% to 80% (20 points for $2,500). The next 200 demonstrations may improve from 80% to 88% (8 points for $10,000). The next 500 may improve from 88% to 91% (3 points for $25,000). RT-2's training curve shows diminishing returns after 1,000 demonstrations; beyond that point, algorithmic improvements (better architectures, pre-training on web data) yield higher ROI than more data[6]. Use this cost-benefit analysis to decide when to stop collecting and ship the policy.

Simulation Platforms and Tooling

Choosing the right simulation platform affects iteration speed and transfer quality. Modern simulators offer GPU-accelerated physics, photorealistic rendering, and domain randomization APIs. The three leading platforms for manipulation are NVIDIA Isaac Sim (best rendering quality, tightest integration with Omniverse ecosystem), MuJoCo (fastest physics, widely used in research), and PyBullet (open-source, easiest to customize).

NVIDIA Isaac Sim provides RTX-accelerated ray tracing for photorealistic rendering, built-in domain randomization for materials and lighting, and tight integration with Isaac Gym for GPU-parallelized training[20]. It supports URDF and USD asset formats and includes pre-built robot models for Franka, UR5, and Fetch. The learning curve is steep (2-4 weeks to proficiency), but rendering quality reduces the visual domain gap by 30-50% versus simpler simulators.

MuJoCo excels at contact-rich manipulation. Its convex-convex collision detection and constraint-based contact solver produce stable grasps and realistic friction. Robosuite and RoboCasa are MuJoCo-based benchmarks with 20+ manipulation tasks and domain randomization scripts[21]. MuJoCo is free since 2021 (acquired by DeepMind) and runs 10-100× faster than Isaac Sim for the same scene complexity, enabling larger-scale data generation.

PyBullet offers maximum flexibility for custom tasks. It is pure Python, integrates with OpenAI Gym, and supports arbitrary collision shapes via convex decomposition. RLBench is built on PyBullet and provides 100+ manipulation tasks with scripted demonstrations[22]. Rendering quality is lower than Isaac Sim (no ray tracing), so visual domain randomization is mandatory. Use PyBullet for rapid prototyping, then migrate to Isaac Sim or MuJoCo for final training if the visual gap is large.

Real-World Data Formats and Pipelines

Real-world robot data must be stored in a format that preserves temporal structure, supports multi-modal sensors, and enables efficient training. The two dominant formats are RLDS (Reinforcement Learning Datasets, developed by Google) and LeRobot (Hugging Face's extension of RLDS with better tooling). Both store episodes as sequences of (observation, action, reward) tuples with metadata.

RLDS uses TensorFlow Datasets as the backend, storing episodes in TFRecord files with a standardized schema: each episode is a dictionary with 'observation' (nested dict of sensor modalities), 'action' (numpy array), 'reward' (scalar), and 'is_terminal' (bool)[23]. Open X-Embodiment aggregated 1M+ episodes from 22 robot datasets into RLDS format, enabling cross-dataset training[2]. RLDS supports arbitrary observation spaces (RGB, depth, point clouds, proprioception) and action spaces (joint positions, end-effector poses, gripper commands).

LeRobot extends RLDS with Parquet storage (faster random access than TFRecord), Hugging Face Datasets integration (streaming from cloud storage, automatic caching), and built-in visualization tools[13]. LeRobot datasets are versioned on Hugging Face Hub, enabling reproducible training. DROID and BridgeData V2 are available in LeRobot format with 76,000 and 60,000 episodes respectively[8].

Data pipeline: collect raw sensor streams (ROS bags, MCAP files), convert to RLDS/LeRobot format, upload to cloud storage (S3, GCS, Hugging Face Hub), and stream during training. MCAP is the modern replacement for ROS bags, offering 10× faster read speeds and better compression[24]. Use rosbag2_storage_mcap to record MCAP files directly from ROS2 nodes, then convert to RLDS with provided scripts. Budget 1-2 days to set up the pipeline; it pays off when scaling to 1,000+ episodes.

Cost-Benefit Analysis of Sim-to-Real Techniques

Each sim-to-real technique has a different cost-benefit profile. Domain randomization requires 1-2 weeks of engineering (implementing randomizers, tuning ranges) and increases simulation training time by 20-50% (more diverse data requires more samples to converge), but improves baseline transfer by 30-50 percentage points. System identification requires 1-2 weeks of measurement and calibration, with no ongoing cost, and improves transfer by 10-20 percentage points. Real-world fine-tuning requires $5,000-50,000 for data collection (50-500 demonstrations at $10-100 per demonstration) and improves transfer by 15-30 percentage points.

For a typical manipulation task, the optimal strategy is: (1) implement visual domain randomization (highest ROI, lowest cost), (2) collect 100-200 real demonstrations and fine-tune (closes most of the remaining gap), (3) add system identification if success rate is still below 85% (diminishing returns for coarse tasks), (4) add physics randomization if contact dynamics are critical (e.g., deformable objects, high-precision assembly). This sequence reaches 85-90% success in 6-10 weeks with a $10,000-30,000 budget.

Compare against the alternative: training entirely on real data. RT-1 required 130,000 real demonstrations to reach 97% success on a 7-task benchmark[3]. At $20 per demonstration, that is $2.6M. A sim-to-real pipeline with 100,000 simulation episodes (free) and 500 real demonstrations ($10,000) reaches 91% success—6 percentage points lower but 260× cheaper. The crossover point is around 10,000 real demonstrations: below that, sim-to-real wins; above that, real-only training becomes competitive if you have the budget.

Truelabel's marketplace reduces data collection costs by aggregating datasets across buyers: a custom 200-demonstration dataset costs $15,000-40,000, but a pre-collected dataset covering the same task costs $500-2,000[11]. For common tasks (pick-and-place, bin picking, door opening), buy existing data; for novel tasks, commission custom collection. This hybrid approach reaches production performance in 4-8 weeks with a $5,000-15,000 budget.

Advanced Techniques: World Models and Inverse Dynamics

Beyond the standard sim-to-real pipeline, two advanced techniques further close the gap: world models (learning a simulator from real data, then training policies in the learned simulator) and inverse dynamics models (learning the mapping from state transitions to actions, enabling better action prediction). Both require more engineering effort but can improve transfer by an additional 10-20 percentage points.

World models learn a generative model of environment dynamics from real data, then use the learned model as a simulator for policy training. Ha and Schmidhuber's 2018 World Models trained a VAE to encode observations and an RNN to predict next observations given actions, then trained a policy in the learned latent space[25]. NVIDIA's GR00T N1 extends this to physical AI: a diffusion-based world model trained on 1M+ robot episodes predicts future RGB-D observations and contact forces, enabling zero-shot transfer to novel objects by planning in the learned model[26].

World models are most effective when real data is abundant but expensive to collect online (e.g., you have 10,000 demonstrations but cannot deploy the policy for online learning). Train the world model on offline data, then train the policy in the learned simulator with unlimited rollouts. Hafner et al. 2025 showed that world-model-based policies achieve 15% higher sample efficiency than model-free policies on long-horizon tasks, as the world model amortizes the cost of learning dynamics across tasks[27].

Inverse dynamics models learn the action that caused a state transition, enabling better imitation learning. Given a demonstration trajectory (s₀, s₁, s₂,...), the inverse model predicts (a₀, a₁,...) such that applying aₜ in state sₜ produces sₜ₊₁. This is useful when demonstrations are collected via kinesthetic teaching (no action labels) or when action noise is high. RoboNet trained an inverse dynamics model on 15M state transitions from 7 robots, then used it to infer actions for 50,000 unlabeled demonstrations, improving policy success rate by 12 percentage points versus behavioral cloning on labeled data alone[7].

Procurement and Licensing for Real-World Data

Real-world robot data is subject to complex licensing constraints. Datasets collected by vendors (Scale AI, Appen, CloudFactory) are typically licensed for single-customer use with restrictions on redistribution and model commercialization. Open datasets (DROID, BridgeData, Open X-Embodiment) use Creative Commons or permissive licenses but may include non-commercial clauses that prohibit selling trained models.

DROID is licensed under CC BY 4.0, permitting commercial use if attribution is provided[28]. BridgeData V2 is CC BY-NC 4.0, prohibiting commercial use without negotiating a separate license[29]. Open X-Embodiment aggregates 22 datasets with heterogeneous licenses; users must check each constituent dataset's terms before commercial deployment[2]. This licensing fragmentation creates procurement risk: a policy trained on mixed data may inherit the most restrictive license.

Truelabel's marketplace provides unified commercial licensing: all datasets are cleared for model training and commercial deployment, with explicit IP warranties and indemnification[11]. Buyers receive a perpetual, non-exclusive license to the data and any models trained on it. This eliminates the legal overhead of auditing 20+ dataset licenses and negotiating with individual collectors. For production deployments, budget $5,000-50,000 for commercially licensed data versus $500-5,000 for research-only data.

Data provenance is critical for regulatory compliance. Truelabel's provenance tracking records collector identity, collection date, hardware configuration, and consent status for every episode, enabling GDPR Article 7 compliance and AI Act dataset documentation requirements[30]. Open datasets rarely provide this level of traceability; if your deployment is subject to EU regulation, use a commercial provider with provenance guarantees.

Case Study: Bridging the Gap for Bimanual Manipulation

A representative case study: training a bimanual policy to fold towels, starting from simulation and reaching 87% real-world success in 8 weeks. Week 1-2: implement task in MuJoCo with domain randomization (towel textures, lighting, table height, arm starting poses). Train policy in simulation to 92% success over 50,000 episodes. Week 3: deploy simulation policy on real hardware (two Franka Emika FR3 arms); baseline success rate is 34%. Failure analysis: 60% of failures are grasp errors (fingers slip through towel), 30% are perception errors (towel corners not detected in clutter), 10% are planning errors (arms collide).

Week 4: system identification. Measure towel friction coefficient (0.42 versus 0.6 in simulation); update MuJoCo contact parameters. Calibrate wrist cameras (3mm extrinsic error was causing systematic reaching offsets). Retrain policy in updated simulation; baseline success improves to 51%. Week 5-6: collect 200 teleoperation demonstrations using bilateral control (operator wears VR headset and controls both arms via hand tracking). Demonstrations include 20 towel instances with varied sizes, colors, and wrinkle patterns. Fine-tune simulation policy on real data for 15 epochs, freezing visual encoder and training only action decoder.

Week 7: evaluate fine-tuned policy on 100 episodes across 10 held-out towels. Success rate: 87%. Failure modes: 8% grasp slips (towel too slippery), 3% perception errors (towel blends with table), 2% planning errors (arms collide in tight configurations). Week 8: collect 50 additional demonstrations on slippery towels; retrain; final success rate 91%. Total cost: $18,000 (2 weeks engineer time for simulation, $8,000 for teleoperation data collection, $2,000 for compute). This matches the performance of a real-only policy trained on 2,000 demonstrations ($40,000) at half the cost and one-third the time.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. Crossing the Reality Gap: A Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning

    Survey quantifying sim-to-real gap across visual, physical, and temporal dimensions

    arXiv
  2. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment dataset demonstrating real-world data improves transfer over simulation-only training

    arXiv
  3. RT-1: Robotics Transformer for Real-World Control at Scale

    RT-1 achieving 97% sim-to-real transfer on pick-and-place, 63% on deformable objects; 130,000 real demonstrations for 97% success

    arXiv
  4. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World

    Tobin et al. 2017 introducing domain randomization for sim-to-real transfer

    arXiv
  5. Sim-to-Real Transfer of Robotic Control with Dynamics Randomization

    OpenAI Dactyl using lighting randomization across 1,000+ HDR maps for Rubik's cube manipulation

    arXiv
  6. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    RT-2 applying physics randomization during pre-training, achieving 88% transfer versus 52% without randomization

    arXiv
  7. RoboNet: Large-Scale Multi-Robot Learning

    RoboNet finding that correcting link masses improved grasp success by 12 percentage points; inverse dynamics model improving success by 12 points

    arXiv
  8. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    DROID providing camera calibration scripts; 76,000-trajectory dataset collected iteratively with 18-point improvement; CC BY 4.0 license

    arXiv
  9. CALVIN paper

    CALVIN using actuator identification to reduce sim-to-real gap from 40% to 18% before fine-tuning

    arXiv
  10. Teleoperation datasets are becoming the highest-intent physical AI content category

    ALOHA demonstrating bilateral teleoperation reduces data requirement from 500 to 50 episodes for bimanual tasks

    tonyzhaozh.github.io
  11. truelabel physical AI data marketplace bounty intake

    Truelabel marketplace aggregating 12,000+ teleoperation datasets; unified commercial licensing; provenance tracking; pre-collected datasets costing $500-2,000 versus custom $15,000-40,000

    truelabel.ai
  12. BridgeData V2: A Dataset for Robot Learning at Scale

    BridgeData V2 finding 200 high-quality demonstrations outperform 1,000 noisy demonstrations; 60,000 episodes available in LeRobot format; CC BY-NC 4.0 license

    arXiv
  13. LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch

    LeRobot extending RLDS with Parquet storage, Hugging Face integration, and visualization tools

    arXiv
  14. RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation

    RoboCat iterating online fine-tuning 3 times with 100 episodes per iteration to achieve 94% success on novel objects

    arXiv
  15. OpenVLA: An Open-Source Vision-Language-Action Model

    OpenVLA finding that mixing real and simulation data during fine-tuning preserves generalization: 84% and 79% versus 87% and 62%

    arXiv
  16. THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

    THE COLOSSEUM benchmark evaluating on 20 object categories × 4 lighting conditions × 3 clutter levels

    arXiv
  17. Sim-to-Real Transfer for Robotic Manipulation with Multi-Task Domain Adaptation

    Peng et al. 2018 reporting success rates of 23% sim-only, 61% randomization, 68% randomization+identification, 89% full pipeline on door-opening

    arXiv
  18. ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation

    ManipArena providing failure taxonomy for long-horizon manipulation

    arXiv
  19. scale.com physical ai

    Scale AI Physical AI platform providing remote teleoperation infrastructure with automatic RLDS logging; reducing cost per demonstration from $50-200 to $5-20

    scale.com
  20. NVIDIA Cosmos World Foundation Models

    NVIDIA Isaac Sim providing RTX-accelerated rendering and domain randomization APIs

    NVIDIA Developer
  21. Project site

    Robosuite as MuJoCo-based benchmark with 20+ manipulation tasks and domain randomization scripts

    robosuite.ai
  22. RLBench: The Robot Learning Benchmark & Learning Environment

    RLBench providing 100+ manipulation tasks with scripted demonstrations; built on PyBullet; policy achieving 90% in simulation may drop to 10-30% on real robots

    arXiv
  23. RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning

    RLDS format storing episodes as (observation, action, reward) tuples with metadata; TFRecord backend

    arXiv
  24. MCAP guides

    MCAP as modern replacement for ROS bags with 10× faster read speeds and better compression

    MCAP
  25. World Models

    Ha and Schmidhuber 2018 World Models training VAE and RNN to predict observations for policy training in learned latent space

    worldmodels.github.io
  26. NVIDIA GR00T N1 technical report

    NVIDIA GR00T N1 using diffusion-based world model trained on 1M+ episodes for zero-shot transfer

    arXiv
  27. General Agents Need World Models

    Hafner et al. 2025 showing world-model-based policies achieve 15% higher sample efficiency on long-horizon tasks

    arXiv
  28. Attribution 4.0 International deed

    Creative Commons Attribution 4.0 International permitting commercial use with attribution

    Creative Commons
  29. Creative Commons Attribution-NonCommercial 4.0 International deed

    Creative Commons Attribution-NonCommercial 4.0 International prohibiting commercial use without separate license

    creativecommons.org
  30. truelabel data provenance glossary

    Truelabel provenance tracking recording collector identity, collection date, hardware configuration, consent status for GDPR and AI Act compliance

    truelabel.ai

FAQ

How many real-world demonstrations are needed to close the sim-to-real gap?

50-200 demonstrations suffice for most manipulation tasks if the policy is pre-trained with domain randomization. Tasks with high-dimensional action spaces (bimanual manipulation, dexterous grasping) may require 200-500 demonstrations. Contact-rich tasks (assembly, deformable object manipulation) benefit from 500-1,000 demonstrations. Beyond 1,000 demonstrations, marginal gains diminish; invest in better simulation or algorithmic improvements instead. RT-1 used 5,000 real demonstrations to fine-tune a simulation-pretrained policy and achieved 91% success; a real-only baseline required 130,000 demonstrations to reach 97% success, demonstrating that simulation pre-training reduces real data needs by 20-50×.

What is the typical success rate improvement from domain randomization alone?

Domain randomization improves baseline sim-to-real transfer by 30-50 percentage points for vision-based policies. A manipulation policy trained in a fixed simulation environment typically achieves 10-30% success on real hardware due to visual domain shift. Adding texture, lighting, and camera pose randomization increases baseline success to 60-80%. Physics randomization (mass, friction, damping) adds another 5-15 percentage points for contact-rich tasks. The exact gain depends on task complexity: pick-and-place benefits more from visual randomization (objects are rigid, dynamics are simple), while assembly benefits more from physics randomization (contact forces dominate). Measure baseline transfer before and after randomization to quantify the gain for your specific task.

Should I use RLDS or LeRobot format for storing real-world data?

Use LeRobot if you are training with Hugging Face Transformers or Diffusion Policy implementations; use RLDS if you are training with TensorFlow or JAX. LeRobot offers better tooling (streaming from Hugging Face Hub, built-in visualization, Parquet storage for fast random access) and is the emerging standard for open datasets (DROID, BridgeData V2, Open X-Embodiment are all available in LeRobot format). RLDS has broader adoption in research (Google, DeepMind, Berkeley use it) and tighter integration with TensorFlow Datasets. Both formats support the same data (multi-modal observations, arbitrary action spaces, episode metadata), so the choice is primarily about ecosystem fit. If you are starting a new project in 2025, default to LeRobot unless you have a strong TensorFlow dependency.

How do I decide between MuJoCo, Isaac Sim, and PyBullet for simulation?

Choose based on your bottleneck. Use Isaac Sim if visual realism is critical (the policy must generalize to varied lighting, textures, backgrounds) and you have GPU budget (RTX 3090 or better). Use MuJoCo if physics accuracy is critical (contact-rich manipulation, high-precision assembly) and you need fast iteration (MuJoCo runs 10-100× faster than Isaac Sim). Use PyBullet if you need maximum flexibility (custom collision shapes, arbitrary sensors, tight Python integration) and rendering quality is secondary. For most manipulation tasks, start with MuJoCo + domain randomization; migrate to Isaac Sim only if the visual domain gap remains large after fine-tuning. PyBullet is best for rapid prototyping and academic research where custom environments are common.

What is the ROI of system identification versus collecting more real data?

System identification costs 1-2 weeks of engineering time ($5,000-10,000) and improves transfer by 10-20 percentage points. Collecting 100 additional real demonstrations costs $1,000-10,000 (depending on task complexity and whether you use in-house operators or a data vendor) and improves transfer by 10-15 percentage points. System identification has higher ROI when the sim-to-real gap is dominated by systematic bias (e.g., all grasps fail because simulated friction is too high) rather than stochastic variation (e.g., grasps succeed 50% of the time due to sensor noise). Run failure analysis: if failure modes are consistent across episodes, invest in system identification; if failure modes are random, collect more data. For contact-rich tasks (assembly, deformable objects), system identification is mandatory; for vision-only tasks (object detection, pose estimation), data collection is usually sufficient.

Can I train a policy entirely in simulation and skip real-world fine-tuning?

Only for tasks where simulation fidelity is very high and the action space is low-dimensional. Policies for navigation in structured environments (warehouses, offices) can achieve 80-90% real-world success with domain randomization alone, as the visual domain gap is small and dynamics are simple (differential drive, known friction). Policies for manipulation require real-world fine-tuning in 95% of cases, as contact dynamics and sensor noise are hard to model accurately. The exception is pick-and-place of rigid objects in uncluttered scenes, where domain randomization + system identification can reach 75-85% success without fine-tuning. For production deployments (>90% success required), budget for 50-200 real demonstrations; the cost is low relative to the risk of deploying an undertrained policy.

Looking for sim-to-real gap?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

Browse 12,000+ Robot Datasets