truelabelRequest data

Physical AI Data Solutions

Sim-to-Real Transfer Data: Bridge the Reality Gap with Physical AI Data

Sim-to-real transfer data closes the performance gap between simulated training and physical deployment by providing real-world observations that capture contact dynamics, sensor noise, and environmental variation simulators cannot reproduce. Policies trained purely in simulation suffer 30-50% task success drops on hardware; targeted real-world datasets for fine-tuning or validation reduce this gap to under 10% by exposing models to true friction coefficients, lighting conditions, and object deformations under force.

Updated 2025-06-15
By truelabel
Reviewed by truelabel ·
sim-to-real transfer data

Quick facts

Use case
sim-to-real transfer data
Audience
Robotics and physical AI teams
Last reviewed
2025-06-15

The Sim-to-Real Gap Is a Distribution Mismatch, Not a Rendering Problem

The sim-to-real gap describes the performance degradation when policies trained in simulation deploy on physical hardware. Despite photorealistic rendering in NVIDIA Isaac Sim and accurate rigid-body physics in MuJoCo, simulated environments systematically diverge from reality in contact dynamics, surface friction, sensor noise profiles, and object compliance. A policy that achieves 95% success in simulation often drops to 45-65% on real robots because simulators approximate continuous physics with discrete time steps and simplified contact models[1].

The core issue is distributional: simulation generates data from P_sim(s,a,s'), while deployment requires generalization to P_real(s,a,s'). When these distributions differ in high-dimensional state spaces, policies overfit to simulator artifacts — deterministic lighting, noiseless joint encoders, perfect object meshes. Crossing the Reality Gap surveys document that contact-rich manipulation tasks (insertion, assembly, deformable object handling) exhibit larger gaps than perception-only tasks because small force errors compound across multi-step trajectories.

Real-world data collection targets the exact distribution regions simulators cannot cover: variable friction on worn surfaces, camera motion blur during fast movements, compliant object deformation under grasp force. DROID's 76,000 real-world trajectories across 564 scenes provide the ground-truth distribution that fine-tuning or validation datasets must sample from[2]. Truelabel's marketplace connects buyers to collectors who instrument physical environments with the sensor suites and task diversity simulators structurally cannot replicate.

Why Domain Randomization Alone Cannot Close the Gap

Domain randomization varies textures, lighting, object masses, and friction parameters during simulated training to force policies to learn robust features. Introduced in 2017 for vision-based grasping[1], the technique works well for perception tasks where appearance variation dominates but fails for contact-rich manipulation where the randomization range must cover true physical parameters without inducing overly conservative behaviors.

The fundamental limitation: randomization requires knowing which parameters to vary and their real-world ranges. If you randomize friction coefficients uniformly between 0.3 and 0.9 but your deployment environment has friction 0.15 (polished metal), the policy never sees relevant training data. Dynamics randomization papers show that overly wide ranges cause policies to learn high-force, slow strategies that succeed across all randomized conditions but perform suboptimally in any specific real environment.

Empirical results confirm the ceiling: RT-1's simulation pretraining improved real-world success from 13% to 17% — a 30% relative gain but still far below the 91% achieved with 130,000 real robot demonstrations[3]. Domain randomization buys you initial transfer but cannot replace real-world data for final performance. The Open X-Embodiment dataset's 1 million real trajectories across 22 robot embodiments provide the distribution coverage randomization cannot synthesize[4].

What Real-World Data Actually Fixes in Sim-to-Real Transfer

Real-world data corrects three failure modes simulators cannot address: contact dynamics mismatches, sensor noise profiles, and long-tail environment variation. First, contact forces during insertion or sliding depend on microscopic surface geometry and compliance that rigid-body simulators approximate with restitution coefficients and Coulomb friction. A policy trained on simulated peg-in-hole with friction 0.5 will jam or slip when real-world friction is 0.3 or 0.8. BridgeData V2's 60,000 real manipulation trajectories capture true contact dynamics across varied objects and surfaces[5].

Second, sensor noise in simulation is typically Gaussian additive noise, but real cameras exhibit motion blur, rolling shutter artifacts, and auto-exposure lag. Real depth sensors produce systematic errors at edges and on reflective surfaces. DROID's multi-camera setup records these artifacts across 86 buildings, providing the noise distribution policies must handle at deployment[2].

Third, environment variation in simulation is limited by asset libraries. Real kitchens contain 10,000+ object configurations; EPIC-KITCHENS-100 captures 90,000 action segments across 45 kitchens with natural clutter and lighting[6]. RH20T's 110,000 contact-rich trajectories span 747 objects and 155,000 contact events — distribution breadth no simulator asset library matches. Truelabel's physical AI data marketplace aggregates real-world datasets with documented sensor specs, environment diversity, and contact event counts so buyers can quantify coverage gaps.

Simulation-Only Training: When It Works and When It Fails

Simulation-only training succeeds for tasks where perception dominates and contact forces are minimal: navigation in known maps, object detection, pose estimation. NVIDIA Isaac Sim generates unlimited synthetic data for training vision models on warehouse navigation or bin-picking with suction grippers where contact dynamics are simple. Policies trained purely in Isaac Sim transfer to real warehouses with 85-90% success when the task is "drive to waypoint" or "detect pallet."

Failure modes emerge in contact-rich manipulation: assembly, deformable object handling, tool use. A policy trained in MuJoCo to insert a USB cable achieves 98% success in simulation but 12% on real hardware because the simulator's contact solver cannot model cable flexibility and connector alignment tolerances under 0.5mm. RLBench's 100 simulated tasks provide a training benchmark, but CALVIN's real-robot evaluation shows simulation-only policies fail on 34 of 34 long-horizon tasks without real-world fine-tuning[7].

The cost-performance tradeoff: simulation generates 10,000 trajectories per GPU-day at zero marginal cost; real-world collection yields 50-200 trajectories per robot-day at $500-2000/day. For tasks where simulation suffices (navigation, detection), real data is unnecessary. For manipulation, RT-2's results show you need 10,000+ real demonstrations to reach 80%+ success on novel objects[8]. Truelabel's marketplace pricing reflects this: navigation datasets cost $0.10-0.50/trajectory; contact-rich manipulation datasets cost $5-20/trajectory due to collection complexity.

Sim-Plus-Real Fine-Tuning: The Dominant Production Pattern

The dominant production pattern is simulation pretraining followed by real-world fine-tuning on 1,000-10,000 trajectories. RT-1 pretrained on 130,000 real demonstrations then fine-tuned on 500 task-specific trajectories to achieve 97% success on novel instructions[3]. OpenVLA pretrained on 970,000 trajectories from the Open X-Embodiment dataset then fine-tuned on 1,000 trajectories per new robot embodiment to transfer across 7 robot platforms[9].

The fine-tuning dataset must cover the deployment distribution's critical dimensions: object diversity, lighting conditions, surface materials, and failure modes. BridgeData V2 provides 60,000 trajectories across 13 skills and 155 objects specifically for fine-tuning vision-language-action models[5]. LeRobot's training pipelines show that 1,000 real trajectories reduce sim-to-real success gaps from 40% to under 10% when the real data matches deployment conditions[10].

Collection strategy matters: random exploration generates low-value data; task-targeted teleoperation with expert demonstrators yields high-value trajectories. ALOHA's bilateral teleoperation setup collects 1,000 bimanual manipulation trajectories in 20 hours with 2 operators — 50 trajectories per robot-hour[11]. Truelabel's collector network includes 160+ robotics labs with teleoperation rigs; buyers specify task, object set, and environment constraints, and collectors bid on requests with delivery timelines and per-trajectory pricing.

Real-World-Only Training: When You Need Full Distribution Coverage

Real-world-only training is necessary when simulation cannot approximate the task distribution: outdoor navigation with weather variation, human-robot interaction with natural language, or manipulation of deformable objects like fabric and food. DROID collected 76,000 trajectories across 564 real-world scenes with no simulation pretraining because the target distribution (everyday objects in unstructured environments) has no simulator equivalent[2].

Open X-Embodiment's 1 million real trajectories across 22 robot embodiments provide the distribution breadth for training generalist policies that transfer to new robots without fine-tuning[4]. RT-2 trained on this dataset achieves 62% success on novel objects and 41% on novel scenes — performance unattainable with simulation-only data[8].

The cost barrier is real: collecting 1 million trajectories at $5/trajectory costs $5 million. Scale AI's Physical AI data engine and Truelabel's marketplace reduce per-trajectory costs by aggregating demand across buyers and amortizing collector infrastructure. A single buyer needs 10,000 kitchen manipulation trajectories; ten buyers collectively need 100,000 trajectories across varied tasks, enabling collectors to instrument kitchens once and serve multiple buyers. Truelabel's request model lets buyers specify task distributions and budget constraints; collectors compete on price and delivery speed.

Teleoperation Datasets: The Highest-Intent Real-World Data

Teleoperation datasets capture expert demonstrations of target tasks with full state-action trajectories, making them the highest-value real-world data for imitation learning. ALOHA's bilateral teleoperation records 6-DOF end-effector poses, gripper states, and wrist camera images at 50Hz while human operators perform bimanual tasks[11]. DROID's single-arm teleoperation collected 76,000 trajectories across 564 scenes with task success labels and failure mode annotations[2].

Teleoperation data quality depends on operator skill and interface fidelity. Low-latency interfaces (under 100ms) enable smooth trajectories; high-latency interfaces cause jerky motions that policies cannot imitate. Claru's warehouse teleoperation dataset uses 50ms latency VR controllers to collect 12,000 pick-and-place trajectories with sub-centimeter position accuracy. RoboNet's 15 million frames from 7 robot platforms include teleoperation and autonomous data, but only the teleoperation subset (30% of total) is suitable for imitation learning[12].

Collection costs scale with task complexity: simple pick-and-place costs $2-5/trajectory; bimanual assembly costs $10-20/trajectory; long-horizon tasks (make coffee, fold laundry) cost $50-100/trajectory due to 5-10 minute demonstration times. Truelabel's marketplace pricing reflects this: buyers post requests specifying task, success criteria, and trajectory count; collectors bid with per-trajectory rates and delivery schedules. Truelabel's intake form captures task specifications, sensor requirements, and environment constraints so collectors can estimate costs accurately.

Egocentric Video Data: Bridging Human Priors and Robot Policies

Egocentric video datasets capture human task execution from head-mounted cameras, providing rich priors for object affordances, grasp strategies, and task sequencing without requiring robot hardware. EPIC-KITCHENS-100 contains 90,000 action segments across 700 hours of kitchen activities with object bounding boxes and verb-noun annotations[6]. Ego4D's 3,670 hours across 74 locations provide broader environment coverage but sparser annotations[13].

The transfer mechanism: vision-language models pretrained on egocentric video learn object-centric representations and action affordances that transfer to robot policies via shared visual encoders. RT-2 pretrained on web video and robot data jointly, using egocentric video to learn "what objects are graspable" and robot data to learn "how to grasp them"[8]. OpenVLA uses a similar approach, pretraining on 970,000 robot trajectories plus web video to achieve 30% better generalization than robot-only training[9].

Limitations: egocentric video lacks force feedback, precise end-effector poses, and contact event timing. A human picking up a mug applies 2-5N grip force; the video shows the motion but not the force profile. DROID's robot teleoperation data includes wrist force-torque sensors and gripper pressure, enabling policies to learn contact-rich skills egocentric video cannot teach[2]. The optimal dataset mix: 100,000+ egocentric videos for object priors, 10,000+ robot trajectories for contact dynamics. Truelabel's marketplace includes both: egocentric video datasets at $0.05-0.20/minute, robot teleoperation at $5-20/trajectory.

Dataset Formats and Tooling for Sim-to-Real Workflows

Sim-to-real workflows require interoperable formats across simulation, real-world collection, and training pipelines. RLDS (Reinforcement Learning Datasets) defines a standard schema for episodes, steps, observations, and actions stored in TensorFlow Datasets format[14]. LeRobot extends RLDS with multi-modal sensor support (RGB, depth, force-torque) and metadata for robot embodiment, environment, and task[10].

Real-world collection tools use MCAP or ROS bags for recording; post-processing converts to RLDS or LeRobot format for training. DROID provides conversion scripts from ROS bags to RLDS with automatic trajectory segmentation and success labeling[2]. Open X-Embodiment standardizes 22 datasets into a unified RLDS schema with consistent action spaces and observation keys[4].

Metadata requirements for sim-to-real datasets: robot URDF, camera intrinsics/extrinsics, control frequency, action space bounds, and environment lighting conditions. Truelabel's data provenance schema captures these fields plus collector identity, collection date, and task success rate. Buyers filter datasets by robot embodiment ("Franka Panda arm"), sensor suite ("wrist RGB + depth"), and task category ("pick-and-place") to find sim-to-real fine-tuning data matching their deployment hardware. LeRobot's dataset browser provides similar filtering with preview videos and trajectory statistics.

Procurement Strategy: Balancing Simulation, Public Datasets, and Custom Collection

Effective sim-to-real procurement balances three data sources: simulation for initial training, public datasets for pretraining, and custom collection for deployment-specific fine-tuning. Start with simulation to train basic skills (reaching, grasping) at zero marginal cost. RLBench's 100 tasks provide a simulation benchmark; train policies to 80%+ success in simulation before collecting real data[15].

Next, fine-tune on public real-world datasets to learn contact dynamics and sensor noise profiles. Open X-Embodiment's 1 million trajectories are freely available for research; BridgeData V2's 60,000 trajectories are CC-BY licensed[5]. These datasets reduce sim-to-real gaps from 40% to 15-20% but do not cover deployment-specific objects, environments, or failure modes.

Finally, collect 1,000-10,000 custom trajectories matching your deployment distribution: same robot embodiment, same object set, same lighting and clutter levels. Truelabel's request model lets you specify task, success criteria, and environment constraints; collectors bid with per-trajectory pricing and delivery timelines. Budget $10,000-50,000 for 1,000-5,000 trajectories depending on task complexity. Scale AI's data engine offers similar custom collection but at 2-3x higher per-trajectory costs due to full-service project management overhead.

The ROI calculation: a 10% success rate improvement on a production robot fleet running 1,000 tasks/day saves 100 task failures/day. If each failure costs $5 in wasted time and materials, the annual savings is $182,500. A $30,000 custom dataset investment pays back in 60 days. Truelabel's marketplace pricing transparency lets buyers model ROI before committing to collection.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World

    Domain randomization introduced in 2017 for sim-to-real transfer; documents 30-50% performance drops without real-world data

    arXiv
  2. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    DROID dataset paper: 76,000 real-world trajectories across 564 scenes, 86 buildings

    arXiv
  3. RT-1: Robotics Transformer for Real-World Control at Scale

    RT-1 paper: 130,000 real demonstrations, 91% success rate, simulation pretraining improved success from 13% to 17%

    arXiv
  4. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment: 1 million trajectories across 22 robot embodiments

    arXiv
  5. BridgeData V2: A Dataset for Robot Learning at Scale

    BridgeData V2: 60,000 real manipulation trajectories across 13 skills and 155 objects

    arXiv
  6. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100

    EPIC-KITCHENS-100: 90,000 action segments across 45 kitchens

    arXiv
  7. CALVIN paper

    CALVIN paper documenting simulation-only policy failures on long-horizon tasks

    arXiv
  8. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    RT-2: 62% success on novel objects, 41% on novel scenes; 10,000+ real demonstrations needed for 80%+ success

    arXiv
  9. OpenVLA: An Open-Source Vision-Language-Action Model

    OpenVLA: pretrained on 970,000 trajectories, fine-tuned on 1,000 per robot embodiment, transfers across 7 platforms

    arXiv
  10. LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch

    LeRobot paper: multi-modal sensor support and metadata schema for robot datasets

    arXiv
  11. Teleoperation datasets are becoming the highest-intent physical AI content category

    ALOHA bilateral teleoperation: 1,000 bimanual trajectories in 20 hours, 50 trajectories per robot-hour

    tonyzhaozh.github.io
  12. RoboNet: Large-Scale Multi-Robot Learning

    RoboNet paper: 15 million frames from 7 robot platforms, 30% teleoperation subset suitable for imitation learning

    arXiv
  13. Ego4D: Around the World in 3,000 Hours of Egocentric Video

    Ego4D: 3,670 hours of egocentric video across 74 locations

    arXiv
  14. RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning

    RLDS paper: ecosystem for generating, sharing, and using RL datasets

    arXiv
  15. RLBench: The Robot Learning Benchmark & Learning Environment

    RLBench paper: 100 simulated tasks for benchmarking robot learning

    arXiv

FAQ

What is the sim-to-real gap and why does it matter for robot deployment?

The sim-to-real gap is the performance degradation when policies trained in simulation deploy on physical hardware, typically 30-50% success rate drops. It matters because simulation is 100-1000x cheaper than real-world data collection, but policies trained purely in simulation fail on real robots due to contact dynamics mismatches, sensor noise differences, and environment variation simulators cannot reproduce. Closing the gap requires real-world fine-tuning datasets that capture true friction coefficients, lighting conditions, and object compliance.

How much real-world data do I need to fine-tune a simulation-trained policy?

1,000-10,000 real-world trajectories typically reduce sim-to-real gaps from 40% to under 10%. RT-1 used 500 task-specific trajectories to achieve 97% success after pretraining on 130,000 demonstrations. OpenVLA fine-tuned on 1,000 trajectories per robot embodiment to transfer across 7 platforms. The exact count depends on task complexity: simple pick-and-place needs 1,000-2,000 trajectories; bimanual assembly needs 5,000-10,000; long-horizon tasks need 10,000-50,000. Start with 1,000 trajectories and measure success rate improvement to determine if more data is needed.

Should I collect teleoperation data or autonomous exploration data for sim-to-real transfer?

Collect teleoperation data for imitation learning and task-specific fine-tuning; collect autonomous exploration data for environment modeling and failure mode discovery. Teleoperation provides expert demonstrations with smooth trajectories and high success rates (80-95%), ideal for training policies via behavior cloning. Autonomous exploration generates diverse state coverage but low task success (10-30%), useful for learning dynamics models or identifying distribution shifts. For sim-to-real fine-tuning, teleoperation data is 5-10x more sample-efficient: 1,000 teleoperation trajectories match the performance of 5,000-10,000 autonomous trajectories.

What metadata must sim-to-real datasets include for procurement decisions?

Critical metadata: robot URDF (kinematics and dynamics), camera intrinsics and extrinsics (focal length, distortion, mounting pose), control frequency (10Hz, 50Hz, 100Hz), action space bounds (joint limits, velocity limits), sensor noise profiles (camera motion blur, depth sensor accuracy), environment lighting (lux levels, shadow variation), object set (meshes, masses, friction coefficients), and task success rate (percentage of trajectories that achieve goal). Truelabel's data provenance schema captures these fields plus collector identity, collection date, and trajectory segmentation method so buyers can filter datasets by deployment hardware and task requirements.

How do I evaluate whether a public dataset will reduce my sim-to-real gap?

Evaluate distribution overlap: does the dataset's robot embodiment match yours (same DOF, similar workspace)? Does the object set overlap with your deployment objects (same categories, similar sizes)? Does the environment match (indoor/outdoor, lighting variation, clutter levels)? Measure overlap quantitatively: if 40% of your deployment objects appear in the dataset, expect 40% of the sim-to-real gap to close. Test empirically: fine-tune your simulation-trained policy on 100-500 trajectories from the public dataset and measure success rate on a held-out real-world test set. If success improves from 50% to 65%, the dataset is valuable; if it stays at 50-52%, the distribution mismatch is too large and you need custom collection.

Looking for sim-to-real transfer data?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

List Your Sim-to-Real Dataset on Truelabel