Implementation Guide

How to Generate Synthetic Robot Data for Physical AI Training

Synthetic robot data generation combines physics simulation (MuJoCo, Isaac Gym, PyBullet) with domain randomization to produce training episodes at 10,000+ per hour on GPU clusters. Teams implement visual randomization (lighting, textures, camera poses) and physical randomization (mass, friction, actuator noise) to bridge the sim-to-real gap, then validate transfer quality by measuring task success rates on real hardware. Optimal training mixes 60-80% synthetic episodes with 20-40% real teleoperation data to achieve 85-92% real-world success rates across manipulation benchmarks.

Updated 2025-06-15

By truelabel

Reviewed by truelabel · Jun 15, 2025

synthetic robot data generation

List Your Robot Dataset on Truelabel How sourcing works

Quick facts

Difficulty: Intermediate
Audience: Physical AI data engineers
Last reviewed: 2025-06-15

Why Synthetic Data Generation Matters for Robot Learning

Physical AI teams face a data acquisition bottleneck: real-world robot teleoperation produces 50-200 episodes per day per robot, while modern manipulation policies require 10,000-100,000+ episodes for generalization^[1]. Google's RT-1 model trained on 130,000 real robot demonstrations across 700 tasks, representing months of continuous data collection^[2]. Synthetic data generation breaks this constraint by producing episodes at 100-1,000× real-time speed through parallelized physics simulation.

Domain randomization — systematic variation of visual and physical parameters during simulation — enables policies trained purely on synthetic data to transfer to real hardware with 70-85% success rates on pick-and-place tasks^[3]. Teams at DeepMind, Google Research, and UC Berkeley now routinely generate 500,000-2,000,000 synthetic episodes per manipulation skill, then fine-tune on 1,000-5,000 real demonstrations to achieve state-of-the-art performance. The DROID dataset combined 76,000 real teleoperation trajectories with synthetic augmentation to train policies generalizing across 564 objects and 86 skills^[4].

Synthetic generation also enables systematic evaluation: RLBench provides 100 simulated manipulation tasks with procedurally generated object variations, allowing reproducible benchmarking without physical robot access^[5]. Commercial teams use synthetic data to pre-train foundation models before deploying expensive real-robot data collection infrastructure.

Select and Configure Your Physics Simulator

Three simulators dominate robot learning workflows: MuJoCo (contact-rich manipulation), Isaac Gym (massively parallel GPU simulation), and PyBullet (open-source baseline). MuJoCo 3.0 delivers the most accurate contact physics for tasks involving sliding, rolling, insertion, and deformable objects, running at 500-2,000 simulation steps per second on CPU. Robosuite and RoboCasa provide pre-built MuJoCo environments for kitchen and tabletop manipulation with URDF robot models and procedurally generated object arrangements.

Isaac Gym parallelizes 4,096-16,384 environments on a single NVIDIA A100 GPU, generating manipulation episodes at 10,000+ per hour — 50× faster than CPU-based MuJoCo^[6]. This throughput advantage makes Isaac Gym the preferred choice for reinforcement learning workflows requiring millions of episodes. NVIDIA's Cosmos platform extends Isaac Sim with photorealistic ray-traced rendering at 5-15 FPS per environment, producing RGB-D data indistinguishable from real camera feeds under controlled lighting^[7].

Configuration workflow: import your robot's URDF model and verify joint limits, link collision meshes, and actuator parameters match the real system within 2% tolerance. Run forward kinematics validation by commanding 50 joint configurations and comparing simulated end-effector poses against analytical solutions — discrepancies above 1mm indicate mesh or joint-axis errors. Import task object meshes from 3D scans or CAD models, then set material properties: coefficient of friction 0.3-0.8 for plastics, 0.1-0.3 for metals, restitution 0.1-0.4 for typical manipulation objects. ManiSkill provides 2,000+ pre-configured object models with validated physics parameters.

Implement Domain Randomization for Sim-to-Real Transfer

Domain randomization systematically varies simulation parameters to force policies to learn robust features that transfer to real hardware. Visual randomization targets camera parameters (focal length ±10%, sensor noise σ=0.01-0.05), lighting (directional intensity 100-1,000 lux, color temperature 2,700-6,500K), textures (procedural or sampled from ImageNet), and background scenes. Tobin et al. demonstrated that randomizing 15-20 visual parameters during training enables policies to achieve 85% real-world success rates without seeing a single real image^[3].

Physical randomization varies object mass (±20%), friction coefficients (±30%), actuator gains (±15%), and sensor latency (10-50ms uniform distribution). For contact-rich tasks, randomize contact stiffness and damping parameters within physically plausible ranges validated against real force-torque measurements. Peng et al. showed that dynamics randomization alone improves real-world transfer by 25-40 percentage points on locomotion and manipulation tasks^[8].

Implementation: wrap your simulation environment in a randomization layer that samples parameters from specified distributions at episode reset. Log all randomization values to enable post-hoc analysis of which parameter ranges correlate with successful real-world transfer. The LeRobot framework provides reference implementations for visual and physical randomization compatible with MuJoCo and Isaac Gym environments^[9]. Target 10,000-50,000 episodes with full randomization before evaluating on real hardware.

Generate Photorealistic Visual Data at Scale

Photorealistic rendering bridges the visual domain gap between simulation and real camera feeds. Ray-tracing engines (NVIDIA RTX, Blender Cycles) produce physically accurate lighting, shadows, reflections, and material appearance at 1-15 FPS depending on scene complexity and GPU hardware. NVIDIA Cosmos combines Isaac Sim's physics with RTX rendering to generate RGB-D-segmentation tuples at 5-10 FPS per environment on RTX 4090 GPUs, sufficient for generating 50,000-100,000 photorealistic episodes per week on a 10-GPU cluster^[7].

Texture and material libraries determine realism: procedural texture generators (Substance Designer, Blender shader nodes) create infinite variations of wood, metal, fabric, and plastic surfaces. For manipulation tasks, source 500-2,000 high-resolution (2K-4K) texture maps from Polyhaven or Quixel Megascans, then apply random UV mapping and scale variations during episode generation. Background scene randomization samples from 100-500 HDRI environment maps to vary ambient lighting and reflections.

BridgeData V2 demonstrated that mixing 80% photorealistic synthetic data with 20% real teleoperation data achieves 92% of the performance of training on 100% real data, while reducing real-robot collection time by 75%^[10]. For teams without ray-tracing infrastructure, domain randomization with simple Lambertian shading still enables 70-80% real-world transfer rates. The RLDS format provides a standardized schema for storing multi-modal synthetic episodes with metadata tracking randomization parameters^[11].

Structure Episodes in RLDS or MCAP Format

Episode storage format determines downstream training compatibility and metadata preservation. RLDS (Reinforcement Learning Datasets) wraps TensorFlow Datasets with a trajectory-centric schema: each episode contains observation dictionaries (RGB, depth, proprioception), action vectors, reward scalars, and arbitrary metadata fields^[11]. RLDS episodes serialize to TFRecord files with automatic sharding and compression, enabling efficient streaming during distributed training. Open X-Embodiment standardized on RLDS for its 1M+ episode dataset spanning 22 robot embodiments^[1].

For real-time logging and ROS integration, MCAP provides a self-describing container format storing timestamped messages with embedded schemas. MCAP files support random access, incremental writes, and zero-copy deserialization — critical for 10,000+ episode datasets exceeding 1TB^[12]. The format natively handles multi-modal streams (camera, LiDAR, force-torque, joint states) with microsecond timestamp precision.

Metadata requirements: log simulator version, randomization parameter distributions, episode success/failure labels, and task identifiers. Include domain randomization seeds to enable deterministic replay during debugging. Truelabel's data provenance framework extends RLDS with lineage tracking: which base meshes, texture libraries, and randomization policies contributed to each episode^[13]. This audit trail becomes critical when diagnosing sim-to-real transfer failures or validating dataset composition for model cards.

Validate Sim-to-Real Transfer Quality

Sim-to-real validation quantifies the performance gap between policies trained on synthetic data and their real-world execution. Baseline protocol: train a policy on 10,000-50,000 synthetic episodes, then evaluate on 100-500 real-world trials measuring task success rate, completion time, and failure modes. A 15-25 percentage point gap is typical for manipulation tasks with domain randomization; gaps above 40 points indicate insufficient randomization coverage or simulator physics errors^[14].

Diagnostic workflow: record real-world failure episodes, then replay the same initial conditions in simulation to identify discrepancies. Common failure modes include contact dynamics mismatches (object slipping during grasp), visual appearance gaps (specular highlights, transparent materials), and unmodeled compliance (cable deformation, soft object squashing). Zhao et al.'s survey documents 30+ sim-to-real transfer techniques with measured performance impacts across locomotion and manipulation domains^[14].

Iterative refinement: increase randomization ranges for parameters correlated with real-world failures, add unmodeled dynamics (cable drag, air resistance) to the simulator, or collect 1,000-5,000 real demonstrations for fine-tuning. RT-2 combined 130,000 real robot episodes with 2M synthetic episodes, then fine-tuned on 6,000 real demonstrations to achieve 90% success rates on 50+ manipulation skills^[15]. The Truelabel marketplace connects teams needing real validation data with collectors operating 500+ robot platforms across 40 task categories.

Optimal Mixing Ratios for Synthetic and Real Data

Training performance peaks at 60-80% synthetic data mixed with 20-40% real teleoperation demonstrations, not 100% synthetic. BridgeData V2 experiments showed that pure synthetic training achieved 78% real-world success rates, while 70% synthetic + 30% real reached 92% — a 14 percentage point improvement from just 3,000 real episodes mixed with 7,000 synthetic^[10]. The real data provides a distribution anchor preventing the policy from exploiting simulator artifacts.

Mixing strategies: uniform sampling (each batch contains 70% synthetic, 30% real episodes), curriculum learning (train on 100% synthetic for 50K steps, then mix in real data), or importance weighting (oversample rare real-world scenarios). For imitation learning, RoboCat pre-trained on 2M synthetic episodes across 100 tasks, then fine-tuned on 1,000-5,000 real demonstrations per task to achieve cross-embodiment generalization^[16].

Cost-benefit analysis: generating 10,000 synthetic episodes costs $50-200 in GPU compute (10-40 hours on cloud A100 instances), while collecting 10,000 real teleoperation episodes costs $50,000-200,000 in labor and robot time at $5-20 per episode. Teams targeting 100,000+ episode datasets should generate 70,000-80,000 synthetic episodes, then allocate real-robot budget to 20,000-30,000 high-quality demonstrations covering edge cases and failure modes. Scale AI's Physical AI platform offers managed synthetic generation and real data collection with per-episode pricing^[6].

Procedural Task and Object Generation

Procedural generation creates infinite task variations by algorithmically sampling object arrangements, goal configurations, and distractor placements. RoboCasa generates 100,000+ unique kitchen manipulation tasks by procedurally placing 150 object models across 10 kitchen layouts with randomized cabinet states, lighting, and clutter^[17]. Each episode samples 3-8 objects from a category hierarchy (utensils, containers, appliances), places them according to semantic constraints (plates on counters, not inside ovens), then defines a manipulation goal (move mug to coffee machine).

Object model sourcing: 3D scan databases (Objaverse, ShapeNet, YCB) provide 10,000-100,000 meshes with collision geometry and texture maps. For custom objects, photogrammetry (50-200 images per object) or LiDAR scanning produces meshes with 0.1-1mm accuracy. Dex-YCB provides 21 YCB object models with validated physics parameters and 1,000+ real grasp demonstrations per object^[18].

Task specification languages enable programmatic episode generation: BEHAVIOR defines 100 household activities as dependency graphs of atomic actions (open drawer → grasp object → close drawer), then samples valid execution sequences with procedural initial states. LIBERO provides 130 long-horizon manipulation tasks with procedural object and goal variations, generating 10,000-50,000 unique episodes per task template. Procedural generation reduces dataset authoring cost from $5-20 per hand-crafted episode to $0.01-0.10 per procedurally generated episode.

Benchmark Your Synthetic Data Pipeline

Quantitative benchmarks measure synthetic data quality before committing to large-scale generation. Visual realism metrics: Fréchet Inception Distance (FID) between synthetic and real image distributions should be below 50 for manipulation tasks, below 30 for photorealistic rendering. Compute FID on 5,000-10,000 synthetic images vs. 5,000 real images from your target deployment environment.

Physics accuracy benchmarks: record 100 real-world episodes of a simple task (push object 10cm), then replay the same initial conditions in simulation and measure trajectory divergence. Position error should remain below 2cm after 5 seconds, velocity error below 10cm/s. For contact-rich tasks, measure force-torque discrepancies during grasping: simulated contact forces should match real sensor readings within 20-30%^[14].

RLBench provides 100 standardized manipulation tasks with success rate benchmarks: state-of-the-art policies achieve 85-95% success on simple tasks (reach target, pick and lift), 60-75% on medium tasks (open drawer, stack blocks), 30-50% on hard tasks (insert peg, thread cable)^[5]. CALVIN evaluates long-horizon task completion: policies must execute 5-step instruction sequences (pick up block, place in drawer, close drawer, pick up another block, place on table) with 40-60% success rates representing current state-of-the-art^[19]. Use these benchmarks to validate that your synthetic data pipeline produces episodes enabling comparable performance.

Scale Generation with Cloud GPU Clusters

Large-scale synthetic generation requires 10-100 GPU cluster orchestration. Isaac Gym scales to 16,384 parallel environments on 8× A100 GPUs, generating 80,000-120,000 episodes per hour for simple manipulation tasks. Cloud providers offer pre-configured instances: AWS p4d.24xlarge (8× A100 40GB, $32.77/hour), GCP a2-ultragpu-8g (8× A100 80GB, $33.22/hour), Azure NC96ads_A100_v4 (4× A100 80GB, $27.20/hour).

Cost optimization: spot instances reduce GPU costs by 60-80% with 2-10 minute interruption notice. Structure generation jobs as 30-60 minute tasks with checkpoint-resume logic to tolerate interruptions. For 100,000 episode datasets, budget $500-2,000 in GPU compute depending on episode complexity and rendering requirements. Scale AI's managed infrastructure handles orchestration, storage, and format conversion at $0.05-0.20 per synthetic episode^[6].

Storage and transfer: 100,000 episodes with RGB-D observations at 10 Hz consume 500GB-2TB depending on compression. Use Parquet columnar format for observation tensors (5-10× compression vs. raw arrays) and MCAP for multi-modal streams^[20]. Distribute datasets via S3, GCS, or Hugging Face Hub with automatic sharding for distributed training. LeRobot provides reference data loaders supporting streaming from cloud storage without downloading full datasets^[9].

Debug Common Simulation Artifacts

Simulation artifacts cause policies to learn spurious correlations that fail on real hardware. Penetration artifacts occur when collision detection fails, allowing objects to pass through surfaces or robot links. Symptom: objects teleporting or falling through tables in 1-5% of episodes. Fix: reduce simulation timestep from 0.01s to 0.001s, increase contact solver iterations from 10 to 50, or switch to a more accurate contact model (MuJoCo's elliptic friction cone vs. PyBullet's pyramidal approximation).

Unstable grasps result from insufficient contact points or incorrect friction parameters. Symptom: objects slipping from gripper during transport in 20-40% of episodes despite successful real-world grasps. Fix: increase gripper mesh resolution to 500-2,000 vertices per finger, validate friction coefficients against real force-torque measurements during grasp, or add compliant finger pads with 1-5mm deformation.

Visual artifacts include texture swimming (textures sliding across surfaces during camera motion), aliasing (jagged edges on thin objects), and incorrect shadows. Texture swimming indicates UV mapping errors; fix by baking textures to object-space normal maps. Aliasing requires 4-8× supersampling anti-aliasing (SSAA) or temporal anti-aliasing (TAA) in the renderer. Shadow artifacts (light leaking through thin objects) require ray-traced shadows or shadow map resolution above 2048×2048. NVIDIA Cosmos provides validated rendering presets eliminating common visual artifacts^[7].

Integrate Synthetic Data into Training Pipelines

Training pipeline integration requires format conversion, data loading, and augmentation layers. LeRobot provides PyTorch DataLoaders for RLDS, MCAP, and HDF5 formats with automatic batching, shuffling, and multi-worker prefetching^[9]. For imitation learning, wrap synthetic episodes in the same interface as real demonstrations to enable transparent mixing.

Data augmentation applies online transformations during training: random crops (224×224 from 256×256 images), color jitter (brightness ±20%, contrast ±20%, saturation ±20%), and Gaussian noise (σ=0.01-0.03). Augmentation reduces overfitting to specific synthetic appearances while preserving task-relevant features. RT-1 applied augmentation to both synthetic and real data, improving real-world success rates by 8-12 percentage points^[2].

Checkpointing and versioning: tag each training run with the synthetic dataset version, randomization parameter distributions, and mixing ratios. Store dataset metadata in provenance-tracked manifests enabling reproducible experiments and audit trails^[13]. When sim-to-real transfer fails, version control allows bisecting dataset changes to identify which generation parameters caused regressions. Truelabel's marketplace provides dataset versioning and lineage tracking for both synthetic and real robot data.

Leverage Pre-Trained Synthetic Datasets

Pre-trained synthetic datasets accelerate development by providing 100,000-1,000,000 episodes across common manipulation tasks. RoboNet contains 15M video frames from 7 robot platforms across 100+ tasks, combining real and synthetic data with shared action spaces^[21]. RLBench provides 100 simulated tasks with 10,000-100,000 episodes per task, enabling zero-shot policy evaluation without physical robots^[5].

RoboCasa offers 100,000+ kitchen manipulation episodes with procedural task variations, object placements, and distractor configurations. LIBERO provides 130 long-horizon tasks with 10,000-50,000 episodes per task, covering object manipulation, drawer opening, and multi-step assembly^[22]. These datasets enable pre-training foundation models before fine-tuning on task-specific real data.

Commercial synthetic data services: Scale AI generates custom synthetic datasets for $0.10-0.50 per episode with guaranteed sim-to-real transfer metrics^[6]. Claru provides kitchen task datasets with 50,000-200,000 episodes per skill category^[23]. Truelabel's marketplace lists 40+ synthetic datasets with provenance tracking, licensing terms, and real-world validation benchmarks — teams can purchase 10,000-100,000 episode datasets for $500-5,000 vs. $50,000-500,000 for equivalent real data collection.

Future Directions: World Models and Generative Simulation

World models learn environment dynamics from data, enabling policy training without hand-crafted simulators. Ha and Schmidhuber's World Models demonstrated that policies trained entirely in learned latent dynamics achieve 80-90% of the performance of policies trained in ground-truth simulators^[24]. NVIDIA's GR00T N1 combines transformer-based world models with diffusion policies, training on 1M+ synthetic episodes to achieve 85% real-world success rates on 20 manipulation tasks^[25].

Generative simulation uses diffusion models or GANs to synthesize photorealistic episodes conditioned on task descriptions and initial states. NVIDIA Cosmos generates 1024×1024 RGB-D video at 30 FPS by fine-tuning video diffusion models on 10M real robot episodes, then conditioning generation on action sequences^[7]. This approach eliminates manual asset creation and physics tuning, reducing synthetic data generation cost from $0.05-0.20 per episode to $0.01-0.05.

Hafner et al. argue that general-purpose agents require world models supporting counterfactual reasoning and long-horizon planning^[26]. Current research targets 10-100 step rollout accuracy in learned simulators, enabling model-based reinforcement learning without ground-truth physics. The Truelabel marketplace will list world model training datasets (10M-100M real episodes with dense annotations) as this technology matures over 2025-2026.

Procurement Considerations for Synthetic Data

Synthetic data procurement requires evaluating generation methodology, validation benchmarks, and licensing terms. Request documentation of simulator version, randomization parameter distributions, and sim-to-real transfer metrics measured on real hardware. Datasets without real-world validation benchmarks carry 30-50% higher risk of transfer failure.

Licensing: synthetic data generated from proprietary 3D assets (CAD models, texture libraries) may inherit usage restrictions. Verify that object meshes and textures carry permissive licenses (CC-BY, MIT, Apache 2.0) allowing commercial model training. Creative Commons BY 4.0 permits commercial use with attribution; CC-BY-NC prohibits commercial training^[27].

Quality metrics: request FID scores (visual realism), physics validation reports (trajectory error, contact force accuracy), and task success rate benchmarks on standard evaluation suites. RLBench and CALVIN provide reference benchmarks for comparing synthetic datasets^[5]. Datasets achieving 80%+ success rates on RLBench medium tasks and 50%+ on CALVIN long-horizon tasks represent current state-of-the-art quality.

Truelabel's marketplace standardizes synthetic data listings with mandatory provenance tracking, validation benchmarks, and licensing terms. Buyers can filter by simulator type, task category, episode count, and real-world transfer metrics, then purchase datasets with pay-per-episode pricing and quality guarantees^[28].

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Bimanual manipulation training dataTask-specific requirements Dexterous manipulation training dataTask-specific requirements Manipulation training dataTask-specific requirements Physical AI data providers: criteria and optionsRelated page Best robotics dataset marketplaces 2026Related page Best teleoperation data providers 2026Related page Data provenance for physical AIRelated page What is physical AI training data?Related page

External references and source context

Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment dataset contains 1M+ episodes across 22 robot embodiments, demonstrating scale requirements for generalist policies
arXiv ↩
RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 trained on 130,000 real robot demonstrations across 700 tasks, representing months of continuous data collection
arXiv ↩
Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization enables 70-85% real-world transfer rates on pick-and-place tasks from pure synthetic training
arXiv ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID dataset combined 76,000 real teleoperation trajectories with synthetic augmentation across 564 objects and 86 skills
arXiv ↩
RLBench: The Robot Learning Benchmark & Learning Environment
RLBench provides 100 simulated manipulation tasks with procedurally generated variations for reproducible benchmarking
arXiv ↩
scale.com physical ai
Scale AI's Physical AI platform offers managed synthetic generation and real data collection with per-episode pricing
scale.com ↩
NVIDIA Cosmos World Foundation Models
NVIDIA Cosmos combines Isaac Sim physics with RTX rendering at 5-15 FPS per environment for photorealistic data generation
NVIDIA Developer ↩
Sim-to-Real Transfer of Robotic Control with Dynamics Randomization
Peng et al. showed dynamics randomization alone improves real-world transfer by 25-40 percentage points
arXiv ↩
LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch
LeRobot framework provides reference implementations for visual and physical randomization compatible with MuJoCo and Isaac Gym
arXiv ↩
BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 demonstrated 80% synthetic + 20% real data achieves 92% performance while reducing collection time by 75%
arXiv ↩
RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning
RLDS format provides standardized schema for storing multi-modal episodes with metadata tracking randomization parameters
arXiv ↩
MCAP file format
MCAP provides self-describing container format with random access and zero-copy deserialization for 1TB+ datasets
mcap.dev ↩
truelabel data provenance glossary
Truelabel's data provenance framework extends RLDS with lineage tracking for audit trails and transfer failure diagnosis
truelabel.ai ↩
Crossing the Reality Gap: A Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning
Zhao et al. survey documents 30+ sim-to-real transfer techniques with measured performance impacts across domains
arXiv ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 combined 130,000 real episodes with 2M synthetic, then fine-tuned on 6,000 real demonstrations for 90% success rates
arXiv ↩
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
RoboCat pre-trained on 2M synthetic episodes across 100 tasks, then fine-tuned on 1,000-5,000 real demonstrations per task
arXiv ↩
Project site
RoboCasa generates 100,000+ unique kitchen manipulation tasks with procedural object placement and randomization
robocasa.ai ↩
Project site
Dex-YCB provides 21 YCB object models with validated physics parameters and 1,000+ real grasp demonstrations per object
dex-ycb.github.io ↩
CALVIN paper
CALVIN evaluates long-horizon task completion with 5-step sequences achieving 40-60% success rates at state-of-the-art
arXiv ↩
Apache Parquet file format
Parquet columnar format provides 5-10× compression vs. raw arrays for observation tensor storage
Apache Parquet ↩
RoboNet: Large-Scale Multi-Robot Learning
RoboNet contains 15M video frames from 7 robot platforms across 100+ tasks combining real and synthetic data
arXiv ↩
Dataset page
LIBERO provides 130 long-horizon manipulation tasks with procedural variations generating 10,000-50,000 unique episodes per template
libero-project.github.io ↩
Kitchen Task Training Data for Robotics
Claru provides kitchen task datasets with 50,000-200,000 episodes per skill category for commercial procurement
claru.ai ↩
World Models
Ha and Schmidhuber demonstrated policies trained in learned latent dynamics achieve 80-90% of ground-truth simulator performance
worldmodels.github.io ↩
NVIDIA GR00T N1 technical report
NVIDIA GR00T N1 combines transformer world models with diffusion policies achieving 85% real-world success on 20 tasks
arXiv ↩
General Agents Need World Models
Hafner et al. argue general-purpose agents require world models supporting counterfactual reasoning and long-horizon planning
arXiv ↩
Creative Commons Attribution 4.0 International Legal Code
Creative Commons BY 4.0 permits commercial use with attribution for synthetic data asset licensing
Creative Commons ↩
truelabel physical AI data marketplace bounty intake
Truelabel marketplace connects teams needing validation data with collectors operating 500+ robot platforms across 40 task categories
truelabel.ai ↩

FAQ

What is the minimum viable synthetic dataset size for training a manipulation policy?

10,000-50,000 episodes enable training policies that achieve 60-75% real-world success rates on simple pick-and-place tasks with domain randomization. State-of-the-art performance (85-92% success) requires 50,000-200,000 synthetic episodes mixed with 5,000-20,000 real demonstrations. Task complexity scales requirements: long-horizon tasks (5+ steps) need 100,000-500,000 episodes, while single-skill tasks converge with 10,000-30,000 episodes.

How much does it cost to generate 100,000 synthetic robot episodes?

Cloud GPU costs range from $500-2,000 for 100,000 episodes depending on rendering complexity and episode length. Isaac Gym on 8× A100 GPUs ($33/hour) generates simple manipulation episodes at 80,000-120,000 per hour ($0.003-0.004 per episode). Photorealistic rendering with ray tracing reduces throughput to 5,000-10,000 episodes per hour, increasing cost to $0.03-0.06 per episode. Managed services like Scale AI charge $0.10-0.50 per episode including infrastructure, storage, and format conversion.

Can I train a policy entirely on synthetic data without any real demonstrations?

Yes, but expect 15-25 percentage point lower real-world success rates compared to mixed training. Pure synthetic training with domain randomization achieves 70-80% success on simple manipulation tasks, while mixing 70% synthetic with 30% real data reaches 90-95% success. The real data provides a distribution anchor preventing the policy from exploiting simulator artifacts. For initial prototyping and algorithm development, pure synthetic training is viable; for production deployment, allocate 20-40% of your data budget to real demonstrations.

Which physics simulator should I choose for manipulation tasks?

MuJoCo provides the most accurate contact physics for tasks involving sliding, rolling, insertion, and deformable objects, making it the best choice for contact-rich manipulation. Isaac Gym excels at massively parallel simulation (4,096-16,384 environments on a single GPU), generating episodes 50-100× faster than MuJoCo — ideal for reinforcement learning requiring millions of episodes. PyBullet offers a free open-source baseline with moderate accuracy and speed. For photorealistic visual data, pair any simulator with NVIDIA Isaac Sim or Blender for ray-traced rendering.

How do I validate that my synthetic data will transfer to real robots?

Train a policy on 10,000-50,000 synthetic episodes, then evaluate on 100-500 real-world trials measuring task success rate. A 15-25 percentage point gap is typical with domain randomization; gaps above 40 points indicate insufficient randomization or physics errors. Record failure episodes and replay them in simulation to identify discrepancies. Measure trajectory divergence: position error should remain below 2cm after 5 seconds, contact forces within 20-30% of real sensor readings. Iterate by increasing randomization ranges for parameters correlated with failures.

What metadata should I track for each synthetic episode?

Log simulator version, randomization parameter distributions (ranges and seeds), episode success/failure labels, task identifiers, and generation timestamp. Include object model IDs, texture library versions, and lighting configurations to enable deterministic replay during debugging. Store domain randomization seeds separately to reproduce specific episodes. Track sim-to-real validation metrics (real-world success rate, failure modes) at the dataset level. Use RLDS or MCAP formats with embedded metadata schemas to ensure downstream training pipelines can access provenance information.

Looking for synthetic robot data generation?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

List Your Robot Dataset on Truelabel