truelabelRequest data

Physical AI Engineering Guide

How to Setup Domain Randomization Pipeline for Sim-to-Real Transfer

A domain randomization pipeline systematically varies visual, physics, and dynamics parameters during synthetic data generation to train policies that generalize from simulation to real hardware. The pipeline requires a physics simulator (Isaac Sim, MuJoCo, or RLBench), randomization APIs for lighting/textures/friction/mass, a training loop that samples parameter distributions per episode, and real-world validation to tune ranges and identify sim-to-real gaps.

Updated 2026-01-15
By truelabel
Reviewed by truelabel ·
domain randomization pipeline

Quick facts

Difficulty
Intermediate
Audience
Physical AI data engineers
Last reviewed
2026-01-15

Why Domain Randomization Solves the Sim-to-Real Gap

Policies trained purely in simulation fail on real hardware because simulators cannot perfectly model lighting, friction, sensor noise, and actuator dynamics. Domain randomization addresses this by exposing the policy to a distribution of plausible environments during training — if the real world falls within that distribution, the policy generalizes[1]. Google's RT-1 model trained on 130,000 real robot episodes, but early sim-to-real work like OpenAI's Dactyl used domain randomization to achieve dexterous manipulation with zero real-world training data[2].

The core insight: a policy that succeeds across randomized lighting (3000-7000K color temperature), friction (0.3-1.5 coefficients), and mass (0.5x-2x nominal) learns features invariant to these nuisances. Scale AI's physical AI platform and NVIDIA Cosmos world foundation models both rely on massive synthetic data generation with domain randomization to pretrain vision-language-action transformers. For manipulation tasks, visual randomization (textures, lighting, camera pose) typically matters more than physics randomization; for locomotion, dynamics randomization (joint damping, actuator lag) dominates[3].

Real-world validation remains mandatory. DROID's 76,000 real manipulation trajectories show that even aggressive randomization leaves a 10-30% performance gap on contact-rich tasks like insertion or cable routing. The pipeline's goal is not perfect transfer but reducing the real-data requirement from 100,000 episodes to 5,000-10,000 for fine-tuning.

Select Your Simulator and Randomization Framework

Three simulators dominate physical AI pipelines: NVIDIA Isaac Sim (photorealistic rendering, RTX GPU required), MuJoCo (fast physics, CPU-friendly), and RLBench (built on PyRep/CoppeliaSim, 100+ manipulation tasks). Isaac Sim's Replicator API provides the richest randomization primitives — lighting, materials, camera intrinsics, physics parameters — and integrates with LeRobot's training loops. MuJoCo 3.0+ supports MJCF-based randomization via Python bindings but requires manual texture/lighting control through OpenGL or offscreen rendering.

For manipulation, start with Isaac Sim if you have an RTX 3090+ GPU and need photorealistic RGB; use MuJoCo if your policy relies on proprioceptive state (joint angles, forces) and speed matters more than visual fidelity. RLBench offers 100+ pre-built tasks (pick-and-place, drawer opening, button pressing) with domain randomization examples, making it ideal for benchmarking. All three export episodes in RLDS format or HDF5, compatible with LeRobot's dataset loaders.

Installation: Isaac Sim requires Ubuntu 20.04/22.04, 32GB RAM, RTX GPU with 12GB+ VRAM. MuJoCo installs via `pip install mujoco` and runs on any OS. RLBench needs PyRep (CoppeliaSim backend) and takes 2-3 hours to configure. Budget 1-2 days for simulator setup and URDF/MJCF model import.

Define Visual Randomization Axes for Camera Robustness

Visual randomization trains policies invariant to lighting, textures, and camera viewpoint — critical for RGB-based manipulation. Randomize lighting intensity (0.3x-3x ambient, uniform or log-uniform sampling), color temperature (3000-7000K to simulate daylight vs tungsten), light count (1-5 point/area lights), and shadow hardness (0.0-1.0). In Isaac Sim, use `omni.replicator.core.create.light()` with randomized position (hemisphere sampling), intensity, and color. In MuJoCo, modify `<light>` tags in MJCF or use `mujoco.mj_light()` API calls per episode reset.

Texture randomization applies procedural materials to tables, walls, and objects. Tobin et al.'s original domain randomization paper randomized table textures across 50+ wood/metal/fabric samples. In Isaac Sim, use `omni.replicator.core.randomizer.texture()` to swap USD material references; in MuJoCo, preload texture arrays and bind them to geoms via `mjr_uploadTexture()`. Randomize object textures only if your policy uses RGB; if using depth or point clouds, texture randomization wastes compute.

Camera randomization includes position (±3-5 cm per axis), orientation (±3-5 degrees), field of view (±5 degrees), and sensor noise (Gaussian with σ=0.01-0.05). RT-2's vision encoder was pretrained on web images with extreme viewpoint variation, making it robust to camera jitter. Add 2-5 distractor objects (random meshes placed outside the workspace) to force the policy to learn task-relevant features. Validate by visualizing 100 randomized frames — if two frames look identical, your ranges are too narrow.

Configure Physics Randomization for Contact-Rich Tasks

Physics randomization tunes friction, mass, restitution, and contact solver parameters to span the uncertainty in real-world object properties. Friction coefficients vary 0.3-1.5 for table-object and gripper-object contacts; use uniform sampling unless you have prior knowledge (e.g., metal objects cluster near 0.4-0.6). In MuJoCo, set `<geom friction='...'/>` per episode; in Isaac Sim, use PhysX material APIs `physxMaterial.set_static_friction()` and `set_dynamic_friction()`.

Mass randomization (0.5x-2x nominal) and center-of-mass offset (±1 cm) simulate manufacturing tolerances and unknown object density. Peng et al.'s dynamics randomization work showed that mass variation alone improved locomotion transfer by 40%. Randomize per object, not globally — a 200g cube and a 50g cylinder should have independent mass samples. Restitution (bounciness, 0.0-0.8) matters for throwing or dropping tasks but can be fixed at 0.1 for quasi-static manipulation.

Contact solver parameters (constraint solver iterations, contact damping) are simulator-specific. MuJoCo's `niter` (default 100) and `impratio` (default 1.0) control convergence; randomizing `niter` between 50-200 simulates varying contact stiffness. Isaac Sim's PhysX exposes `solver_position_iteration_count` (default 4) and `bounce_threshold` (default 0.2 m/s). Start with default ranges from Zhao et al.'s sim-to-real survey, then tighten based on real-world failure modes.

Implement Dynamics Randomization for Actuator Realism

Dynamics randomization models actuator lag, joint damping, and control noise — the gap between commanded and executed actions. Actuator lag (1-3 timesteps, 20-60ms at 50Hz control) simulates motor response delay; implement by buffering actions in a queue and applying them with random delay. Joint damping and armature (reflected inertia) affect how quickly joints accelerate; randomize damping 0.5x-2x nominal in MJCF `<joint damping='...'/>` or Isaac Sim's articulation APIs.

Action noise adds Gaussian noise to commanded joint positions or velocities (σ=0.01-0.05 radians for position control, σ=0.1-0.5 rad/s for velocity control). OpenAI's Dactyl used action delays up to 80ms and joint damping variation of 3x to achieve in-hand cube reorientation with zero real data. Observation noise (sensor noise on joint encoders, force-torque sensors) is equally important; add Gaussian noise σ=0.001-0.01 radians to proprioceptive state.

For end-effector tasks (pick-and-place, pushing), dynamics randomization matters less than visual and physics randomization. For whole-arm or mobile manipulation, it's critical. RT-1's real-robot training included natural actuator variation across 13 robots, providing implicit dynamics randomization. If training in sim, you must inject it explicitly.

Generate Synthetic Episodes with Randomization Loops

Structure your data generation loop: (1) sample randomization parameters from distributions, (2) reset simulator with new parameters, (3) run policy or scripted controller for one episode, (4) save observations/actions/rewards in RLDS or HDF5 format, (5) repeat for 10,000-100,000 episodes. In Isaac Sim, use `omni.replicator.core.orchestrator.run()` to batch-generate episodes with different random seeds. In MuJoCo, wrap `mujoco.mj_resetData()` in a loop that reloads randomized MJCF models.

LeRobot's dataset pipeline expects episodes as HDF5 files with `/observations/`, `/actions/`, `/rewards/` groups and per-episode metadata (random seed, parameter values). Store randomization parameters in episode metadata for debugging — if a policy fails on low-friction scenarios, you can filter and retrain. Use HDF5's chunked storage for efficient I/O; a 100,000-episode dataset at 50Hz, 10s episodes, 224x224 RGB is ~500GB.

Parallelize generation across GPUs (Isaac Sim supports multi-GPU via `omni.replicator.core.settings.set_render_device()`) or CPU cores (MuJoCo). Budget 1-3 days to generate 50,000 episodes on a 4x RTX 4090 workstation (Isaac Sim) or 12-core CPU (MuJoCo). Validate by spot-checking 100 episodes: do objects fall through tables (friction too low)? Do images look washed out (lighting too bright)? Adjust ranges and regenerate.

Train Policies with Randomized Data and Validate Transfer

Train a vision-language-action policy (OpenVLA, RT-2, or Diffusion Policy) on your synthetic dataset using LeRobot's training scripts. For manipulation, Diffusion Policy achieves 70-90% sim-to-real success on pick-and-place with 10,000 randomized episodes[4]. For mobile manipulation, RT-2-style transformers need 50,000-100,000 episodes to learn robust visual grounding. Use a 90/10 train/val split and monitor validation loss — if val loss plateaus while train loss drops, you're overfitting to specific randomization samples (increase diversity or episode count).

Deploy the policy on real hardware and measure zero-shot transfer success rate (no real-world fine-tuning). Expect 40-70% success on simple tasks (pick-and-place, pushing), 10-30% on contact-rich tasks (insertion, cable routing). Record failure modes: does the gripper miss objects (camera calibration issue)? Do objects slip (friction mismatch)? Does the arm oscillate (dynamics mismatch)? Each failure mode maps to a randomization axis to retune.

Collect 500-2,000 real-world episodes using teleoperation or scripted policies, then fine-tune the sim-trained policy. BridgeData V2 showed that 5,000 real episodes + 50,000 synthetic episodes outperformed 50,000 real episodes alone, proving domain randomization reduces real-data requirements by 10x. Store real episodes in the same RLDS format for seamless mixing.

Tune Randomization Ranges with Real-World Feedback

Initial randomization ranges are guesses; real-world failures reveal which parameters need wider or narrower distributions. If the policy fails when the table is dark, widen lighting intensity to 0.1x-5x. If it fails on slippery objects, extend friction to 0.2-2.0. If it fails on heavy objects, extend mass to 0.3x-3x. Automatic domain randomization (ADR) algorithms like OpenAI's approach adjust ranges online by measuring task success across parameter bins, but manual tuning is faster for initial pipelines.

Use ablation studies to isolate which randomization axes matter. Train three policies: (1) visual randomization only, (2) physics randomization only, (3) both. Test on real hardware. For RGB-based grasping, visual randomization typically contributes 60-80% of transfer performance; physics adds 10-20%; dynamics adds 5-10%[3]. If your task uses depth or tactile sensing, physics randomization dominates.

Curriculum learning starts with narrow ranges (easy sim) and gradually widens them (hard sim) as the policy improves. This stabilizes training for complex tasks like dexterous manipulation. Implement by linearly increasing randomization bounds over 100,000-500,000 training steps. RoboNet's multi-robot dataset used implicit curriculum by mixing easy (scripted) and hard (human teleoperation) episodes, achieving 70% transfer on 7-DoF arms.

Combine Synthetic and Real Data for Production Deployment

Production policies blend synthetic pretraining with real-world fine-tuning. The optimal ratio depends on task complexity: simple pick-and-place needs 10:1 synthetic:real, contact-rich assembly needs 3:1, and dexterous manipulation needs 1:1[5]. Open X-Embodiment's RT-X models trained on 1M+ episodes across 22 robot types, with 60% real and 40% synthetic data, achieving 50-80% success on unseen tasks.

Store all data in a unified format (RLDS, LeRobot HDF5, or MCAP) with provenance metadata (simulator version, randomization seed, real vs synthetic flag). Truelabel's data provenance tracking ensures you can trace policy failures back to specific dataset slices. Use Hugging Face Datasets for versioning and Parquet for efficient columnar storage of metadata.

Continuous data collection loops real-world deployment back into training. Deploy the policy, collect 100-500 episodes per week, retrain monthly. Scale AI's data engine automates this loop for customers like Figure AI and Universal Robots, combining teleoperation, domain randomization, and active learning to reduce annotation cost by 10x. For in-house pipelines, budget 2-4 engineer-months to build the full loop (sim generation, real collection, retraining, deployment).

Validate with Benchmark Tasks and Ablation Studies

Test your pipeline on standard benchmarks before deploying to custom tasks. RLBench's 100+ tasks (reach target, pick-and-place, open drawer, press button) provide reproducible sim-to-real baselines. Meta-World's 50 manipulation tasks and RoboSuite's kitchen/assembly environments are widely used for ablation studies. Report zero-shot transfer success rate (no real fine-tuning), 10-shot transfer (10 real demos), and 100-shot transfer.

Ablation checklist: (1) no randomization (baseline), (2) visual only, (3) physics only, (4) dynamics only, (5) all randomization, (6) all randomization + 1,000 real episodes. Measure success rate, episode length, and failure modes (grasp failures, collisions, timeouts). If visual randomization alone achieves 60% success and adding physics increases it to 65%, physics randomization has low ROI for your task — focus compute on visual diversity.

Publish your pipeline configuration (randomization ranges, episode count, training hyperparameters) for reproducibility. LeRobot's paper provides full training recipes for ACT, Diffusion Policy, and VQ-BeT on simulated and real datasets. DROID's dataset card documents camera calibration, gripper specs, and teleoperation protocol, enabling others to reproduce their 76,000-episode collection.

Scale Data Generation with Cloud Infrastructure

Generating 100,000+ episodes requires distributed infrastructure. Isaac Sim's Replicator scales to multi-GPU clusters via NVIDIA Omniverse Farm; MuJoCo parallelizes across CPU cores using Python's `multiprocessing` or Ray. For cloud deployment, use AWS EC2 G5 instances (NVIDIA A10G GPUs, $1.20/hr) or GCP A2 instances (A100 GPUs, $3.67/hr). Budget $500-2,000 to generate 50,000 episodes depending on simulator and episode length.

Storage costs dominate at scale. A 100,000-episode dataset at 224x224 RGB, 50Hz, 10s episodes is 500GB-1TB. Use Parquet for metadata (actions, rewards, episode boundaries) and HDF5 or MCAP for observations (images, point clouds). Compress images with JPEG (quality=90) or H.264 video encoding to reduce size by 10x. Store on S3 ($0.023/GB/month) or GCS ($0.020/GB/month); avoid EBS for large datasets (10x more expensive).

Truelabel's physical AI marketplace offers pre-generated domain-randomized datasets for common tasks (pick-and-place, bin picking, cable routing) with 10,000-50,000 episodes per task, eliminating infrastructure setup. For custom tasks, Truelabel's data factory generates episodes on-demand with your URDF, objects, and randomization spec, delivering datasets in 1-2 weeks.

Monitor Training Metrics and Debug Sim-to-Real Failures

Track training loss (action prediction MSE or cross-entropy), validation loss, and episode success rate (if training with RL). For imitation learning, validation loss below 0.01 (normalized actions) indicates the policy has memorized the dataset; success rate on held-out randomization seeds measures generalization. If validation loss is 10x higher than training loss, increase dataset size or randomization diversity.

Sim-to-real failure taxonomy: (1) perception failures (policy misidentifies objects or grasps air) indicate insufficient visual randomization or camera calibration error; (2) contact failures (objects slip or bounce unexpectedly) indicate physics randomization mismatch; (3) oscillation or instability indicate dynamics randomization mismatch or control frequency mismatch (sim at 50Hz, real at 20Hz). Record 10-20 failure episodes on real hardware, replay them in sim with matched randomization parameters, and check if the policy fails identically.

Use gradient-weighted class activation maps (Grad-CAM) to visualize which image regions the policy attends to. If the policy focuses on table texture instead of object geometry, your texture randomization is too extreme (the policy learned a texture-based shortcut). Retrain with narrower texture variation or add more object diversity.

Integrate Domain Randomization with Active Learning

Active learning selects the most informative real-world episodes to label, reducing annotation cost. After deploying a sim-trained policy, collect 1,000 episodes (mix of successes and failures), then train an uncertainty estimator (ensemble of policies or Monte Carlo dropout) to score each episode. Label the top 10% highest-uncertainty episodes via teleoperation, retrain, and repeat. Scale AI's active learning loop reduced labeling cost by 5x for Figure AI's humanoid pretraining.

Failure-driven data collection prioritizes episodes where the policy failed. If the policy fails 30% of the time on dark lighting, generate 5,000 additional synthetic episodes with lighting intensity 0.1x-0.5x, then collect 500 real episodes in dim environments. This targeted augmentation closes specific sim-to-real gaps faster than uniform data collection.

Combine domain randomization with self-supervised learning on unlabeled real data. Train a visual encoder (ResNet, ViT) on real images using contrastive learning (SimCLR, MoCo), then freeze it and train the policy head on synthetic data. RT-2's vision encoder was pretrained on 6B web images, providing robust visual features that transfer to robotics with minimal fine-tuning.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World

    Tobin et al. introduced domain randomization for sim-to-real transfer in 2017, showing that randomizing textures and lighting enables zero-shot transfer for object detection and grasping.

    arXiv
  2. Sim-to-Real Transfer of Robotic Control with Dynamics Randomization

    OpenAI's dynamics randomization work showed that randomizing actuator lag, joint damping, and mass enables sim-to-real transfer for dexterous manipulation with zero real-world training data.

    arXiv
  3. Crossing the Reality Gap: A Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning

    Zhao et al.'s survey quantifies that visual randomization contributes 60-80% of transfer performance for RGB-based tasks, while physics and dynamics randomization add 10-20% each.

    arXiv
  4. LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch

    LeRobot paper shows Diffusion Policy achieves 70-90% sim-to-real success on pick-and-place with 10,000 randomized episodes.

    arXiv
  5. BridgeData V2: A Dataset for Robot Learning at Scale

    BridgeData V2 demonstrated that 5,000 real episodes plus 50,000 synthetic episodes outperformed 50,000 real episodes alone, proving domain randomization reduces real-data requirements by 10x.

    arXiv

FAQ

What is the minimum episode count for effective domain randomization?

For simple pick-and-place tasks, 10,000 randomized episodes achieve 50-70% zero-shot transfer success. Contact-rich tasks (insertion, assembly) need 50,000-100,000 episodes. Dexterous manipulation (in-hand reorientation) requires 100,000-500,000 episodes. These counts assume aggressive randomization across visual, physics, and dynamics axes. Narrow randomization ranges require 2-5x more episodes to cover the same distribution. Always validate with real-world deployment — if zero-shot success is below 40%, increase episode count or widen randomization ranges before collecting real data.

Should I randomize all parameters or focus on a subset?

Start with visual randomization (lighting, textures, camera pose) for RGB-based policies — it contributes 60-80% of transfer performance. Add physics randomization (friction, mass) if your task involves contact (grasping, pushing). Add dynamics randomization (actuator lag, joint damping) only for whole-arm or mobile manipulation. Randomizing irrelevant parameters (e.g., table texture for a depth-based policy) wastes compute and can hurt performance by forcing the policy to learn spurious correlations. Run ablation studies to measure each axis's contribution, then allocate episode budget proportionally.

How do I know if my randomization ranges are too wide or too narrow?

Too narrow: validation loss is low but zero-shot real-world success is below 40%, indicating the policy overfit to a narrow sim distribution. Too wide: training loss plateaus above 0.05 (normalized actions) and the policy fails even in sim, indicating the task is unsolvable under extreme randomization. Optimal ranges: validation loss 1.5-3x training loss, zero-shot real success 50-70%. Tune by deploying on real hardware, recording failure modes (e.g., fails on dark lighting), then widening the corresponding randomization axis (lighting intensity 0.1x-5x) and retraining. Iterate 2-3 times to converge.

Can I use domain randomization with real-world datasets like DROID or BridgeData?

Yes — domain randomization reduces the real-data requirement but does not eliminate it. Train on 50,000 synthetic episodes, then fine-tune on 5,000 real episodes from DROID or BridgeData V2. This hybrid approach outperforms 50,000 real episodes alone because synthetic data provides diverse visual and physics coverage while real data corrects simulator biases. Store both in the same format (RLDS or LeRobot HDF5) and mix them in the training dataloader with a 10:1 or 5:1 synthetic:real ratio. For contact-rich tasks, increase the real ratio to 3:1 or 1:1.

What hardware do I need to generate 50,000 episodes?

For Isaac Sim: RTX 4090 or A6000 GPU (24GB VRAM), 64GB RAM, 2TB SSD. Generates 5-10 episodes/minute at 224x224 RGB, 50Hz, 10s episodes — budget 3-7 days for 50,000 episodes on a single GPU. For MuJoCo: 12-core CPU (Ryzen 9 or Xeon), 32GB RAM. Generates 20-50 episodes/minute (no rendering) — budget 1-2 days for 50,000 episodes. Cloud alternative: AWS EC2 G5.4xlarge (A10G GPU, $1.20/hr) or GCP n1-standard-16 (16 vCPU, $0.76/hr). Total cost: $500-1,500 for 50,000 episodes including storage (S3/GCS).

How do I validate that my synthetic data matches real-world statistics?

Collect 500-1,000 real episodes and compute distribution statistics: object pose variance, grasp success rate, episode length, action magnitudes. Compare to synthetic data distributions. If real object poses have σ=2cm but synthetic has σ=5cm, tighten your pose randomization. If real grasp success is 80% but synthetic is 95%, your sim friction is too high (objects are easier to grasp). Use two-sample Kolmogorov-Smirnov tests to quantify distribution mismatch. Visualize side-by-side: 100 real frames vs 100 synthetic frames — if you can easily distinguish them, your visual randomization is insufficient.

Looking for domain randomization pipeline?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

Browse Physical AI Datasets on Truelabel