Physical AI Evaluation
How to Evaluate Sim-to-Real Transfer Performance
Sim-to-real transfer evaluation requires three phases: establish simulation baselines across 1,000+ episodes measuring success rates and action distributions, execute controlled real-world trials with matched initial conditions while logging visual and dynamics discrepancies, then attribute performance gaps to specific sources—visual domain shift, physics mismatch, or actuation errors—using diagnostic metrics like CLIP embedding distance and trajectory RMSE to guide domain randomization or fine-tuning interventions.
Quick facts
- Difficulty
- Intermediate
- Audience
- Physical AI data engineers
- Last reviewed
- 2025-06-15
Why Sim-to-Real Evaluation Matters for Robot Deployment
Simulation training generates robot policies at scale—RT-1 trained on 130,000 episodes from seven robots, Open X-Embodiment pooled 527 skills across 22 embodiments—but real-world success rates drop 20-60% on transfer[1]. The reality gap stems from three sources: visual domain shift (lighting, textures, camera noise), dynamics mismatch (contact models, friction, actuator lag), and task distribution drift (object pose variance, distractor clutter).
Systematic evaluation isolates these failure modes. Without controlled protocols, teams waste weeks tuning hyperparameters that address symptoms rather than root causes. Domain randomization closes visual gaps by varying sim rendering parameters; dynamics randomization perturbs mass, friction, and PID gains; but both require gap measurements to guide intervention priority. A policy failing 80% of real grasps due to depth estimation errors needs different fixes than one failing on contact-rich insertion.
DROID's 76,000 real-world trajectories show that policies pretrained on diverse sim data then fine-tuned on 200-500 real demonstrations achieve 85-92% real success rates[2]. Evaluation infrastructure—logging, metrics, attribution—determines how fast you reach that threshold. The Scale Physical AI platform and truelabel's data marketplace supply real-world validation datasets, but you must design the evaluation protocol that surfaces actionable gaps.
Establish Simulation Baselines Before Real-World Transfer
Run 1,000+ simulation episodes across your full task distribution before touching hardware. Record per-task success rates, failure taxonomies (grasp slip, collision, timeout), and action trajectory statistics (mean gripper velocity, joint acceleration peaks). A sim success rate below 85% signals undertrained policies—real performance will be 15-30 percentage points lower[3]. Fix simulation training first.
Compute observation statistics: image channel means/stds, depth histogram percentiles, proprioceptive state ranges. Compare against a real-world calibration set of 50-100 trajectories collected via teleoperation. ALOHA's teleoperation setup provides this baseline efficiently. If sim RGB images average 140/255 brightness while real images average 95/255, you have a 32% luminance gap that domain randomization must address by varying light intensity ±40% during training.
Rank tasks by transfer difficulty. Policies trained on RoboSuite pick-and-place transfer better than those trained on deformable object manipulation; tasks with <5mm precision requirements (peg insertion, connector mating) show 40-60% real success drops versus 10-20% for coarse manipulation[4]. Start real evaluation with easiest tasks to validate your measurement stack, then progress to harder scenarios. Log every sim episode to RLDS format or MCAP for reproducible analysis.
Design Controlled Real-World Evaluation Protocols
Match initial conditions between sim and real trials. If your sim policy trains on objects placed within a 10cm×10cm region, real evaluation must use the same bounds—measured with ArUco markers or motion capture. Uncontrolled pose variance inflates failure rates by 15-25%[5]. Use 3D-printed jigs or laser-cut templates to ensure repeatability across 20-50 trials per task.
Standardize lighting and backgrounds. EPIC-KITCHENS captured 100 hours across varied home environments, but evaluation requires consistency. Install diffuse LED panels (5000-6500K color temperature) to eliminate shadows; use neutral backdrops (18% gray cards) to reduce segmentation noise. Measure illuminance with a lux meter—target ±10% variance across trials. If your sim uses HDR environment maps, capture real-world lighting with a 360° camera and match sim parameters.
Log synchronized sensor streams: RGB-D at 30Hz, proprioceptive state at 100Hz, commanded actions at policy frequency (10-20Hz). LeRobot's dataset format provides a reference schema. Record ground-truth object poses via external tracking (OptiTrack, Vicon) for the first 10 trials to validate policy perception. Store raw data in MCAP containers with ROS2 message definitions—this enables post-hoc replay and alternative metric computation without re-running expensive real trials.
Execute Real-World Trials and Collect Gap Diagnostics
Run 50-100 real-world episodes per task, stratified across object instances and initial poses. For each trial, log success/failure, failure mode taxonomy (grasp slip, collision, wrong object, timeout), and trajectory data. Compute real-world success rate and compare to sim baseline—a 35% drop (e.g., 90% sim → 55% real) indicates severe transfer issues requiring multi-pronged fixes.
Measure visual domain shift with embedding distances. Extract CLIP ViT-L/14 embeddings from sim and real RGB observations at corresponding trajectory timesteps. Compute mean cosine distance—values >0.25 indicate significant appearance gaps[6]. OpenVLA's 7B vision-language-action model shows that policies robust to CLIP distance <0.20 transfer with <10% success degradation. If distance exceeds 0.30, expand domain randomization: vary texture maps, add procedural noise, randomize camera exposure ±2 stops.
Quantify dynamics mismatch via trajectory RMSE. Replay real robot joint commands in simulation and compare resulting trajectories. Position RMSE >3cm or orientation RMSE >8° suggests physics model errors. Dynamics randomization addresses this by perturbing link masses ±20%, friction coefficients 0.5-2.0×, and PID gains ±30% during training. For contact-rich tasks, measure contact force distributions—real grasps often apply 2-5N while sim applies 0.5-1.5N due to simplified contact models[7]. NVIDIA Cosmos world foundation models offer learned physics priors that reduce this gap.
Attribute Failures to Specific Gap Sources
Decompose each failure into visual, dynamics, or policy components. For grasp failures, check: (1) Was the object correctly segmented? Compare real and sim depth maps—if real depth noise >5mm std, add depth augmentation. (2) Did the gripper reach the target pose? Measure end-effector position error—if >2cm, dynamics mismatch dominates. (3) Did contact forces match expectations? If real slip occurs at 40% lower force than sim, friction coefficients need adjustment.
Use ablation studies to isolate causes. Run the policy on real RGB + sim depth to test visual transfer; run on sim RGB + real proprioception to test dynamics. RT-1's evaluation showed that 60% of failures stemmed from visual domain shift, 25% from dynamics, 15% from policy generalization. Your task mix will differ—insertion tasks skew toward dynamics (70%), while bin-picking skews visual (80%).
Quantify the cost of each gap source. If closing visual gaps via 500 additional sim training hours yields +12% real success but closing dynamics gaps via 200 real fine-tuning demonstrations yields +18%, prioritize real data collection. Truelabel's marketplace offers real robot trajectories at $50-200 per task-hour; Scale AI's Universal Robots partnership provides annotation infrastructure. BridgeData V2's 60,000 trajectories cost ~$400K to collect[5]—budget accordingly based on your gap attribution results.
Close Gaps Through Domain Randomization or Fine-Tuning
For visual gaps (CLIP distance >0.25), expand domain randomization. Vary lighting intensity ±50%, color temperature 3000-7000K, add Gaussian noise σ=0.02-0.05, apply random crops and perspective transforms. Tobin et al. 2017 showed that randomizing five visual parameters (lighting, texture, camera pose, background, distractor objects) closed 70% of the sim-to-real gap for object detection. Retrain for 20-40% additional sim episodes—typically 2,000-5,000 episodes for manipulation tasks.
For dynamics gaps (trajectory RMSE >3cm), implement dynamics randomization. Perturb object masses ±25%, surface friction 0.4-1.5×, joint damping ±40%, actuator time constants ±30%. OpenAI's Dactyl work used 1,000+ randomized physics parameters to achieve cube reorientation transfer. Start with 10-15 parameters covering the dominant failure modes identified in your attribution analysis, then expand if real success rates plateau.
When randomization yields <5% improvement after two iterations, switch to real-world fine-tuning. Collect 200-500 demonstrations via teleoperation or kinesthetic teaching. LeRobot's training scripts support fine-tuning RT-2 and diffusion policies on small real datasets. Budget 40-80 hours for data collection (2-4 hours per task × 20 tasks) plus 10-20 GPU-hours for training. Real fine-tuning typically recovers 60-80% of the remaining sim-to-real gap[8].
Validate Transfer with Benchmark Suites and Long-Horizon Tasks
Test on standardized benchmarks to compare against published baselines. THE COLOSSEUM provides 20 manipulation tasks with difficulty ratings; CALVIN offers 34 long-horizon tasks in kitchen environments; ManiSkill covers 2,000+ object instances across 20 task families. Report success rates, average episode length, and failure mode distributions—this enables apples-to-apples comparison with RT-1 (97% on seen tasks, 76% on unseen), OpenVLA (87% generalist performance), and RoboCat (90% after 1,000 fine-tuning demos).
Evaluate long-horizon robustness with 5-10 step task chains. LongBench shows that policies achieving 85% single-step success drop to 45-60% on three-step sequences due to compounding errors. Test your policy on "pick A, place in B, pick C, stack on A" sequences—if success drops >30% versus single-step tasks, your policy lacks error recovery. ManipArena's reasoning-oriented tasks require 8-12 step plans; real-world deployment demands this capability.
Measure sample efficiency: how many real demonstrations close the gap to within 5% of sim performance? DROID policies reached 85% real success with 200-500 demos; RT-1 needed 130K sim + 3K real episodes. If your policy requires >1,000 real demos per task, revisit sim training—poor sim generalization cannot be fixed cheaply with real data. Track cost per percentage point of real success gain to guide data acquisition budgets.
Document Results and Build Continuous Evaluation Infrastructure
Publish evaluation reports with: sim baseline metrics (success rate, failure modes, observation statistics), real-world results (success rate, gap magnitude, failure attribution), intervention details (randomization parameters, fine-tuning dataset size), and final transfer performance. Include per-task breakdowns—aggregate metrics hide critical details. Model Cards for Model Reporting and Datasheets for Datasets provide documentation templates.
Build automated evaluation pipelines that run nightly. Use RLBench or RoboSuite for sim regression tests; schedule weekly real-world validation runs (10-20 trials per task) to catch performance drift. LeRobot's evaluation scripts integrate with Weights & Biases for metric tracking. Set alert thresholds: if real success drops >10% week-over-week, trigger gap re-attribution.
Archive all evaluation data—sim episodes, real trajectories, sensor logs, failure videos—in RLDS or MCAP format with provenance metadata. Open X-Embodiment demonstrates the value of shared evaluation datasets: 22 institutions contributed data, enabling cross-embodiment transfer research. Your evaluation logs become training data for future policies—budget storage accordingly (expect 50-200GB per 1,000 real episodes with RGB-D at 30Hz).
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- Crossing the Reality Gap: A Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning
Survey documenting 20-60% real-world success rate drops on sim-to-real transfer
arXiv ↩ - DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID policies achieving 85-92% real success with 200-500 fine-tuning demonstrations
arXiv ↩ - Sim-to-Real Transfer of Robotic Control with Dynamics Randomization
Dynamics randomization research showing 15-30 percentage point real performance drops from sim baselines
arXiv ↩ - THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation
THE COLOSSEUM benchmark showing 40-60% success drops for precision tasks vs 10-20% for coarse manipulation
arXiv ↩ - BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 dataset with 60,000 trajectories, costing approximately $400K to collect
arXiv ↩ - OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA 7B vision-language-action model showing <10% degradation with CLIP distance <0.20
arXiv ↩ - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 vision-language-action model transferring web knowledge to robotic control
arXiv ↩ - Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment pooled 527 skills across 22 embodiments, demonstrating cross-embodiment transfer
arXiv ↩
FAQ
What sim-to-real success rate drop is acceptable for production deployment?
Target <15% drop for production systems. Policies achieving 90% sim success should reach ≥75% real success after domain randomization and fine-tuning. Drops >25% indicate fundamental sim-real misalignment requiring physics model improvements or expanded real training data. Safety-critical applications (surgical robots, autonomous vehicles) demand <5% gaps, achievable only with high-fidelity simulators and extensive real validation datasets.
How many real-world trials are needed to reliably measure transfer performance?
Run 50-100 trials per task for statistical significance. With 50 trials, you can detect ±14% success rate differences at 95% confidence (binomial proportion test). For tasks with high variance (e.g., deformable object manipulation), increase to 100-200 trials. Stratify across object instances (5-10 per category) and initial conditions (3-5 pose clusters) to avoid sampling bias that inflates or deflates measured performance.
Should I prioritize domain randomization or real-world fine-tuning?
Start with domain randomization—it scales better and costs less. Randomizing 10-15 visual and dynamics parameters typically closes 60-75% of the sim-to-real gap for $0 marginal cost beyond compute. Switch to real fine-tuning when randomization improvements plateau (<3% gain per iteration) or when task-specific real data is cheap to collect. Budget $50-200 per task-hour for teleoperation; 200-500 demos (40-80 collection hours) usually suffice for 10-15% additional real success gain.
What metrics best predict real-world transfer success?
CLIP embedding distance between sim and real observations predicts visual transfer: <0.20 correlates with <10% success drop, >0.30 predicts >25% drop. For dynamics, trajectory RMSE when replaying real commands in sim: <2cm position error and <5° orientation error indicate good physics alignment. Combine these with sim task diversity metrics—policies trained on 50+ object instances transfer 15-20% better than those trained on 10 instances.
How do I attribute failures when multiple gap sources interact?
Use ablation studies with mixed sim-real inputs. Run the policy on: (1) real RGB + sim depth/proprioception to isolate visual gaps, (2) sim RGB + real depth/proprioception to isolate dynamics gaps, (3) real RGB-D + sim actions to test perception, (4) sim RGB-D + real actions to test control. Compare success rates across conditions—the largest drop identifies the dominant gap source. For contact-rich tasks, log force-torque sensor data and compare real vs. sim contact force distributions.
What simulation fidelity is required for successful transfer?
Physics timestep ≤1ms, contact solver with ≥10 iterations, and realistic actuator models (PID gains, velocity limits, backlash) are minimum requirements. Visual fidelity matters less than diversity—policies trained on 100 procedurally randomized textures transfer better than those trained on 10 photorealistic textures. For manipulation, simulate friction coefficients within ±20% of real values and object masses within ±15%. High-fidelity simulators like NVIDIA Isaac Sim or MuJoCo with carefully tuned parameters achieve 80-90% real success without fine-tuning for pick-and-place tasks.
Looking for sim-to-real transfer evaluation?
Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.
Browse Physical AI Datasets