Physical AI Evaluation
How to Evaluate Robot Policy Performance
Robot policy evaluation requires binary success criteria (e.g., object lifted >5cm AND placed within 3cm of target), controlled variation across object poses and lighting, minimum 50 trials per condition for 80% power at p<0.05, video logging of every trial, and failure-mode taxonomies that map directly to data collection priorities—teams shipping policies without this rigor see 40–60% deployment failure rates.
Quick facts
- Difficulty
- Intermediate
- Audience
- Physical AI data engineers
- Last reviewed
- 2025-01-15
Why Rigorous Evaluation Separates Research from Deployment
Academic robot learning papers report success rates on narrow task distributions—RT-1 achieved 97% on 17 tasks in controlled settings, but deployment teams discover 40–60% failure rates when object poses, lighting, or clutter deviate from training distributions[1]. The gap stems from evaluation protocols that conflate in-distribution performance with generalization. Real-world deployment demands evaluation across the full operational design domain: every object pose the system will encounter, every lighting condition, every occlusion pattern.
Statistically rigorous evaluation requires minimum sample sizes—50 trials per condition yields 80% power to detect a 20-percentage-point difference at p<0.05 significance[2]. Teams running 10–20 trials per condition cannot distinguish true policy improvements from random variation. Open X-Embodiment aggregated 527 skills across 22 robot embodiments but noted that per-task sample sizes in contributed datasets ranged from 12 to 2,847 episodes, making cross-dataset performance comparisons statistically invalid.
Failure mode taxonomies connect evaluation results to data collection priorities. If 35% of failures stem from grasp pose errors on reflective objects, the next data collection sprint targets reflective-surface grasps with pose diversity. DROID collected 76,000 trajectories across 564 skills and 86 buildings specifically to cover failure modes identified in prior BridgeData V2 evaluations. Without structured failure logging, teams iterate blindly—adding more data without addressing the distribution gaps that cause deployment failures.
Define Task-Specific Binary Success Criteria
Vague criteria like 'the robot successfully picks up the object' are unenforceable—does partial contact count? Does the robot need to transport the object or just lift it? Binary success criteria eliminate ambiguity: 'Success = object lifted >5cm above table surface AND transported to within 3cm of target position AND released without bouncing >1cm from target.' Every condition is measurable with sensors or post-hoc video analysis.
For insertion tasks: 'Success = peg inserted into hole with top surface flush within 2mm AND no contact force exceeded 30N during insertion AND insertion completed within 15 seconds.' RT-2 evaluations used similar force-threshold criteria to distinguish successful insertions from jamming events. Multi-step tasks require success definitions for each substep and the full sequence—CALVIN defines 34 long-horizon tasks as sequences of atomic skills, scoring partial credit for completed substeps.
Pre-calibrate inter-rater reliability before running trials. Two evaluators independently score 20 recorded trials using the written criteria, then compute Cohen's kappa—values below 0.8 indicate ambiguous criteria that need refinement[3]. EPIC-KITCHENS-100 achieved 0.91 kappa on action-segment boundaries by iteratively refining annotation guidelines with evaluator feedback. Create a one-page score sheet listing trial number, initial condition (object ID, pose, lighting), binary success/failure, completion time, and failure mode dropdown (grasp failure, collision, timeout, other).
Design Evaluation Protocols with Controlled Variation
Identify the operational design domain (ODD)—the full range of conditions the deployed policy will encounter. For a bin-picking task: object poses spanning 360° rotation and ±10cm position variation, lighting from 200 to 800 lux, clutter density from 0 to 5 distractor objects. Naive evaluation samples one pose per object and declares success, but ManipArena showed that policies achieving 89% success on canonical poses drop to 34% when poses are uniformly sampled across the full SE(3) space.
Latin hypercube sampling or full factorial designs ensure coverage without exhaustive enumeration. For 3 objects × 5 poses × 3 lighting conditions = 45 combinations, a 50-trial budget allows 1.1 trials per combination—statistically useless. Instead, run 50 trials with Latin hypercube sampling across the continuous pose and lighting ranges, ensuring every decile of each variable is represented[4]. RoboNet used stratified sampling across 7 robot platforms and 113 camera viewpoints to ensure coverage, but many contributed datasets had <10 examples per viewpoint.
Randomize trial order to avoid confounds—if you evaluate all easy poses first, robot wear or battery depletion will bias later trials. Use a random number generator to shuffle the trial sequence before execution. LongBench evaluated 120 long-horizon tasks with randomized object placements and found that policies trained on fixed orderings failed 67% of randomized trials despite 91% success on canonical orderings. Document the randomization seed in your evaluation log for reproducibility.
Execute Evaluation Trials with Rigorous Logging
Run every trial to completion or timeout—do not abort early even if failure is obvious. Early aborts introduce selection bias (you only log 'interesting' failures) and prevent analysis of recovery behaviors. Set a maximum trial duration (e.g., 60 seconds for pick-and-place, 180 seconds for multi-step assembly) and log timeout as a distinct failure mode.
Record video from multiple viewpoints for every trial—wrist camera, third-person overhead, third-person side view. DROID collected 76,000 trajectories with synchronized wrist and third-person video at 15 Hz, enabling post-hoc failure analysis when real-time telemetry missed grasp-pose errors. Store videos in MCAP or HDF5 containers with frame-synchronized robot state (joint positions, gripper force, end-effector pose) and trial metadata (object ID, initial pose, success label).
Log every failure mode using a predefined taxonomy: grasp failure (no contact, slippage, collision), transport failure (dropped object, collision with obstacle), placement failure (missed target, excessive force), timeout. RT-1 evaluations logged 13 distinct failure modes across 17 tasks, revealing that 42% of failures stemmed from grasp-pose errors on transparent objects—a distribution gap addressed in subsequent data collection. Independent evaluators should log failure modes without consulting the policy developer to avoid confirmation bias.
Analyze Results with Statistical Rigor
Compute success rate with 95% confidence intervals using the Wilson score interval (not normal approximation, which fails for rates near 0% or 100%). For 50 trials with 40 successes, the Wilson interval is [67%, 89%]—a 22-percentage-point range that makes claims like '80% success' misleading[5]. OpenVLA reported success rates with exact binomial confidence intervals across 29 tasks, showing that many task-level rates had ±15-percentage-point uncertainty.
Use chi-squared tests to compare success rates across conditions (e.g., does the policy perform differently on red vs. blue objects?). For continuous metrics like completion time, use Mann-Whitney U tests (non-parametric) rather than t-tests, because robot task durations are rarely normally distributed. RLDS provides trajectory-level statistics utilities for computing percentile-based metrics that are robust to outliers.
Failure mode frequency tables reveal data collection priorities. If 35% of failures are grasp-pose errors, 28% are collisions, 22% are timeouts, and 15% are placement errors, the next data sprint should allocate 35% of collection budget to grasp diversity. BridgeData V2 used failure-mode analysis from prior evaluations to guide collection of 13,000 trajectories focused on previously underrepresented grasp types, improving success rates from 73% to 87% on the same task distribution.
Connect Failure Analysis to Data Collection Recommendations
Every evaluation report must end with actionable data collection priorities ranked by failure-mode frequency and deployment impact. If reflective-object grasp failures account for 30% of total failures and reflective objects represent 40% of production volume, that failure mode has 0.30 × 0.40 = 12% deployment impact—higher priority than a 20% failure mode affecting only 10% of production volume.
Specify collection parameters for each priority: 'Collect 200 grasp trajectories on reflective cylindrical objects (diameter 3–8cm) under 400–800 lux lighting, with grasp poses uniformly sampled across 360° rotation and ±5cm position variation.' Truelabel's physical AI marketplace intake forms require these specifications because vague requests ('we need more reflective object data') yield datasets that do not address the actual distribution gap.
Sim-to-real gap analysis identifies when simulation training data is insufficient. If the policy achieves 95% success in RoboSuite simulation but 60% in real-world trials, compute per-failure-mode sim vs. real discrepancies. Sim-to-real transfer surveys show that contact-rich tasks (insertion, assembly) have 2–3× higher sim-to-real gaps than pick-and-place, requiring real-world data collection rather than domain randomization. Domain randomization improves transfer for vision but rarely closes the gap for contact dynamics—teams need real force-torque trajectories.
Benchmark Against Public Datasets and Leaderboards
Compare your policy's performance to published baselines on the same task distribution. Open X-Embodiment provides RT-X model checkpoints and evaluation protocols for 527 skills—if your policy achieves 78% on a task where RT-X achieves 82%, the 4-percentage-point gap may stem from training data volume (RT-X trained on 1M+ trajectories) rather than architecture.
ManipArena introduced 60 real-world manipulation tasks with standardized evaluation protocols and public leaderboards, enabling direct comparison across research groups. Policies evaluated on private task distributions cannot be compared—teams must either evaluate on public benchmarks or publish their task definitions and initial-condition sampling code for reproducibility.
Leaderboard submissions require full evaluation logs—video, robot state, trial metadata—to enable third-party verification. LongBench requires submitters to upload MCAP files with synchronized video and state for every trial, preventing cherry-picked results. Data provenance tracking ensures that evaluation datasets are not contaminated by training data—a policy trained on 10,000 trajectories from a task should not be evaluated on a subset of those same trajectories.
Iterate Evaluation Protocols as Policies Improve
As policies improve, evaluation protocols must increase in difficulty to maintain discriminative power. A protocol that yields 90% success for all candidate policies provides no signal for model selection. THE COLOSSEUM introduced progressively harder evaluation tiers (bronze, silver, gold) with success rates of 85%, 60%, and 35% for state-of-the-art policies, ensuring that the benchmark remains useful as the field advances.
Add adversarial test cases that probe known failure modes—if the policy struggles with transparent objects, create an evaluation set with 50% transparent objects rather than the 10% in the training distribution. RoboCat used adversarial object selection to identify distribution gaps, then collected targeted data to close those gaps, iterating evaluation and collection in a closed loop.
Longitudinal tracking of evaluation metrics reveals whether improvements are real or overfitting to the evaluation set. If success rates improve from 70% to 85% over three data collection sprints but failure modes remain unchanged (still 40% grasp errors, 30% collisions), the policy is memorizing evaluation-set object poses rather than learning generalizable skills. EPIC-KITCHENS maintains held-out test sets that are never released publicly, preventing overfitting to leaderboard benchmarks.
Document Evaluation Methodology for Reproducibility
Publish a methods appendix with every evaluation report: task definitions, success criteria, initial-condition sampling procedure, trial count per condition, randomization seed, video frame rate, robot state logging frequency, failure-mode taxonomy, statistical tests used, and confidence interval method. Without this documentation, other teams cannot reproduce your results or compare their policies to yours.
Model Cards for Model Reporting and Datasheets for Datasets provide templates for structured documentation, but robot policy evaluation requires additional fields: robot platform (make, model, firmware version), end-effector (gripper type, force limits), workspace dimensions, object set (IDs, masses, friction coefficients), and environmental conditions (lighting range, temperature, humidity). RLDS episode metadata schemas capture many of these fields but lack standardized vocabularies—teams use inconsistent units and coordinate frames.
Version control for evaluation protocols prevents drift—if you modify success criteria or add failure modes mid-project, results before and after the change are not comparable. Use semantic versioning (v1.0, v1.1, v2.0) and document changes in a changelog. LeRobot maintains versioned evaluation scripts in the repository, ensuring that results reported in papers can be reproduced with the exact protocol version used.
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 achieved 97% success on 17 tasks in controlled settings
arXiv ↩ - THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation
50 trials per condition yields 80% power to detect 20-percentage-point difference at p<0.05
arXiv ↩ - Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
EPIC-KITCHENS-100 achieved 0.91 Cohen's kappa on action-segment boundaries through iterative guideline refinement
arXiv ↩ - Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization techniques for sim-to-real transfer
arXiv ↩ - THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation
Statistical rigor requirements for robot policy evaluation
arXiv ↩
FAQ
How many trials do I need to detect a 10-percentage-point improvement in success rate?
For 80% statistical power at p<0.05 significance, detecting a 10-percentage-point improvement from a 70% baseline requires 199 trials per condition (398 total for baseline vs. improved policy). Detecting a 20-percentage-point improvement requires 50 trials per condition. Use a power analysis calculator with the baseline success rate, target improvement, desired power (typically 0.8), and significance level (typically 0.05) to compute sample size before running trials—underpowered evaluations waste robot time and cannot distinguish real improvements from noise.
Should I evaluate in simulation or on real hardware?
Evaluate on real hardware for any task involving contact dynamics (insertion, assembly, deformable object manipulation) because simulation contact models have 2–3× higher error rates than vision-based tasks. Evaluate in simulation first for pick-and-place or navigation tasks to identify obvious failures before consuming real robot time, but always validate final policies on real hardware across the full operational design domain. Sim-to-real transfer gaps are task-specific—domain randomization closes the gap for some vision tasks but rarely for contact-rich manipulation.
What video frame rate and resolution do I need for failure analysis?
15–30 Hz frame rate and 720p resolution are sufficient for most manipulation tasks—higher rates (60+ Hz) are needed only for high-speed tasks like catching or striking. Synchronize video frames with robot state logs to within 10ms so you can correlate visual events (object slip, collision) with force-torque spikes. Store videos in MCAP or HDF5 containers with frame timestamps rather than separate MP4 files, because post-hoc synchronization is error-prone and loses sub-frame timing precision.
How do I handle partial successes in multi-step tasks?
Define success criteria for each substep independently and log a binary success vector (e.g., [1,1,0,0] for a 4-step task where steps 1 and 2 succeeded but steps 3 and 4 failed). Compute per-step success rates and full-sequence success rate separately—a policy with 90% per-step success has only 66% full-sequence success for a 4-step task (0.9^4). Partial-success logging reveals which substeps are bottlenecks and guides targeted data collection for those substeps.
What failure-mode taxonomy should I use?
Start with a coarse taxonomy (grasp failure, transport failure, placement failure, timeout, other) and refine based on observed failure patterns. After 50 trials, split 'grasp failure' into no-contact, slippage, and collision subcategories if those account for >10% of failures each. Use a hierarchical taxonomy (grasp failure → slippage → rotational slippage vs. translational slippage) only if you have >200 trials and need fine-grained data collection guidance. Overly detailed taxonomies with <5 examples per category are statistically useless.
How do I compare policies trained on different datasets?
Evaluate all policies on the same held-out test set with identical initial conditions, trial order, and success criteria. Report success rates with 95% confidence intervals and use chi-squared tests to determine if differences are statistically significant. If policies were trained on datasets with different object sets or task distributions, you cannot directly compare success rates—instead, report per-object or per-task success rates and analyze which distribution differences explain performance gaps. Cross-dataset comparisons require public benchmarks like Open X-Embodiment or ManipArena with standardized evaluation protocols.
Looking for evaluate robot policy performance?
Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.
Browse Physical AI Datasets