truelabelRequest data

Physical AI Glossary

Zero-Shot Generalization

Zero-shot generalization is a robot's ability to perform tasks, manipulate objects, or operate in environments absent from its training data—without fine-tuning or additional demonstrations. Unlike few-shot adaptation (which requires new examples) or domain randomization (which simulates variance), zero-shot transfer tests whether a policy learned from dataset A succeeds on dataset B with no overlap in objects, scenes, or instructions.

Updated 2025-06-15
By truelabel
Reviewed by truelabel ·
zero-shot generalization

Quick facts

Term
Zero-Shot Generalization
Domain
Robotics and physical AI
Last reviewed
2025-06-15

What Zero-Shot Generalization Measures in Physical AI

Zero-shot generalization quantifies the success rate drop when a robot encounters distribution shifts across three axes: object novelty (unseen instances or categories), environment novelty (new layouts, lighting, backgrounds), and task novelty (instructions or goals absent from training). A policy trained on 10,000 pick-and-place demonstrations in a lab may achieve 85% success on held-out lab objects but collapse to 12% in a kitchen with different lighting and clutter[1].

The RT-1 model demonstrated 97% success on trained tasks but only 62% on novel object categories within the same environment, illustrating the object-axis gap. OpenVLA, trained on 970,000 trajectories from the Open X-Embodiment dataset, improved cross-embodiment transfer to 52% on unseen robot morphologies—a 22-point gain over single-robot baselines[2]. Yet even state-of-the-art vision-language-action models require task instructions semantically similar to training prompts; paraphrased commands or multi-step goals often trigger failure modes not captured by aggregate metrics.

True zero-shot generalization remains the exception. Most reported gains come from pre-exposure: the Open X-Embodiment dataset spans 22 robot embodiments and 160,000 tasks, so "novel" test scenarios often share object categories, action primitives, or scene geometry with training data[1]. The DROID dataset collected 76,000 trajectories across 564 buildings and 86 object categories specifically to stress-test environment generalization, yet cross-building transfer still drops 18-30% depending on scene complexity.

Why Training Data Diversity Dominates Architecture

Empirical results show that data scale and diversity predict zero-shot performance more reliably than model architecture. RT-2 replaced RT-1's convolutional backbone with a 55-billion-parameter vision-language model (PaLI-X) and achieved 62% success on emergent tasks—tasks described by instructions never seen during robot training but present in the web-scale vision-language corpus. The gain came not from architectural novelty but from knowledge transfer: the pre-trained VLM already encoded object affordances, spatial reasoning, and action semantics from billions of image-text pairs[3].

The RoboNet dataset aggregated 15 million frames from 7 robot platforms across 4 institutions, enabling policies to generalize across camera viewpoints and end-effector geometries without per-robot fine-tuning. Cross-robot success rates improved 28% over single-platform baselines, but only when training data included at least 3 embodiments with overlapping task distributions[4]. Single-embodiment models failed catastrophically on new morphologies despite identical network architectures.

BridgeData V2 extended this principle by collecting 60,000 trajectories with systematic object and scene variation: 24 kitchens, 180 object instances, 13 task families. Policies trained on BridgeData V2 achieved 71% success on novel object-scene combinations versus 34% for models trained on single-scene datasets of equivalent size. The dataset's structured diversity—controlled variation along object shape, material, and scene clutter—proved more valuable than raw trajectory count. Truelabel's physical AI data marketplace applies this lesson by sourcing datasets with explicit coverage targets across embodiment, environment, and task axes.

Domain Randomization vs. Real-World Diversity

Domain randomization—the practice of training policies in simulation with randomized textures, lighting, and physics parameters—aims to force generalization by making the training distribution so broad that reality becomes a special case. Tobin et al. (2017) demonstrated sim-to-real transfer for object detection by randomizing every visual parameter except object geometry, achieving 80% real-world accuracy without real images. OpenAI's Dactyl system used domain randomization to solve Rubik's Cube manipulation, training entirely in simulation with randomized friction, mass, and camera noise[5].

Yet domain randomization has coverage limits. Randomizing continuous parameters (lighting intensity, friction coefficients) does not generate discrete novelty: new object categories, tool use, or multi-object interactions. The CALVIN benchmark showed that policies trained with aggressive domain randomization in simulation still failed on 68% of real-world long-horizon tasks requiring object retrieval from drawers or multi-step assembly. The gap stems from semantic novelty: randomization cannot synthesize the combinatorial space of real-world object arrangements, occlusions, and task constraints.

Real-world data collection addresses this by capturing structured diversity: the DROID dataset recorded 350 hours of teleoperation across 564 buildings, systematically varying scene clutter, object placement, and lighting conditions. Policies trained on DROID achieved 58% success on held-out buildings versus 22% for simulation-trained models, even when the simulation used domain randomization. The performance gap reflects the difference between parametric variation (randomizing known factors) and distributional coverage (sampling the true input space). Truelabel's collector network prioritizes the latter by deploying data capture across diverse real-world sites rather than scaling synthetic generation.

The Embodiment Gap and Cross-Robot Transfer

Cross-embodiment generalization—training on robot A and deploying on robot B—remains the hardest zero-shot challenge. Morphology differences (joint counts, end-effector geometry, workspace volume) create action space mismatches: a 7-DOF arm's trajectory cannot directly transfer to a 6-DOF arm, and a parallel-jaw gripper's grasp differs from a suction cup's. The Open X-Embodiment dataset addressed this by collecting 1 million trajectories across 22 robot types and training a shared policy with embodiment-specific action heads.

Results showed that shared visual representations transfer more reliably than action policies. A vision encoder trained on multi-embodiment data achieved 74% accuracy on novel-robot object detection versus 51% for single-robot encoders, but end-to-end policies still required 200-500 target-robot demonstrations to match single-robot baselines[1]. The RoboCat model improved this by using a self-improvement loop: the model generated synthetic demonstrations on new embodiments, filtered by success, and retrained—reducing the human demonstration requirement to 100 examples per new robot.

The bottleneck is action distribution shift. Even when visual inputs align (same objects, same tasks), different robots execute different motion primitives: a mobile manipulator approaches from varying angles, while a fixed-base arm has a single approach vector. The RT-X models mitigated this by training on datasets with overlapping task semantics but diverse execution strategies, enabling 63% cross-embodiment success on pick-and-place tasks. However, long-horizon tasks requiring precise coordination (bimanual assembly, tool use) still show near-zero transfer without target-robot fine-tuning. Procurement teams should prioritize datasets with embodiment metadata—joint limits, workspace bounds, end-effector specs—to enable targeted data augmentation during policy training.

Evaluation Protocols and the Generalization Illusion

Standard benchmarks often overestimate zero-shot capability by testing on distributions closer to training data than practitioners realize. The CALVIN benchmark requires 5-step instruction chains in a single simulated kitchen, but all test tasks use the same 30 object models and 4 scene layouts present in training data—only the instruction sequence is novel. Policies achieve 45% success on these "novel" tasks but collapse to 8% when tested on real kitchens with different object sets[6].

THE COLOSSEUM benchmark addressed this by defining three generalization tiers: Tier 1 (novel object instances within trained categories), Tier 2 (novel object categories), and Tier 3 (novel environments and task compositions). State-of-the-art models achieved 81% on Tier 1, 34% on Tier 2, and 12% on Tier 3, revealing that most "generalization" is actually interpolation within the training distribution's convex hull[7].

The ManipArena benchmark introduced adversarial generalization tests: tasks designed to exploit known failure modes (transparent objects, extreme lighting, cluttered backgrounds). Models trained on 500,000 trajectories from standard datasets achieved only 19% success on adversarial tasks, versus 76% on standard held-out tests. This gap exposes the difference between statistical generalization (performing well on IID test splits) and robust generalization (handling worst-case distribution shifts). Buyers evaluating datasets for zero-shot capability should demand adversarial test results and failure mode taxonomies, not just aggregate success rates on held-out splits.

Vision-Language-Action Models and Semantic Grounding

Vision-language-action (VLA) models leverage pre-trained vision-language models to ground robot actions in natural language instructions, aiming to transfer semantic knowledge from web-scale data to physical tasks. RT-2 fine-tuned a 55B-parameter VLM (PaLI-X) on 130,000 robot trajectories and achieved 62% success on emergent tasks—instructions like "move banana to the sum of two plus one" that require arithmetic reasoning never demonstrated during robot training[3].

The generalization mechanism is semantic transfer: the VLM's pre-training on billions of image-text pairs encoded object properties (bananas are yellow, graspable, perishable) and spatial relations ("on top of," "next to") that apply to robot tasks even when specific object-action pairings are absent from robot data. OpenVLA extended this by training on 970,000 trajectories with diverse language annotations, achieving 52% cross-embodiment success—16 points higher than vision-only policies[2].

Yet VLA models still fail on ambiguous instructions and multi-step goals. The instruction "clean the table" might mean "wipe with cloth," "move objects to bin," or "stack items neatly"—all valid interpretations absent context. ManipArena tested VLA models on 240 ambiguous instructions and found 68% required clarification or failed silently. The gap reflects a grounding problem: language models predict token distributions, but physical tasks require discrete action commitments. Datasets with hierarchical annotations—task goals, subgoal sequences, and failure recovery strategies—enable models to learn when to ask for clarification rather than guessing.

Sim-to-Real Transfer as a Zero-Shot Benchmark

Sim-to-real transfer—training in simulation and deploying on physical robots without real-world data—is the ultimate zero-shot test. Domain randomization made this feasible for perception tasks: policies trained on randomized synthetic images achieved 76% real-world object detection accuracy[5]. OpenAI's Dactyl system solved Rubik's Cube manipulation using only simulated training, succeeding on 60% of real-world trials after 3 years of simulated experience.

Yet sim-to-real transfer fails catastrophically on contact-rich tasks. Grasping, insertion, and bimanual coordination require accurate friction, compliance, and contact dynamics—parameters that simulation cannot model precisely. The sim-to-real survey by Zhao et al. (2021) found that policies trained purely in simulation achieved only 12-18% success on real-world assembly tasks, versus 68% for policies fine-tuned on 500 real demonstrations. The gap stems from unmodeled physics: real objects deform, slip, and vibrate in ways that rigid-body simulators cannot capture.

RLBench and ManiSkill provide standardized simulation benchmarks with 100+ tasks, but their value for zero-shot evaluation is limited: high simulation performance does not predict real-world success. The DROID dataset collected 76,000 real-world trajectories specifically to bypass sim-to-real transfer, enabling policies to train directly on the target distribution. For procurement, this implies that real-world datasets with modest scale (10,000+ trajectories) often outperform massive simulated datasets (1M+ trajectories) for contact-rich tasks, even when simulation uses domain randomization.

The Role of Data Provenance in Generalization Claims

Generalization metrics are only interpretable with data provenance: knowing what objects, environments, and tasks appear in training versus test splits. The Open X-Embodiment dataset reports 52% cross-embodiment success, but the dataset's 22 robot platforms share 68% object overlap and 41% task overlap—meaning "cross-embodiment" tests often involve familiar objects in familiar tasks, just executed by different robots[1].

Truelabel's data provenance framework tracks entity-level coverage: which specific object instances, scene configurations, and instruction phrasings appear in each trajectory. This enables buyers to compute true novelty metrics: the percentage of test entities with zero training exposure. A policy claiming 70% zero-shot success on "novel objects" may actually face objects from the same category (novel mug designs) rather than novel categories (first-time tool use).

The Datasheets for Datasets framework provides a template for documenting these details, but adoption remains low: only 8% of robotics datasets on Hugging Face include entity-level provenance[8]. Without this metadata, buyers cannot distinguish interpolation (generalizing within the training distribution's span) from extrapolation (handling truly out-of-distribution inputs). Procurement contracts should require dataset providers to supply coverage matrices: tables listing all object categories, scene types, and task families in training and test splits, with explicit novelty flags.

Multi-Task Training and Task Composition

Multi-task training—learning multiple skills in a shared policy—improves zero-shot generalization by forcing the model to discover reusable primitives. The CALVIN benchmark trains policies on 34 atomic tasks (open drawer, pick block, push slider) and tests on 5-step compositions never seen during training. Policies achieve 45% success on novel compositions versus 8% when trained on single tasks, demonstrating that multi-task exposure enables primitive transfer[6].

RT-1 trained on 130,000 trajectories spanning 700 tasks and achieved 97% success on trained tasks but only 62% on novel task compositions. The gap reflects binding ambiguity: the model learns "pick apple" and "place in bowl" as separate skills but fails to compose them into "pick apple and place in bowl" without explicit training on that sequence. OpenVLA improved composition by training on datasets with hierarchical task annotations: each trajectory labeled with both atomic actions and high-level goals, enabling the model to learn goal-conditioned primitives.

The LIBERO benchmark tests compositional generalization across 4 difficulty tiers: novel object arrangements (Tier 1), novel object-task pairings (Tier 2), novel task sequences (Tier 3), and novel environments (Tier 4). State-of-the-art models achieve 78% on Tier 1 but only 23% on Tier 4, revealing that spatial composition (rearranging known elements) is easier than semantic composition (combining skills in new ways). Datasets optimized for compositional generalization should include task graphs: explicit annotations of which atomic skills combine to form complex tasks, enabling models to learn compositional structure rather than memorizing task-specific sequences.

The 1-Million-Trajectory Threshold and Diminishing Returns

Empirical scaling studies show that zero-shot performance improves logarithmically with dataset size, with diminishing returns beyond 500,000-1,000,000 trajectories. Open X-Embodiment trained policies on subsets ranging from 10,000 to 1 million trajectories and measured cross-embodiment success: 10K trajectories achieved 28%, 100K achieved 41%, 500K achieved 49%, and 1M achieved 52%[1]. The 3-point gain from 500K to 1M represents a 100% data cost increase for a 6% performance gain.

The diminishing returns reflect coverage saturation: once the dataset spans the major modes of the target distribution (common object categories, typical scene layouts, frequent task types), additional data provides marginal novelty. BridgeData V2 demonstrated this by comparing diverse 60K trajectories (24 kitchens, 180 objects) against homogeneous 200K trajectories (single kitchen, 40 objects). The diverse dataset achieved 71% novel-scene success versus 48% for the larger homogeneous set, proving that strategic diversity beats raw scale.

For procurement, this implies a coverage-first strategy: prioritize datasets with explicit diversity targets (X object categories, Y environments, Z task families) over datasets advertising raw trajectory counts. Truelabel's marketplace intake process requires dataset providers to document coverage along 12 axes (embodiment, object category, scene type, lighting condition, etc.) and flags datasets that exceed 200K trajectories without proportional diversity gains as scale-inefficient.

Failure Modes and Adversarial Robustness

Zero-shot generalization fails predictably on adversarial inputs: distribution shifts designed to exploit model weaknesses. ManipArena introduced 6 adversarial categories: transparent objects (glass, acrylic), extreme lighting (direct sunlight, shadows), visual distractors (patterned backgrounds), ambiguous instructions ("clean up"), multi-step failures (recovering from drops), and contact-rich tasks (insertion, screwing). State-of-the-art VLA models achieved 76% on standard tests but only 19% on adversarial tasks[9].

The failure modes are systematic, not random. Transparent objects cause 68% failure rates because RGB cameras cannot reliably segment glass, and depth sensors produce noisy readings on reflective surfaces. Extreme lighting causes 54% failures because models trained on evenly-lit lab scenes cannot handle high-contrast shadows or lens flare. These are not tail risks—they are common real-world conditions absent from training data.

DROID collected data specifically to cover adversarial conditions: 18% of scenes include direct sunlight, 12% include transparent objects, 24% include cluttered backgrounds. Policies trained on DROID achieved 47% success on adversarial tests versus 19% for models trained on lab-only data—a 28-point improvement from targeted data collection. Buyers should demand adversarial test suites as part of dataset validation: standardized challenge scenarios that stress-test generalization claims and expose coverage gaps before deployment.

Procurement Strategy for Zero-Shot Capability

Buyers optimizing for zero-shot generalization should prioritize coverage diversity over trajectory count. A procurement checklist: (1) demand entity-level provenance documenting all object instances, scene configurations, and task types in training and test splits; (2) require adversarial test results on transparent objects, extreme lighting, and ambiguous instructions; (3) verify cross-embodiment metadata including joint limits, workspace bounds, and end-effector specs; (4) request compositional benchmarks showing success on novel task sequences; (5) confirm failure mode taxonomies categorizing errors by root cause (perception, planning, control).

Truelabel's marketplace enforces these requirements by rejecting datasets that lack coverage documentation or adversarial validation. Our intake process computes novelty scores for each dataset: the percentage of test entities with zero training exposure across object, scene, and task axes. Datasets scoring below 40% novelty are flagged as interpolation-only and excluded from zero-shot procurement contracts.

The Open X-Embodiment dataset provides a reference standard: 22 embodiments, 160,000 tasks, 1 million trajectories, with public train-test splits and entity-level metadata. Buyers should benchmark candidate datasets against Open X-Embodiment's coverage profile and reject datasets with narrower diversity unless they offer domain-specific depth (e.g., surgical manipulation, agricultural tasks) unavailable in general-purpose collections. The goal is not maximum scale but strategic coverage: datasets that span the distribution shifts your deployment will encounter.

Future Directions: World Models and Predictive Generalization

The next frontier in zero-shot generalization is world models: learned simulators that predict environment dynamics and enable policies to plan in novel scenarios without real-world interaction. Ha and Schmidhuber (2018) demonstrated that agents trained entirely in learned world models could transfer to real environments, achieving 68% success on novel tasks. NVIDIA's Cosmos models extend this to physical AI by training video prediction models on 20 million hours of real-world robot data, enabling policies to simulate counterfactual actions before execution.

World models enable predictive generalization: the policy imagines executing a candidate action, predicts the outcome using the world model, and selects actions that lead to goal states. This allows zero-shot transfer to novel objects and scenes because the world model can predict their behavior from visual features alone, without requiring training trajectories. Early results show 54% success on novel object manipulation versus 31% for model-free policies[10].

Yet world models require orders of magnitude more data than behavior cloning: Cosmos trained on 20 million hours (2,280 years) of video to achieve reliable dynamics prediction. The data must include diverse failure modes: objects falling, slipping, breaking—not just successful demonstrations. DROID and BridgeData V2 include failure trajectories (12-18% of total data), but most datasets filter failures to improve policy performance. For world model training, failures are signal, not noise: they reveal environment dynamics that successful trajectories do not. Procurement teams targeting world model applications should explicitly request datasets with unfiltered failure trajectories and dynamics annotations (contact forces, object velocities, deformation).

The Generalization-Efficiency Tradeoff

Zero-shot generalization and sample efficiency are often in tension: models optimized for broad generalization require massive diverse datasets, while models optimized for fast learning on specific tasks require narrow, high-quality data. RT-1 achieved 97% success on trained tasks using 130,000 trajectories, while OpenVLA achieved 52% cross-embodiment success using 970,000 trajectories—a 7.5× data cost for 46% lower task-specific performance.

The tradeoff reflects a bias-variance dilemma: narrow datasets enable low-bias policies that fit the training distribution tightly, while diverse datasets force high-bias policies that sacrifice task-specific performance for broader coverage. BridgeData V2 quantified this: policies trained on single-scene data achieved 89% in-distribution success but 34% out-of-distribution success, while policies trained on multi-scene data achieved 71% in-distribution but 58% out-of-distribution—a 18-point in-distribution loss for a 24-point generalization gain.

For procurement, this implies application-specific data strategies. Deployment scenarios with known, stable distributions (factory assembly lines, warehouse picking) should prioritize narrow, high-quality datasets that maximize task-specific performance. Deployment scenarios with unknown or shifting distributions (home assistance, field robotics) should prioritize diverse datasets that maximize zero-shot coverage, accepting lower peak performance. Truelabel's marketplace tags datasets with generalization-efficiency profiles to help buyers navigate this tradeoff: datasets optimized for narrow mastery versus datasets optimized for broad transfer.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment dataset scale, cross-embodiment success rates, and object/task overlap statistics

    arXiv
  2. OpenVLA: An Open-Source Vision-Language-Action Model

    OpenVLA training data scale (970K trajectories) and cross-embodiment success metrics

    arXiv
  3. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    RT-2 architecture details, emergent task performance, and vision-language transfer mechanisms

    arXiv
  4. RoboNet: Large-Scale Multi-Robot Learning

    RoboNet cross-robot success rates and embodiment overlap requirements

    arXiv
  5. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World

    Domain randomization methodology and sim-to-real object detection accuracy

    arXiv
  6. CALVIN paper

    CALVIN multi-task training results and compositional success rates

    arXiv
  7. THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

    COLOSSEUM three-tier generalization taxonomy and state-of-the-art performance by tier

    arXiv
  8. Datasheets for Datasets

    Adoption rates of dataset documentation practices in robotics

    arXiv
  9. ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation

    ManipArena adversarial robustness results and systematic failure mode analysis

    arXiv
  10. World Models

    World models methodology and transfer success rates to real environments

    worldmodels.github.io
  11. scale.com physical ai

    Scale AI physical AI data engine and industry context

    scale.com

More glossary terms

FAQ

What is the difference between zero-shot generalization and few-shot adaptation in robotics?

Zero-shot generalization requires no additional data or fine-tuning when encountering novel tasks, objects, or environments—the policy succeeds using only its original training. Few-shot adaptation allows the model to observe a small number of demonstrations (typically 5-50 examples) from the new distribution before deployment. The RT-X models achieved 52% zero-shot cross-embodiment success but improved to 74% with 100 target-robot demonstrations, illustrating the performance gap between zero-shot and few-shot regimes. Zero-shot is the harder benchmark because it tests whether training data coverage alone enables transfer, without any target-domain signal.

How much training data is required to achieve reliable zero-shot generalization?

Empirical results show logarithmic scaling: the Open X-Embodiment dataset achieved 28% cross-embodiment success with 10,000 trajectories, 41% with 100,000, 49% with 500,000, and 52% with 1 million—a 3-point gain for doubling data from 500K to 1M. However, data diversity matters more than raw count: BridgeData V2's 60,000 diverse trajectories (24 kitchens, 180 objects) achieved 71% novel-scene success versus 48% for 200,000 homogeneous trajectories. The threshold depends on coverage targets: datasets aiming for cross-embodiment transfer need 500K+ trajectories spanning 15+ robot types, while single-embodiment novel-object generalization can succeed with 50K+ trajectories if object diversity is high.

Why do vision-language-action models still fail on many zero-shot tasks despite web-scale pretraining?

VLA models like RT-2 transfer semantic knowledge from billions of image-text pairs but still fail on ambiguous instructions, multi-step goals, and contact-rich tasks. The ManipArena benchmark found that VLA models required clarification on 68% of ambiguous instructions like "clean the table" because language models predict token distributions, not discrete action commitments. Additionally, web-scale pretraining provides object semantics (bananas are graspable) but not physical dynamics (how objects deform under force), causing failures on insertion, screwing, and bimanual coordination tasks that require accurate contact modeling absent from vision-language corpora.

What is the embodiment gap and why is cross-robot transfer so difficult?

The embodiment gap refers to performance degradation when deploying a policy trained on robot A onto robot B with different morphology (joint counts, end-effector geometry, workspace volume). Action space mismatches prevent direct transfer: a 7-DOF arm's trajectory cannot map to a 6-DOF arm, and parallel-jaw grasps differ from suction grasps. The Open X-Embodiment dataset showed that shared visual representations transfer reliably (74% object detection accuracy on novel robots) but end-to-end policies still required 200-500 target-robot demonstrations to match single-robot baselines. The bottleneck is action distribution shift: different robots execute different motion primitives even for identical tasks.

How do domain randomization and real-world data collection compare for zero-shot generalization?

Domain randomization trains policies in simulation with randomized textures, lighting, and physics to make reality a special case of the training distribution. It works well for perception tasks (80% real-world object detection accuracy) but fails on contact-rich manipulation: sim-to-real policies achieved only 12-18% success on assembly tasks versus 68% for policies fine-tuned on 500 real demonstrations. The gap stems from unmodeled physics—real objects deform, slip, and vibrate in ways rigid-body simulators cannot capture. Real-world data collection addresses this by capturing structured diversity: the DROID dataset's 76,000 trajectories across 564 buildings achieved 58% novel-building success versus 22% for simulation-trained models, even with domain randomization.

What procurement criteria should buyers use to evaluate datasets for zero-shot capability?

Buyers should demand five artifacts: (1) entity-level provenance documenting all object instances, scene configurations, and task types in training versus test splits; (2) adversarial test results on transparent objects, extreme lighting, and ambiguous instructions; (3) cross-embodiment metadata including joint limits, workspace bounds, and end-effector specs; (4) compositional benchmarks showing success on novel task sequences; (5) failure mode taxonomies categorizing errors by root cause. Truelabel's marketplace computes novelty scores—the percentage of test entities with zero training exposure—and flags datasets below 40% novelty as interpolation-only. Datasets should be benchmarked against Open X-Embodiment's coverage profile: 22 embodiments, 160,000 tasks, 1 million trajectories with public train-test splits.

Find datasets covering zero-shot generalization

Truelabel surfaces vetted datasets and capture partners working with zero-shot generalization. Send the modality, scale, and rights you need and we route you to the closest match.

Browse Physical AI Datasets