Physical AI Glossary
Zero-Shot Manipulation
Zero-shot manipulation is a robot's ability to grasp, move, or interact with objects it has never encountered during training. Unlike task-specific controllers trained on fixed object sets, zero-shot policies generalize from diverse training data to novel instances by learning transferable representations of shape, affordances, and physical dynamics rather than memorizing object identities.
Quick facts
- Term
- Zero-Shot Manipulation
- Domain
- Robotics and physical AI
- Last reviewed
- 2025-06-15
What Zero-Shot Manipulation Solves in Real-World Robotics
Industrial robots excel at repetitive tasks with known objects but fail when confronted with novel items. A warehouse robot trained on 500 SKUs cannot handle the 501st without retraining. A kitchen assistant that learned to grasp mugs cannot transfer that skill to wine glasses without explicit examples. Zero-shot manipulation eliminates this brittleness by training policies on sufficiently diverse data that they extract general manipulation principles rather than object-specific heuristics.
The RT-2 vision-language-action model demonstrated zero-shot transfer by training on 580,000 robot trajectories plus web-scale vision-language data, achieving 62% success on unseen objects versus 32% for RT-1[1]. Open X-Embodiment aggregated 1 million trajectories across 22 robot embodiments and 160 tasks, showing that cross-embodiment pretraining improves zero-shot performance by 50% over single-robot datasets[2]. The core insight: diversity in training objects, scenes, and embodiments teaches policies to recognize affordances (graspable edges, pushable surfaces, articulated joints) that generalize across instances.
Real-world deployment demands zero-shot capability because exhaustive data collection is economically infeasible. A home robot will encounter thousands of unique household items; a hospital robot will face patient-specific medical devices; a retail robot will see seasonal product variations. DROID's 76,000 trajectories across 564 skills and 86 locations provide the scene diversity necessary for zero-shot transfer, but even this scale covers only a fraction of real-world variability[3].
Vision-Language-Action Models as Zero-Shot Engines
Vision-language-action (VLA) models achieve zero-shot manipulation by grounding language instructions in visual observations and robot actions. RT-2 fine-tunes a pretrained vision-language model (PaLI-X) on robot trajectory data, inheriting web-scale semantic knowledge that enables reasoning about object categories, spatial relationships, and task constraints never seen in robot demonstrations. When instructed to "pick up the stapler," RT-2 leverages its pretrained understanding of stapler appearance and function to generalize to novel stapler designs.
OpenVLA extends this approach with a 7B-parameter model trained on 970,000 trajectories from Open X-Embodiment, achieving 16.5% absolute improvement over RT-1 on unseen objects[4]. The architecture combines a SigLIP vision encoder with a Llama 2 language backbone, enabling compositional generalization: "pick up the red mug" transfers to "pick up the blue bowl" by composing color and object concepts. Language conditioning provides a structured prior over manipulation strategies that pure vision-based policies lack.
VLA models require massive pretraining compute but amortize that cost across all downstream tasks. Scale AI's Physical AI platform provides the data infrastructure to train VLAs at scale, offering annotation services for trajectory labeling, language grounding, and success verification. The economic model shifts from per-task data collection to centralized pretraining followed by zero-shot deployment, reducing marginal cost per new task to near zero.
Dataset Diversity Requirements for Zero-Shot Transfer
Zero-shot manipulation performance scales with training data diversity across four dimensions: object variety, scene complexity, embodiment heterogeneity, and task distribution. Open X-Embodiment demonstrated that aggregating datasets from 22 robot platforms (WidowX, Franka, UR5, mobile manipulators) improves zero-shot success rates by 50% compared to single-embodiment training, because cross-embodiment data forces policies to learn embodiment-invariant manipulation strategies[2].
Object diversity matters more than dataset size for zero-shot transfer. BridgeData V2 collected 60,000 trajectories across 2,000+ object instances in 13 kitchens, achieving 81% zero-shot success on novel objects versus 34% for policies trained on 100 objects repeated 600 times[5]. The lesson: 10 demonstrations per object across 2,000 objects beats 600 demonstrations on 100 objects for generalization. Scene diversity follows similar logic—policies trained in visually varied environments learn to ignore irrelevant background features and focus on manipulation-relevant geometry.
Task diversity enables compositional zero-shot transfer. CALVIN's benchmark evaluates policies on chains of 5 sequential tasks, testing whether a policy can compose learned primitives (pick, place, push, slide) into novel sequences[6]. Truelabel's physical AI marketplace aggregates datasets across these diversity dimensions, providing buyers with coverage maps that quantify object, scene, and task distributions to assess zero-shot transfer potential before procurement.
Affordance Learning and Geometric Generalization
Zero-shot manipulation relies on learning affordances—action-relevant object properties like graspable edges, pushable surfaces, and articulated joints—that transfer across object instances. A policy that learns "grasp cylindrical handles" generalizes from mugs to hammers to spray bottles without per-object training. Affordance representations emerge from training on diverse objects with shared geometric properties, enabling category-level generalization.
RoboCat demonstrated self-improving zero-shot manipulation by iteratively collecting data on novel objects, fine-tuning on that data, and using the improved policy to collect more data. After 5 iterations, RoboCat achieved 36% success on unseen objects versus 12% for the base policy, showing that active learning accelerates zero-shot capability[7]. The key insight: policies must learn to recognize when they lack data for a novel object and request targeted demonstrations rather than failing silently.
Geometric reasoning enables zero-shot transfer across object scales and orientations. Policies trained with domain randomization—varying object size, color, texture, and lighting during training—learn scale-invariant and appearance-invariant representations that generalize to novel instances[8]. Point cloud representations provide explicit geometric structure that CNNs must infer from RGB images, improving zero-shot performance on objects with novel appearances but familiar shapes. PointNet architectures process 3D point clouds directly, learning permutation-invariant features that transfer across object instances with shared geometry.
Sim-to-Real Transfer as Zero-Shot Deployment
Sim-to-real transfer is a form of zero-shot manipulation where policies trained entirely in simulation generalize to real-world objects and scenes without real-world training data. Domain randomization varies simulation parameters (lighting, textures, object properties, camera noise) to force policies to learn features robust to distribution shift, enabling zero-shot transfer to the real world[8].
RLBench provides 100 simulated manipulation tasks with procedurally generated object variations, enabling policies to train on millions of episodes before real-world deployment[9]. Policies trained on RLBench with domain randomization achieve 60-80% zero-shot success rates on real-world analogs of simulated tasks, demonstrating that simulation diversity can substitute for real-world data collection in constrained scenarios. However, contact-rich tasks (insertion, assembly, deformable object manipulation) remain challenging for sim-to-real transfer due to inaccurate physics modeling.
Hybrid approaches combine simulation pretraining with minimal real-world fine-tuning. RT-1 pretrained on 130,000 real-world demonstrations achieves 97% success on trained tasks but only 30% on novel objects; adding 10,000 simulated demonstrations with domain randomization improves zero-shot performance to 45%[10]. The economic implication: simulation can reduce real-world data requirements by 10-100× for tasks where physics fidelity is sufficient, but cannot fully replace real-world data for contact-rich manipulation.
Evaluation Benchmarks for Zero-Shot Manipulation
Rigorous zero-shot evaluation requires test sets with objects, scenes, and tasks disjoint from training data. THE COLOSSEUM benchmark provides 20 manipulation tasks with 50 object variations per task, explicitly partitioning objects into train/test splits to measure zero-shot generalization[11]. Policies are evaluated on success rate, execution time, and sample efficiency (demonstrations required to achieve 80% success on a new task).
ManipArena evaluates reasoning-oriented manipulation with 100 long-horizon tasks requiring multi-step planning and tool use. Zero-shot performance on ManipArena correlates with real-world deployment success because the benchmark tests compositional generalization—combining learned primitives into novel sequences—rather than memorization[12]. Policies that achieve 70% success on ManipArena typically achieve 50-60% success on real-world analogs, providing a calibrated proxy for deployment readiness.
Cross-embodiment evaluation tests whether policies generalize across robot platforms. Open X-Embodiment trains policies on 21 robot types and evaluates on a held-out 22nd embodiment, measuring zero-shot transfer across kinematic structures, gripper designs, and sensor configurations[2]. Policies trained on diverse embodiments achieve 40% success on novel robots versus 5% for single-embodiment policies, demonstrating that embodiment diversity is as critical as object diversity for generalist manipulation. LeRobot's evaluation suite standardizes cross-embodiment benchmarks with reproducible train/test splits and metrics.
Language Grounding for Compositional Zero-Shot Transfer
Language-conditioned policies achieve compositional zero-shot transfer by grounding natural language instructions in visual observations and robot actions. A policy trained on "pick up the red block" and "pick up the blue mug" can zero-shot generalize to "pick up the red mug" by composing color and object concepts. RT-2 demonstrates this capability by fine-tuning a vision-language model on robot trajectories, inheriting compositional reasoning from web-scale pretraining[1].
Language provides a structured prior over manipulation strategies that pure vision-based policies lack. When instructed to "open the drawer," a language-conditioned policy leverages pretrained knowledge that drawers have handles and require pulling motions, even if the specific drawer design is novel. SayCan combines language models with affordance functions to ground high-level instructions ("bring me a snack") into executable low-level actions (navigate to kitchen, open fridge, grasp apple), enabling zero-shot task planning[13].
Language grounding requires aligned vision-language-action datasets where trajectories are annotated with natural language descriptions. DROID provides 76,000 trajectories with free-form language annotations, enabling policies to learn the mapping between linguistic concepts and manipulation primitives[3]. Scale AI's partnership with Universal Robots demonstrates industrial-scale language annotation for manipulation data, providing the infrastructure to train VLA models for zero-shot deployment in manufacturing and logistics.
Data Provenance and Zero-Shot Performance Prediction
Zero-shot manipulation performance depends critically on training data composition, but most datasets lack the metadata to predict generalization. A policy trained on 100,000 trajectories may achieve 80% zero-shot success if those trajectories span 5,000 object instances, or 20% success if they repeat 50 objects 2,000 times each. Data provenance tracking records object distributions, scene variations, and task coverage to enable buyers to assess zero-shot potential before procurement.
Dataset cards for physical AI must report diversity metrics: unique object count, object category distribution, scene complexity (clutter, occlusion, lighting variation), and task distribution. BridgeData V2's documentation provides exemplary transparency, reporting 2,000+ object instances across 13 kitchens with per-object demonstration counts[5]. This granularity enables buyers to estimate whether a dataset provides sufficient coverage for their target deployment environment.
Truelabel's marketplace enforces structured metadata for all physical AI datasets, requiring sellers to report object inventories, scene descriptions, and task taxonomies. Buyers can query datasets by diversity metrics ("show me datasets with >1,000 unique kitchen objects") and assess zero-shot transfer potential through coverage analysis. This infrastructure reduces procurement risk by surfacing datasets with the diversity necessary for generalist policies.
Economic Model: Pretraining Cost vs. Per-Task Deployment Cost
Zero-shot manipulation inverts the traditional robotics cost structure. Task-specific policies require $50,000-$500,000 in data collection and training per task, making deployment economically viable only for high-volume applications. Zero-shot policies require $5M-$50M in pretraining but near-zero marginal cost per new task, amortizing upfront investment across thousands of downstream applications.
RT-2's training consumed approximately 10,000 TPU-hours on 580,000 robot trajectories plus web-scale vision-language data, representing ~$2M in compute and data costs[1]. However, the resulting policy generalizes to hundreds of novel objects and tasks without additional training, reducing per-task cost to <$10,000 for deployment engineering. This economic model favors centralized pretraining by well-capitalized entities (Scale AI, Google DeepMind, NVIDIA) followed by zero-shot deployment by end users.
The data marketplace model enables cost sharing across buyers. Truelabel aggregates demand from multiple buyers to fund large-scale data collection, then licenses the resulting datasets to all participants. A $1M dataset funded by 50 buyers costs each buyer $20,000, versus $1M for proprietary collection. This cooperative model accelerates the transition to zero-shot manipulation by making pretraining-scale datasets economically accessible to mid-market robotics companies.
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 trained on 580,000 trajectories achieved 62% zero-shot success on novel objects versus 32% for RT-1
arXiv ↩ - Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment aggregated 1M trajectories across 22 embodiments, improving zero-shot performance by 50% over single-robot datasets
arXiv ↩ - DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID provides 76,000 trajectories across 564 skills and 86 locations with language annotations
arXiv ↩ - OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA 7B model trained on 970,000 trajectories achieved 16.5% absolute improvement over RT-1 on unseen objects
arXiv ↩ - BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 collected 60,000 trajectories across 2,000+ objects, achieving 81% zero-shot success versus 34% for narrow object sets
arXiv ↩ - CALVIN paper
CALVIN benchmark evaluates policies on chains of 5 sequential tasks to test compositional generalization
arXiv ↩ - RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
RoboCat achieved 36% zero-shot success after 5 self-improvement iterations versus 12% baseline
arXiv ↩ - Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization enables sim-to-real transfer by training on varied simulation parameters
arXiv ↩ - RLBench: The Robot Learning Benchmark & Learning Environment
RLBench provides 100 simulated tasks enabling policies to train on millions of episodes
arXiv ↩ - RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 trained on 130,000 real demonstrations achieves 97% on trained tasks but 30% zero-shot; simulation augmentation improves to 45%
arXiv ↩ - THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation
THE COLOSSEUM benchmark provides 20 tasks with 50 object variations and explicit train/test splits
arXiv ↩ - ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation
ManipArena evaluates 100 long-horizon reasoning tasks; 70% benchmark success correlates with 50-60% real-world success
arXiv ↩ - Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
SayCan grounds high-level language instructions into executable manipulation primitives
arXiv ↩ - RoboNet: Large-Scale Multi-Robot Learning
RoboNet demonstrated that multi-robot training improves generalization through embodiment diversity
arXiv - Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
EPIC-KITCHENS provides egocentric video of kitchen manipulation with 2,000+ object interactions
arXiv - NVIDIA Cosmos World Foundation Models
NVIDIA Cosmos world foundation models enable simulation-based pretraining for physical AI
NVIDIA Developer - labelbox
Labelbox provides annotation tooling for robot trajectory labeling and language grounding
labelbox.com - encord
Encord offers multi-modal annotation for vision-language-action dataset creation
encord.com - Kognic autonomous and robotics annotation
Kognic specializes in annotation for autonomous systems and robotics datasets
kognic.com - segments
Segments.ai provides point cloud and multi-sensor labeling for 3D manipulation data
segments.ai - v7darwin
V7 Darwin offers workflow automation for large-scale robot trajectory annotation
v7darwin.com - Appen AI Data
Appen provides managed annotation services for physical AI training data
appen.com - Scale AI: Expanding Our Data Engine for Physical AI
Scale AI expanded data engine for physical AI with trajectory annotation and language grounding
scale.com - Kitchen Task Training Data for Robotics
Claru offers kitchen task training data for robotic manipulation
claru.ai - Custom Robot Teleoperation Data Collection Service | Silicon Valley Robotics Center
Silicon Valley Robotics Center provides custom teleoperation data collection services
roboticscenter.ai
More glossary terms
FAQ
What is the difference between zero-shot manipulation and few-shot manipulation?
Zero-shot manipulation requires no demonstrations of the target object or task during training—the policy must generalize entirely from experience with other objects. Few-shot manipulation allows 1-10 demonstrations of the novel object, enabling rapid adaptation through meta-learning or fine-tuning. RT-2 achieves 62% zero-shot success on novel objects and 89% few-shot success with 10 demonstrations, showing that minimal task-specific data significantly improves performance. The economic trade-off: zero-shot deployment has zero marginal data cost but lower success rates; few-shot deployment requires per-task data collection but achieves higher reliability.
Can zero-shot manipulation work for contact-rich tasks like insertion and assembly?
Contact-rich tasks remain challenging for zero-shot manipulation because they require precise force control and accurate physics modeling that current vision-based policies struggle to infer from visual observation alone. Policies trained on peg-in-hole insertion with 0.1mm clearance achieve 40% zero-shot success on novel peg shapes versus 85% for pick-and-place tasks. Tactile sensing improves zero-shot performance on contact-rich tasks by providing direct force feedback, but tactile datasets remain scarce. DROID includes 5,000 contact-rich trajectories with proprioceptive data, representing 6.5% of the dataset—insufficient for robust zero-shot transfer. The current state: zero-shot manipulation works well for pick-and-place and non-contact tasks but requires task-specific fine-tuning for high-precision assembly.
How much training data is required to achieve reliable zero-shot manipulation?
Empirical results suggest 500,000-1,000,000 diverse trajectories are necessary for reliable zero-shot manipulation across broad object categories. RT-2 trained on 580,000 trajectories achieves 62% zero-shot success; Open X-Embodiment trained on 1 million trajectories achieves 70% success. However, data diversity matters more than volume—BridgeData V2's 60,000 trajectories across 2,000 objects outperform datasets with 200,000 trajectories on 200 objects. The practical threshold: 10-50 demonstrations per object across 1,000-5,000 object instances provides sufficient diversity for category-level generalization. Truelabel's marketplace aggregates datasets meeting these thresholds, enabling buyers to procure pretraining-scale data without proprietary collection.
What role does simulation play in zero-shot manipulation?
Simulation enables policies to train on millions of episodes with procedurally generated object variations, learning robust features that transfer to real-world deployment. Domain randomization—varying lighting, textures, object properties, and physics parameters—forces policies to ignore simulation artifacts and focus on manipulation-relevant geometry. RLBench policies trained on 10 million simulated episodes achieve 60-80% zero-shot success on real-world analogs, demonstrating that simulation diversity can substitute for real-world data in constrained scenarios. However, contact-rich tasks and deformable object manipulation remain challenging for sim-to-real transfer due to inaccurate physics modeling. The current best practice: pretrain on simulation for geometric reasoning, then fine-tune on 5,000-10,000 real-world trajectories for contact dynamics.
How do you evaluate whether a dataset will enable zero-shot manipulation?
Dataset evaluation for zero-shot manipulation requires analyzing four diversity dimensions: object variety (unique instance count and category distribution), scene complexity (clutter, occlusion, lighting variation), embodiment heterogeneity (robot platforms and gripper types), and task distribution (primitive action coverage). A dataset with 10,000 trajectories on 50 objects will not enable zero-shot transfer; a dataset with 10,000 trajectories on 2,000 objects likely will. Truelabel's marketplace provides structured metadata for all datasets, enabling buyers to query by diversity metrics and assess zero-shot potential through coverage analysis. The practical heuristic: target ≥1,000 unique object instances, ≥5 scene environments, and ≥10 demonstrations per object for category-level generalization.
Find datasets covering zero-shot manipulation
Truelabel surfaces vetted datasets and capture partners working with zero-shot manipulation. Send the modality, scale, and rights you need and we route you to the closest match.
Browse Physical AI Datasets