Glossary
Foundation Model Robotics
Foundation model robotics refers to large neural networks—typically 100M to 10B+ parameters—pretrained on internet-scale vision and language data, then fine-tuned on robot demonstrations to produce generalist policies that follow natural language instructions and manipulate novel objects across embodiments. The architecture pattern is Vision-Language-Action (VLA): a vision encoder processes camera frames, a language model backbone integrates visual features with text instructions, and an action head outputs robot-executable commands.
Quick facts
- Topic
- Foundation Model Robotics
- Audience
- Procurement leads, ML ops, robotics engineers
- Deliverable
- Buyer-facing reference + procurement guidance
Architecture: Vision Encoders, Language Backbones, and Action Heads
Foundation model robotics systems decompose into three components. The vision encoder—typically a Vision Transformer (ViT) or SigLIP model—processes RGB camera observations into spatial feature mapsRT-2's vision encoder. The language model backbone—PaLM, LLaMA, or similar—integrates visual features with natural language task instructions via cross-attention or token concatenation. The action head maps the joint representation to robot-executable outputs: RT-1 discretizes actions into 256 bins per dimension, while OpenVLA outputs continuous 7-DoF action vectors.
Pretraining occurs on internet data—LAION-5B image-text pairs, WebLI captions, YouTube videos—providing zero-shot visual recognition and language grounding that would require millions of robot demonstrations to learn from scratch[1]. Fine-tuning then adapts the model to robot control using teleoperation or scripted demonstration datasets. Open X-Embodiment aggregates 1M+ trajectories across 22 robot embodiments, enabling cross-platform transfer. The action head is the only component trained exclusively on robot data; vision and language layers leverage internet pretraining.
Deployment requires real-time inference—typically 10-20 Hz control loops. Scale AI's Physical AI platform provides GPU-accelerated serving infrastructure for VLA models. Latency budgets constrain model size: 7B-parameter models run on edge GPUs like NVIDIA RTX 4090, while 30B+ models require datacenter inference.
Training Data Requirements: Teleoperation, Scripted Play, and Synthetic Trajectories
Foundation model robotics training pipelines consume three data categories. Teleoperation datasets—human operators controlling robots via VR or kinesthetic teaching—capture high-quality manipulation trajectories. DROID contains 76K teleoperated episodes across 564 scenes and 86 objects, recorded at 10 Hz with wrist-mounted RGB-D cameras. BridgeData V2 provides 60K demonstrations of kitchen tasks using a WidowX robot arm. Teleoperation data exhibits high task success rates (80-95%) but limited scene diversity[2].
Scripted play datasets—robots executing randomized action sequences in instrumented environments—provide broader coverage at lower quality. RoboNet aggregates 15M frames from 7 robot platforms performing scripted interactions. Success rates drop to 20-40%, but the dataset spans 1,000+ object configurations. Scripted data trains vision encoders to recognize object affordances; teleoperation data teaches task completion.
Synthetic trajectories from simulators—RLBench, ManiSkill, RoboSuite—generate unlimited demonstrations but suffer from sim-to-real gaps. Domain randomization varies lighting, textures, and physics parameters during training to improve real-world transfer. NVIDIA Cosmos World Foundation Models pretrain on 20M synthetic video frames before fine-tuning on 100K real demonstrations, achieving 73% real-world task success versus 45% for real-only training[3].
Benchmark Datasets: Open X-Embodiment, DROID, and CALVIN
Three benchmark datasets anchor foundation model robotics research. Open X-Embodiment aggregates 1M+ episodes from 22 robot platforms—Franka Panda, WidowX, Google Robot, ALOHA—spanning 160K unique scenes[4]. Each episode includes RGB-D video at 10-30 Hz, proprioceptive joint states, gripper commands, and natural language task annotations. The dataset uses RLDS (Reinforcement Learning Datasets) format with HDF5 storage, enabling TensorFlow and PyTorch loaders.
DROID (Distributed Robot Interaction Dataset) contains 76K teleoperated manipulation episodes collected across 564 real-world scenes. Each trajectory includes dual wrist-camera RGB-D streams, 7-DoF end-effector poses, and binary success labels. DROID's diversity—86 object categories, 12 task families—makes it a primary fine-tuning corpus for generalist policies. Hugging Face LeRobot provides DROID loaders with automatic train/val splits.
CALVIN (Composing Actions from Language and Vision) benchmarks long-horizon task planning. The dataset contains 24K scripted episodes in a simulated kitchen, each comprising 3-5 sequential subtasks ('open drawer', 'grasp block', 'place in drawer'). Language instructions specify task sequences; models must infer subtask boundaries from visual observations. CALVIN's multi-step structure tests compositional generalization—a key gap in current VLA architectures.
Model Families: RT-1, RT-2, OpenVLA, and pi-zero
RT-1 (Robotics Transformer) introduced the VLA architecture in December 2022. The model uses a ViT-based vision encoder, a 35M-parameter Transformer backbone, and a discretized action head with 256 bins per dimension. Trained on 130K demonstrations from Google's mobile manipulation fleet, RT-1 achieves 97% success on seen tasks and 76% on novel object configurations[5]. The action discretization enables autoregressive sampling but limits precision for contact-rich tasks.
RT-2 replaces RT-1's vision encoder with PaLI-X, a 55B-parameter vision-language model pretrained on WebLI (10B image-text pairs). The language backbone provides zero-shot reasoning: RT-2 follows instructions like 'move the extinct animal' by grounding 'extinct' to a toy dinosaur via internet knowledge. RT-2 improves novel-task success from 76% to 83% over RT-1 while using the same 130K robot demonstrations[1].
OpenVLA is a 7B-parameter open-source VLA released June 2024. The architecture uses SigLIP for vision encoding and LLaMA-2-7B for language modeling, with a continuous action head outputting 7-DoF vectors. Trained on Open X-Embodiment's 1M episodes, OpenVLA achieves 78% success on cross-embodiment transfer tasks. The model runs at 15 Hz on an NVIDIA RTX 4090, enabling edge deployment.
pi-zero (Physical Intelligence's foundation model, announced October 2024) uses a diffusion-based action head that denoises 16-step action sequences. This architecture handles contact dynamics better than single-step predictors: pi-zero achieves 91% success on cloth folding versus 67% for RT-2[6]. The model trains on 500K demonstrations spanning household, warehouse, and assembly tasks.
Pretraining Strategies: Internet Data, Multi-Robot Aggregation, and World Models
Foundation model robotics pretraining follows three paradigms. Internet pretraining initializes vision and language components on web-scale data before robot fine-tuning. RT-2 uses PaLI-X weights from 10B image-text pairs, providing object recognition and spatial reasoning priors. This approach reduces robot data requirements by 10-100× versus training from scratch[1].
Multi-robot aggregation pools demonstrations from diverse platforms to learn embodiment-invariant representations. RoboCat trains on 253K episodes from 6 robot arms with different kinematics, then adapts to new embodiments with 100-1,000 demonstrations. Cross-embodiment pretraining improves sample efficiency: a model pretrained on Franka Panda data achieves 65% success on WidowX after 500 fine-tuning episodes, versus 42% when training WidowX-only from scratch[7].
World model pretraining—learning video prediction models of physical dynamics—is emerging as a third paradigm. NVIDIA Cosmos trains diffusion transformers on 20M synthetic video frames to predict future observations given actions. The learned world model serves as a dynamics prior for model-based planning. World Models demonstrated this approach in 2018 for simple environments; 2024 systems scale to high-dimensional manipulation. Recent work argues world models are necessary for general agents that plan beyond immediate actions.
Fine-Tuning Protocols: Task-Specific Adaptation and In-Context Learning
Foundation models adapt to new tasks via fine-tuning or in-context learning. Task-specific fine-tuning updates model weights on 100-10,000 demonstrations of the target task. LeRobot's ACT training pipeline fine-tunes OpenVLA on custom datasets using LoRA (Low-Rank Adaptation), updating <1% of parameters while preserving pretrained knowledge. Fine-tuning for 5,000 steps on 500 demonstrations achieves 80-90% task success for pick-and-place operations[8].
In-context learning provides task examples in the prompt without weight updates. RT-2 accepts 1-5 demonstration videos as context, then generates actions for novel scenes. This approach works for tasks similar to pretraining data but degrades on out-of-distribution scenarios. In-context success rates range from 60% (novel objects) to 85% (seen objects, new arrangements)[1].
Scale AI's partnership with Universal Robots demonstrates production fine-tuning workflows. Customers upload 200-500 teleoperation demonstrations via Scale's data engine; the platform fine-tunes a base VLA model and deploys it to UR cobots within 48 hours. This turnkey approach reduces deployment time from months to days for repetitive manipulation tasks.
Data Formats and Tooling: RLDS, MCAP, and LeRobot
Robot demonstration datasets use specialized formats to store multi-modal trajectories. RLDS (Reinforcement Learning Datasets) defines a schema for episodes containing observations (images, joint states), actions, rewards, and metadata. RLDS datasets serialize to TensorFlow's TFRecord format or HDF5, with per-episode compression reducing storage by 60-80%[9].
MCAP is a container format for time-series sensor data, widely adopted in robotics for its efficient random access and ROS 2 compatibility. rosbag2_storage_mcap enables direct recording from ROS nodes. MCAP files store synchronized camera streams, LiDAR point clouds, and IMU data with microsecond timestamps, critical for learning contact-rich manipulation policies.
Hugging Face LeRobot provides a unified interface for 25+ robot datasets—DROID, BridgeData V2, CALVIN, RoboNet—with automatic format conversion to LeRobot's Parquet-based schema. The library includes data loaders for PyTorch, visualization tools, and reference training scripts for Diffusion Policy and ACT. LeRobot reduces dataset integration time from weeks to hours, accelerating foundation model research.
Sim-to-Real Transfer: Domain Randomization and Dynamics Adaptation
Simulators generate unlimited training data but introduce reality gaps—discrepancies in physics, rendering, and sensor noise. Domain randomization varies simulation parameters (lighting, textures, object masses, friction coefficients) during training, forcing models to learn robust features invariant to these factors. Policies trained with randomization transfer to real robots with 70-85% success versus 30-50% for fixed-parameter simulation[10].
Dynamics adaptation fine-tunes simulator-trained models on small real-world datasets (100-1,000 episodes) to correct physics mismatches. Multi-task domain adaptation learns a mapping from simulated to real observations using adversarial training. This approach achieves 78% real-world success after 500 real demonstrations, versus 91% for models trained entirely on 10,000 real episodes—a 20× data efficiency gain[11].
RLBench provides 100 simulated manipulation tasks with domain randomization built in. NVIDIA Cosmos combines synthetic pretraining (20M frames) with real fine-tuning (100K frames), achieving 73% real-world success—midway between pure-sim (45%) and pure-real (91%) baselines. The optimal ratio is task-dependent: contact-rich tasks (peg insertion, cloth folding) require more real data than vision-dominant tasks (object sorting)[3].
Evaluation Benchmarks: Success Rate, Generalization, and Long-Horizon Planning
Foundation model robotics evaluation measures three capabilities. Task success rate—percentage of episodes achieving the goal within a time limit—is the primary metric. RT-1 reports 97% success on seen tasks, 76% on novel objects. Benchmarks like THE COLOSSEUM test 50 manipulation tasks across 10 object categories, providing standardized comparisons.
Generalization metrics assess performance on out-of-distribution scenarios: novel objects (seen category, new instance), novel scenes (new backgrounds, lighting), and novel embodiments (different robot kinematics). Open X-Embodiment's cross-embodiment benchmark trains on 20 platforms and evaluates on 2 held-out robots, measuring zero-shot transfer. Top models achieve 60-70% success on novel embodiments versus 80-90% on training platforms[4].
Long-horizon planning benchmarks—CALVIN, LongBench—test multi-step task completion. CALVIN requires executing 3-5 sequential subtasks from a single language instruction ('prepare breakfast'). Current VLA models complete 2.1 subtasks on average before failure, versus 4.8 for human teleoperation[12]. Compositional generalization remains an open challenge: models struggle to chain learned primitives into novel sequences.
Commercial Deployment: Scale AI, Physical Intelligence, and Figure AI
Foundation model robotics is transitioning from research to production. Scale AI's Physical AI platform provides end-to-end infrastructure: data collection via teleoperation interfaces, annotation with 3D bounding boxes and semantic segmentation, model training on H100 clusters, and deployment to edge devices. Scale's partnership with Universal Robots delivers fine-tuned VLA models for UR cobots, targeting warehouse pick-and-place and assembly tasks.
Physical Intelligence (pi-zero's developer) raised $400M in November 2024 to commercialize generalist robot policies. The company's model trains on 500K demonstrations spanning household (laundry folding, dishwasher loading), warehouse (box sorting, pallet stacking), and assembly (cable routing, screw driving) tasks. Pi-zero achieves 85-91% success across these domains, versus 60-75% for task-specific models[6].
Figure AI's partnership with Brookfield Asset Management focuses on humanoid pretraining datasets. Brookfield operates 1,200 industrial facilities; Figure will deploy humanoid robots to collect manipulation data in warehouses, manufacturing plants, and logistics centers. The partnership targets 10M demonstration hours by 2026—100× larger than existing datasets—to train foundation models for bipedal manipulation[13].
Data Provenance and Licensing for Robot Foundation Models
Foundation model robotics datasets raise provenance and licensing questions absent from internet pretraining. Teleoperation data often contains proprietary scene configurations, product designs, or manufacturing processes. Data provenance tracking—recording collector identity, timestamp, sensor calibration, and usage rights—is critical for commercial deployment. Truelabel's physical AI marketplace enforces provenance metadata for all listed datasets, including collector agreements and derivative-work permissions.
Licensing terms vary widely. RoboNet uses a permissive Apache 2.0 license allowing commercial use. EPIC-KITCHENS-100 restricts commercial use via a CC BY-NC 4.0 license, limiting foundation model pretraining for production systems. DROID's license permits commercial fine-tuning but prohibits redistribution of raw video—a common pattern for privacy-sensitive datasets.
Government procurement adds compliance layers. FAR Subpart 27.4 requires US federal contractors to document data rights and usage restrictions. Victoria's data procurement guidelines mandate provenance audits for AI training datasets. Foundation model vendors must maintain per-sample lineage to satisfy these requirements—a capability truelabel's marketplace provides via cryptographic provenance chains.
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 uses PaLI-X 55B-parameter vision-language model, WebLI 10B image-text pretraining, 83% novel-task success, internet knowledge grounding
arXiv ↩ - DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID contains 76K teleoperated episodes, 564 scenes, 86 object categories, dual wrist-camera RGB-D, 80-95% teleoperation success rates
arXiv ↩ - NVIDIA Cosmos World Foundation Models
NVIDIA Cosmos pretrains on 20M synthetic video frames, 100K real demonstrations, 73% real-world success versus 45% sim-only and 91% real-only
NVIDIA Developer ↩ - Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment aggregates 1M+ episodes from 22 robot platforms, 160K unique scenes, cross-embodiment transfer benchmarks, 60-70% novel-embodiment success
arXiv ↩ - RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 architecture details, 130K training demonstrations, 97% seen-task success, 76% novel-object success, 256-bin action discretization
arXiv ↩ - Scale AI: Expanding Our Data Engine for Physical AI
Physical Intelligence pi-zero model details, 500K demonstrations, 85-91% success across household/warehouse/assembly domains
scale.com ↩ - RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
RoboCat trains on 253K episodes from 6 robot arms, 65% WidowX success after 500 fine-tuning episodes versus 42% embodiment-only training
arXiv ↩ - LeRobot documentation
Hugging Face LeRobot documentation, DROID loaders, train/val splits
Hugging Face ↩ - RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning
RLDS schema for episodes with observations, actions, rewards, metadata; 60-80% storage compression
arXiv ↩ - Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization varies lighting, textures, physics parameters, 70-85% sim-to-real success versus 30-50% fixed-parameter
arXiv ↩ - Sim-to-Real Transfer for Robotic Manipulation with Multi-Task Domain Adaptation
Multi-task domain adaptation for sim-to-real transfer, 78% success after 500 real demonstrations versus 91% with 10,000 real episodes
arXiv ↩ - CALVIN paper
CALVIN contains 24K episodes with 3-5 sequential subtasks, tests compositional generalization, models complete 2.1 subtasks average versus 4.8 human
arXiv ↩ - Figure + Brookfield humanoid pretraining dataset partnership
Figure AI partnership with Brookfield Asset Management, 1,200 facilities, 10M demonstration hours target by 2026
figure.ai ↩
More glossary terms
FAQ
What is the difference between a vision-language-action model and a traditional robot learning policy?
Traditional robot learning policies map observations directly to actions using task-specific neural networks trained on 1,000-10,000 demonstrations of a single task. Vision-language-action (VLA) models add a language model backbone pretrained on internet text, enabling zero-shot task specification via natural language instructions and transfer learning across tasks. VLA models like RT-2 achieve 83% success on novel tasks versus 45% for task-specific policies, using the same 130K demonstration corpus. The language component provides compositional reasoning—understanding 'extinct animal' or 'heaviest object'—that pure vision-action models cannot learn from robot data alone.
How much robot demonstration data is required to fine-tune a foundation model for a new task?
Fine-tuning requirements depend on task complexity and similarity to pretraining data. Simple pick-and-place tasks require 100-500 demonstrations when fine-tuning OpenVLA or RT-2, achieving 75-85% success rates. Contact-rich tasks like cloth folding or cable routing require 1,000-5,000 demonstrations for 80%+ success. Novel embodiments (robot platforms not in the pretraining set) require 500-2,000 demonstrations to adapt kinematics and gripper geometry. In-context learning—providing examples in the prompt without weight updates—works for tasks very similar to pretraining data but typically achieves only 60-70% success versus 85-90% for fine-tuned models.
What data formats are used for robot foundation model training datasets?
Robot datasets use RLDS (Reinforcement Learning Datasets) for episode-structured data, MCAP for time-series sensor streams, and HDF5 for hierarchical storage. RLDS defines a schema with observations (RGB-D images, joint states), actions, rewards, and language annotations, serialized to TensorFlow TFRecord or Parquet files. MCAP stores synchronized multi-sensor data (cameras, LiDAR, IMU) with microsecond timestamps, critical for contact-rich manipulation. HDF5 provides hierarchical groups for organizing episodes, with per-dataset compression reducing storage by 60-80%. Hugging Face LeRobot converts 25+ dataset formats to a unified Parquet schema, enabling cross-dataset training without custom loaders.
Can foundation models trained on simulation data work on real robots without real-world demonstrations?
Pure sim-to-real transfer (zero real demonstrations) achieves 30-50% success on real robots due to reality gaps in physics, rendering, and sensor noise. Domain randomization—varying simulation parameters during training—improves real-world success to 70-85% without real data. However, production systems typically combine synthetic pretraining with real-world fine-tuning: NVIDIA Cosmos trains on 20M synthetic frames then fine-tunes on 100K real demonstrations, achieving 73% real success versus 45% for sim-only and 91% for real-only training. The optimal ratio is task-dependent—vision-dominant tasks (object sorting) transfer better than contact-rich tasks (peg insertion, cloth folding).
What are the computational requirements for deploying a foundation model on a robot?
Deployment requires real-time inference at 10-20 Hz control frequencies. 7B-parameter models like OpenVLA run at 15 Hz on NVIDIA RTX 4090 GPUs (consumer edge hardware), consuming 180W. 30B+ parameter models require datacenter GPUs—A100 or H100—for 10 Hz inference, unsuitable for mobile robots. Quantization (INT8, INT4) reduces latency by 2-4× but degrades task success by 5-10 percentage points. Edge deployment typically uses 1B-7B models; cloud-connected robots can query 30B+ models with 100-200ms round-trip latency, acceptable for non-contact tasks but problematic for dynamic manipulation.
Find datasets covering foundation model robotics
Truelabel surfaces vetted datasets and capture partners working with foundation model robotics. Send the modality, scale, and rights you need and we route you to the closest match.
Browse Physical AI Datasets