Physical AI Glossary

Multimodal Foundation Model

Q: What is the difference between a vision-language model and a vision-language-action model?

A vision-language model (VLM) like CLIP or GPT-4V processes images and text to perform tasks like image captioning, visual question answering, or zero-shot classification. A vision-language-action model (VLA) extends VLMs by adding an action prediction head that outputs robot commands. VLAs take images and language instructions as input and produce continuous or discrete actions (joint velocities, end-effector poses, gripper states) as output. RT-1 and RT-2 are canonical VLA examples: they use pretrained VLM encoders but add transformer-based policy decoders trained on robot demonstration data. The key distinction is embodiment: VLAs are trained on trajectories that pair observations with actions, enabling closed-loop control, while VLMs are trained on static image-text pairs without action labels.

Q: How much robot data is needed to fine-tune a pretrained multimodal foundation model?

Fine-tuning a pretrained multimodal foundation model for a novel task typically requires 50-5,000 demonstrations depending on task complexity and model size. Simple tabletop manipulation tasks (pick-and-place, pushing) can reach 80% success rates with 50-500 demonstrations when fine-tuning models like OpenVLA or RT-2. Complex tasks requiring precise force control or multi-step reasoning (assembly, deformable object manipulation) may require 1,000-5,000 demonstrations. This is 10-100× fewer demonstrations than training from scratch, which typically requires 50,000-500,000 trajectories. The sample efficiency gain comes from pretrained vision and language representations: the model already understands object semantics and spatial relationships from internet-scale data, so fine-tuning only needs to adapt the action decoder to the specific task and embodiment.

Q: Can multimodal foundation models generalize to robot hardware they were not trained on?

Yes, with caveats. Models pretrained on diverse embodiments (10+ robot types) can achieve 30-50% zero-shot success rates on novel hardware, as demonstrated by RT-X on the Open X-Embodiment dataset. However, performance improves significantly with fine-tuning: 100-500 demonstrations on the target robot typically boost success rates to 70-90%. The degree of generalization depends on morphological similarity—a model trained on 6-DOF arms generalizes better to other 6-DOF arms than to mobile manipulators or humanoids. Action space compatibility is critical: models trained with end-effector pose actions can transfer to robots with different kinematics via inverse kinematics, but models trained with joint-space actions require embodiment-specific fine-tuning. Practitioners should budget for 100-1,000 demonstrations when deploying pretrained models on novel hardware.

Q: What data formats are standard for multimodal robot datasets?

The three dominant formats are HDF5, MCAP, and RLDS. HDF5 is a hierarchical container format that stores images, actions, and metadata in a single file with efficient random access; it is widely used in academia and supported by LeRobot, RoboMimic, and CALVIN. MCAP is a modern format designed for multi-sensor robotics data, supporting arbitrary message schemas and efficient streaming; it is the default for ROS 2 and Foxglove tooling. RLDS (Reinforcement Learning Datasets) is a TensorFlow-native format that represents trajectories as sequences of (observation, action, reward) tuples; it is used by Google Research and supports efficient data pipelines for large-scale training. Most tooling can convert between formats—LeRobot provides converters for HDF5 ↔ MCAP ↔ RLDS. Practitioners should choose based on their ML framework (PyTorch → HDF5, TensorFlow → RLDS) and robot middleware (ROS 2 → MCAP).

Q: How do I evaluate whether a pretrained model will work for my use case before committing to fine-tuning?

Run zero-shot evaluation on a small test set (10-50 episodes) in your target environment. Deploy the pretrained model without fine-tuning and measure task success rate, execution time, and failure modes. If zero-shot success is >20%, fine-tuning will likely reach 70-90% with 100-1,000 demonstrations. If zero-shot success is <5%, the pretrained model lacks relevant priors and fine-tuning may require 5,000+ demonstrations—consider training from scratch or sourcing a model pretrained on more similar data. Inspect failure modes: if the model correctly identifies objects but selects wrong actions, the vision encoder is adequate and fine-tuning should focus on the action decoder. If the model misidentifies objects, the vision encoder may need fine-tuning or replacement. Request sample datasets from vendors like Truelabel to test data quality and format compatibility before purchasing large volumes.

Q: What are the main cost drivers for training a multimodal foundation model from scratch versus fine-tuning?

Training from scratch costs $100K-10M depending on model size and data volume: $50K-5M for data collection (100K-10M trajectories at $0.50-50 each), $10K-1M for compute (1,000-100,000 GPU-hours at $1-10/hour), and $10K-500K for engineering (3-12 months of ML engineer time). Fine-tuning a pretrained model costs $10K-500K: $5K-250K for task-specific data (500-5,000 demonstrations), $1K-50K for compute (100-5,000 GPU-hours), and $5K-200K for engineering (1-6 months). The 10-100× cost difference comes from reduced data requirements (pretrained models need 10-100× fewer demonstrations) and faster iteration (fine-tuning converges in days vs weeks). For organizations deploying across multiple tasks, the one-time pretraining cost amortizes: a $1M pretrained model used for 20 tasks costs $50K per task, competitive with task-specific training.

A multimodal foundation model is a large-scale transformer pretrained on text, images, video, audio, and action data that learns cross-modal representations transferable to downstream tasks. For physical AI, these models bridge internet-scale knowledge and embodied robot behavior by processing sensor streams and language instructions in a unified architecture.

Updated 2025-06-15

By truelabel

Reviewed by truelabel · Jun 15, 2025

multimodal foundation model

Browse Physical AI Datasets Browse glossary

Quick facts

Term: Multimodal Foundation Model
Domain: Robotics and physical AI
Last reviewed: 2025-06-15

Architecture: Unified Token Spaces Across Modalities

Modern multimodal foundation models convert every input—images, text, audio, robot actions—into token sequences processed by a shared transformer decoder. OpenVLA demonstrates this pattern: a Vision Transformer encodes camera frames into 256 visual tokens, a language tokenizer converts instructions into text tokens, and a diffusion head decodes the transformer's output into continuous robot actions^[1]. The key innovation is modality-agnostic attention: the transformer learns cross-modal dependencies without hardcoded fusion rules.

Image encoders typically use ViT architectures that patchify frames into 16×16 or 14×14 grids, project patches through learned embeddings, and add positional encodings. RT-2 uses a 22B-parameter PaLI-X vision encoder pretrained on 10 billion image-text pairs, then fine-tunes on 130,000 robot demonstrations^[2]. Text tokenizers apply byte-pair encoding or SentencePiece to map language into discrete IDs. Action spaces are either discretized into bins (RT-1's 256-bin vocabulary) or projected into continuous vectors decoded by diffusion or autoregressive heads.

The transformer backbone processes concatenated token sequences through self-attention layers that compute pairwise interactions between all tokens regardless of source modality. RT-1 stacks 8 transformer layers with 12 attention heads each, totaling 35 million parameters in the policy network^[3]. Larger models like RT-2 leverage pretrained language model weights (PaLM-E's 562B parameters) and freeze most layers during robot fine-tuning to preserve internet-scale knowledge while adapting output heads to action prediction.

Pretraining Paradigms: From Web Data to Embodied Trajectories

Multimodal foundation models follow a two-stage training recipe. Stage one pretrains on internet-scale datasets: CLIP trained on 400 million image-text pairs, Flamingo on 2.3 billion image-text-video samples, GPT-4V on undisclosed web corpora exceeding trillions of tokens. This stage learns general visual semantics, language understanding, and cross-modal alignment through contrastive objectives (CLIP) or autoregressive next-token prediction (GPT-4V).

Stage two adapts pretrained weights to embodied tasks via robot demonstration data. Open X-Embodiment aggregated 970,000 trajectories across 22 robot embodiments from 34 institutions, enabling RT-X models to generalize across morphologies^[4]. DROID contributed 76,000 trajectories spanning 564 skills and 86 locations, emphasizing real-world diversity over lab-controlled scenarios^[5]. Fine-tuning typically freezes vision encoders and language model layers while training only the action decoder and cross-attention modules to preserve pretrained representations.

Data mixing ratios critically impact downstream performance. RT-2 mixes web data and robot data at 95:5 during co-training, maintaining language capabilities while learning action grounding. LeRobot provides open tooling for this workflow: users can pretrain on ImageNet or LAION subsets, then fine-tune on task-specific teleoperation datasets like ALOHA or BridgeData V2^[6]. The framework supports HDF5, MCAP, and Parquet formats for trajectory storage, enabling practitioners to plug custom datasets into standard training pipelines.

Vision-Language-Action Models: Grounding Language in Robot Control

Vision-language-action models extend vision-language models by adding an action prediction head that outputs robot commands conditioned on visual observations and natural language instructions. The canonical example is RT-1, which takes RGB images and text prompts as input and predicts 7-DOF end-effector actions at 3 Hz^[3]. The model architecture stacks a pretrained EfficientNet image encoder, a Universal Sentence Encoder for language, and a Transformer policy that attends over both modalities before decoding discrete action tokens.

RT-2 scales this approach by initializing from PaLI-X, a 55B-parameter vision-language model pretrained on web data. Fine-tuning on 130,000 robot demonstrations yields a policy that can execute 6,000 distinct skills including novel object categories never seen during robot training^[2]. The key insight: internet-scale pretraining provides semantic priors ("a banana is yellow and curved") that transfer to physical manipulation without requiring millions of robot examples.

OpenVLA open-sources this paradigm with a 7B-parameter model trained on 970,000 Open X-Embodiment trajectories. The architecture uses a Llama 2 language backbone, SigLIP vision encoder, and diffusion action decoder. Inference runs at 10 Hz on a single A100 GPU, enabling real-time control for tabletop manipulation^[1]. Practitioners can fine-tune OpenVLA on 50-500 task-specific demonstrations using LeRobot's training scripts, achieving 80%+ success rates on novel object configurations after 24-48 hours of training.

Cross-Modal Attention Mechanisms: How Models Fuse Modalities

Cross-modal attention is the computational primitive that enables multimodal foundation models to reason jointly over vision, language, and action. In a standard transformer, each token attends to all other tokens via learned query-key-value projections. For multimodal inputs, this means image patch tokens can attend to language tokens and vice versa, discovering correlations like "the word 'red' aligns with crimson pixel regions."

Flamingo introduced the perceiver resampler architecture: a small cross-attention module that compresses variable-length image sequences into a fixed number of visual tokens (64 or 256) before feeding them to the language model. This design reduces computational cost from O(n²) to O(n·k) where k is the resampler output size. RT-2 adopts this pattern, using a 6-layer perceiver to distill 1024 ViT patches into 256 visual tokens that condition the language model's action predictions^[2].

Alternative fusion strategies include early fusion (concatenate all tokens before the first transformer layer), late fusion (process modalities separately then merge outputs), and hierarchical fusion (cross-attend at multiple depths). OpenVLA uses early fusion: visual tokens and language tokens are concatenated into a single sequence, then processed by a unified Llama 2 decoder with causal masking^[1]. This maximizes parameter sharing but requires careful positional encoding to distinguish modality boundaries.

Attention visualization reveals learned cross-modal alignments. In RT-2, action prediction tokens attend strongly to object regions mentioned in the language instruction ("pick up the apple" → high attention on red fruit pixels). These attention maps provide interpretability: if a model fails, inspecting attention weights shows whether the failure stems from vision (wrong object localization) or language (misunderstood instruction).

Training Data Requirements: Scale, Diversity, and Embodiment Coverage

Multimodal foundation models for physical AI require three data categories: internet-scale vision-language pairs (hundreds of millions to billions), robot demonstration trajectories (tens of thousands to millions), and simulation rollouts (optional, millions of episodes). The first category provides semantic grounding, the second teaches embodied skills, the third enables safe exploration of edge cases.

Open X-Embodiment established the current scale benchmark: 970,000 real-robot trajectories spanning 22 embodiments and 160 tasks^[4]. This dataset enabled RT-X models to achieve 50%+ zero-shot success rates on unseen robot morphologies—a 3× improvement over single-embodiment training. DROID contributed 76,000 trajectories emphasizing geographic and demographic diversity: 86 collection sites across 13 US states, 100+ object categories, 564 manipulation skills^[5].

Data diversity matters more than raw volume for generalization. A model trained on 10,000 diverse trajectories (varied objects, backgrounds, lighting, embodiments) outperforms one trained on 100,000 homogeneous lab demonstrations. BridgeData V2 deliberately varied camera angles, table textures, and distractor objects across 60,000 trajectories to improve robustness^[7]. Truelabel's marketplace aggregates 12,000+ collectors contributing teleoperation data across residential kitchens, warehouses, and retail environments—capturing real-world variability absent from lab datasets.

Embodiment coverage determines transfer learning effectiveness. Models pretrained on data from 20+ robot types (Franka Panda, UR5, Kinova Gen3, mobile manipulators) generalize better to novel hardware than models trained on a single platform. LeRobot supports 15 robot embodiments out-of-the-box, with adapters for custom kinematics and action spaces^[6]. Practitioners should prioritize datasets matching their target deployment environment: warehouse automation benefits from teleoperation warehouse data, while kitchen robotics requires kitchen task datasets.

Evaluation Metrics: Beyond Task Success Rate

Standard robot learning metrics—task success rate, execution time, collision frequency—are necessary but insufficient for evaluating multimodal foundation models. These models must also demonstrate language grounding (does "pick up the red block" select the correct object?), generalization (does the model work on novel objects, backgrounds, embodiments?), and sample efficiency (how many demonstrations are needed for fine-tuning?).

Language grounding is measured via instruction following accuracy: the percentage of trials where the robot executes the semantically correct action given a natural language command. RT-2 reports 62% instruction following on 6,000 unseen skills, compared to 32% for RT-1 trained only on robot data^[2]. Evaluation requires diverse instruction phrasings ("grasp the apple" vs "pick up the red fruit") to test robustness to linguistic variation.

Generalization benchmarks test zero-shot transfer to novel conditions. Open X-Embodiment defines cross-embodiment transfer: train on robots A, B, C and evaluate on robot D without fine-tuning. RT-X achieves 50% success on this metric^[4]. Cross-task transfer trains on tasks 1-100 and evaluates on tasks 101-200. Cross-environment transfer trains in lab settings and evaluates in homes or warehouses. DROID emphasizes geographic diversity as a generalization axis: models trained on data from 10 US states outperform single-location models by 18% when deployed in new regions^[5].

Sample efficiency quantifies fine-tuning cost. A well-pretrained model should reach 80% task success with 50-500 demonstrations, versus 5,000-50,000 for training from scratch. OpenVLA demonstrates 10× sample efficiency gains: 100 demonstrations suffice for novel tabletop tasks after pretraining on Open X-Embodiment^[1]. Practitioners should budget 24-48 GPU-hours for fine-tuning 7B-parameter models on task-specific datasets of 100-1,000 trajectories.

Deployment Considerations: Latency, Hardware, and Safety

Deploying multimodal foundation models on physical robots requires balancing model capacity against real-time control constraints. Most manipulation tasks demand 3-10 Hz action prediction rates, implying 100-333 ms inference budgets. A 7B-parameter model running on an NVIDIA A100 GPU achieves 10 Hz (100 ms per forward pass), while a 55B-parameter model requires multi-GPU parallelism or quantization to meet real-time requirements.

OpenVLA provides deployment benchmarks: the 7B model runs at 10 Hz on a single A100, 5 Hz on an RTX 4090, and 2 Hz on a Jetson AGX Orin^[1]. Quantization to INT8 or INT4 precision doubles throughput with <5% accuracy loss. For edge deployment, practitioners can distill large models into smaller student networks: a 1B-parameter student trained on a 55B-parameter teacher's outputs retains 85% of task performance while running at 30 Hz on embedded GPUs.

Safety mechanisms are critical for physical deployment. Action clamping limits predicted velocities and accelerations to safe ranges (e.g., max 0.5 m/s end-effector speed). Collision detection monitors joint torques and halts execution if forces exceed thresholds. Uncertainty estimation via ensemble predictions or dropout at inference time flags low-confidence actions for human review. RT-1 uses a learned termination classifier that predicts when a task is complete or unrecoverable, preventing infinite loops^[3].

Model versioning and rollback procedures are essential for production systems. Data provenance tracking links each model checkpoint to its training data sources, enabling audits when failures occur. If a deployed model exhibits unexpected behavior, operators can trace the issue to specific training trajectories, retrain with corrected data, and redeploy within hours. This workflow requires standardized metadata schemas—LeRobot embeds dataset cards with collection dates, robot IDs, and task labels in every HDF5 file^[6].

Open-Source Implementations: LeRobot, OpenVLA, and RT-X

The open-source ecosystem for multimodal foundation models has matured rapidly since 2023. LeRobot, released by Hugging Face in 2024, provides end-to-end tooling: dataset loaders for 15+ robot formats (HDF5, MCAP, RLDS), pretrained model checkpoints (ACT, Diffusion Policy, VQ-BeT), training scripts with mixed-precision and distributed data parallelism, and evaluation harnesses for simulation and real-robot benchmarks^[6]. The library has 8,000+ GitHub stars and 200+ contributors as of early 2025.

OpenVLA released a 7B-parameter vision-language-action model trained on 970,000 Open X-Embodiment trajectories. The model weights, training code, and inference server are Apache 2.0 licensed. Practitioners can fine-tune OpenVLA on custom datasets using a single A100 GPU and 50-500 demonstrations, achieving 80%+ success rates on novel tasks after 24 hours of training^[1]. The project includes Docker containers for reproducible environments and ROS 2 integration for real-robot deployment.

RT-X is a family of models trained on Open X-Embodiment data, ranging from 1B to 55B parameters. The 1B model runs at 30 Hz on edge GPUs, suitable for mobile manipulators. The 55B model achieves state-of-the-art performance on cross-embodiment transfer benchmarks but requires A100 clusters for training^[4]. Google released RT-1 and RT-2 model cards with architecture details, training hyperparameters, and evaluation protocols, enabling independent replication.

These open implementations democratize access to physical AI capabilities. A robotics startup can download OpenVLA, fine-tune on 100 task-specific demonstrations collected via Truelabel's marketplace, and deploy a production-ready policy within one week. This workflow was impossible before 2023, when foundation models required proprietary datasets and million-dollar compute budgets.

Limitations and Active Research Directions

Current multimodal foundation models face four major limitations. First, they struggle with long-horizon tasks requiring 50+ sequential actions. Most models are trained on 10-30 second demonstrations and fail to generalize to multi-minute tasks like "clean the entire kitchen." Hierarchical policies that decompose long tasks into subgoals are an active research direction.

Second, sample efficiency remains poor for fine-tuning on novel embodiments. Adapting a pretrained model to a new robot with different kinematics, sensors, or action spaces requires 500-5,000 demonstrations—still prohibitive for many applications. Meta-learning approaches like Model-Agnostic Meta-Learning (MAML) and few-shot imitation learning aim to reduce this to 10-50 demonstrations.

Third, models lack robust failure recovery. When a manipulation attempt fails (object slips, gripper misaligns), most policies repeat the same action indefinitely rather than trying alternative strategies. Incorporating online reinforcement learning or replanning mechanisms could address this, but real-world RL remains sample-inefficient and safety-critical.

Fourth, data diversity gaps limit real-world deployment. Most open datasets come from lab environments with controlled lighting, uncluttered backgrounds, and cooperative objects. DROID made progress by collecting in 86 real-world sites, but the dataset still underrepresents challenging conditions: outdoor settings, dynamic obstacles, adversarial objects^[5]. Truelabel's marketplace addresses this by incentivizing collectors to contribute data from diverse environments—warehouses, retail stores, residential kitchens—with monetary requests for high-value scenarios.

Active research directions include world models (learning predictive models of environment dynamics to enable planning), sim-to-real transfer (training in simulation then adapting to reality with minimal real data), multi-task learning (single policies that handle 1,000+ distinct skills), and embodied reasoning (models that can explain their decisions in natural language). NVIDIA's Cosmos world foundation models represent one approach: pretrain on video to learn physics priors, then fine-tune on robot data for action prediction^[8].

Commercial Landscape: Vendors, Pricing, and Procurement

The commercial market for multimodal foundation model training data has three tiers. Tier one vendors like Scale AI offer end-to-end data engines: custom data collection, annotation, quality assurance, and dataset delivery. Pricing ranges from $50-500 per trajectory depending on task complexity, with minimum orders of 10,000 trajectories ($500K-5M contracts). Scale's physical AI division has raised $1B+ and serves customers including OpenAI, Meta, and Tesla^[9].

Tier two vendors like Appen, Labelbox, and Encord provide annotation platforms and managed workforces but require customers to supply raw data. Pricing is $10-100 per trajectory for annotation-only services. These vendors excel at high-volume 2D image/video labeling but have limited robotics expertise—most lack native support for 3D point clouds, force-torque sensors, or proprioceptive data.

Tier three is the emerging peer-to-peer marketplace model. Truelabel operates a request-based platform where robotics teams post data collection requests ("500 trajectories of warehouse picking, $50 each") and 12,000+ collectors worldwide contribute data. Pricing is 50-80% lower than tier-one vendors ($25-100 per trajectory), with faster turnaround (days vs months) and greater diversity (residential, retail, industrial environments)^[10]. The platform handles payment escrow, quality verification, and licensing—collectors retain copyright but grant perpetual commercial use rights to buyers.

Procurement considerations include data rights (exclusive vs non-exclusive, commercial use, derivative works), quality guarantees (success rate thresholds, re-collection policies), delivery formats (HDF5, MCAP, RLDS), and metadata completeness (robot specs, sensor calibration, task labels). Buyers should request sample datasets before committing to large orders and verify that vendors provide provenance metadata linking each trajectory to collection conditions, annotator IDs, and timestamps.

Integration with Existing ML Pipelines

Integrating multimodal foundation models into production ML pipelines requires adapting data loaders, training loops, and deployment infrastructure. Most robotics teams use PyTorch or JAX for model development, ROS or ROS 2 for robot control, and cloud platforms (AWS, GCP, Azure) for training. LeRobot bridges these ecosystems with native support for HDF5, MCAP, and RLDS dataset formats, PyTorch DataLoader integration, and ROS 2 action servers for real-robot deployment^[6].

Data preprocessing pipelines must handle multimodal inputs: images require resizing, normalization, and augmentation (random crops, color jitter); language requires tokenization and padding; actions require normalization to [-1, 1] ranges and discretization for autoregressive models. LeRobot's dataset API provides these transforms out-of-the-box, with configurable augmentation policies. For custom datasets, practitioners implement a `__getitem__` method that returns a dictionary with keys `observation.image`, `observation.state`, `action`, and `language_instruction`.

Training loops follow standard supervised learning patterns: sample a batch of trajectories, forward pass through the model, compute loss (cross-entropy for discrete actions, MSE for continuous), backward pass, optimizer step. Mixed-precision training (FP16 or BF16) reduces memory usage by 50% and speeds up training by 2-3×. Distributed data parallelism across 8-64 GPUs is standard for models >1B parameters. LeRobot includes Accelerate integration for seamless multi-GPU training without manual device placement^[6].

Deployment infrastructure varies by application. Cloud-based inference (model runs on remote servers, robot streams observations and receives actions over network) introduces 50-200 ms latency, acceptable for slow manipulation but not high-speed assembly. Edge inference (model runs on robot's onboard GPU) achieves <10 ms latency but requires model compression. Hybrid approaches run vision encoders on edge GPUs and language models in the cloud, balancing latency and compute cost.

Case Study: Fine-Tuning OpenVLA for Warehouse Picking

A logistics company deployed OpenVLA for automated warehouse picking, fine-tuning the pretrained 7B model on 500 task-specific demonstrations. The workflow illustrates practical considerations for multimodal foundation model deployment.

Data collection: The company used Truelabel's marketplace to source 500 teleoperation trajectories of warehouse picking tasks: grasping cardboard boxes, plastic bins, and soft packages from shelves at varying heights. Collectors used a Franka Panda arm with a wrist-mounted RealSense D435 camera, recording RGB images at 30 Hz, joint positions at 100 Hz, and gripper states at 10 Hz. Data was delivered in MCAP format with embedded metadata (object categories, shelf heights, lighting conditions).

Preprocessing: The team converted MCAP files to LeRobot's HDF5 format using the library's conversion script. Images were resized to 224×224, normalized to ImageNet statistics, and augmented with random crops and color jitter. Actions (7-DOF end-effector poses) were normalized to [-1, 1] and discretized into 256 bins for autoregressive prediction. Language instructions ("pick the blue box from the top shelf") were tokenized using OpenVLA's SentencePiece vocabulary.

Training: Fine-tuning ran on a single A100 GPU for 48 hours, processing 500 trajectories (≈15,000 frames) for 100 epochs. The learning rate was 1e-5, batch size 16, with gradient accumulation over 4 steps. Only the action decoder and final transformer layers were unfrozen; vision and language encoders remained frozen to preserve pretrained representations. Validation loss plateaued after 60 epochs, indicating convergence.

Evaluation: The fine-tuned model achieved 82% success rate on 100 held-out test tasks, compared to 34% for the pretrained model without fine-tuning and 91% for human teleoperation. Failure modes included misalignment on reflective packages (8% of trials) and collisions with shelf edges (10%). The company deployed the model in a pilot warehouse, processing 200 picks per day with human oversight for failure recovery.

Future Directions: Scaling Laws and Emergent Capabilities

Scaling laws for multimodal foundation models remain poorly understood compared to language models. In language, model performance follows predictable power laws: doubling parameters or data improves loss by a fixed percentage. For embodied AI, the relationship between model size, data volume, and task performance is nonlinear and task-dependent.

Empirical evidence suggests data diversity matters more than volume for physical AI. Open X-Embodiment showed that 970,000 diverse trajectories (22 embodiments, 160 tasks) outperform 5 million homogeneous trajectories from a single robot^[4]. This contrasts with language models, where scaling data volume monotonically improves performance. The implication: robotics teams should prioritize collecting data across varied environments, objects, and embodiments rather than maximizing trajectory count in a single setting.

Emergent capabilities—skills that appear suddenly at certain scale thresholds—have been observed in vision-language models but not yet in vision-language-action models. GPT-4V exhibits chain-of-thought reasoning and few-shot learning that GPT-3 lacks. Will 100B-parameter robot policies exhibit analogous emergent skills like tool use, multi-step planning, or failure recovery? Current evidence is mixed: RT-2's 55B model shows better generalization than RT-1's 35M model, but the improvement is gradual rather than discontinuous^[2].

Future research will likely focus on world models: learning predictive models of environment dynamics to enable planning and counterfactual reasoning. NVIDIA's Cosmos pretrains on video to learn physics priors, then fine-tunes on robot data^[8]. This approach could reduce the need for massive robot datasets by transferring knowledge from passive video observation. Another direction is multi-agent learning: training policies that coordinate multiple robots, requiring datasets of synchronized multi-robot trajectories—a category Truelabel's marketplace is beginning to support.

Regulatory and Ethical Considerations

Deploying multimodal foundation models in physical environments raises regulatory and ethical questions absent from purely digital AI. Safety certification is required for robots operating near humans: ISO 10218 for industrial robots, ISO 13482 for service robots. Models must demonstrate bounded failure modes—a misclassification should not cause injury. This requires extensive testing on adversarial inputs and edge cases, which most open datasets lack.

Data privacy is critical when training on data collected in homes, hospitals, or retail stores. GDPR Article 7 requires explicit consent for data collection, and subjects must be able to request deletion^[11]. Truelabel's platform implements consent workflows: collectors confirm they have rights to share data, and buyers receive attestations of compliance. For datasets containing human faces or voices, anonymization (face blurring, voice distortion) is standard practice.

Bias and fairness concerns arise when models are trained on geographically or demographically skewed data. If a warehouse picking model is trained only on data from US facilities, it may fail in warehouses with different shelf designs, lighting, or product packaging common in other regions. DROID addressed this by collecting data across 13 US states and multiple demographic groups, but global coverage remains limited^[5]. Buyers should audit dataset demographics and request additional collection to fill gaps.

Intellectual property questions include: Who owns a model trained on crowdsourced data? Can a model trained on CC-BY datasets be commercialized? Truelabel's licensing model grants buyers perpetual commercial use rights while collectors retain copyright, similar to stock photo marketplaces. For datasets with restrictive licenses (CC-BY-NC, research-only), buyers must negotiate separate commercial terms or avoid those sources entirely.

Comparison with Single-Modality Models

Multimodal foundation models outperform single-modality models on tasks requiring cross-modal reasoning but incur higher training costs and inference latency. A vision-only model (e.g., a ResNet trained on ImageNet) can classify objects but cannot follow language instructions. A language-only model (e.g., GPT-3) can generate text but cannot perceive visual scenes. Multimodal models unify these capabilities at the cost of 10-100× more parameters and training data.

For tasks with fixed, well-defined objectives ("detect apples in images"), single-modality models are more efficient. A YOLOv8 object detector trained on 10,000 labeled images achieves 95% mAP and runs at 100 FPS on edge GPUs. A multimodal model like RT-2 achieves similar detection accuracy but runs at 10 FPS and requires 130,000 robot demonstrations for training^[2]. The multimodal model's advantage is flexibility: it can handle novel instructions ("pick up the reddest apple") without retraining, while the single-modality detector cannot.

For robotics, the flexibility advantage is decisive. Real-world deployments encounter unbounded variation in objects, environments, and task specifications. A warehouse may stock 10,000 SKUs that change weekly; a multimodal model can adapt to new products via language descriptions without retraining. A single-modality vision model would require retraining on labeled images of each new SKU—a prohibitive data collection burden.

Practitioners should use single-modality models for high-throughput, low-latency tasks with fixed objectives (quality inspection, barcode scanning) and multimodal models for flexible, instruction-following tasks (general-purpose manipulation, household assistance). Hybrid architectures are emerging: a fast vision model for object detection feeds into a multimodal model for action selection, balancing speed and flexibility.

Tooling Ecosystem: Annotation, Training, and Deployment

The tooling ecosystem for multimodal foundation models spans data annotation, model training, and robot deployment. Annotation tools for robot data must handle temporal sequences, 3D geometry, and multimodal alignment. Labelbox supports video annotation with frame-by-frame bounding boxes and keypoint tracking but lacks native support for 3D point clouds or force-torque data^[12]. Segments.ai specializes in multi-sensor data labeling, supporting LiDAR point clouds, radar, and camera fusion^[13].

Training frameworks include LeRobot (PyTorch-based, 15+ robot embodiments, HDF5/MCAP/RLDS loaders), TensorFlow Agents (JAX-based, RLDS native, Google's internal framework), and RoboSuite (MuJoCo simulation, domain randomization). LeRobot has the most active open-source community, with 200+ contributors and monthly releases^[6]. The library includes pretrained checkpoints for ACT, Diffusion Policy, and VQ-BeT, enabling practitioners to start from strong baselines.

Deployment tools bridge trained models and robot hardware. ROS 2 action servers wrap model inference in a standard interface: the robot sends observations (images, joint states) and receives actions (target poses, gripper commands). LeRobot provides ROS 2 integration scripts that launch an action server, load a trained model, and handle message serialization^[6]. For non-ROS robots, practitioners implement custom inference loops using the model's Python API.

Monitoring and debugging tools are critical for production deployments. Foxglove Studio visualizes robot data streams in real time: camera feeds, joint trajectories, action predictions, attention maps. When a model fails, operators can replay the episode, inspect attention weights to diagnose whether the failure was perceptual (wrong object detected) or policy-level (correct detection, wrong action). Provenance tracking links each failure to the training data that influenced the model's behavior, enabling targeted data collection to fix specific failure modes.

Cost-Benefit Analysis: When to Use Foundation Models

Multimodal foundation models require significant upfront investment—$50K-500K for data collection, $10K-100K for compute, weeks to months of engineering effort—but offer long-term cost savings through transfer learning and reduced per-task data requirements. The break-even point depends on the number of tasks and deployment scale.

Single-task scenario: A company needs a robot to perform one task (e.g., bin picking) in one environment (e.g., a single warehouse). Training a task-specific model from scratch requires 5,000-50,000 demonstrations at $10-50 each ($50K-2.5M data cost) plus $5K-20K compute. Fine-tuning a pretrained foundation model requires 500-5,000 demonstrations ($5K-250K data cost) plus $10K-50K compute (higher per-task compute due to larger models). For a single task, the foundation model approach is cost-competitive only if data collection is the dominant cost.

Multi-task scenario: The company needs robots to perform 10 tasks across 5 warehouses (50 task-environment combinations). Training task-specific models requires 50 × 5,000 = 250,000 demonstrations ($2.5M-12.5M data cost). Fine-tuning a foundation model requires 50 × 500 = 25,000 demonstrations ($250K-1.25M data cost) plus one-time pretraining cost ($100K-500K). Total foundation model cost: $350K-1.75M, a 5-10× savings. The break-even point is ≈5-10 tasks.

Deployment scale amplifies savings. A model deployed on 100 robots amortizes data and compute costs across all units. If each robot generates $50K/year in labor savings, a $500K model development cost pays back in 10 robot-years (3 months for a 100-robot fleet). Truelabel's marketplace model reduces data costs by 50-80% compared to traditional vendors, lowering the break-even point to 2-5 tasks.

Practitioners should use foundation models when: (1) deploying across multiple tasks or environments, (2) expecting frequent task changes requiring rapid adaptation, (3) operating at scale (10+ robots), or (4) requiring generalization to novel objects or instructions. For single-task, fixed-environment, small-scale deployments, task-specific models remain more cost-effective.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Physical AI glossaryGlossary hub What is physical AI training data?Related page Robot demonstrationsDefinition and terminology VLA modelDefinition and terminology World model AIDefinition and terminology VLA training dataBuyer conversion page Physical AI data providers: criteria and optionsRelated page Data provenance for physical AIRelated page

External references and source context

OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA architecture details: Llama 2 backbone, SigLIP vision encoder, diffusion action decoder, 10 Hz inference on A100
arXiv ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 uses 55B-parameter PaLI-X pretrained on 10B image-text pairs, fine-tuned on 130,000 robot demonstrations, achieves 6,000 skills
arXiv ↩
RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 architecture: 8 transformer layers, 12 attention heads, 35M parameters, 3 Hz action prediction, learned termination classifier
arXiv ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment: 970,000 trajectories, 22 embodiments, 34 institutions, 160 tasks; RT-X achieves 50% cross-embodiment transfer
arXiv ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID: 76,000 trajectories, 564 skills, 86 locations, 13 US states, 100+ object categories, emphasizes real-world diversity
arXiv ↩
LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch
LeRobot: open-source framework supporting 15 robot embodiments, HDF5/MCAP/RLDS formats, PyTorch integration, ROS 2 deployment
arXiv ↩
BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2: 60,000 trajectories with varied camera angles, table textures, distractor objects for robustness
arXiv ↩
NVIDIA Cosmos World Foundation Models
NVIDIA Cosmos world foundation models for video-based physics pretraining
NVIDIA Developer ↩
scale.com physical ai
Scale AI physical AI data engine for custom collection and annotation
scale.com ↩
truelabel physical AI data marketplace bounty intake
Truelabel physical AI data marketplace with 12,000+ collectors, request-based model, 50-80% cost reduction vs traditional vendors
truelabel.ai ↩
GDPR Article 7 — Conditions for consent
GDPR Article 7 requirements for explicit consent in data collection
GDPR-Info.eu ↩
docs.labelbox.com overview
Labelbox documentation covering video annotation and tracking features
docs.labelbox.com ↩
Segments.ai multi-sensor data labeling
Segments.ai multi-sensor data labeling for LiDAR, radar, camera fusion
segments.ai ↩

More glossary terms

Robot demonstrationsRecorded successful task executions used as demonstrations for imitation learning.VLA modelVision-language-action models that map perception and language to robot actions.World model AIModels that learn the dynamics of an environment and can simulate forward.Consent artifactSigned documentation that contributors agreed to commercial use of their data.Data provenanceTraceability metadata: source, consent, rights, capture conditions, chain of custody.Egocentric dataFirst-person camera footage capturing how a worker or operator sees a task.

FAQ

What is the difference between a vision-language model and a vision-language-action model?

A vision-language model (VLM) like CLIP or GPT-4V processes images and text to perform tasks like image captioning, visual question answering, or zero-shot classification. A vision-language-action model (VLA) extends VLMs by adding an action prediction head that outputs robot commands. VLAs take images and language instructions as input and produce continuous or discrete actions (joint velocities, end-effector poses, gripper states) as output. RT-1 and RT-2 are canonical VLA examples: they use pretrained VLM encoders but add transformer-based policy decoders trained on robot demonstration data. The key distinction is embodiment: VLAs are trained on trajectories that pair observations with actions, enabling closed-loop control, while VLMs are trained on static image-text pairs without action labels.

How much robot data is needed to fine-tune a pretrained multimodal foundation model?

Fine-tuning a pretrained multimodal foundation model for a novel task typically requires 50-5,000 demonstrations depending on task complexity and model size. Simple tabletop manipulation tasks (pick-and-place, pushing) can reach 80% success rates with 50-500 demonstrations when fine-tuning models like OpenVLA or RT-2. Complex tasks requiring precise force control or multi-step reasoning (assembly, deformable object manipulation) may require 1,000-5,000 demonstrations. This is 10-100× fewer demonstrations than training from scratch, which typically requires 50,000-500,000 trajectories. The sample efficiency gain comes from pretrained vision and language representations: the model already understands object semantics and spatial relationships from internet-scale data, so fine-tuning only needs to adapt the action decoder to the specific task and embodiment.

Can multimodal foundation models generalize to robot hardware they were not trained on?

Yes, with caveats. Models pretrained on diverse embodiments (10+ robot types) can achieve 30-50% zero-shot success rates on novel hardware, as demonstrated by RT-X on the Open X-Embodiment dataset. However, performance improves significantly with fine-tuning: 100-500 demonstrations on the target robot typically boost success rates to 70-90%. The degree of generalization depends on morphological similarity—a model trained on 6-DOF arms generalizes better to other 6-DOF arms than to mobile manipulators or humanoids. Action space compatibility is critical: models trained with end-effector pose actions can transfer to robots with different kinematics via inverse kinematics, but models trained with joint-space actions require embodiment-specific fine-tuning. Practitioners should budget for 100-1,000 demonstrations when deploying pretrained models on novel hardware.

What data formats are standard for multimodal robot datasets?

The three dominant formats are HDF5, MCAP, and RLDS. HDF5 is a hierarchical container format that stores images, actions, and metadata in a single file with efficient random access; it is widely used in academia and supported by LeRobot, RoboMimic, and CALVIN. MCAP is a modern format designed for multi-sensor robotics data, supporting arbitrary message schemas and efficient streaming; it is the default for ROS 2 and Foxglove tooling. RLDS (Reinforcement Learning Datasets) is a TensorFlow-native format that represents trajectories as sequences of (observation, action, reward) tuples; it is used by Google Research and supports efficient data pipelines for large-scale training. Most tooling can convert between formats—LeRobot provides converters for HDF5 ↔ MCAP ↔ RLDS. Practitioners should choose based on their ML framework (PyTorch → HDF5, TensorFlow → RLDS) and robot middleware (ROS 2 → MCAP).

How do I evaluate whether a pretrained model will work for my use case before committing to fine-tuning?

Run zero-shot evaluation on a small test set (10-50 episodes) in your target environment. Deploy the pretrained model without fine-tuning and measure task success rate, execution time, and failure modes. If zero-shot success is >20%, fine-tuning will likely reach 70-90% with 100-1,000 demonstrations. If zero-shot success is <5%, the pretrained model lacks relevant priors and fine-tuning may require 5,000+ demonstrations—consider training from scratch or sourcing a model pretrained on more similar data. Inspect failure modes: if the model correctly identifies objects but selects wrong actions, the vision encoder is adequate and fine-tuning should focus on the action decoder. If the model misidentifies objects, the vision encoder may need fine-tuning or replacement. Request sample datasets from vendors like Truelabel to test data quality and format compatibility before purchasing large volumes.

What are the main cost drivers for training a multimodal foundation model from scratch versus fine-tuning?

Training from scratch costs $100K-10M depending on model size and data volume: $50K-5M for data collection (100K-10M trajectories at $0.50-50 each), $10K-1M for compute (1,000-100,000 GPU-hours at $1-10/hour), and $10K-500K for engineering (3-12 months of ML engineer time). Fine-tuning a pretrained model costs $10K-500K: $5K-250K for task-specific data (500-5,000 demonstrations), $1K-50K for compute (100-5,000 GPU-hours), and $5K-200K for engineering (1-6 months). The 10-100× cost difference comes from reduced data requirements (pretrained models need 10-100× fewer demonstrations) and faster iteration (fine-tuning converges in days vs weeks). For organizations deploying across multiple tasks, the one-time pretraining cost amortizes: a $1M pretrained model used for 20 tasks costs $50K per task, competitive with task-specific training.

Find datasets covering multimodal foundation model

Truelabel surfaces vetted datasets and capture partners working with multimodal foundation model. Send the modality, scale, and rights you need and we route you to the closest match.

Browse Physical AI Datasets