Physical AI Glossary

Vision-Language-Action Model

A Vision-Language-Action (VLA) model is a neural architecture that processes camera images and natural-language instructions to produce robot control outputs. VLA models pretrain on internet-scale vision-language pairs (e.g., CLIP, SigLIP embeddings) then fine-tune on robot demonstration datasets to ground semantic concepts in continuous action spaces. Google's RT-2 fine-tunes a web-pretrained vision-language model on robot demonstrations, achieving 62% success on novel tasks versus 32% for behavior-cloning baselines.

Updated 2026-07-1416 min read

By Truelabel Team

Reviewed by Truelabel Team · Jul 14, 2026

vision-language-action model

Browse Physical AI Datasets Browse glossary

Quick facts

Topic: Vision Language Action Model
Audience: Procurement leads, ML ops, robotics engineers
Deliverable: Buyer-facing reference + procurement guidance

What Is a Vision-Language-Action Model?

A Vision-Language-Action (VLA) model is a neural network that accepts visual observations (RGB images, depth maps, point clouds) and natural-language instructions as inputs and emits robot control actions (joint velocities, end-effector poses, gripper commands) as outputs. VLA models represent the convergence of three previously separate capabilities: computer vision for scene understanding, natural language processing for instruction parsing, and motor control for physical manipulation. The core architectural insight is that large-scale pretraining on billions of image-text pairs provides a rich foundation of world knowledge that transfers to physical robot control when fine-tuned on robot demonstration data.

Architecturally, VLA models consist of three components: a vision encoder (typically a Vision Transformer such as SigLIP or DINOv2), a language model backbone (often a pretrained LLM like PaLM or LLaMA), and an action head that maps the language model's hidden states to continuous robot action outputs. The action head must output at the control frequency of the robot—typically 10–50 Hz for manipulation tasks. RT-2 uses a 55B-parameter PaLM-E vision-language model fine-tuned on 130,000 robot demonstrations^[1]. OpenVLA adopts a 7B-parameter Llama-2 backbone trained on the Open X-Embodiment dataset, achieving 52% success on unseen tasks in the LIBERO benchmark^[2].

VLA models differ from traditional behavior-cloning policies in two critical ways. First, they leverage semantic priors from internet-scale pretraining: a VLA model pretrained on billions of image-text pairs already understands concepts like 'red cup,' 'kitchen counter,' and 'pick up'—the robot learning stage then grounds these concepts in physical action spaces. Second, VLA models generalize across embodiments: RT-X demonstrates that a single VLA policy trained on data from 22 distinct robot types outperforms specialist policies trained on individual robots^[3]. This cross-embodiment transfer is impossible for vision-only behavior-cloning models that lack the semantic grounding provided by language.

Historical Evolution: From Vision-Language Models to Physical Control

The VLA architecture emerged from three parallel research threads. The first thread was vision-language models (VLMs) for image understanding. OpenAI's CLIP (2021) demonstrated that contrastive pretraining on 400 million image-text pairs produces visual representations that transfer to downstream tasks without task-specific fine-tuning. DeepMind's Flamingo (2022) extended this to few-shot visual question answering, and Google's PaLI (2022) scaled to 17 billion parameters. These models established that large-scale vision-language pretraining captures rich semantic knowledge about objects, scenes, and actions.

The second thread was large-scale robot learning datasets. RoboNet (2019) aggregated 15 million video frames from 7 robot platforms, demonstrating that multi-robot datasets improve generalization^[4]. BridgeData V2 (2023) contributed 60,000 teleoperation demonstrations across 24 tasks in kitchen environments. Open X-Embodiment (2023) unified 22 robot datasets into a single 1-million-trajectory corpus with standardized action spaces and observation formats.

The third thread was language-conditioned robot policies. Google's SayCan (2022) used an LLM to decompose high-level instructions into low-level skills, but required hand-engineered skill libraries. DeepMind's RT-1 Robotics Transformer (2022) was the first end-to-end VLA model, training a 35M-parameter transformer on 130,000 demonstrations to map language instructions and images directly to actions. RT-1 achieved 97% success on seen tasks and 76% on novel object configurations.

RT-2 (2023) scaled RT-1 by initializing from a 55B-parameter vision-language model (PaLM-E) pretrained on web data, then fine-tuning on robot demonstrations. This approach improved novel-task success from 32% (RT-1) to 62%^[5]. OpenVLA (2024) open-sourced a 7B-parameter VLA model trained on Open X-Embodiment data, achieving state-of-the-art performance on cross-embodiment benchmarks. NVIDIA GR00T N1 (2025) introduced a 1.3B-parameter VLA model trained on 386,000 robot trajectories, demonstrating 68% success on long-horizon manipulation tasks^[6].

Training Data Requirements for VLA Models

VLA models require two distinct data types: pretraining data (vision-language pairs from the web) and fine-tuning data (robot demonstration trajectories). Pretraining data provides semantic grounding: a VLA model pretrained on billions of image-text pairs learns that 'red cup' refers to a cylindrical object with specific visual features, that 'kitchen counter' is a horizontal surface, and that 'pick up' implies a grasping motion. Fine-tuning data grounds these semantic concepts in physical action spaces: the model learns that 'pick up the red cup' maps to a specific sequence of joint velocities and gripper commands.

Pretraining datasets are typically sourced from web-scale vision-language corpora. RT-2 uses PaLM-E, which was pretrained on 10 billion image-text pairs from the internet. OpenVLA uses SigLIP, pretrained on 400 million image-text pairs. These pretraining datasets are not robot-specific—they contain images of everyday objects, scenes, and activities paired with natural-language captions. The pretraining stage is computationally expensive (thousands of GPU-hours) but is performed once and amortized across all downstream robot tasks.

Fine-tuning datasets are robot demonstration trajectories collected via teleoperation, kinesthetic teaching, or scripted policies. Each trajectory consists of a sequence of observations (RGB images, depth maps, proprioceptive state), actions (joint velocities, end-effector poses, gripper commands), and a natural-language instruction. Open X-Embodiment aggregates 22 robot datasets into a unified corpus of 1 million trajectories^[7]. DROID contributes 76,000 teleoperation trajectories across 564 scenes and 86 tasks, collected by 50 operators over 12 months^[8]. BridgeData V2 provides 60,000 kitchen manipulation demonstrations with pixel-aligned language annotations.

Data quality is critical: labeling errors (incorrect action annotations, misaligned timestamps, missing modalities) degrade downstream policies. Truelabel's physical AI marketplace enforces provenance tracking for every trajectory: collector identity, hardware configuration, calibration parameters, and annotation protocol are recorded in structured metadata. This provenance layer is essential for debugging distribution shift and ensuring reproducibility.

Architecture Components: Vision Encoders, Language Backbones, and Action Heads

VLA models consist of three architectural components. The vision encoder processes camera images into visual tokens. Most VLA models use Vision Transformers (ViTs) pretrained on large-scale image-text datasets. RT-2 uses SigLIP, a ViT-L/14 model pretrained on 400 million image-text pairs. OpenVLA uses DINOv2, a self-supervised ViT-B/14 model pretrained on 142 million images. The vision encoder outputs a sequence of visual tokens (typically 256–1024 tokens per image) that are concatenated with language tokens and fed to the language model backbone.

The language model backbone processes both visual tokens and language instruction tokens. Most VLA models use pretrained LLMs: RT-2 uses PaLM-E (55B parameters), OpenVLA uses Llama-2 (7B parameters), and GR00T N1 uses a custom 1.3B-parameter transformer. The language model backbone is typically frozen during fine-tuning (to preserve semantic knowledge) or fine-tuned with a low learning rate. The language model outputs a sequence of hidden states that are passed to the action head.

The action head maps language model hidden states to continuous robot action outputs. The action head must output at the control frequency of the robot—typically 10–50 Hz for manipulation tasks. RT-1 uses a simple linear projection from hidden states to actions. RT-2 discretizes the action space into 256 bins per dimension and treats action prediction as a token generation problem, allowing the model to leverage the language model's autoregressive decoding. OpenVLA discretizes each action dimension into 256 bins and predicts action tokens autoregressively, reusing the language model's decoding stack rather than a diffusion process.

Action representations vary by task. For manipulation tasks, actions are typically 7-DOF end-effector poses (3D position, 4D quaternion orientation) plus a binary gripper command. For mobile manipulation, actions include base velocities (2D linear, 1D angular) in addition to arm commands. RT-X standardizes action spaces across 22 distinct embodiments by normalizing joint ranges to [-1, 1] and resampling trajectories to a common 10 Hz control frequency.

Cross-Embodiment Transfer and Multi-Robot Datasets

Cross-embodiment transfer is the ability of a VLA model trained on data from multiple robot types to generalize to new robot embodiments. RT-X demonstrates that a single VLA policy trained on data from 22 distinct robot types outperforms specialist policies trained on individual robots by 50% on average^[9]. This transfer is enabled by the semantic grounding provided by vision-language pretraining: the model learns that 'pick up the red cup' refers to a high-level goal that can be achieved by different kinematic chains.

Multi-robot datasets are essential for cross-embodiment transfer. Open X-Embodiment aggregates 22 robot datasets into a unified corpus of 1 million trajectories, standardizing observation formats (RGB images, depth maps, proprioceptive state) and action spaces (normalized joint velocities, end-effector poses). RoboNet contributed 15 million video frames from 7 robot platforms, demonstrating that multi-robot datasets improve generalization to novel objects and scenes^[10]. DROID provides 76,000 teleoperation trajectories across 564 scenes collected by 50 operators, with standardized action spaces and observation formats.

Standardization is critical for cross-embodiment transfer. Hugging Face LeRobot defines a common trajectory format (observations, actions, rewards, episode boundaries) that is compatible with 15 robot datasets. RLDS (Reinforcement Learning Datasets) provides a TensorFlow-based format for storing robot trajectories with standardized metadata (robot type, control frequency, action space). MCAP is a container format for multi-modal time-series data (camera images, LiDAR point clouds, IMU readings) that is used by DROID and other large-scale robot datasets.

Cross-embodiment transfer has practical implications for data procurement. A buyer training a VLA model for a new robot embodiment can bootstrap from existing multi-robot datasets rather than collecting demonstrations from scratch. Truelabel's physical AI marketplace indexes robot demonstration trajectories across many embodiments, with standardized action spaces and observation formats. Buyers can filter by embodiment type (mobile manipulator, fixed-base arm, humanoid), task category (pick-and-place, assembly, navigation), and scene complexity (tabletop, kitchen, warehouse).

Inference Latency, Control Frequency, and Evaluation Benchmarks

VLA models are evaluated on two types of benchmarks: simulation benchmarks (low-cost, high-throughput) and real-world benchmarks (high-cost, low-throughput). Simulation benchmarks include CALVIN (long-horizon manipulation in a simulated kitchen), LIBERO (130 tasks across 10 scenes), and RLBench (100 tasks in a simulated tabletop environment). Real-world benchmarks include long-horizon manipulation suites requiring 10–40 steps per task and reasoning-oriented manipulation evaluations.

Success metrics vary by benchmark. Task success rate is the percentage of episodes in which the robot achieves the goal (e.g., 'pick up the red cup and place it on the blue plate'). Generalization success rate measures performance on novel objects, scenes, or instructions not seen during training. RT-2 reports 62% success on novel tasks versus 32% for behavior-cloning baselines^[5]. Cross-embodiment success rate measures performance when a VLA model trained on one robot type is deployed on a different robot type. RT-X reports 50% improvement in cross-embodiment success when training on multi-robot datasets versus single-robot datasets^[9].

Data efficiency is measured by the number of demonstrations required to achieve a target success rate. OpenVLA achieves 52% success on LIBERO tasks with 100,000 demonstrations, versus 38% for behavior-cloning baselines with the same data budget^[11]. Inference latency is the time from observation to action output, which must be below the control period (typically 20–100 ms for manipulation tasks). GR00T N1 achieves 15 ms inference latency on an NVIDIA Jetson AGX Orin, enabling 50 Hz control^[12]. OpenVLA's own latency profile is different and often misquoted: base OpenVLA decodes 7 discrete action tokens autoregressively at about 4.2 Hz (240 ms/step on an A100), while the OpenVLA-OFT recipe reaches 109.7 Hz (73 ms per 8-action chunk) via parallel decoding and action chunking — a 26x throughput gain^[13]. The OpenVLA 15 ms inference guide works through how OFT gets there and what control frequency your training data must support.

Benchmark reproducibility is a persistent challenge: a meaningful share of published VLA results cannot be reproduced due to missing hyperparameters, undocumented data preprocessing, or hardware-specific tuning. Truelabel's physical AI marketplace enforces provenance tracking for every trajectory: collector identity, hardware configuration, calibration parameters, and annotation protocol are recorded in structured metadata, enabling reproducible benchmarking.

Common Misconceptions About VLA Models

Misconception 1: VLA models require millions of robot demonstrations. While early VLA models like RT-1 trained on 130,000 demonstrations, recent models leverage vision-language pretraining to reduce data requirements. OpenVLA achieves 52% success on novel tasks with 100,000 demonstrations^[11], and GR00T N1 achieves 68% success on long-horizon tasks with 386,000 demonstrations^[6]. The pretraining stage (billions of image-text pairs) is performed once and amortized across all downstream tasks.

Misconception 2: VLA models generalize to any robot embodiment. Cross-embodiment transfer is effective when the target embodiment is similar to the training embodiments (e.g., both are 7-DOF arms with parallel-jaw grippers). Transfer degrades when the target embodiment has a different kinematic structure (e.g., training on fixed-base arms, deploying on mobile manipulators) or different action spaces (e.g., training on joint velocities, deploying on end-effector poses). RT-X reports 50% improvement in cross-embodiment success when training on multi-robot datasets, but absolute success rates remain below single-embodiment specialist policies for highly dissimilar embodiments.

Misconception 3: VLA models understand physics and causality. VLA models learn correlations between visual observations, language instructions, and actions, but do not learn explicit physics models or causal relationships. World models are a complementary approach that learns forward dynamics models (predicting future observations given current observations and actions), enabling model-based planning and counterfactual reasoning. NVIDIA Cosmos combines VLA models with world models to enable long-horizon planning in complex environments.

Misconception 4: VLA models are production-ready. Most published VLA models are research prototypes evaluated in controlled lab environments. Deployment in unstructured real-world environments requires additional engineering: safety constraints (collision avoidance, joint limits), failure recovery (detecting and recovering from execution errors), and online adaptation (updating the policy based on deployment data). Scale AI's partnership with Universal Robots demonstrates production deployment of VLA models in warehouse environments, but reports that 30% of deployment effort is spent on safety and failure recovery.

Data Formats and Tooling for VLA Training

VLA training pipelines require standardized data formats for robot trajectories. Hugging Face LeRobot defines a common trajectory format: each episode is a dictionary with keys `observation` (RGB images, depth maps, proprioceptive state), `action` (joint velocities, end-effector poses, gripper commands), `reward` (scalar reward signal), and `episode_end` (boolean flag). Observations and actions are stored as NumPy arrays or PyTorch tensors, with standardized shapes and data types.

RLDS (Reinforcement Learning Datasets) provides a TensorFlow-based format for storing robot trajectories with standardized metadata: robot type, control frequency, action space, observation modalities, and data collection protocol. RLDS datasets are stored as TFRecord files (efficient binary format) and can be loaded with the TensorFlow Datasets API. The RLDS GitHub repository provides conversion scripts for 15 robot datasets, including BridgeData, RoboNet, and DROID.

MCAP is a container format for multi-modal time-series data (camera images, LiDAR point clouds, IMU readings, joint states). MCAP files store data in a columnar format with efficient compression (Zstandard, LZ4) and support random access (seeking to arbitrary timestamps). DROID uses MCAP to store 76,000 teleoperation trajectories with synchronized RGB images, depth maps, and proprioceptive state. MCAP guides provide conversion scripts for ROS bag files, HDF5 files, and Parquet files.

Data preprocessing is critical for VLA training. Image normalization: RGB images are typically normalized to [0, 1] or [-1, 1] and resized to 224×224 or 256×256 pixels. Action normalization: joint velocities and end-effector poses are normalized to [-1, 1] to stabilize training. Temporal alignment: observations and actions must be synchronized to the same timestamp, accounting for sensor latency and control delays. LeRobot's preprocessing pipeline provides reference implementations for image normalization, action normalization, and temporal alignment.

Future Directions: World Models, Sim-to-Real, and Foundation Models

Three research directions are shaping the future of VLA models. World models learn forward dynamics models (predicting future observations given current observations and actions), enabling model-based planning and counterfactual reasoning. World Models (Ha & Schmidhuber, 2018) demonstrated that learning a compact latent representation of the environment enables sample-efficient reinforcement learning. NVIDIA Cosmos combines VLA models with world models to enable long-horizon planning in complex environments, achieving 68% success on tasks requiring 10–40 steps.

Sim-to-real transfer trains VLA models in simulation and deploys them on real robots. Domain randomization varies visual appearance, object properties, and dynamics parameters during simulation training to improve real-world robustness. A 2021 survey reports that sim-to-real transfer reduces real-world data requirements by 10× but introduces a 20–40% performance gap compared to real-world training. RLBench provides a simulation benchmark with 100 manipulation tasks and domain randomization support.

Foundation models for robotics are large-scale VLA models pretrained on diverse robot datasets and fine-tuned for specific tasks. RT-X demonstrates that a single VLA policy trained on 22 robot types outperforms specialist policies by 50% on average. OpenVLA open-sources a 7B-parameter foundation model trained on Open X-Embodiment data, enabling researchers to fine-tune on custom tasks with 1,000–10,000 demonstrations. Figure AI's partnership with Brookfield aims to collect 100 million humanoid robot trajectories to train a foundation model for warehouse automation.

Data marketplaces are emerging to support foundation model training. Truelabel's physical AI marketplace indexes robot demonstration trajectories across many embodiments, with standardized action spaces and observation formats. Buyers can filter by embodiment type, task category, and scene complexity, and purchase trajectories with provenance metadata (collector identity, hardware configuration, calibration parameters). NVIDIA's Physical AI Data Factory Blueprint provides reference architectures for large-scale data collection, annotation, and quality control.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Physical AI glossaryGlossary hub Robot training data marketplaceRobotics datasets Multi-Task Learning RoboticsDefinition and terminology Foundation Model RoboticsDefinition and terminology Visuomotor PolicyDefinition and terminology Best VLA training data providers 2026Related page VLA training dataBuyer conversion page What is physical AI training data?Related page

External references and source context

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 fine-tuned on 130,000 robot demonstrations
arXiv ↩
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA achieves 52% success on unseen LIBERO tasks
arXiv ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
RT-X trained on 22 robot types, 1 million trajectories
arXiv ↩
RoboNet: Large-Scale Multi-Robot Learning
RoboNet aggregated 15 million video frames from 7 robot platforms
arXiv ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 achieves 62% success on novel tasks versus 32% for baselines
arXiv ↩
NVIDIA GR00T N1 technical report
GR00T N1 trained on 386,000 trajectories, achieves 68% success on long-horizon tasks
arXiv ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment aggregates 1 million trajectories from 22 distinct embodiments
arXiv ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID provides 76,000 teleoperation trajectories across 564 scenes
arXiv ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
RT-X demonstrates 50% improvement in cross-embodiment success
arXiv ↩
RoboNet: Large-Scale Multi-Robot Learning
RoboNet demonstrates multi-robot datasets improve generalization to novel objects
arXiv ↩
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA achieves 52% success with 100,000 demonstrations
arXiv ↩
NVIDIA GR00T N1 technical report
GR00T N1 achieves 15 ms inference latency on Jetson AGX Orin
arXiv ↩
Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success
OpenVLA-OFT: base OpenVLA 4.2 Hz / 240 ms (7 discrete action tokens, autoregressive) vs OpenVLA-OFT 109.7 Hz / 73 ms per 8-action chunk via parallel decoding and action chunking (26x throughput).
arXiv ↩

More glossary terms

Multi-Task Learning RoboticsMulti-task learning robotics trains a single neural network policy to execute multiple manipulation tasks by learning shared representations across diverse demonstrations Foundation Model RoboticsFoundation model robotics refers to large neural networks—typically 100M to 10B+ parameters—pretrained on internet-scale vision and language data, then fine-tuned on robot demonstrations to produce generalist policies that follow natural language instructions and manipulate novel objects across embodiments Visuomotor PolicyA visuomotor policy is a neural network that accepts raw camera images as input and outputs robot motor commands (joint positions, velocities, or torques) as a single differentiable function, learning the entire perception-to-action pipeline end-to-end from demonstration or interaction data Trajectory PredictionTrajectory prediction forecasts the future spatial positions and velocities of agents (humans, robots, vehicles) and objects over time horizons of 1–10 seconds Off-the-shelf datasetAn existing public or commercial dataset bought without custom collection.Physical AI training dataData that teaches models to perceive, reason about, and act in physical environments.

FAQ

What is the difference between a VLA model and a vision-language model (VLM)?

A vision-language model (VLM) processes images and text to produce text outputs (e.g., image captions, visual question answering). A vision-language-action (VLA) model extends VLMs by adding an action head that produces robot control outputs (joint velocities, end-effector poses, gripper commands). VLA models are pretrained on vision-language data (billions of image-text pairs) then fine-tuned on robot demonstration data to ground semantic concepts in physical action spaces. RT-2 uses a 55B-parameter VLM (PaLM-E) as the backbone and adds a linear action head to map hidden states to 7-DOF end-effector poses.

How much robot demonstration data is required to train a VLA model?

Data requirements depend on the pretraining strategy. VLA models pretrained on vision-language data (e.g., RT-2, OpenVLA) require 100,000–500,000 robot demonstrations for fine-tuning. VLA models trained from scratch (e.g., RT-1) require 500,000–1,000,000 demonstrations. Cross-embodiment transfer reduces data requirements: RT-X achieves 50% improvement in success rate when training on multi-robot datasets versus single-robot datasets. Truelabel's physical AI marketplace indexes robot demonstration trajectories across many embodiments, enabling buyers to bootstrap VLA training with existing data.

Can VLA models generalize to new robot embodiments?

VLA models exhibit cross-embodiment transfer when the target embodiment is similar to the training embodiments (e.g., both are 7-DOF arms with parallel-jaw grippers). RT-X demonstrates that a single VLA policy trained on 22 robot types outperforms specialist policies by 50% on average. Transfer degrades when the target embodiment has a different kinematic structure (e.g., training on fixed-base arms, deploying on mobile manipulators) or different action spaces (e.g., training on joint velocities, deploying on end-effector poses). Fine-tuning on 1,000–10,000 demonstrations from the target embodiment typically recovers 80–90% of specialist-policy performance.

What data formats are used for VLA training datasets?

VLA training datasets use three primary formats. Hugging Face LeRobot defines a dictionary format with keys `observation`, `action`, `reward`, and `episode_end`, stored as Parquet files or HDF5 files. RLDS (Reinforcement Learning Datasets) uses TFRecord files with standardized metadata (robot type, control frequency, action space). MCAP is a container format for multi-modal time-series data (camera images, LiDAR point clouds, IMU readings), used by DROID and other large-scale datasets. All three formats support efficient compression, random access, and standardized preprocessing pipelines.

What are the main challenges in deploying VLA models in production?

Production deployment requires three engineering layers beyond research prototypes. **Safety constraints**: collision avoidance, joint limits, and workspace boundaries must be enforced at the control level. **Failure recovery**: the system must detect execution errors (e.g., dropped objects, unreachable goals) and trigger recovery behaviors (e.g., re-grasping, replanning). **Online adaptation**: the policy must update based on deployment data to handle distribution shift (e.g., new object types, lighting conditions). Scale AI's partnership with Universal Robots reports that 30% of deployment effort is spent on safety and failure recovery, and that online adaptation improves success rates by 15–25% over static policies.

Find datasets covering vision-language-action model

Truelabel surfaces vetted datasets and capture partners working with vision-language-action model. Send the modality, scale, and rights you need and we route you to the closest match.

Browse Physical AI Datasets