Vision-language-action models

VLA training data

VLA training data is the paired vision, language, and action signal that vision-language-action models — OpenVLA, RT-2, π0, GR00T N1 — train on: robot teleoperation traces, instruction-grounded demonstrations, egocentric video, action labels, and timing metadata. Truelabel routes VLA sourcing requests to capture partners that can deliver paired modalities at the buyer's target embodiment.

Updated 2026-05-05

By truelabel

Reviewed by truelabel · May 5, 2026

VLA training data

Request VLA training data How sourcing works

Quick facts

Observation: Egocentric, wrist, or external video
Task context: Instruction text, object set, environment, success criteria
Action: Joint states, end-effector poses, or discrete action labels
Format: RLDS, LeRobot, HDF5, MCAP, or buyer-defined schema
QA: Observation-action sync and complete task episodes

Comparison

Model need	Data requirement	Bounty implication
Visual grounding	Diverse real-world observations	Specify camera and environment
Language grounding	Instructions and task labels	Define task taxonomy
Action grounding	States, actions, trajectories	Set sync and format rules
Evaluation	Held-out accepted episodes	Start with an eval request

Provider list — VLA training data

10 providers covering VLA training data. Each entry summarizes the provider's strongest fit and a buyer-bottleneck signal so you can shortcut the discovery loop.

#1
OpenVLA
Open-source 7B-parameter VLA model from Stanford / TRI / UC Berkeley — released with weights and training recipe.
Best for: Reference model when designing VLA training data shape (vision token format, action representation, instruction grounding).
#2
RT-2 (Google DeepMind)
Vision-language-action model that co-fine-tunes web-scale VLM with robotics data — defines the modern VLA benchmark.
Best for: Architecture reference; the data recipe (web VLM data + robot trajectories) is the template most production VLA programs follow.
#3
π0 (Physical Intelligence)
Foundation VLA model from Physical Intelligence trained on a large mix of teleop, manipulation, and language data.
Best for: Frontier VLA reference; informs scale and diversity requirements for production training data.
#4
NVIDIA GR00T N1
NVIDIA's open VLA foundation model for humanoids with synthetic-data-heavy training recipe and public weights.
Best for: Sim-first VLA training pattern; useful when synthetic data is part of the production mix.
#5
Open X-Embodiment / RT-X
22-institution cross-embodiment dataset that anchors the dominant VLA pretraining recipe.
Best for: Cross-robot VLA pretraining corpus before deployment-specific fine-tune.
#6
DROID
76k Franka demonstrations with synchronized vision + language annotation in many cases.
Best for: Real-world manipulation slice for VLA fine-tune when single-arm Franka matches deployment.
#7
BridgeData V2
60,096 instruction-conditioned manipulation trajectories with language labels.
Best for: Affordable, well-documented VLA training data with strong instruction grounding.
#8
Hugging Face LeRobot Bridge / DROID variants
Curated LeRobot conversions of canonical VLA datasets (Bridge, DROID, ALOHA) in modern Parquet format.
Best for: Off-the-shelf ingestion path for VLA training when you want modern format conventions.
#9
RoboCat training set
DeepMind self-improving foundation agent — reference for VLA scaling laws.
Best for: Architecture-of-thought reference for self-improvement loops; underlying corpus not redistributable.
#10
SayCan
Affordance-grounded language-to-action work from Google — defines the language → action grounding pattern many VLAs imitate.
Best for: Reference for language grounding shape in VLA training data (especially affordance tags + step verbs).

What a VLA dataset should include

Modern VLA models need robot episodes where vision, language, and actions stay aligned, as OpenVLA shows with 970k Open X-Embodiment robot episodes ^[1]. The action stream should expose the control space the model must produce; RT-2 describes robot actions as text-token-like action strings paired with images and language commands ^[2]. Language-instruction density should be explicit at trajectory level: BridgeData V2 reports 53,896 trajectories across 13 skills and 24 environments, each labeled with a natural-language instruction ^[3].

Why real-world data matters

Real-world diversity is the reason Open X-Embodiment pools more than 1M robot trajectories across 22 robot embodiments instead of relying on one lab setup ^[4]. DROID adds the deployment-side view: 76k demonstrations across 564 scenes and 86 tasks improved policy performance, robustness, and generalization ^[5].

"Using human video capture in a variety of Brookfield environments, Figure will amass critical AI training data for Helix to teach humanoid robots how to move, perceive, and act across a spectrum of human-centric spaces."
— from Figure + Brookfield humanoid pretraining dataset partnership — figure.ai

^[6]

For contact-rich work, RH20T shows why RGB alone is not always enough by pairing real-world sequences with visual, force, audio, action, and human-demonstration modalities ^[7]. Buyers should also specify delivery formats up front because LeRobot standardizes robotics datasets around synchronized video or images plus state/action data ^[8].

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Best VLA training data providers 2026Related page Best teleoperation data providers 2026Related page Data provenance for physical AIRelated page What is physical AI training data?Related page HDF5 robot data format for robot training dataDelivery format detail LeRobot format format for robot training dataDelivery format detail MCAP format for robot training dataDelivery format detail Parquet robot data format for robot training dataDelivery format detail

External references and source context

OpenVLA project
OpenVLA is a VLA model trained on 970k robot episodes from Open X-Embodiment, so VLA training data must pair visual observations, language, and executable action labels at robot-episode scale.
openvla.github.io ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 represents robot actions as text-token-like action strings paired with image observations and language commands, making action-token coverage a core VLA data requirement.
robotics-transformer2.github.io ↩
Project site
BridgeData V2 reports 53,896 trajectories across 13 skills and 24 environments, each labeled with a natural-language instruction, supporting the language-instruction-density requirement for VLA datasets.
rail-berkeley.github.io ↩
Project site
Open X-Embodiment pools 1M+ real robot trajectories across 22 robot embodiments and many datasets, supporting the need for embodiment, task, and scene diversity in VLA training data.
robotics-transformer-x.github.io ↩
Project site
DROID reports 76k demonstrations / 350 hours across 564 scenes and 86 tasks, and reports improved performance, robustness, and generalization from real-world robot manipulation data.
droid-dataset.github.io ↩
Figure + Brookfield humanoid pretraining dataset partnership
Figure and Brookfield describe real-world humanlike navigation and manipulation data across household environments as necessary training data for scaling a proprietary vision-language-action humanoid model.
figure.ai ↩
Project site
RH20T includes over 110,000 contact-rich sequences with visual, force, audio, action, and human-demonstration modalities, supporting the need for multimodal real-world capture when VLA buyers need more than RGB video.
rh20t.github.io ↩
LeRobot GitHub repository
LeRobot describes a standardized LeRobotDataset format with synchronized video or images plus state/action data, supporting buyer requirements for delivery formats and schema validation.
GitHub ↩

FAQ

What is a VLA model?

A VLA model, or vision-language-action model, connects visual input, language or task context, and actions. In robotics, VLA models use observations and instructions to produce behavior in the physical world.

What data does a VLA model need?

VLA models need observations, task context, and action data. That can mean video, instructions, robot states, trajectories, task labels, success markers, and metadata in a format the training pipeline can consume.

Can egocentric video help VLA training?

Yes. Egocentric video can help models learn human object interactions and task context. For action-producing models, buyers may also need robot demonstrations or teleoperation traces with action data.

Which formats are common for VLA datasets?

Common formats include RLDS, LeRobot, HDF5, MCAP, ROS bag, zarr, WebDataset, JSON, and CSV. The sourcing request should define the expected schema before suppliers submit samples.

Looking for VLA training data?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

Request VLA training data