truelabelRequest data

Vision-language-action models

VLA training data

VLA training data is the paired vision, language, and action signal that vision-language-action models — OpenVLA, RT-2, π0, GR00T N1 — train on: robot teleoperation traces, instruction-grounded demonstrations, egocentric video, action labels, and timing metadata. Truelabel routes VLA sourcing requests to capture partners that can deliver paired modalities at the buyer's target embodiment.

Updated 2026-05-05
By truelabel
Reviewed by truelabel ·
VLA training data

Quick facts

Observation
Egocentric, wrist, or external video
Task context
Instruction text, object set, environment, success criteria
Action
Joint states, end-effector poses, or discrete action labels
Format
RLDS, LeRobot, HDF5, MCAP, or buyer-defined schema
QA
Observation-action sync and complete task episodes

Comparison

Model needData requirementBounty implication
Visual groundingDiverse real-world observationsSpecify camera and environment
Language groundingInstructions and task labelsDefine task taxonomy
Action groundingStates, actions, trajectoriesSet sync and format rules
EvaluationHeld-out accepted episodesStart with an eval request

Provider list — VLA training data

10 providers covering VLA training data. Each entry summarizes the provider's strongest fit and a buyer-bottleneck signal so you can shortcut the discovery loop.

  1. #1

    OpenVLA

    Open-source 7B-parameter VLA model from Stanford / TRI / UC Berkeley — released with weights and training recipe.

    Best for: Reference model when designing VLA training data shape (vision token format, action representation, instruction grounding).

  2. #2

    RT-2 (Google DeepMind)

    Vision-language-action model that co-fine-tunes web-scale VLM with robotics data — defines the modern VLA benchmark.

    Best for: Architecture reference; the data recipe (web VLM data + robot trajectories) is the template most production VLA programs follow.

  3. #3

    π0 (Physical Intelligence)

    Foundation VLA model from Physical Intelligence trained on a large mix of teleop, manipulation, and language data.

    Best for: Frontier VLA reference; informs scale and diversity requirements for production training data.

  4. #4

    NVIDIA GR00T N1

    NVIDIA's open VLA foundation model for humanoids with synthetic-data-heavy training recipe and public weights.

    Best for: Sim-first VLA training pattern; useful when synthetic data is part of the production mix.

  5. #5

    Open X-Embodiment / RT-X

    22-institution cross-embodiment dataset that anchors the dominant VLA pretraining recipe.

    Best for: Cross-robot VLA pretraining corpus before deployment-specific fine-tune.

  6. #6

    DROID

    76k Franka demonstrations with synchronized vision + language annotation in many cases.

    Best for: Real-world manipulation slice for VLA fine-tune when single-arm Franka matches deployment.

  7. #7

    BridgeData V2

    60,096 instruction-conditioned manipulation trajectories with language labels.

    Best for: Affordable, well-documented VLA training data with strong instruction grounding.

  8. #8

    Hugging Face LeRobot Bridge / DROID variants

    Curated LeRobot conversions of canonical VLA datasets (Bridge, DROID, ALOHA) in modern Parquet format.

    Best for: Off-the-shelf ingestion path for VLA training when you want modern format conventions.

  9. #9

    RoboCat training set

    DeepMind self-improving foundation agent — reference for VLA scaling laws.

    Best for: Architecture-of-thought reference for self-improvement loops; underlying corpus not redistributable.

  10. #10

    SayCan

    Affordance-grounded language-to-action work from Google — defines the language → action grounding pattern many VLAs imitate.

    Best for: Reference for language grounding shape in VLA training data (especially affordance tags + step verbs).

What a VLA dataset should include

Modern VLA models need robot episodes where vision, language, and actions stay aligned, as OpenVLA shows with 970k Open X-Embodiment robot episodes [1]. The action stream should expose the control space the model must produce; RT-2 describes robot actions as text-token-like action strings paired with images and language commands [2]. Language-instruction density should be explicit at trajectory level: BridgeData V2 reports 53,896 trajectories across 13 skills and 24 environments, each labeled with a natural-language instruction [3].

Why real-world data matters

Real-world diversity is the reason Open X-Embodiment pools more than 1M robot trajectories across 22 robot embodiments instead of relying on one lab setup [4]. DROID adds the deployment-side view: 76k demonstrations across 564 scenes and 86 tasks improved policy performance, robustness, and generalization [5].

"Using human video capture in a variety of Brookfield environments, Figure will amass critical AI training data for Helix to teach humanoid robots how to move, perceive, and act across a spectrum of human-centric spaces."

[6]

For contact-rich work, RH20T shows why RGB alone is not always enough by pairing real-world sequences with visual, force, audio, action, and human-demonstration modalities [7]. Buyers should also specify delivery formats up front because LeRobot standardizes robotics datasets around synchronized video or images plus state/action data [8].

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. OpenVLA project

    OpenVLA is a VLA model trained on 970k robot episodes from Open X-Embodiment, so VLA training data must pair visual observations, language, and executable action labels at robot-episode scale.

    openvla.github.io
  2. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    RT-2 represents robot actions as text-token-like action strings paired with image observations and language commands, making action-token coverage a core VLA data requirement.

    robotics-transformer2.github.io
  3. Project site

    BridgeData V2 reports 53,896 trajectories across 13 skills and 24 environments, each labeled with a natural-language instruction, supporting the language-instruction-density requirement for VLA datasets.

    rail-berkeley.github.io
  4. Project site

    Open X-Embodiment pools 1M+ real robot trajectories across 22 robot embodiments and many datasets, supporting the need for embodiment, task, and scene diversity in VLA training data.

    robotics-transformer-x.github.io
  5. Project site

    DROID reports 76k demonstrations / 350 hours across 564 scenes and 86 tasks, and reports improved performance, robustness, and generalization from real-world robot manipulation data.

    droid-dataset.github.io
  6. Figure + Brookfield humanoid pretraining dataset partnership

    Figure and Brookfield describe real-world humanlike navigation and manipulation data across household environments as necessary training data for scaling a proprietary vision-language-action humanoid model.

    figure.ai
  7. Project site

    RH20T includes over 110,000 contact-rich sequences with visual, force, audio, action, and human-demonstration modalities, supporting the need for multimodal real-world capture when VLA buyers need more than RGB video.

    rh20t.github.io
  8. LeRobot GitHub repository

    LeRobot describes a standardized LeRobotDataset format with synchronized video or images plus state/action data, supporting buyer requirements for delivery formats and schema validation.

    GitHub

FAQ

What is a VLA model?

A VLA model, or vision-language-action model, connects visual input, language or task context, and actions. In robotics, VLA models use observations and instructions to produce behavior in the physical world.

What data does a VLA model need?

VLA models need observations, task context, and action data. That can mean video, instructions, robot states, trajectories, task labels, success markers, and metadata in a format the training pipeline can consume.

Can egocentric video help VLA training?

Yes. Egocentric video can help models learn human object interactions and task context. For action-producing models, buyers may also need robot demonstrations or teleoperation traces with action data.

Which formats are common for VLA datasets?

Common formats include RLDS, LeRobot, HDF5, MCAP, ROS bag, zarr, WebDataset, JSON, and CSV. The sourcing request should define the expected schema before suppliers submit samples.

Looking for VLA training data?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

Request VLA training data