Physical AI Data Marketplace

VLA Training Data: Action-Labeled Demonstrations for Embodied AI

Q: Why do VLA models require action labels when vision-language models trained on internet video achieve strong zero-shot performance?

Vision-language models learn visual semantics—object recognition, spatial relationships, scene understanding—from internet video, but they do not learn sensorimotor mappings required for physical interaction. A model that knows what a mug looks like cannot infer the gripper trajectory, wrist torque, or contact force required to grasp it without action-labeled demonstrations. RT-2 showed that web-scale pretraining improves robot policy generalization by 3× when fine-tuned on task-specific demonstrations, but the fine-tuning step remains mandatory. OpenVLA achieved 16.5% higher success than RT-2-X (55B parameters) using only 7B parameters trained on 970,000 action-labeled trajectories, proving that sensorimotor data quality outweighs visual pretraining scale.

Q: How does teleoperation data quality compare to autonomous scripted exploration for VLA training?

Teleoperation yields high-quality demonstrations with precise action labels but scales linearly with human operator hours. DROID collected 76,000 trajectories using 350 hours of teleoperation, averaging 217 trajectories per operator-hour. Autonomous scripted exploration generates trajectories at machine speed but introduces distribution shift—RoboNet's 15 million frames included 40% failure states (collisions, dropped objects), and training on failure-heavy data degrades policy performance unless explicitly filtered. Robomimic showed that filtering the bottom 25% of trajectories by task success improved imitation learning accuracy by 18%. Truelabel's hybrid bounty system lets buyers request teleoperation data for high-value tasks and autonomous scripted data for coverage, balancing quality and scale.

Q: What licensing restrictions prevent commercial use of open VLA datasets?

RoboNet uses a custom non-commercial license that forbids training models for sale or deployment in commercial products. EPIC-KITCHENS-100 annotations are CC BY-NC 4.0, permitting academic use but requiring separate commercial licensing. Creative Commons BY-NC terms explicitly exclude revenue-generating applications, creating legal risk for startups training VLA models on these datasets. DROID released under MIT license permits commercial use but lacks structured provenance metadata—collector identity, capture timestamps, hardware specifications—required for EU AI Act Article 10 compliance. Truelabel embeds collector identity, capture location, hardware serial numbers, and consent records into dataset metadata using C2PA content credentials, enabling buyers to audit data lineage and demonstrate regulatory compliance with perpetual commercial licenses.

Q: Why do VLA models trained on millions of trajectories still fail on long-horizon tasks?

Existing datasets concentrate on short-horizon manipulation—pick-place, push, reach—with task horizons of 5-20 steps. CALVIN introduced 5-step instruction chains but contains only 24,000 episodes in a single simulated kitchen. LongBench evaluated policies on 100-step real-world tasks and found that models trained on Open X-Embodiment (1 million trajectories) achieved <10% success, revealing a massive generalization gap. Real-world assembly, maintenance, and logistics workflows often require 20-50 step sequences with tool use, multi-object coordination, and error recovery—task structures absent from current benchmarks. ManipArena proposed 100 real-world tasks requiring 20-50 steps and tool use, and models trained on Open X-Embodiment achieved only 6% success, rising to 23% after task-specific fine-tuning on 2,000 demonstrations.

Q: How does multi-sensor fusion improve VLA policy performance on contact-rich tasks?

Vision-only policies suffer from depth ambiguity and occlusion, limiting performance on contact-rich tasks. RT-1 trained on RGB images alone achieved 89% success on pick-place but 34% on insertion tasks requiring sub-millimeter alignment. Adding depth channels improved insertion success to 67%, and incorporating wrist force-torque sensors pushed success to 81%, demonstrating that tactile feedback is non-optional for precision manipulation. DROID captures RGB-D from wrist-mounted RealSense cameras and proprioceptive joint states at 10 Hz but omits force-torque and tactile signals. Dex-YCB includes 582,000 frames of dexterous grasps with ground-truth 6-DOF object poses and contact labels, enabling policies to learn grasp stability from visual-tactile correlation.

Q: What action space normalization strategies enable cross-embodiment VLA training?

Open X-Embodiment normalized actions by computing per-dataset mean and standard deviation, then clipping to [-1, 1], but this approach fails when action dimensions have semantic mismatches (e.g., 7-DOF arm vs. 2-DOF gripper). RT-1 discretized continuous actions into 256 bins per dimension, enabling transformer-based policies to treat actions as token sequences, but discretization introduces quantization error that degrades performance on precision tasks. Octo trained a generalist policy on 800,000 trajectories by learning embodiment-specific action heads—separate output layers for each robot morphology—while sharing a common visual encoder and language model, achieving 52% higher success rates than single-embodiment baselines. Task-space control—specifying end-effector position and orientation rather than joint angles—provides a morphology-agnostic action representation but requires accurate inverse kinematics.

Vision-language-action models require synchronized triplets—visual observation, language instruction, and executed action—at each timestep. OpenVLA achieved 16.5% higher success rates than RT-2-X (55B parameters) using only 7B parameters trained on diverse demonstrations, proving data quality outweighs model scale. Truelabel's marketplace connects robotics teams to 20,000+ collectors capturing teleoperation trajectories, egocentric video with action labels, and multi-sensor demonstrations across warehouses, kitchens, and manufacturing floors in 160+ countries.

Updated 2025-05-15

By Truelabel Team

Reviewed by Truelabel Team · May 15, 2025

vla training data

Post a VLA Data Bounty How sourcing works

Quick facts

Topic: VLA Training Data
Audience: Procurement leads, ML ops, robotics engineers
Deliverable: Buyer-facing reference + procurement guidance

Why VLA Models Demand Action-Labeled Trajectories at Scale

Vision-language-action architectures unify visual perception, natural language grounding, and motor control into a single transformer backbone. Unlike vision-language models that output text tokens, VLAs generate continuous action vectors—joint positions, gripper states, end-effector velocities—conditioned on visual observations and language instructions. OpenVLA demonstrated that a 7B-parameter model trained on 970,000 trajectories from Open X-Embodiment outperformed RT-2-X (55B parameters) by 16.5% on unseen manipulation tasks^[1]. The performance gap stems from data diversity, not parameter count.

RT-2 showed that web-scale vision-language pretraining improves zero-shot generalization by 3× when fine-tuned on robot demonstrations, but the fine-tuning step remains mandatory^[2]. Pretraining on internet video teaches visual semantics—object recognition, spatial reasoning—but not the sensorimotor mappings required for physical interaction. A model that knows what a mug looks like cannot infer the gripper trajectory to grasp it without action-labeled examples. Scale AI's Physical AI platform addresses this gap by pairing teleoperation hardware with annotation pipelines that label every frame with 6-DOF poses, contact forces, and task success signals.

The bottleneck is not compute or architecture design but the cost and logistics of capturing diverse, high-quality demonstrations. DROID collected 76,000 trajectories across 564 skills and 86 buildings using 13 mobile manipulators, requiring 350 hours of human teleoperation^[3]. Scaling to millions of trajectories demands distributed data collection infrastructure that no single lab can sustain. Truelabel's marketplace connects robotics teams to 20,000+ collectors equipped with teleoperation rigs, egocentric cameras, and motion-capture systems across 160+ countries, enabling parallel data capture at a scale previously reserved for web scraping.

What Distinguishes VLA Training Data from Standard Video Datasets

Standard video datasets—Kinetics, Ego4D, Something-Something—provide visual frames with category labels or natural language captions but omit the action sequences that VLA models consume. Ego4D contains 3,670 hours of first-person video across 74 scenarios, yet lacks the joint-level action annotations required for imitation learning^[4]. A video of a human pouring coffee shows the outcome but not the wrist torques, finger pressures, or velocity profiles that a robot must replicate.

VLA training requires synchronized triplets at every timestep: RGB-D observation, language instruction, and a 7-14 dimensional action vector encoding joint positions, gripper state, and sometimes base velocity for mobile manipulators. RLDS (Reinforcement Learning Datasets) standardized this format as episodes containing observation dictionaries, action arrays, and reward scalars, enabling cross-dataset training without format conversion^[5]. LeRobot extends RLDS with Parquet-backed storage and Hugging Face integration, reducing dataset loading time by 40% compared to HDF5-based pipelines.

Action space heterogeneity remains the primary obstacle to multi-embodiment training. A 7-DOF Franka arm outputs joint angles; a parallel-jaw gripper outputs binary open/close; a mobile base outputs (x, y, θ) velocities. Open X-Embodiment aggregated 1 million trajectories from 22 robot platforms but required per-embodiment action normalization and task-specific fine-tuning to achieve cross-platform transfer^[6]. RT-1 trained on 130,000 demonstrations from a single robot achieved 97% success on seen tasks but 13% on unseen embodiments, illustrating the generalization penalty of single-platform datasets^[7].

Truelabel's data bounty system lets buyers specify action space requirements—joint-level, end-effector, or task-space control—and embodiment constraints, ensuring collected trajectories match downstream policy architectures without post-hoc conversion.

How Open Datasets Limit VLA Generalization Across Embodiments

Public datasets concentrate on tabletop manipulation in laboratory settings, leaving warehouse logistics, outdoor navigation, and dexterous assembly underrepresented. BridgeData V2 contains 60,000 trajectories across 13 tasks but all demonstrations use a single WidowX 250 arm in a controlled lab environment^[8]. RoboNet aggregated 15 million frames from 7 robot platforms but 90% of trajectories involve pick-and-place or push tasks on flat surfaces^[9]. Models trained on these datasets fail when deployed in cluttered warehouses, uneven outdoor terrain, or tasks requiring bimanual coordination.

DROID expanded coverage to 564 skills across 86 buildings using mobile manipulators, capturing navigation, door opening, and object retrieval in real-world offices and homes^[3]. Yet even DROID's 76,000 trajectories span only 13 robots, all sharing similar morphology (mobile base + 6-DOF arm + parallel gripper). Humanoid robots, quadrupeds, dexterous hands, and aerial manipulators remain absent from large-scale open datasets. Figure AI's partnership with Brookfield aims to collect 100 million humanoid manipulation hours, but the dataset remains proprietary and unavailable for academic or commercial VLA training.

Task diversity also skews toward short-horizon manipulation. CALVIN introduced long-horizon tasks requiring 5-step instruction chains, but the dataset contains only 24,000 episodes in a single simulated kitchen^[10]. Real-world assembly, maintenance, and logistics workflows often require 20-50 step sequences with tool use, multi-object coordination, and error recovery—task structures absent from current benchmarks. LongBench evaluated policies on 100-step real-world tasks and found that models trained on short-horizon datasets achieved <5% success rates, even when fine-tuned on task-specific data^[11].

Truelabel's marketplace enables buyers to post bounties specifying embodiment (humanoid, quadruped, dexterous hand), environment (warehouse, kitchen, outdoor), and task complexity (short-horizon pick-place vs. multi-step assembly), routing requests to collectors with matching hardware and facilities.

Teleoperation vs. Autonomous Collection: Trade-offs for VLA Training

Teleoperation yields high-quality demonstrations with precise action labels but scales linearly with human operator hours. ALOHA collected 650 bimanual manipulation demonstrations at 50 Hz, requiring 12 hours of expert teleoperation for tasks like cable routing and dishwasher loading. DROID amassed 76,000 trajectories using 350 hours of teleoperation across 13 robots, averaging 217 trajectories per operator-hour^[3]. At this rate, collecting 1 million trajectories demands 4,600 operator-hours—feasible for a single lab but prohibitive for continuous data refresh cycles.

Scale AI's partnership with Universal Robots industrialized teleoperation by distributing data collection across contract operators, reducing per-trajectory cost by 60% compared to in-house collection. Truelabel extends this model globally, connecting buyers to 20,000+ collectors who own teleoperation hardware—haptic interfaces, VR controllers, motion-capture gloves—and can capture demonstrations in their local environments without shipping robots internationally.

Autonomous data collection via scripted policies or reinforcement learning generates trajectories at machine speed but introduces distribution shift. RoboNet used scripted random exploration to collect 15 million frames, but 40% of trajectories ended in failure states (collisions, dropped objects, out-of-reach targets)^[9]. Training on failure-heavy data degrades policy performance unless explicitly filtered or relabeled. Robomimic showed that filtering the bottom 25% of trajectories by task success improved imitation learning accuracy by 18%, but manual filtering does not scale to million-trajectory datasets.

Domain randomization in simulation generates infinite synthetic trajectories but requires sim-to-real transfer validation. A 2021 survey found that 68% of sim-trained policies required real-world fine-tuning on 500-2,000 demonstrations to match teleoperation-trained baselines^[12]. Truelabel's hybrid bounty system lets buyers request teleoperation data for high-value tasks and autonomous scripted data for coverage, balancing quality and scale.

Multi-Sensor Fusion: RGB-D, Proprioception, and Force-Torque Signals

DROID captures RGB-D from wrist-mounted RealSense cameras, proprioceptive joint states at 10 Hz, and gripper binary state, but omits force-torque and tactile signals^[3]. Dex-YCB includes 582,000 frames of dexterous grasps with ground-truth 6-DOF object poses and contact labels, enabling policies to learn grasp stability from visual-tactile correlation. HOI4D pairs egocentric video with IMU wristbands and pressure-sensitive gloves, capturing 4 million frames of human-object interaction across 800 objects and 610 categories.

MCAP emerged as the de facto standard for multi-sensor robotics data, replacing ROS bags with 10× faster random access and native support for protobuf, JSON, and custom schemas. LeRobot converts MCAP episodes to Parquet tables with separate columns for RGB frames (stored as PNG), depth (16-bit arrays), joint states (float32 vectors), and language instructions (UTF-8 strings), enabling columnar queries like

Egocentric Video as a Proxy for Robot Observation Data

Human egocentric video captures task semantics—object affordances, spatial layouts, action sequences—without robot-specific action labels, serving as a pretraining corpus for visual encoders. Ego4D contains 3,670 hours of first-person video across 74 scenarios, including cooking, assembly, and social interaction^[4]. EPIC-KITCHENS-100 provides 100 hours of kitchen activities with 90,000 action segments labeled by verb-noun pairs (e.g.,

Action Space Normalization: Bridging Heterogeneous Embodiments

Cross-embodiment training requires mapping diverse action representations—joint angles, end-effector poses, velocity commands—into a shared latent space. Open X-Embodiment normalized actions by computing per-dataset mean and standard deviation, then clipping to [-1, 1], but this approach fails when action dimensions have semantic mismatches (e.g., 7-DOF arm vs. 2-DOF gripper)^[6]. RT-1 discretized continuous actions into 256 bins per dimension, enabling transformer-based policies to treat actions as token sequences, but discretization introduces quantization error that degrades performance on precision tasks^[7].

Octo trained a generalist policy on 800,000 trajectories by learning embodiment-specific action heads—separate output layers for each robot morphology—while sharing a common visual encoder and language model^[13]. This architecture achieved 52% higher success rates than single-embodiment baselines but required per-robot fine-tuning datasets of 500-2,000 trajectories. RoboCat extended this approach with self-improvement: the model generated synthetic demonstrations on new embodiments, filtered by success, then retrained on the augmented dataset, reducing human teleoperation requirements by 40%^[14].

Task-space control—specifying end-effector position and orientation rather than joint angles—provides a morphology-agnostic action representation but requires accurate inverse kinematics. Franka's FR3 Duo supports task-space impedance control with 0.1 mm position accuracy, enabling policies to transfer across arms with different kinematic chains. Truelabel's data bounty system lets buyers specify action space format (joint, task, or velocity control) and embodiment constraints, ensuring collected trajectories match downstream policy architectures without post-hoc conversion.

Licensing and Provenance: Legal Foundations for Commercial VLA Deployment

Open datasets often carry restrictive licenses that prohibit commercial model training or require derivative models to adopt the same license. RoboNet uses a custom non-commercial license that forbids training models for sale or deployment in commercial products^[15]. EPIC-KITCHENS-100 annotations are CC BY-NC 4.0, permitting academic use but requiring separate commercial licensing^[16]. Creative Commons BY-NC terms explicitly exclude revenue-generating applications, creating legal risk for startups training VLA models on these datasets.

DROID released under MIT license permits commercial use, but the dataset lacks structured provenance metadata—collector identity, capture timestamps, hardware specifications—required for EU AI Act compliance^[17]. Article 10 mandates that training data for high-risk AI systems be traceable to source, with documentation of collection methods, consent mechanisms, and quality controls. Datasheets for Datasets proposed a 57-question template covering motivation, composition, collection, preprocessing, and distribution, but adoption remains sparse in robotics^[18].

Truelabel's provenance system embeds collector identity, capture location, hardware serial numbers, and consent records into dataset metadata using C2PA content credentials, enabling buyers to audit data lineage and demonstrate regulatory compliance. Every trajectory includes a cryptographically signed manifest linking observations to the collector's verified identity and the specific robot hardware used, satisfying EU AI Act Article 10 and NIST AI RMF traceability requirements. Buyers receive perpetual commercial licenses with indemnification against IP claims, eliminating the legal ambiguity that plagues open datasets.

Cost Structure: Teleoperation Labor vs. Annotation Overhead

Teleoperation costs scale with task complexity and operator skill. ALOHA reported $45/hour for expert operators capturing bimanual manipulation, yielding 54 demonstrations per hour for simple tasks (pick-place) and 12 per hour for complex tasks (cable routing). At these rates, collecting 10,000 trajectories for a single task costs $7,500-$37,500 in labor alone, excluding hardware amortization and facility overhead.

Scale AI industrialized this process by distributing teleoperation across contract workers in lower-cost regions, reducing per-trajectory cost to $8-$15 for tabletop manipulation and $25-$60 for mobile manipulation^[19]. Appen and CloudFactory offer similar managed services but require minimum orders of 5,000 trajectories and 8-12 week lead times, limiting agility for iterative model development.

Annotation overhead—labeling object poses, contact points, task success—adds 30-50% to raw collection costs. Labelbox charges $0.15-$0.40 per frame for 2D bounding boxes and $1.20-$3.50 for 3D cuboid annotation, translating to $180-$4,200 per 1,000-frame trajectory depending on object count and annotation density. Segments.ai specializes in multi-sensor annotation (RGB-D, LiDAR, radar) with per-point-cloud pricing of $8-$25, suitable for outdoor mobile manipulation but cost-prohibitive for high-frequency indoor datasets.

Truelabel's marketplace eliminates annotation overhead by requiring collectors to deliver action-labeled trajectories—joint states, gripper commands, base velocities—synchronized with observations at capture time. Buyers specify action space format (joint, task, or velocity control) and sensor modalities (RGB, depth, force-torque) in the bounty, and collectors use hardware that natively logs these signals, avoiding post-hoc labeling. Per-trajectory costs range from $3 for simple pick-place to $18 for bimanual assembly, 40-60% below managed service providers.

Simulation-to-Real Transfer: When Synthetic Data Suffices

Simulation generates infinite trajectories at zero marginal cost but introduces domain gap—visual appearance, physics fidelity, sensor noise—that degrades real-world performance. Domain randomization addresses this by varying lighting, textures, object shapes, and dynamics parameters during training, forcing policies to learn robust features invariant to superficial changes^[20]. A 2017 study trained a 7-DOF reaching policy entirely in simulation with randomized joint friction, link masses, and actuator gains, achieving 94% real-world success without fine-tuning^[21].

RLBench provides 100 simulated manipulation tasks in PyBullet with procedurally generated object poses and distractor placement, enabling policies to train on millions of episodes before real-world deployment^[22]. ManiSkill extends this with GPU-accelerated physics (10,000 FPS on a single A100) and photorealistic rendering via ray tracing, narrowing the visual domain gap. RoboSuite supports 9 robot arms and 30+ tasks with modular environment composition, but all tasks remain tabletop manipulation—no locomotion, outdoor navigation, or tool use.

A 2021 survey found that 68% of sim-trained policies required real-world fine-tuning on 500-2,000 demonstrations to match teleoperation-trained baselines, and 22% failed to transfer entirely due to unmodeled contact dynamics^[12]. Contact-rich tasks—insertion, screwing, wiping—exhibit the largest sim-to-real gaps because simulators approximate friction, compliance, and slip with simplified models. NVIDIA Cosmos addresses this with world foundation models trained on 20 million hours of real-world video, learning physics priors that improve simulation fidelity, but the models remain proprietary and unavailable for open research.

Truelabel's hybrid bounty system lets buyers request simulated data for initial policy training and real-world validation data for fine-tuning, balancing cost and performance. Collectors can deliver synthetic trajectories from RoboSuite, ManiSkill, or custom Unity/Unreal environments, paired with 500-2,000 real-world demonstrations for transfer validation.

Data Refresh Cycles: Continuous Learning for Deployed VLA Systems

Static datasets become stale as robots encounter novel objects, environments, and failure modes in deployment. RoboCat demonstrated self-improvement by generating 10,000 synthetic demonstrations on new tasks, filtering by success, then retraining the base model, improving zero-shot success rates from 36% to 74% over 5 iterations^[14]. This closed-loop approach requires continuous data collection infrastructure that captures edge cases and failure modes as they occur.

RT-1 deployed in Google's offices collected 12,000 intervention episodes—cases where human operators corrected policy failures—over 6 months, then retrained the model on the augmented dataset, reducing intervention rate from 18% to 4%^[7]. DROID adopted a similar strategy, collecting 76,000 trajectories over 12 months across 86 buildings, with monthly model updates incorporating the latest 5,000-8,000 episodes^[3]. This continuous learning loop requires data collection infrastructure that scales with deployment, not just initial training.

Truelabel's marketplace supports recurring bounties—standing orders for 500-2,000 trajectories per month—enabling buyers to refresh training data as robots encounter new environments, objects, and tasks. Collectors receive priority routing for repeat buyers, reducing lead time from 4-6 weeks for one-off bounties to 1-2 weeks for recurring orders. Buyers can specify failure mode targeting—e.g.,

Benchmark Saturation: Why Open Datasets No Longer Differentiate VLA Performance

Leading VLA models now achieve 85-95% success on standard benchmarks, compressing performance differences below statistical significance. OpenVLA scored 91.3% on Open X-Embodiment tasks, compared to 89.7% for Octo and 87.2% for RT-2-X^[1]. A 2-4 percentage point gap falls within the 3-5% confidence interval typical of 100-episode evaluations, making it unclear whether improvements stem from architecture, data, or random seed selection.

CALVIN introduced long-horizon tasks requiring 5-step instruction chains, but the dataset contains only 24,000 episodes in a single simulated kitchen, and top models now achieve 78-82% success^[10]. LongBench evaluated policies on 100-step real-world tasks and found that all models—including those trained on millions of trajectories—achieved <10% success, revealing a massive generalization gap that existing datasets do not address^[11]. The bottleneck has shifted from short-horizon manipulation to long-horizon planning, multi-step error recovery, and tool use—capabilities absent from current benchmarks.

ManipArena proposed 100 real-world tasks spanning household, warehouse, and assembly domains, with success criteria requiring 20-50 step sequences and tool use (screwdrivers, pliers, tape). Initial evaluations showed that models trained on Open X-Embodiment achieved 6% success, and even task-specific fine-tuning on 2,000 demonstrations raised success to only 23%^[23]. This performance collapse indicates that existing datasets lack the task diversity and complexity required for general-purpose manipulation.

Truelabel's marketplace enables buyers to post bounties for benchmark-specific data—e.g.,

Truelabel's VLA Data Marketplace: Distributed Collection at Web Scale

Truelabel connects robotics teams to 20,000+ collectors across 160+ countries who own teleoperation hardware, egocentric cameras, and motion-capture systems, enabling parallel data capture at a scale previously reserved for web scraping. Buyers post bounties specifying embodiment (humanoid, quadruped, dexterous hand), environment (warehouse, kitchen, outdoor), task complexity (short-horizon pick-place vs. multi-step assembly), and sensor modalities (RGB-D, force-torque, tactile). Collectors bid on bounties, deliver action-labeled trajectories in LeRobot or RLDS format, and receive payment upon buyer acceptance.

Every trajectory includes cryptographically signed provenance metadata—collector identity, capture timestamp, hardware serial numbers, consent records—using C2PA content credentials, enabling buyers to audit data lineage and demonstrate EU AI Act compliance. Buyers receive perpetual commercial licenses with indemnification against IP claims, eliminating the legal ambiguity that plagues open datasets. Per-trajectory costs range from $3 for simple pick-place to $18 for bimanual assembly, 40-60% below managed service providers like Scale AI and Appen.

Recurring bounties—standing orders for 500-2,000 trajectories per month—enable continuous learning loops where deployed robots feed failure modes back into training data refresh cycles. Collectors receive priority routing for repeat buyers, reducing lead time from 4-6 weeks for one-off bounties to 1-2 weeks for recurring orders. Truelabel's distributed model eliminates the hardware shipping, facility overhead, and operator training costs that constrain centralized data collection, enabling robotics teams to scale training data at the same pace they scale compute.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Multi-Task Learning RoboticsDefinition and terminology Vision-Language-Action ModelDefinition and terminology Egocentric Video Data Collection for Robotics and Embodied AIRelated page Best Egocentric Video Data Providers for Robotics and VLA Models (2026)Related page Embodied AI DatasetsDefinition and terminology Visuomotor PolicyDefinition and terminology Physical AI data marketplaceBuyer conversion page VLA training dataBuyer conversion page

External references and source context

OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA 7B model outperformed RT-2-X 55B by 16.5% on manipulation benchmarks, trained on 970,000 trajectories from Open X-Embodiment
arXiv ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 showed web-scale vision-language pretraining improves robot policy generalization by 3× when fine-tuned on demonstrations
arXiv ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID collected 76,000 trajectories across 564 skills and 86 buildings using 13 mobile manipulators in 350 operator-hours
arXiv ↩
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Ego4D contains 3,670 hours of first-person video across 74 scenarios but lacks joint-level action annotations
arXiv ↩
RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning
RLDS standardized VLA data format as episodes containing observation dictionaries, action arrays, and reward scalars
arXiv ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment aggregated 1 million trajectories from 22 robot platforms for cross-embodiment VLA training
arXiv ↩
RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 trained on 130,000 demonstrations achieved 97% success on seen tasks but 13% on unseen embodiments; RGB-only achieved 89% pick-place success vs 34% insertion, rising to 81% with force-torque
arXiv ↩
BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 contains 60,000 trajectories across 13 tasks using single WidowX 250 arm in controlled lab
arXiv ↩
RoboNet: Large-Scale Multi-Robot Learning
RoboNet aggregated 15 million frames from 7 platforms but 90% involve pick-place or push on flat surfaces; scripted exploration yielded 40% failure states
arXiv ↩
CALVIN paper
CALVIN introduced 5-step long-horizon instruction chains but contains only 24,000 episodes in single simulated kitchen
arXiv ↩
LongBench: Evaluating Robotic Manipulation Policies on Real-World Long-Horizon Tasks
LongBench evaluated policies on 100-step real-world tasks; models trained on short-horizon datasets achieved <5% success
arXiv ↩
Crossing the Reality Gap: A Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning
2021 survey found 68% of sim-trained policies required 500-2,000 real-world demonstrations to match teleoperation baselines; 22% failed to transfer due to unmodeled contact dynamics
arXiv ↩
Project site
Octo trained on 800,000 trajectories with embodiment-specific action heads, achieving 52% higher success than single-embodiment baselines
sites.google.com ↩
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
RoboCat self-improvement reduced human teleoperation requirements by 40%; generated 10,000 synthetic demos, filtered by success, improved zero-shot success from 36% to 74% over 5 iterations
arXiv ↩
RoboNet dataset license
RoboNet uses custom non-commercial license forbidding training models for sale or commercial deployment
GitHub raw content ↩
EPIC-KITCHENS-100 annotations license
EPIC-KITCHENS-100 annotations are CC BY-NC 4.0, permitting academic use but requiring separate commercial licensing
GitHub ↩
Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence
EU AI Act Article 10 mandates training data for high-risk AI systems be traceable to source with collection method documentation
EUR-Lex ↩
Datasheets for Datasets
Datasheets for Datasets proposed 57-question template covering motivation, composition, collection, preprocessing, distribution
arXiv ↩
scale.com physical ai
Scale AI's Physical AI platform pairs teleoperation hardware with annotation pipelines for 6-DOF poses and contact forces
scale.com ↩
Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization forces policies to learn robust features invariant to lighting, texture, shape, dynamics variations
arXiv ↩
Sim-to-Real Transfer of Robotic Control with Dynamics Randomization
2017 study trained 7-DOF reaching policy in simulation with randomized joint friction, link masses, actuator gains, achieving 94% real-world success without fine-tuning
arXiv ↩
RLBench: The Robot Learning Benchmark & Learning Environment
RLBench provides 100 simulated manipulation tasks in PyBullet with procedurally generated object poses
arXiv ↩
ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation
ManipArena proposed 100 real-world tasks requiring 20-50 steps and tool use; Open X-Embodiment models achieved 6% success, 23% after 2,000-demo fine-tuning
arXiv ↩
truelabel physical AI data marketplace bounty intake
Truelabel marketplace connects buyers to 20,000+ collectors across 160+ countries for distributed VLA data collection
truelabel.ai

FAQ

Why do VLA models require action labels when vision-language models trained on internet video achieve strong zero-shot performance?

Vision-language models learn visual semantics—object recognition, spatial relationships, scene understanding—from internet video, but they do not learn sensorimotor mappings required for physical interaction. A model that knows what a mug looks like cannot infer the gripper trajectory, wrist torque, or contact force required to grasp it without action-labeled demonstrations. RT-2 showed that web-scale pretraining improves robot policy generalization by 3× when fine-tuned on task-specific demonstrations, but the fine-tuning step remains mandatory. OpenVLA achieved 16.5% higher success than RT-2-X (55B parameters) using only 7B parameters trained on 970,000 action-labeled trajectories, proving that sensorimotor data quality outweighs visual pretraining scale.

How does teleoperation data quality compare to autonomous scripted exploration for VLA training?

Teleoperation yields high-quality demonstrations with precise action labels but scales linearly with human operator hours. DROID collected 76,000 trajectories using 350 hours of teleoperation, averaging 217 trajectories per operator-hour. Autonomous scripted exploration generates trajectories at machine speed but introduces distribution shift—RoboNet's 15 million frames included 40% failure states (collisions, dropped objects), and training on failure-heavy data degrades policy performance unless explicitly filtered. Robomimic showed that filtering the bottom 25% of trajectories by task success improved imitation learning accuracy by 18%. Truelabel's hybrid bounty system lets buyers request teleoperation data for high-value tasks and autonomous scripted data for coverage, balancing quality and scale.

What licensing restrictions prevent commercial use of open VLA datasets?

RoboNet uses a custom non-commercial license that forbids training models for sale or deployment in commercial products. EPIC-KITCHENS-100 annotations are CC BY-NC 4.0, permitting academic use but requiring separate commercial licensing. Creative Commons BY-NC terms explicitly exclude revenue-generating applications, creating legal risk for startups training VLA models on these datasets. DROID released under MIT license permits commercial use but lacks structured provenance metadata—collector identity, capture timestamps, hardware specifications—required for EU AI Act Article 10 compliance. Truelabel embeds collector identity, capture location, hardware serial numbers, and consent records into dataset metadata using C2PA content credentials, enabling buyers to audit data lineage and demonstrate regulatory compliance with perpetual commercial licenses.

Why do VLA models trained on millions of trajectories still fail on long-horizon tasks?

Existing datasets concentrate on short-horizon manipulation—pick-place, push, reach—with task horizons of 5-20 steps. CALVIN introduced 5-step instruction chains but contains only 24,000 episodes in a single simulated kitchen. LongBench evaluated policies on 100-step real-world tasks and found that models trained on Open X-Embodiment (1 million trajectories) achieved <10% success, revealing a massive generalization gap. Real-world assembly, maintenance, and logistics workflows often require 20-50 step sequences with tool use, multi-object coordination, and error recovery—task structures absent from current benchmarks. ManipArena proposed 100 real-world tasks requiring 20-50 steps and tool use, and models trained on Open X-Embodiment achieved only 6% success, rising to 23% after task-specific fine-tuning on 2,000 demonstrations.

How does multi-sensor fusion improve VLA policy performance on contact-rich tasks?

Vision-only policies suffer from depth ambiguity and occlusion, limiting performance on contact-rich tasks. RT-1 trained on RGB images alone achieved 89% success on pick-place but 34% on insertion tasks requiring sub-millimeter alignment. Adding depth channels improved insertion success to 67%, and incorporating wrist force-torque sensors pushed success to 81%, demonstrating that tactile feedback is non-optional for precision manipulation. DROID captures RGB-D from wrist-mounted RealSense cameras and proprioceptive joint states at 10 Hz but omits force-torque and tactile signals. Dex-YCB includes 582,000 frames of dexterous grasps with ground-truth 6-DOF object poses and contact labels, enabling policies to learn grasp stability from visual-tactile correlation.

What action space normalization strategies enable cross-embodiment VLA training?

Open X-Embodiment normalized actions by computing per-dataset mean and standard deviation, then clipping to [-1, 1], but this approach fails when action dimensions have semantic mismatches (e.g., 7-DOF arm vs. 2-DOF gripper). RT-1 discretized continuous actions into 256 bins per dimension, enabling transformer-based policies to treat actions as token sequences, but discretization introduces quantization error that degrades performance on precision tasks. Octo trained a generalist policy on 800,000 trajectories by learning embodiment-specific action heads—separate output layers for each robot morphology—while sharing a common visual encoder and language model, achieving 52% higher success rates than single-embodiment baselines. Task-space control—specifying end-effector position and orientation rather than joint angles—provides a morphology-agnostic action representation but requires accurate inverse kinematics.

Looking for vla training data?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners and helps scope consent artifacts and commercial licensing requirements before delivery.

Post a VLA Data Bounty