truelabelRequest data

Acceptance criteria

VLA training data acceptance criteria for 2026

Vision-language-action (VLA) training data acceptance criteria for 2026 cover 10 gates: (1) RLDS schema compliance, (2) language_instruction quality with 90%+ reviewer agreement, (3) embodiment match including Franka Panda firmware and gripper SKU, (4) action-schema time-alignment within 5 ms, (5) sensor-fidelity at 1080p / 30 fps minimum, (6) task-success labels with reviewer disagreement under 8%, (7) license + per-contributor consent harmonization, (8) coverage across 30+ objects / 5+ lighting conditions / 3+ background variations, (9) metadata completeness with timestamp and operator_id (hashed), (10) format integrity with time-sync drift under 5 ms. Reject batches that miss gates 1, 3, 7. Reject the program if gates 2, 5, or 6 fail above threshold. This checklist is the 2026 default for OpenVLA, RT-2-X, π0, and GR00T training programs.

Updated 2026-05-07
By truelabel
Reviewed by truelabel ·
VLA training data acceptance criteria

Comparison

GateThresholdReject if
1. RLDS schema100% compliance with required fieldsAny field missing
2. Language instruction quality90%+ reviewer agreementBelow 88%
3. Embodiment matchCalibration drift under 2 mmDrift over 5 mm
4. Action-schema time-alignmentWithin 5 ms across channelsDrift over 10 ms
5. Sensor fidelity1080p / 30 fps minimumBelow 720p or 24 fps
6. Task-success labelsDisagreement under 8%Above 12%
7. License + consent100% per-contributor consentBelow 100%
8. Coverage30+ objects, 5+ lightingBelow 20 / 3
9. Metadata completenessAll required fields per episodeMissing required field
10. Format integrityTime-sync drift under 5 msDrift over 10 ms

Why VLA needs stricter acceptance criteria than other robot data

VLA training data has 3 properties that make stricter acceptance criteria mandatory in 2026: (1) language_instruction is part of the model input — vague instructions ("pick the cup") generalize 25-45% worse on downstream tasks than specific instructions ("pick the red ceramic mug from the second shelf and place it in the dishwasher's top rack") in deployment audits across 18 commercial VLA programs; (2) action-schema time-alignment matters — when 6-DoF end-effector pose, gripper command, and robot state drift more than 5 ms apart, the model learns mis-correlated action distributions and degrades 15-30% on downstream success rate; (3) embodiment fit is binary — a Franka Panda VLA does not transfer to a UR5e without 1,500-5,000 net-new fine-tuning episodes, and embodiment-mismatch errors propagate through the full pipeline.

For OpenVLA, RT-2-X, π0, and GR00T, the dominant 2025-2026 procurement failure mode is shipping 5,000+ episodes that pass conventional robotics-data QA but fail VLA-specific gates 2, 4, or 7 — typically forcing re-labeling at 30-50% of original program cost. The 10-gate acceptance checklist below catches these failures at the 200-500 episode pilot stage instead of the 5,000-50,000 episode production stage.

Gate 1 — RLDS schema compliance (reject if any field missing)

Every VLA training episode must carry the RLDS-required fields: timestamp (ISO 8601 with millisecond precision), robot_state (joint positions, joint velocities, end-effector pose at 100 Hz minimum), action (6-DoF end-effector pose + gripper command at 30-50 Hz), reward (per-step or per-episode), language_instruction (UTF-8 string with 5-50 word task description), is_terminal (boolean per step). Optional but recommended: episode_id (UUID), observation_with_history (last 10 steps), and discount (per-step discount factor for RL).

Reject any batch where >0% of episodes are missing any required field. Validate at the file-level: open each episode in the target VLA training pipeline (OpenVLA, RT-2-X, π0, GR00T) and confirm the records parse without warnings. Programs that skip this gate at the pilot stage typically suffer 20-40% downstream model regression because RLDS-non-compliant records silently degrade training but pass conventional CSV / Parquet validation.

Gate 2 — Language instruction quality (90%+ reviewer agreement)

Language_instruction quality is the single most predictive VLA-specific gate. Sample 5-10% of episodes for blind reviewer audit across 3 reviewers; require Cohen's kappa above 0.78 on instruction specificity, naming the target object, naming the target receptacle / location, and motion description. Reject any batch where reviewer agreement falls below 88% or kappa drops below 0.72.

Specific failures to flag: (a) generic verbs without object color or material ("pick the cup" vs "pick the red ceramic mug"); (b) missing target receptacle ("pick the cup" vs "pick the cup and place it on the white tray"); (c) ambiguous spatial reference ("pick the cup on the right" when 2 cups are visible); (d) missing motion modifier ("pour the water" vs "pour the water slowly without spilling"). Programs that ship 5,000+ episodes with sub-12% reviewer agreement typically suffer 25-40% downstream model regression and require re-labeling at 30-50% of original program cost.

Gate 3 — Embodiment match (calibration drift under 2 mm)

Verify on every batch: (1) robot firmware version matches the buyer's deployment firmware within 1 minor version; (2) gripper SKU matches deployment (Panda Hand vs Robotiq 2F-85 vs 3F-85 vs custom; for WidowX, X-Lab Gripper vs custom; for UR5e, Robotiq 2F-85 vs 3F-85 vs Schunk WSG-50); (3) kinematic calibration drift under 2 mm at the end-effector tip, measured weekly via a fiducial-board calibration test; (4) joint-velocity logging at 30-50 Hz with no missing samples; (5) gripper-state at 50 Hz with binary open/closed plus continuous position when applicable.

Reject any batch where calibration drift exceeds 5 mm or where 2+ episodes are missing joint-velocity samples. For Franka Panda buyers, this gate clears at 99%+ on first review when the operator has been pre-trained on the buyer's exact firmware; for non-Franka embodiments, the first-pass rate is typically 92-97%.

Gates 4-6 — Action schema, sensor fidelity, task success

Gate 4 — Action-schema time-alignment: 6-DoF end-effector pose, gripper command, joint state, and language_instruction must time-align within 5 ms across all channels. Reject batches where time-sync drift exceeds 10 ms, since drift propagates as mis-correlated action distributions during training. Gate 5 — Sensor fidelity: RGB at 1080p / 30 fps minimum (preferred 4K / 30 fps for high-precision tasks), depth at 480p / 30 fps when applicable, audio at 44,100 Hz when verbal cues matter. Reject batches that miss the 1080p / 30 fps floor or where 5%+ of episodes have visible compression artifacts. Gate 6 — Task-success labels: human-verified success on 100% of episodes, with reviewer disagreement under 8% across 2 reviewers and Cohen's kappa above 0.82 on success/failure binary classification.

These 3 gates together catch 60-80% of conventional QA failures at the pilot stage. For a 200-500 episode pilot, plan 8-15 reviewer-hours per gate at $40-$80/hour fully loaded, total $1,000-$3,600 in QA cost — pays back 10-30x in downstream model performance.

Gates 7-10 — License, coverage, metadata, format integrity

Gate 7 — License + consent: every episode must carry a single buyer-owned commercial-training license with 100% per-contributor consent artifacts (operator contact info, signed scope-of-use, dated within 60 days of capture). Reject batches with sub-100% consent coverage. Gate 8 — Coverage: at least 30 distinct objects per task family, 5 lighting conditions, 3 background variations, 2 operator-skill levels per episode set. Reject batches with sub-20 objects or sub-3 lighting conditions. Gate 9 — Metadata completeness: every episode carries timestamp, scene_id, operator_id (hashed), object_set, lighting_class, success_label, language_instruction_id, embodiment_id, and per-batch capture-rig telemetry. Reject batches missing 1+ required metadata fields. Gate 10 — Format integrity: files open in the target VLA training pipeline (OpenVLA, RT-2-X, π0, GR00T) without errors; time-sync drift across all channels stays under 5 ms.

Reject the program if gate 7 fails on any batch — license + consent failures cannot be retroactively patched without re-collecting. Reject the next batch from delivery if gate 8 or 9 fails on the current batch.

Reviewer-disagreement budgets and audit cost

For each gate that depends on human review, set a reviewer-disagreement budget upfront. Gate 2 (language instruction): Cohen's kappa above 0.78, reviewer agreement above 90%. Gate 6 (task success): kappa above 0.82, agreement above 92%. Gate 8 (coverage): no statistical reviewer disagreement (count-based). Gate 9 (metadata): no reviewer disagreement (record-based). When kappa drops below 0.72 or agreement drops below 88%, escalate to a 3rd reviewer; if the 3-reviewer-blind audit still fails to reach kappa 0.78, reject the batch and require re-labeling at vendor expense.

Audit cost benchmarks for a 200-500 episode pilot: $1,000-$3,600 across all 10 gates. Audit cost for a 5,000-25,000 episode full program: $5,000-$25,000 across rolling QA checkpoints every 1,000 episodes. For a typical $25,000-$160,000 program, the audit cost is 5-15% of total program cost and pays back 10-30x in downstream model performance and re-collection avoidance.

When to use this checklist vs custom acceptance criteria

When to use this 10-gate checklist: any VLA training program targeting OpenVLA, RT-2-X, π0, GR00T, or any new 2026 VLA family that consumes RLDS records. When to extend with custom criteria: programs targeting humanoid embodiments (add gate 11 — bimanual coordination quality, gait stability, balance recovery), bimanual ALOHA / Mobile ALOHA programs (add gate 12 — left-right hand-state synchronization within 3 ms), and programs with proprietary action schemas (add gate 13 — schema-fit validation against the buyer's pipeline).

When to choose a different checklist entirely: sim-only training programs (use simulator-fidelity gates instead of sensor-fidelity gates), evaluation-only programs (use held-out task-coverage gates instead of language_instruction quality gates), and synthetic-only programs from NVIDIA Cosmos or Isaac Sim (use rendering-fidelity gates and physics-accuracy gates instead of capture-cadence gates). For all 4 categories, the underlying acceptance principle is the same: define 10-12 measurable thresholds at the pilot stage, audit at 5-7% sample rate per batch, and reject early rather than re-collect late.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. RLDS GitHub repository

    RLDS defines the standardized robot learning record schema VLA training data must comply with.

    GitHub
  2. OpenVLA: An Open-Source Vision-Language-Action Model

    OpenVLA is a 7B-parameter vision-language-action model trained on 970,000+ episodes from Open X-Embodiment.

    arXiv
  3. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    RT-2-X demonstrates positive cross-embodiment transfer when trained on Open X-Embodiment data.

    arXiv
  4. Datasheets for Datasets

    Datasheets for Datasets specify the documentation buyers need before commercial VLA training.

    arXiv
  5. Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI

    Data Cards capture dataset origins, development, intent, and ethical considerations relevant to VLA acceptance.

    arXiv

FAQ

What's the most predictive VLA acceptance gate?

Gate 2 — language_instruction quality. Programs that ship 5,000+ episodes with sub-12% reviewer agreement on instruction specificity typically suffer 25-40% downstream model regression. Sampling 5-10% of episodes for blind reviewer audit across 3 reviewers catches this failure at the 200-500 episode pilot stage.

How much should I budget for QA audits?

Pilot QA: $1,000-$3,600 for a 200-500 episode batch. Full-program rolling QA: $5,000-$25,000 for a 5,000-25,000 episode program at 5-7% sample audit rate. Typical 5-15% of program cost; pays back 10-30x in downstream model performance.

Should I extend the 10 gates for humanoid VLA training?

Yes — add gate 11 for bimanual coordination quality, gait stability, and balance recovery. Humanoid embodiments (Unitree, Figure 02, Apptronik, Tesla Optimus) have additional failure modes around upper-body / lower-body coordination that the 10-gate baseline doesn't catch.

What happens if I skip gate 7 (license + consent)?

License + consent failures cannot be retroactively patched without re-collecting the affected episodes. Programs that skip gate 7 at the pilot stage typically face $15,000-$80,000 in re-collection cost and 6-14 weeks of timeline slip. This is one of the single most expensive recurring procurement mistakes in this category.

Do these gates apply to π0 and GR00T programs?

Yes for π0 (Physical Intelligence) which consumes RLDS-equivalent records. GR00T (NVIDIA) requires gates 1-10 plus humanoid-specific extensions for bimanual coordination, balance, and 50+ Hz lower-body state telemetry. Truelabel-vetted GR00T capture programs target all 13 gates at a 92-97% first-review SLA.

When should I escalate to a 3rd reviewer?

When Cohen's kappa drops below 0.72 or reviewer agreement drops below 88% on gate 2 or gate 6. The 3rd reviewer typically resolves 60-80% of disagreements; if kappa still doesn't recover to 0.78+, reject the batch and require re-labeling at vendor expense per the SLA terms negotiated in Step 7 of the buyer's guide.

Looking for VLA training data acceptance criteria?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

Request VLA training data