Buyer's guide
Physical AI training data buyer's guide for 2026
Buying physical AI training data in 2026 means navigating 6 modality classes (egocentric video, teleoperation, robot demonstrations, evaluation, synthetic, multimodal sensor fusion), 22+ embodiment types (Franka Panda, WidowX 250, UR5e, xArm 7, Stretch 3, Kuka iiwa, Sawyer, ALOHA, Mobile ALOHA, AgiBot, Unitree, custom humanoid), 8 license categories (Apache-2.0, MIT, CC BY 4.0, CC BY-NC 4.0, research-only, commercial-only, custom-research, no-license), and 5 delivery formats (RLDS, MCAP, Parquet, HDF5, LeRobotDataset v3.0). This 8-step buyer's guide covers vendor selection, sample QA gates, contract terms, and pricing benchmarks for 2026 procurement programs.
Comparison
| Step | Action | Output |
|---|---|---|
| 1. Define modality | Egocentric video, teleop, demos, eval, synthetic, sensor fusion | Modality spec sheet |
| 2. Specify embodiment | Franka Panda, WidowX 250, UR5e, etc. | Embodiment + rig spec |
| 3. Choose license | Apache-2.0, MIT, commercial, custom | License decision |
| 4. Pick format | RLDS, MCAP, Parquet, HDF5, LeRobotDataset v3.0 | Format spec |
| 5. Run vendor bake-off | 2-3 candidate marketplaces, 200-500 episode pilots | Pilot QA results |
| 6. Define QA gates | 12 acceptance criteria, reviewer thresholds | QA rubric |
| 7. Negotiate contract | SLA, indemnification, consent, exclusivity | Signed contract |
| 8. Scale and re-verify | 5,000-50,000 episodes with rolling QA | Production corpus |
Step 1 — Define your modality and task family
Physical AI training data spans 6 modality classes: (1) egocentric first-person video at 1080p / 30 fps with hand-pose tracking, gaze, and language_instruction; (2) teleoperation traces at 30-50 Hz control cadence with 100-200 Hz state telemetry, 6-DoF end-effector pose, joint-velocity logging, and gripper state; (3) robot demonstrations including kinesthetic teaching and motion-capture replay; (4) evaluation datasets for held-out testing with 200-2,000 task variants per evaluation suite; (5) synthetic data from NVIDIA Cosmos, Isaac Sim, MuJoCo MJX, or CoppeliaSim; (6) multimodal sensor fusion combining LiDAR (16-128 channel), multi-camera RGB-D, IMU at 200 Hz, force-torque at 1 kHz, and tactile signals.
For each modality, write a 1-page spec covering: target task family (5-15 tasks), expected episode count (5,000-50,000), per-episode duration (30-300 seconds), capture cadence, sensor fidelity requirements, language_instruction depth, and acceptance success rate (typically 92-97% on first review). The 1-page spec is the single most predictive procurement document — programs that skip the spec document typically suffer 25-40% downstream model regression after deployment because vendor and buyer never aligned on the task taxonomy.
- Egocentric video — 1080p / 30 fps, hand-pose, gaze
- Teleoperation — 30-50 Hz control, 100-200 Hz state
- Robot demonstrations — kinesthetic + motion-capture replay
- Evaluation datasets — 200-2,000 task variants per suite
- Synthetic — Cosmos / Isaac Sim / MuJoCo / CoppeliaSim
- Multimodal sensor fusion — LiDAR + camera + IMU + tactile
Step 2 — Match embodiment to deployment robot
Embodiment fit dominates physical AI data quality in 2026. The 22+ embodiment types in active commercial use: Franka Panda 7-DoF (most common, ~30% of programs), WidowX 250 (~15%), UR5e and UR10e (~12%), xArm 7 (~8%), Stretch 3 (~5%), Kuka iiwa (~5%), Sawyer (~3%), Yaskawa (~3%), FANUC (~2%), ABB (~2%), bimanual ALOHA / Mobile ALOHA (~5%), humanoid Unitree H1/G1 (~3%), Figure 02 (~2%), Apptronik Apollo (~2%), Tesla Optimus (~1%), AgiBot (~1%), custom industrial arms (~1%). For each embodiment, the buyer should require: firmware version, gripper SKU (Panda Hand vs Robotiq 2F-85 vs custom), kinematic calibration drift under 2 mm, joint-velocity logging at 30-50 Hz, and gripper-state at 50 Hz minimum.
When the buyer's deployment robot is on the rare side (FANUC, ABB, Yaskawa, custom), open-license corpora typically under-cover by 80-95%, and 5,000-25,000 net-new commercial-license episodes are required to recover the 30-55% deployment-side degradation. When the buyer's robot is Franka Panda or WidowX 250, DROID and BridgeData V2 provide strong open-license pretraining substrates and only 5,000-15,000 net-new fine-tuning episodes are typically needed.
Step 3 — Pick license posture and indemnification rider
License posture in 2026 has 8 categories: (1) Apache-2.0 — permissive, commercial OK, no attribution required (cadene/droid is the canonical example); (2) MIT — permissive, commercial OK, attribution required (BridgeData V2); (3) CC BY 4.0 — commercial OK with attribution and NOTICE file (RoboNet); (4) CC BY-NC 4.0 — non-commercial only, hard block on paid products (EPIC-KITCHENS); (5) research-only with named-PI Data Use Agreement (Ego4D); (6) commercial-only with buyer-owned single license (Truelabel, Scale AI, Encord net-new programs); (7) custom-research with case-by-case commercial exception (most lab-specific datasets); (8) no-license-file — requires upstream license review (10-20% of Hugging Face robotics datasets).
For enterprise legal review, the typical due-diligence cost per license category is: Apache-2.0 / MIT — 2-4 hours per dataset; CC BY 4.0 — 4-8 hours plus NOTICE file build; research-only — 8-24 hours plus DUA negotiation (often unsuccessful for commercial); commercial buyer-owned — 4-12 hours plus indemnification rider review. Indemnification riders typically run $5,000-$40,000 per program for 5,000-25,000 episode tier and cover IP infringement, contributor-consent claims, and biometric / personality-rights exposure.
Step 4 — Select delivery format and schema
Delivery format shapes downstream training pipeline cost. The 5 dominant formats in 2026: (1) RLDS — Google Research's record schema for robot learning data with timestamp, robot_state, action, reward, language_instruction, is_terminal; (2) MCAP — open container for multimodal log data, used in 80%+ of teleoperation pipelines; (3) Parquet — columnar format ideal for large-scale dataset storage with HuggingFace and Spark integration; (4) HDF5 — hierarchical format used by 60%+ of academic robotics datasets; (5) LeRobotDataset v3.0 — Hugging Face's standardized format with bundled video / state / action / metadata. For VLA training, RLDS is the preferred schema because OpenVLA, RT-2-X, and π0 pipelines consume RLDS records natively.
When the buyer's pipeline is RLDS-native, require RLDS-compliant delivery from day 1; the alternative is a 200-400 hour engineering integration cost to map vendor-specific formats. When the buyer's pipeline is MCAP-native, accept MCAP and convert to RLDS at training time. For Parquet / HDF5 / LeRobotDataset v3.0 deliveries, the conversion cost is typically 100-200 hours per pipeline.
Step 5 — Run a vendor bake-off
Run a 4-week bake-off across 2-3 candidate marketplaces before committing to a full program. Week 1: ship the same 200-500 episode pilot spec to each candidate with RLDS-compliant delivery, per-contributor consent artifacts, single buyer-owned commercial license, RGB at 1080p / 30 fps, 6-DoF end-effector pose at 100 Hz, joint-velocity logging at 30-50 Hz, and human-verified task-success labels. Week 2: each marketplace ships pilot batch — measure delivery date adherence, sample-to-acceptance rate, reviewer disagreement, and per-clip QA-gate failure rate. Week 3: blind-rank the pilot deliveries against 8 buyer-decision criteria; require 2 independent reviewers per delivery and surface only the top 2 finalists. Week 4: negotiate full-program pricing with the top 2 finalists.
The spread between #1 and #2 typically captures 40-60% of program cost via SLA, indemnification rider, and license terms. Skipping the bake-off is the single most expensive procurement mistake — recurring industry patterns show programs that ship 5,000+ episodes without a competitive pilot routinely surface gate failures late, with re-collection cost typically 60-110% of original program cost. The bake-off cost is typically $2,250-$7,500 across 2-3 candidates and pays back 5-15x in pricing leverage on the full program.
Step 6 — Define 12 sample QA gates
Every physical AI training data program should clear 12 acceptance gates: (1) modality + task match — episodes match the 1-page spec from Step 1; (2) embodiment match — firmware, gripper SKU, calibration drift under 2 mm; (3) sensor-fidelity — RGB at 1080p / 30 fps minimum; (4) action-schema compliance — RLDS / MCAP / format spec from Step 4; (5) capture cadence — 30-50 Hz teleop control, 100-200 Hz state; (6) operator-skill calibration — 50-100 episode skill-calibration set with 90%+ success; (7) license + consent — single buyer-owned commercial license, 100% per-contributor consent; (8) task-success labels — human-verified success on 100% of episodes with reviewer disagreement under 8%; (9) language_instruction quality (VLA only) — reviewer agreement above 90% on instruction specificity; (10) coverage — at least 30 distinct objects, 5 lighting conditions, 3 background variations, 2 operator-skill levels; (11) metadata completeness — timestamp, scene_id, operator_id (hashed), object_set, lighting_class, success_label per episode; (12) format integrity — files open in target pipeline without errors, time-sync drift under 5 ms across all channels.
Reject batches that miss gates (1), (2), (4), (7); reject the program if gate (3), (5), or (8) failure rate exceeds 8%. A typical pilot of 200-500 episodes ships in 7-14 days at $750-$2,500; the full program of 5,000-25,000 episodes ships in 60-120 days at $25,000-$160,000.
- 1. Modality + task match
- 2. Embodiment match
- 3. Sensor-fidelity (1080p / 30 fps)
- 4. Action-schema compliance
- 5. Capture cadence (30-50 Hz)
- 6. Operator-skill calibration
- 7. License + consent
- 8. Task-success labels
- 9. Language_instruction quality (VLA)
- 10. Coverage (objects / lighting / backgrounds)
- 11. Metadata completeness
- 12. Format integrity
Step 7 — Negotiate contract and SLA
Physical AI training data contracts in 2026 should cover 8 terms: (1) delivery date with daily SLA of 0.5-1.0% of program cost for slip; (2) acceptance threshold with 92-97% first-pass rate guaranteed and re-collection at vendor expense for batches below threshold; (3) per-contributor consent artifacts with audit-trail access for 24 months minimum; (4) indemnification rider covering IP infringement, biometric / personality-rights exposure, and contributor-consent claims at $5,000-$40,000 program-level cap; (5) exclusivity terms (Truelabel-vetted programs typically include 6-12 months of buyer-exclusive use); (6) revision loop with 1-2 free re-collections of failed batches plus per-batch revision pricing; (7) data-deletion clause covering vendor-side retention after delivery (typically 30-90 days); (8) termination clause with prorated refund for unbuilt episodes.
For a typical $25,000-$160,000 program, a 6-12 hour contract negotiation with the legal team typically captures $3,000-$25,000 in additional protection. Skipping this step is the second most expensive procurement mistake — recurring industry patterns show programs without negotiated SLA terms absorbing $15,000-$80,000 in re-collection cost that would have been the vendor's expense under a properly negotiated contract.
Step 8 — Scale and re-verify
After the pilot clears all 12 QA gates, scale to the full 5,000-50,000 episode program with rolling QA: every 1,000 episodes, run a 5-7% sample audit across 3 reviewers; reject the next 1,000-episode batch from delivery if the audit fails any gate. For the full program timeline (60-120 days), expect 60-90 day delivery cadence with 4-6 audit checkpoints, 92-97% first-pass acceptance rate at each checkpoint, and 1-3 re-collected batches across the program. Truelabel-vetted programs target gate (7) at 96-99% on first review, gate (3) at 92-97%, and gate (1) at 99%+ when the spec from Step 1 is well-defined.
When the program completes, archive the full corpus in 2 redundant buyer-owned storage locations (typically S3 + Azure Blob Storage or equivalent), retain all per-contributor consent artifacts for 24 months minimum, and run a final consolidated QA pass before ingesting into the training pipeline. The 8-step procurement playbook is the 2026 default for $25,000-$160,000 programs and scales linearly to $500,000-$2,000,000 enterprise programs with additional steps for compliance (HIPAA, SOC 2, GDPR), per-jurisdiction consent, and federated capture across 3-5 contributor regions.
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- RLDS GitHub repository
RLDS defines a standardized robot learning record schema with timestamp, robot_state, action, reward, language_instruction, and is_terminal fields.
GitHub - MCAP file format
MCAP is an open-source container format for multimodal log data used in 80%+ of teleoperation pipelines.
mcap.dev - LeRobot dataset documentation
LeRobotDataset v3.0 standardizes robot learning data delivery across video, state, action, and metadata fields.
Hugging Face - Datasheets for Datasets
Datasheets for Datasets specify the documentation buyers need before commercial training including motivation, composition, and recommended uses.
arXiv - Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI
Data Cards capture dataset origins, development, intent, and ethical considerations for buyer review.
arXiv - DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID provides a real-world robot manipulation reference for buyers scoping commercial alternatives at 76,000 demonstrations / 564 scenes / 86 tasks.
arXiv
FAQ
What's the typical pricing for a 5,000-episode physical AI training data program in 2026?
Truelabel: $25,000-$60,000. Encord: $80,000-$120,000. Scale AI: $200,000-$300,000 minimum. Appen: $50,000-$90,000. Labelbox: $60,000-$100,000. The price spread reflects SLA, license terms, and indemnification rider differences, not raw collection cost.
How long does a typical pilot batch take?
Truelabel ships 200-500 episode pilots in 7-14 days at $750-$2,500. Encord 14-21 days at $4,000-$8,000. Scale AI typically requires 30-60 days including onboarding. The pilot is the single best signal on full-program acceptance rate.
Should I always run a vendor bake-off?
Yes for any program above $25,000. A 4-week bake-off across 2-3 candidates costs $2,250-$7,500 and typically returns 5-15x that in pricing leverage. Industry patterns show programs that skip the bake-off frequently require re-collection at 60-110% of original program cost.
What's the most common procurement mistake?
Skipping the 1-page spec document in Step 1. Programs that skip the spec typically suffer 25-40% downstream model regression because vendor and buyer never align on task taxonomy, and the gap surfaces only after 5,000+ episodes are delivered.
When should I prefer open-license over commercial-license data?
Use open-license (Apache-2.0, MIT, CC BY 4.0) for pretraining substrates where embodiment fit is loose. Use commercial-license for fine-tuning on the buyer's exact embodiment, workspace, and task family. The 2026 hybrid recipe is: pretrain on open + fine-tune on commercial under a single buyer-owned license.
What goes in an indemnification rider?
Coverage for IP infringement, biometric / personality-rights exposure, contributor-consent claims, and license-warranty breach. Typical caps: $5,000-$40,000 program-level for 5,000-25,000 episode tiers. Skipping the indemnification rider is one of the most expensive recurring procurement mistakes in this category.
Looking for physical AI training data buyer's guide 2026?
Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.
Request physical AI data