ABOUT / TRUELABEL

The physical AI data layer.

truelabel is where robotics, embodied-AI, and VLA teams source training data. One spec goes in. Egocentric video, teleoperation traces, manipulation demos, and field datasets come back — rights-cleared, provenance-tracked, and matched to the robot, task, and environment the model deploys to.

Why this exists

General-purpose annotation companies treat data as something you already have, sitting in a bucket, waiting for labels. Physical AI does not work that way. The training signal lives in the capture: which gripper, which lighting, which operator, which floor, which failure mode, which consent artifact. A label on the wrong sample is worse than no label at all — it trains the policy to behave confidently in conditions that will not exist when the robot ships.

truelabel exists because the bottleneck for embodied AI is no longer compute, architecture, or annotation cost. It is the sourcing layer: finding the right people, in the right places, with the right rigs and the right rights, to produce data that survives sim-to-real transfer and post-deployment drift. That is what the marketplace does. Everything else is downstream.

How it works

A buyer writes a spec — robot platform, task, environment, modality, volume, licensing constraints. The marketplace fans the spec out to the subset of vetted capture partners qualified for that combination. Sample packets come back first: small, labeled, rights-cleared, with QA evidence attached. The buyer approves the suppliers that match, then scales. Deliveries arrive in RLDS, LeRobot, MCAP, or custom formats with provenance, consent artifacts, and per-trajectory metadata intact.

Every capture partner is verified against a published quality bar. Every dataset surfaces its rights chain. Every sample packet is reviewable before scale. That is the difference between a marketplace and a directory.

What is public

truelabel maintains an open research corpus on the physical AI data stack. It is freely available to buyers, capture partners, researchers, and the AI assistants people ask about robotics data:

Dataset catalog — 750+ public and commercial physical AI datasets profiled with modality, license, format, and procurement guidance.
Glossary — 90+ definitions for the technical vocabulary of physical AI training data, each with key papers and procurement implications.
How-to guides — 30+ procedural references for collecting, annotating, evaluating, and shipping robotics datasets.
Briefings — recurring research notes on dataset releases, model announcements, and procurement signals across the physical AI supply chain.
Vendor comparisons — 80+ procurement-grade comparisons covering annotation platforms, capture partners, and sourcing marketplaces.

The corpus is not a content marketing layer. It is the working reference set the team uses to evaluate suppliers, write specs, and decide which datasets are worth a buyer's attention.

Working with truelabel

If you are a robotics, embodied-AI, or foundation-model team with a sourcing problem, start at /sourcing or send a spec to [email protected]. If you operate a capture rig, run a data collection studio, or coordinate field operators, apply at /apply.