Reference
Physical AI data glossary
Plain-English definitions of the terms buyers and suppliers use when scoping physical AI data bounties — modalities, capture rigs, formats, and metadata.
How to use this hub
Start here when you know the broad category but haven't nailed the exact bounty spec yet. Each linked page narrows the request into a concrete data shape: modality, task, environment, metadata, rights, consent, delivery format, and sample QA. That structure is what turns a vague physical AI data need into something a supplier can prove or reject with evidence.
The hub isn't meant to be the last page you read. It should hand off to a detail page where the specific intent is answered with sample specs, comparison tables, proof requirements, and external source context.
10 pages — search and filter
10 of 10 datasets
Consent artifact
Glossary
Consent artifact means a record showing that a contributor or site granted permission for data capture and downstream use. The term matters because it turns a model or procurement concept into concrete data requirements you can evaluate samples against.
Data provenance
Glossary
Data provenance means the record of where data came from, how it was collected, what rights apply, and how it changed before delivery. The term matters because it turns a model or procurement concept into concrete data requirements you can evaluate samples against.
Egocentric data
Glossary
Egocentric data means first-person video or sensor data captured from the perspective of a person or embodied actor. The term matters because it turns a model or procurement concept into concrete data requirements you can evaluate samples against.
Off-the-shelf dataset
Glossary
Off-the-shelf dataset means an existing dataset a supplier can license without running a new capture program. The term matters because it turns a model or procurement concept into concrete data requirements you can evaluate samples against.
Physical AI training data
Glossary
Physical AI training data means data that teaches models to perceive, reason about, and act in real or simulated physical environments. The term matters because it turns a model or procurement concept into concrete data requirements you can evaluate samples against.
Robot demonstrations
Glossary
Robot demonstrations means task examples showing a robot or human demonstrator completing a behavior that a model should learn or evaluate. The term matters because it turns a model or procurement concept into concrete data requirements you can evaluate samples against.
Sim-to-real gap
Glossary
Sim-to-real gap means the performance gap between behavior learned in simulation and behavior deployed in real physical environments. The term matters because it turns a model or procurement concept into concrete data requirements you can evaluate samples against.
Teleoperation data
Glossary
Teleoperation data means robot observations, state, and action traces recorded while a human remotely controls the robot. The term matters because it turns a model or procurement concept into concrete data requirements you can evaluate samples against.
VLA model
Glossary
VLA model means a vision-language-action model that connects visual observations and language context to physical actions. The term matters because it turns a model or procurement concept into concrete data requirements you can evaluate samples against.
World model AI
Glossary
World model AI means a model that learns predictive structure about environments, objects, motion, and consequences. The term matters because it turns a model or procurement concept into concrete data requirements you can evaluate samples against.
Procurement questions before posting a bounty
- What exact model behavior or evaluation question should this data improve?
- Which modality, camera viewpoint, robot state, or metadata stream is required?
- What evidence proves the supplier has rights, consent, and provenance?
- Which delivery format must the sample open in before scale-up?
- What specific failure reasons should cause sample rejection?
Quality gate before a page becomes a deal spec
A page in this hub should not be treated as a finished procurement document by itself. It is a starting point for a bounty. Before a buyer funds capture or licenses off-the-shelf data, the page needs to become a short operating spec: accepted examples, rejected examples, file format, metadata fields, consent requirements, delivery location, and a named reviewer who can approve the sample.
The practical test is simple: if two suppliers read the same detail record, would they submit comparable samples? If not, the buyer needs to narrow the research into a more specific bounty. The strongest truelabel references help with that narrowing by linking from broad hubs into task pages, dataset profiles, format guides, glossary definitions, and public dataset alternatives.
| Gate | Question | Pass signal |
|---|---|---|
| Intent | What model behavior does the data improve? | The objective is tied to a task, benchmark, or evaluation gap. |
| Evidence | What proves a supplier can deliver? | A sample package includes files, manifest, rights, and QA notes. |
| Ingestion | Can the buyer load the sample? | The sample opens in the expected format or converter. |
Hub FAQ
How should buyers use the Physical AI data glossary hub?
Use the Physical AI data glossary hub to move from a broad physical AI data need into a concrete page with modality, sample, QA, format, rights, and supplier-evidence requirements.
Are these pages public datasets?
No. These pages are sourcing and specification guides for posting bounties. They help buyers define what a supplier must prove before data is accepted.
Why does this hub link to so many detail pages?
Each detail page handles one specific task, dataset, comparison, definition, or format. The hub is the index that helps a buyer pick the right one for the bounty they want to post.
What makes a page ready for a bounty?
A page is ready when it names a model objective, concrete files, metadata requirements, rights and consent expectations, sample QA checks, and a delivery format.
External source context
- Scale AI physical AI data engine
Shows enterprise demand for custom physical AI collection and enrichment programs.
- NVIDIA Physical AI Data Factory Blueprint
Frames physical AI data as an end-to-end factory problem spanning curation, generation, evaluation, and delivery.
- Open X-Embodiment
Baseline open robotics data entity for cross-embodiment tasks and VLA pretraining discussions.
- Ego4D dataset
Canonical egocentric video benchmark for first-person physical-world capture and limitations.