Physical AI Data Engineering
How to Build a Language-Conditioned Dataset for Physical AI
A language-conditioned dataset pairs natural language instructions with robot demonstrations, enabling vision-language-action (VLA) models to follow free-form commands. Build one by defining a task ontology mapping instructions to behaviors, recording synchronized multi-modal data (RGB-D video, proprioception, audio), collecting demonstrations with concurrent language scaffolding, generating paraphrases to expand linguistic diversity, validating alignment between language and action trajectories, and formatting outputs for VLA training frameworks like LeRobot or RLDS.
Quick facts
- Difficulty
- Intermediate
- Audience
- Physical AI data engineers
- Last reviewed
- 2026-01-20
The Language-Action Alignment Problem in Physical AI
Language-conditioned datasets solve a fundamental problem in physical AI: bridging the gap between how humans describe tasks and how robots execute them. Traditional robot datasets record state-action trajectories without semantic context, forcing models to infer intent from visual patterns alone. RT-1 demonstrated that pairing 130,000 robot demonstrations with natural language instructions improves task success rates by 40% over vision-only baselines[1]. The challenge is that language is ambiguous, context-dependent, and compositional — a single instruction like 'put the cup on the table' can map to thousands of valid trajectories depending on cup pose, table clutter, and gripper constraints.
The Open X-Embodiment dataset aggregated 1 million trajectories from 22 robot platforms, but only 60% included language annotations, and annotation quality varied wildly across sources[2]. Some trajectories had goal-level instructions ('clean the table'), others had step-level commands ('grasp the red block'), and many had post-hoc descriptions that didn't match demonstrator intent. This inconsistency degrades model performance: RT-2 showed that training on mixed-quality language data reduces zero-shot generalization by 25% compared to curated instruction sets[3].
Language-conditioned datasets require three alignment properties. Temporal alignment ensures instructions correspond to the correct action window — a 'pick up the apple' command must align with the grasp trajectory, not the preceding reach. Semantic alignment ensures language granularity matches action complexity — 'move left' is too vague for a 6-DOF manipulation task. Distributional alignment ensures language diversity reflects real-world command variation — if 80% of your training instructions use 'pick up' but users say 'grab' or 'take', your model will fail at deployment. DROID addressed this by collecting 76,000 demonstrations with concurrent verbal narration from 50 operators, capturing natural linguistic variation[4].
Define Your Language Ontology and Task Space
Before recording a single demonstration, formalize the relationship between language and robot behavior in your domain. A task ontology is a structured mapping from canonical instruction templates to executable robot behaviors, expanded with paraphrases to cover linguistic variation. For a kitchen manipulation domain, the canonical template 'pick up {object} from {location}' might expand to 15 paraphrases: 'grab the {object} off the {location}', 'take the {object} sitting on the {location}', 'get the {object} from the {location}', 'lift the {object} on the {location}'. Store this ontology in YAML or JSON as ground truth for annotator training and post-hoc validation.
Define language granularity across three levels. Goal-level instructions describe desired end states ('put all fruits in the bowl'). Step-level instructions describe individual manipulation primitives ('reach for the apple on the left'). Motion-level narrations describe continuous trajectories ('move the arm slowly to the right'). RT-2 requires only goal-level instructions, while hierarchical planners like SayCan need all three levels[5]. The CALVIN benchmark uses a hybrid approach: goal-level instructions for task specification plus step-level annotations for 34 atomic skills[6].
Document your vocabulary constraints. If your target deployment environment uses domain-specific jargon ('stage the pallet' in warehousing, 'prep the specimen' in lab automation), include those terms in your ontology. If you're training a consumer robot, avoid technical language — EPIC-KITCHENS annotations use everyday kitchen verbs ('chop', 'pour', 'stir') rather than robotics terminology[7]. Specify forbidden constructions: negations ('don't touch the red block') and conditionals ('if the cup is full, then pour') are hard to ground without explicit world models. The BridgeData V2 ontology explicitly excludes temporal connectives ('before', 'after', 'while') because their 7-DOF robot lacked the perception stack to resolve them[8].
Set Up the Multi-Modal Recording Pipeline
Language-conditioned datasets require synchronized capture of RGB-D video, proprioceptive state, and audio or text annotations. Use hardware timestamps, not software logging, to avoid drift — a 100ms misalignment between camera frames and joint angles will corrupt action labels. ROS bag files provide nanosecond-precision timestamps via the ROS clock, but require careful configuration: set `/use_sim_time` to false and sync all nodes to a common NTP server or PTP grandmaster clock[9].
For visual streams, record at ≥30 Hz from multiple viewpoints. DROID uses 3 RGB cameras (wrist-mounted, third-person static, third-person dynamic) plus a wrist-mounted depth sensor, yielding 4 synchronized streams at 15 Hz[4]. Higher frame rates improve action prediction for fast motions, but increase storage costs — a 10-minute episode at 60 Hz with 3 1080p cameras generates 18 GB uncompressed. Use H.264 or H.265 encoding with a quality-based rate control (CRF 18-23) to balance fidelity and size. Store depth as 16-bit PNG or lossless compressed streams; lossy depth compression introduces artifacts that degrade grasp pose estimation.
Record proprioceptive state at the robot's control frequency (typically 100-1000 Hz). Log joint positions, velocities, torques, and gripper state. If your robot has force-torque sensors, log those too — UMI's gripper dataset includes 6-axis F/T readings that improve contact-rich manipulation[10]. For language, record audio at ≥16 kHz if collecting verbal narration, or use a text annotation interface for typed instructions. The LeRobot framework provides a reference recording script that synchronizes video, robot state, and text annotations into a single MCAP container[11].
Collect Demonstrations with Concurrent Language Scaffolding
The timing of language annotation determines alignment quality. Concurrent scaffolding — where the demonstrator narrates actions as they perform them — produces tighter language-action coupling than post-hoc annotation. EPIC-KITCHENS used head-mounted cameras with concurrent audio narration from 32 participants, capturing 700 hours of kitchen activities with natural linguistic variation[7]. The dataset's verb-noun annotations ('open door', 'cut tomato') were extracted from these narrations, preserving temporal alignment within 0.5 seconds.
For teleoperation, integrate a push-to-talk interface into the control station. The demonstrator presses a button, speaks the instruction, then executes the task. This workflow ensures the instruction precedes the action by a known offset (typically 0.5-2 seconds), simplifying alignment during post-processing. DROID used this approach with 50 non-expert operators, collecting 76,000 demonstrations across 564 skills and 86 environments[4]. Each demonstration includes a goal-level instruction ('pick up the blue block') plus optional step-level narration ('reaching for the block', 'grasping', 'lifting').
If concurrent narration isn't feasible, use a two-pass annotation workflow. First, collect demonstrations without language. Second, replay the demonstrations to annotators who write instructions while watching the video. BridgeData V2 used this approach, showing annotators 10-second clips and asking 'What task is the robot performing?' and 'Describe the key steps'[8]. This method produces cleaner language (no filler words, no false starts) but loses the temporal precision of concurrent scaffolding. To mitigate, ask annotators to mark instruction timestamps by clicking on the video timeline when each new sub-goal begins.
Generate Paraphrases and Augment Language Diversity
A dataset with 10,000 demonstrations but only 50 unique instruction templates will produce a model that overfits to template syntax. Paraphrase generation expands linguistic diversity without collecting new demonstrations. Use a two-stage approach: template-based augmentation for controlled variation, then model-based augmentation for open-ended paraphrases.
Template-based augmentation applies syntactic transformations to canonical instructions. For 'pick up the {object}', generate 'grab the {object}', 'take the {object}', 'lift the {object}', 'grasp the {object}'. For 'move to the {location}', generate 'go to the {location}', 'navigate to the {location}', 'head to the {location}'. This approach is deterministic and preserves semantic meaning, but produces stilted language. RT-1 used template-based augmentation to expand 512 canonical instructions to 4,096 paraphrases, improving task success by 12%[1].
Model-based augmentation uses large language models to generate free-form paraphrases. Prompt a model like GPT-4 or OpenVLA's language backbone with 'Rewrite the following robot instruction in 5 different ways: {instruction}'. Filter outputs to remove paraphrases that change semantic meaning — 'pick up the red block' should not become 'pick up the blue block'. RT-2 used this approach with a 70B parameter language model, generating 10 paraphrases per instruction and validating them with human reviewers[3]. Acceptance rate was 85%, with most rejections due to hallucinated object attributes or spatial relations.
Validate paraphrase quality by measuring semantic similarity (cosine distance in sentence embedding space) and lexical diversity (unique n-gram count). Target a semantic similarity of 0.85-0.95 — lower values indicate semantic drift, higher values indicate redundant paraphrases. Target a lexical diversity of ≥0.6 (60% unique bigrams) to ensure the model sees varied syntax. The Open X-Embodiment dataset reports a lexical diversity of 0.72 across 1 million instructions, with semantic similarity of 0.89[2].
Validate Language-Action Alignment
Misaligned language-action pairs degrade model performance more than missing annotations. A 'pick up the apple' instruction paired with a 'put down the cup' trajectory will teach the model incorrect associations. Validate alignment using three methods: temporal overlap analysis, semantic consistency checks, and human review.
Temporal overlap analysis verifies that instruction timestamps fall within the corresponding action window. For each instruction, extract the robot's end-effector trajectory from the timestamp to the next instruction or episode end. Compute the trajectory's bounding box and check that it intersects the mentioned object's bounding box (from object detection). If the instruction is 'pick up the red block' but the gripper never enters the red block's vicinity, flag the pair for review. DROID used this method to filter 8% of demonstrations where concurrent narration was misaligned due to operator delays[4].
Semantic consistency checks use vision-language models to verify that the instruction matches the visual scene. For each instruction-frame pair, prompt a VLM like CLIP or OpenVLA with 'Is the robot performing this action: {instruction}?' and threshold the confidence score. RT-2 used this approach during data curation, rejecting 15% of instruction-frame pairs where CLIP similarity was below 0.7[3]. This method catches semantic drift (instruction says 'red block' but video shows blue block) but misses temporal misalignment (instruction is correct but 5 seconds early).
Human review remains the gold standard. Sample 5-10% of demonstrations, show annotators the video plus instruction, and ask 'Does this instruction accurately describe what the robot is doing?' BridgeData V2 used this protocol with 3 annotators per sample, achieving 92% inter-annotator agreement[8]. Flag demonstrations where ≥2 annotators disagree for re-annotation or removal. Budget 2-3 hours of review time per 100 demonstrations.
Format for VLA Training and Generate Splits
VLA training frameworks expect specific data formats. LeRobot uses a columnar Parquet schema with separate files for metadata, episodes, and frames[11]. RLDS (Reinforcement Learning Datasets) uses TFRecord files with nested protocol buffers for observations, actions, and language[12]. Open X-Embodiment defines a common schema that both frameworks can ingest, with fields for `observation/image`, `observation/state`, `action`, `language_instruction`, and `language_embedding`[2].
For each demonstration, store the full instruction at the episode level and optional step-level instructions at the frame level. If your dataset includes paraphrases, store all paraphrases in a list field and sample one at random during training — this prevents the model from memorizing a single phrasing. Precompute language embeddings using the same encoder your VLA will use (e.g., T5-Base for RT-2, SigLIP for OpenVLA) and store them alongside raw text. This reduces training-time compute by 30-40% for large datasets.
Generate train/val/test splits that preserve task diversity. A naive random split will leak similar demonstrations across splits, inflating validation metrics. Instead, split by task instance (if you have 100 'pick up apple' demonstrations, put 80 in train, 10 in val, 10 in test) or by environment (if you collected data in 10 kitchens, use 8 for train, 1 for val, 1 for test). DROID uses an environment-based split, ensuring the test set contains unseen backgrounds, lighting, and object arrangements[4]. Target an 80/10/10 split for datasets with ≥10,000 demonstrations, or 70/15/15 for smaller datasets where validation signal is noisier.
Storage Formats and Compression for Multi-Modal Data
Language-conditioned datasets are storage-intensive. A 10,000-episode dataset with 3 RGB cameras, 1 depth stream, and 100 Hz proprioception generates 2-5 TB uncompressed. Choose formats that balance compression ratio, random access speed, and ecosystem compatibility.
For video, MCAP is the emerging standard for multi-modal robotics data. It stores synchronized video, sensor data, and metadata in a single file with efficient seeking and partial reads[13]. LeRobot uses MCAP as its native format, with H.264-encoded video and Zstd-compressed sensor streams[11]. For datasets that must integrate with existing ML pipelines, Apache Parquet offers columnar storage with per-column compression, reducing size by 60-80% compared to uncompressed CSV while maintaining fast filtering and aggregation[14].
For depth data, avoid lossy compression. Store 16-bit depth maps as PNG with lossless compression, or use HDF5 with gzip or lz4 compression[15]. DROID stores depth as 16-bit PNG, achieving 3:1 compression ratios without quality loss[4]. For point clouds, use PCD format with binary encoding, or convert to Parquet for integration with cloud data lakes[16].
Language annotations are small (typically <1 KB per instruction) but benefit from structured storage. Store instructions in a separate metadata file (JSON or Parquet) with foreign keys linking to episode IDs. This allows you to update annotations without re-encoding video. Open X-Embodiment uses this approach, storing language in a `language.parquet` file with columns for `episode_id`, `timestamp`, `instruction`, and `paraphrase_id`[2].
Licensing and Provenance for Language-Conditioned Datasets
Language-conditioned datasets inherit licensing constraints from three sources: the robot demonstrations, the language annotations, and any pre-trained models used for paraphrase generation. If you collect demonstrations in a commercial kitchen, you may need location releases. If annotators generate instructions, those annotations are copyrightable works. If you use GPT-4 to generate paraphrases, OpenAI's terms prohibit using outputs to train competing models.
Data provenance tracking is critical for compliance and reproducibility. Record the identity of each demonstrator, annotator, and paraphrase model. Store this metadata in a `provenance.json` file with fields for `collector_id`, `annotation_method`, `paraphrase_model`, and `collection_date`. DROID includes provenance metadata for all 76,000 demonstrations, enabling users to filter by operator experience level or annotation quality[4].
For public release, choose a license that matches your intended use case. CC-BY-4.0 allows commercial use with attribution, suitable for datasets you want to maximize adoption[17]. CC-BY-NC-4.0 restricts commercial use, suitable for academic datasets you want to keep in the research commons[18]. BridgeData V2 uses CC-BY-4.0, while EPIC-KITCHENS uses a custom license that permits research use but requires separate negotiation for commercial deployment[7].
Benchmarking and Continuous Improvement
A language-conditioned dataset is never finished. As VLA architectures evolve, you'll need to add new annotation types, expand task coverage, or re-annotate demonstrations with higher-quality language. Establish a benchmarking pipeline that tracks model performance as you iterate on the dataset.
Define task success metrics that align with your deployment goals. For manipulation, measure grasp success rate, placement accuracy, and task completion time. For navigation, measure goal-reaching success and collision rate. For each metric, report performance stratified by instruction complexity — models often succeed on simple commands ('pick up the block') but fail on compositional instructions ('pick up the red block and place it to the left of the blue block'). CALVIN reports success rates for 1-step, 2-step, 3-step, 4-step, and 5-step instruction chains, showing that performance degrades exponentially with chain length[6].
Track language diversity metrics over time. Compute the vocabulary size (unique tokens), lexical diversity (unique n-grams / total n-grams), and semantic coverage (fraction of task space covered by instructions). If you add 1,000 new demonstrations but vocabulary size doesn't increase, you're collecting redundant language. Open X-Embodiment tracks these metrics across 22 datasets, showing that vocabulary size plateaus after 50,000 demonstrations but semantic coverage continues to improve with task diversity[2].
Use active learning to prioritize new data collection. Train a VLA on your current dataset, deploy it in simulation or on a real robot, and log failure cases. Cluster failures by failure mode (grasp failures, collision, timeout) and instruction type. Collect new demonstrations that target the highest-frequency failure modes. RT-2 used this approach to improve performance on long-horizon tasks, adding 5,000 demonstrations of multi-step instructions and improving success rates by 18%[3].
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 trained on 130,000 demonstrations with language instructions, improving task success by 40% over vision-only baselines
arXiv ↩ - Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment aggregated 1 million trajectories from 22 platforms, but only 60% included language annotations with varying quality
arXiv ↩ - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 showed that mixed-quality language data reduces zero-shot generalization by 25% compared to curated instruction sets
arXiv ↩ - DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID collected 76,000 demonstrations with concurrent verbal narration from 50 operators across 564 skills and 86 environments
arXiv ↩ - Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
SayCan requires goal-level and step-level language annotations for hierarchical planning over skill libraries
arXiv ↩ - CALVIN paper
CALVIN uses goal-level instructions plus step-level annotations for 34 atomic skills, reporting success rates for 1-5 step instruction chains
arXiv ↩ - Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
EPIC-KITCHENS captured 700 hours of kitchen activities with concurrent audio narration from 32 participants using everyday verbs
arXiv ↩ - BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 used two-pass annotation workflow and achieved 92% inter-annotator agreement, explicitly excluding temporal connectives
arXiv ↩ - Reading from a ROS bag file
ROS bag files provide nanosecond-precision timestamps via ROS clock for synchronized multi-modal recording
docs.ros.org ↩ - Project site
UMI gripper dataset includes 6-axis force-torque readings that improve contact-rich manipulation performance
umi-gripper.github.io ↩ - LeRobot documentation
LeRobot framework provides reference recording scripts and uses columnar Parquet schema with MCAP containers
Hugging Face ↩ - RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning
RLDS uses TFRecord files with nested protocol buffers for observations, actions, and language in RL datasets
arXiv ↩ - MCAP guides
MCAP stores synchronized video, sensor data, and metadata in a single file with efficient seeking and partial reads
MCAP ↩ - Apache Arrow Parquet files
Apache Parquet offers columnar storage with per-column compression, reducing size by 60-80% compared to uncompressed formats
Apache Arrow ↩ - Introduction to HDF5
HDF5 with gzip or lz4 compression provides lossless storage for 16-bit depth maps and sensor data
The HDF Group ↩ - PCD file format
PCD format with binary encoding is standard for point cloud storage in robotics applications
Point Cloud Library ↩ - Attribution 4.0 International deed
CC-BY-4.0 allows commercial use with attribution, maximizing dataset adoption for public releases
Creative Commons ↩ - Creative Commons Attribution-NonCommercial 4.0 International deed
CC-BY-NC-4.0 restricts commercial use, suitable for academic datasets intended for research commons
creativecommons.org ↩ - Scale AI: Expanding Our Data Engine for Physical AI
Scale AI expanded its data engine for physical AI, demonstrating enterprise demand for language-conditioned robot datasets
scale.com - docs.labelbox.com overview
Labelbox provides annotation tooling for multi-modal robotics data including video, sensor streams, and language labels
docs.labelbox.com - encord.com annotate
Encord offers annotation platform for robotics datasets with support for temporal alignment and multi-modal data
encord.com - roboflow.com annotate
Roboflow provides annotation tools and dataset management for computer vision and robotics applications
roboflow.com - dataloop.ai annotation
Dataloop offers annotation platform with support for video, point clouds, and language labeling for robotics datasets
dataloop.ai - truelabel physical AI data marketplace bounty intake
Truelabel marketplace connects buyers and collectors for custom language-conditioned robot dataset collection
truelabel.ai
FAQ
What is the minimum dataset size for training a vision-language-action model?
Minimum viable dataset size depends on task complexity and model architecture. For single-task models (one robot, one environment, narrow skill set), 500-1,000 demonstrations can achieve 70-80% success rates with behavior cloning. For multi-task models, you need 10,000+ demonstrations to learn generalizable policies. RT-1 used 130,000 demonstrations across 700 tasks. RT-2 scaled to 1 million demonstrations by mixing robot data with web data. If you're fine-tuning a pre-trained VLA like OpenVLA, 100-500 demonstrations per new task is often sufficient. Quality matters more than quantity — 1,000 high-quality demonstrations with diverse language annotations outperform 10,000 demonstrations with repetitive instructions.
Should I use goal-level or step-level language annotations?
Use goal-level annotations if your model will receive high-level commands from end users ('clean the table', 'make a sandwich'). Use step-level annotations if your model needs to follow detailed instructions ('pick up the red block', 'move left 10 cm') or if you're training a hierarchical planner. RT-2 uses only goal-level annotations because it learns to decompose goals into actions through imitation learning. SayCan uses both goal-level and step-level annotations because it explicitly plans over a library of skills. If you're unsure, collect both — you can always ignore step-level annotations during training, but you can't add them retroactively without re-annotating.
How do I handle ambiguous instructions like 'put it there'?
Ambiguous instructions require grounding through visual context or dialogue history. If your dataset includes multi-turn interactions, store the full dialogue history and use it as additional context during training. If demonstrations are single-turn, avoid ambiguous instructions during collection — instruct annotators to use explicit references ('put the red block on the table' not 'put it there'). For deployed systems, use clarification dialogues: if the model's confidence is below a threshold, ask the user to rephrase. RT-2 handles ambiguity by training on web data that includes visual grounding examples, allowing it to resolve pronouns and spatial references from context.
What paraphrase generation model should I use?
For high-quality paraphrases with semantic preservation, use GPT-4 or Claude 3.5 with careful prompting. For cost-effective generation at scale, use open-weight models like Llama 3.1 70B or Mixtral 8x22B. For domain-specific paraphrases (medical robotics, industrial automation), fine-tune a smaller model on your task ontology. Always validate paraphrases with human review — budget 10-15% of your paraphrase generation cost for quality control. RT-2 used a 70B parameter model and achieved 85% acceptance rate. If acceptance rate drops below 70%, your prompts need refinement or your base model is too weak.
How do I version and update a language-conditioned dataset?
Use semantic versioning (major.minor.patch) and document changes in a CHANGELOG. Increment the major version when you change the task ontology, add new robots, or modify the data schema in backward-incompatible ways. Increment the minor version when you add new demonstrations, expand language coverage, or improve annotation quality. Increment the patch version for bug fixes (correcting misaligned timestamps, fixing metadata errors). Store each version in a separate directory or repository tag. BridgeData V2 is a major version increment over BridgeData V1 because it added new tasks and changed the action space. DROID uses date-based versioning (DROID-2024-03) to track monthly data releases.
Looking for language-conditioned dataset?
Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.
Post a language-conditioned data request