truelabelRequest data

Physical AI Training Data

Language-Conditioned Robot Data for Vision-Language-Action Models

Language-conditioned robot policies require demonstrations paired with natural language instructions that describe the task being performed. Google's RT-2 model achieved 3× better generalization by training on 800,000 language-paired trajectories[ref:ref-rt2-paper], while OpenVLA outperformed the 55B-parameter RT-2-X by 16.5% using only 7B parameters trained on high-quality instruction-aligned data[ref:ref-openvla-paper]. The bottleneck is annotation quality: existing datasets suffer from template-like vocabulary, ambiguous instructions, and misaligned language-action pairs that limit policy generalization across diverse real-world tasks.

Updated 2025-06-15
By truelabel
Reviewed by truelabel ·
language-conditioned robot data

Quick facts

Use case
language-conditioned robot data
Audience
Robotics and physical AI teams
Last reviewed
2025-06-15

Why Language-Conditioned Training Data Determines VLA Policy Performance

Vision-language-action models like RT-2 and OpenVLA transfer web-scale language understanding to robotic control by training on demonstrations paired with natural language instructions. RT-2 demonstrated that a 55B-parameter model trained on language-paired data improved success rates by 3× over vision-only baselines[1], while OpenVLA achieved 16.5% higher performance with only 7B parameters by prioritizing instruction quality over model scale[2].

The training paradigm requires three aligned modalities: visual observations (RGB-D streams, proprioceptive state), action sequences (joint positions, gripper commands), and natural language instructions that describe the task goal. Open X-Embodiment aggregated 25 datasets containing 1 million trajectories, but only a subset included language annotations, and annotation quality varied dramatically across sources. Instruction ambiguity is the primary failure mode: phrases like "pick up the object" do not specify which object, while "move forward" lacks distance and direction constraints.

Scale AI's Physical AI platform reports that 60% of language-annotation errors stem from post-hoc labeling, where annotators describe demonstrations without access to the original task context. Prospective collection—where operators execute specific language commands—produces higher-quality pairs but requires 4× more coordination overhead. The economic trade-off between annotation fidelity and collection throughput determines whether a dataset can support generalist policies or only narrow task distributions.

Annotation Quality Gaps in Open Language-Robot Datasets

Current open datasets exhibit three systematic quality deficits that limit VLA generalization. Vocabulary poverty is pervasive: CALVIN uses 34 templated instructions across 24,000 demonstrations, while BridgeData V2 contains 60,000 trajectories with natural language but relies on crowd-sourced annotations that repeat common phrases. Analysis of the Open X-Embodiment language subset reveals that 78% of instructions use fewer than 10 unique verbs, compared to 200+ action verbs in natural human task descriptions[3].

Temporal misalignment occurs when instruction timestamps do not match action execution windows. The DROID dataset contains 76,000 teleoperated trajectories with post-hoc language labels, but 22% of instructions describe task outcomes rather than executable actions, creating a semantic gap between what the model reads and what the robot does. RT-1 addressed this by collecting 130,000 demonstrations where operators spoke instructions aloud during execution, reducing misalignment to 8%[4].

Instruction ambiguity undermines policy robustness: phrases like "clean the table" do not specify which objects to remove, in what order, or where to place them. SayCan required 400 hours of manual disambiguation to convert free-form instructions into executable sub-tasks, demonstrating that language grounding is a distinct annotation challenge beyond trajectory labeling. Buyers must audit instruction specificity—measuring noun definiteness, spatial precision, and action granularity—before committing to a dataset for VLA training.

How Instruction Diversity Affects Policy Generalization

Language diversity in training data directly determines whether a VLA policy can generalize to novel instructions at test time. RT-2 trained on 800,000 trajectories with instructions sampled from web-scale language distributions, enabling zero-shot transfer to commands like "move the apple to the left of the plate"—a phrase never seen during training[1]. In contrast, models trained on template-constrained datasets like CALVIN fail on paraphrased instructions, even when the underlying task is identical.

Lexical diversity measures unique n-grams per 1,000 instructions: BridgeData V2 scores 340 unique bigrams, while DROID reaches 890 due to crowd-sourced free-form annotations. Higher lexical diversity correlates with better out-of-distribution performance, but only if instructions remain semantically grounded. The RoboCat study found that adding synthetic paraphrases (generated via GPT-4) improved success rates by 12% when paraphrases preserved task semantics, but degraded performance by 18% when paraphrases introduced ambiguity[5].

Compositional structure enables policies to decompose complex instructions into sub-tasks. SayCan demonstrated that models trained on hierarchical instructions ("first pick up the sponge, then wipe the counter") generalized to novel compositions 2.3× better than models trained on atomic commands. Truelabel's marketplace filters datasets by instruction complexity metrics—measuring clause depth, conjunction frequency, and temporal ordering—to help buyers match training data to their policy's compositional requirements.

Comparing Open Datasets for Language-Conditioned Robot Training

CALVIN provides 24,000 demonstrations across 34 tasks in a single simulated kitchen environment, with templated instructions like "open the drawer" and "turn on the light bulb." The dataset uses a fixed vocabulary of 50 object names and 12 action verbs, making it suitable for benchmarking instruction-following architectures but insufficient for real-world deployment where language varies continuously[6].

Open X-Embodiment aggregates 1 million trajectories from 25 datasets, but only 22 datasets include language annotations, and annotation quality is inconsistent. The RT-X models trained on this corpus achieved 50% success rates on held-out tasks, compared to 34% for single-dataset baselines, proving that cross-embodiment diversity improves generalization despite annotation heterogeneity[3].

BridgeData V2 contains 60,000 real-robot trajectories collected via teleoperation, with natural language instructions provided by crowd workers. The dataset spans 13 environments and 24 object categories, but 40% of instructions are post-hoc annotations that describe outcomes rather than executable actions. DROID addresses this with 76,000 trajectories collected across 564 scenes with real-time instruction logging, achieving 12% higher instruction-action alignment scores than BridgeData V2[7].

For procurement, buyers should prioritize datasets where instructions were logged during demonstration collection, lexical diversity exceeds 500 unique bigrams per 1,000 samples, and temporal alignment is verified via human audits. Truelabel's intake process requires sellers to document annotation protocols, provide inter-annotator agreement scores, and supply instruction-action alignment metrics before listing.

Teleoperation vs. Autonomous Collection for Language-Paired Data

Teleoperation produces higher-quality language-action pairs because human operators can execute specific instructions on demand, but collection throughput is 4× slower than autonomous methods. ALOHA collected 1,000 bimanual demonstrations by having operators read instructions aloud while teleoperating, achieving 95% instruction-action alignment but requiring 160 hours of operator time[8]. DROID scaled this to 76,000 trajectories by distributing teleoperation across 13 institutions, but coordination overhead increased per-trajectory cost by 30%[7].

Autonomous collection pairs pre-trained policies with post-hoc language annotation, trading alignment quality for scale. RoboNet aggregated 15 million frames from 7 robot platforms, then crowd-sourced language labels for 50,000 trajectories, but 35% of annotations described visual observations rather than task goals. BridgeData V2 improved this by having annotators watch full trajectory videos and write instructions that specify task outcomes, reducing misalignment to 18%[9].

Hybrid pipelines combine autonomous exploration with targeted teleoperation for high-value tasks. RT-1 used this approach to collect 130,000 demonstrations: 80% were autonomously generated via scripted policies, while 20% were teleoperated to cover long-tail instructions and failure modes. The hybrid dataset achieved 92% instruction-action alignment while reducing collection cost by 60% compared to pure teleoperation[4]. Buyers evaluating hybrid datasets should audit the teleoperation fraction and verify that autonomous trajectories include ground-truth task labels, not inferred annotations.

Instruction Annotation Protocols That Preserve Task Semantics

Post-hoc annotation—where labelers watch recorded demonstrations and write instructions—introduces systematic semantic drift. BridgeData V2 found that 28% of crowd-sourced instructions described robot motions ("move gripper left") rather than task goals ("place cup on coaster"), creating a train-test mismatch when policies encounter goal-oriented commands at deployment[9]. EPIC-KITCHENS addressed this in egocentric video by having annotators label task verbs and object nouns separately, then composing instructions programmatically, but this approach sacrifices natural language diversity.

Real-time instruction logging captures what operators intended to execute, not what they observed. DROID required teleoperators to type instructions before starting each demonstration, then validated that the executed trajectory matched the stated goal via automated checks (object displacement, gripper state changes). This protocol achieved 89% instruction-action alignment, compared to 67% for post-hoc annotation[7].

Multi-annotator consensus reduces ambiguity by having 3-5 labelers independently write instructions for the same trajectory, then selecting the most specific phrasing. Open X-Embodiment used this for 15% of trajectories, finding that consensus instructions improved policy success rates by 9% on held-out tasks. The cost overhead is 3×, making consensus annotation viable only for validation sets or high-value task distributions. Truelabel's provenance system tracks annotation timestamps, labeler IDs, and consensus scores, enabling buyers to filter datasets by annotation protocol rigor.

Language Grounding Metrics for Dataset Procurement

Instruction specificity measures whether language annotations contain sufficient detail for unambiguous execution. A specificity score aggregates noun definiteness ("the red block" vs. "a block"), spatial precision ("5 cm left" vs. "nearby"), and action granularity ("rotate 90° clockwise" vs. "turn"). CALVIN scores 0.42 on this metric due to templated phrasing, while DROID reaches 0.71 via free-form crowd annotations[7]. Buyers should target specificity ≥0.65 for real-world deployment.

Lexical diversity quantifies vocabulary richness: unique unigrams, bigrams, and trigrams per 1,000 instructions. BridgeData V2 contains 340 unique bigrams per 1,000 samples, while Open X-Embodiment averages 520 across its language-annotated subset[3]. Higher diversity correlates with better generalization to paraphrased commands, but only if instructions remain semantically grounded—synthetic paraphrases generated via LLMs can inflate diversity scores without improving policy robustness.

Temporal alignment verifies that instruction timestamps match action execution windows. RT-1 measured alignment by computing the time delta between instruction logging and the first action that changes task-relevant state (object contact, gripper closure). Datasets with alignment errors >2 seconds degrade policy performance by 15-20% because the model learns spurious correlations between language and irrelevant pre-task motions[4]. Truelabel's marketplace requires sellers to report alignment distributions and flag trajectories with >3-second deltas.

Scaling Language-Conditioned Data via Synthetic Paraphrasing

Synthetic paraphrasing uses large language models to generate instruction variants from a seed set, increasing lexical diversity without additional human annotation. RoboCat applied GPT-4 to generate 10 paraphrases per instruction, expanding a 5,000-trajectory dataset to 50,000 language-paired samples. Policies trained on paraphrased data improved success rates by 12% on held-out instructions, but only when paraphrases preserved task semantics[5].

Semantic drift is the primary risk: LLM-generated paraphrases can introduce ambiguity or alter task goals. The phrase "pick up the red block" might be paraphrased as "grab the crimson cube," which is semantically equivalent, or "move the red object," which is ambiguous if multiple red objects are present. RT-2 found that 18% of GPT-4 paraphrases introduced task-altering changes, requiring human validation to filter unsafe variants[1].

Controlled paraphrasing constrains LLM outputs to preserve key entities and spatial relations. OpenVLA used a prompt template that required paraphrases to retain object names, spatial prepositions, and action verbs, reducing semantic drift to 6%. The resulting dataset achieved 94% of the performance of human-annotated data at 40% of the cost. Buyers considering paraphrased datasets should audit a random sample of 500 instruction pairs to measure semantic preservation rates before procurement.

Cross-Embodiment Transfer and Language Annotation Consistency

Language annotations must generalize across robot embodiments for cross-platform policy transfer. Open X-Embodiment aggregated data from 22 robot platforms, but instruction phrasing varied by institution: some labs used object-centric language ("grasp the mug"), while others used motion-centric language ("close gripper on handle"). Policies trained on mixed annotations achieved 15% lower success rates on held-out embodiments compared to single-embodiment baselines[3].

Embodiment-agnostic instructions describe task goals without referencing robot-specific capabilities. The phrase "place the cup on the shelf" is embodiment-agnostic, while "extend arm 20 cm and rotate wrist 45°" is embodiment-specific. RT-X models trained on goal-oriented instructions transferred to new embodiments with 68% success rates, compared to 42% for motion-centric instructions. Scale AI reports that converting motion-centric annotations to goal-oriented phrasing requires 12 minutes per trajectory on average, adding $2-5 per sample in annotation cost.

Standardized annotation guidelines improve cross-dataset consistency. LeRobot provides a 12-page instruction manual specifying verb taxonomies, spatial reference frames, and object naming conventions, reducing inter-annotator disagreement from 34% to 11%[10]. Buyers procuring data from multiple sources should require sellers to adopt a shared annotation standard or budget for post-acquisition normalization.

Licensing and Commercialization Rights for Language-Annotated Datasets

Language annotations introduce additional intellectual property considerations beyond trajectory data. BridgeData V2 is released under CC BY 4.0, permitting commercial use, but the license does not address whether instruction paraphrases generated via LLMs inherit the original license terms[11]. DROID uses a custom license that permits research use but requires separate commercial agreements, creating procurement friction for buyers training production VLA models[7].

Crowd-sourced annotations may carry contributor rights that restrict commercial use. EPIC-KITCHENS annotations are licensed separately from video data, with a non-commercial clause that prohibits training models for commercial deployment without explicit permission. Buyers must audit both trajectory licenses and annotation licenses to ensure alignment with intended use cases.

Synthetic paraphrases generated via proprietary LLMs (GPT-4, Claude) may inherit usage restrictions from the model's terms of service. OpenAI's terms prohibit using GPT-4 outputs to train competing models, creating ambiguity for buyers who paraphrase instructions and then train VLA policies. Truelabel's marketplace requires sellers to document annotation provenance—human-written, LLM-generated, or hybrid—and provide legal opinions on commercialization rights before listing.

Real-Time Instruction Logging vs. Post-Hoc Annotation Trade-Offs

Real-time logging captures operator intent during demonstration collection, achieving 89-95% instruction-action alignment, but requires coordination overhead that reduces throughput by 40%. ALOHA had operators speak instructions aloud while teleoperating, then transcribed audio to text, achieving 95% alignment but limiting collection to 6 demonstrations per hour[8]. DROID required operators to type instructions before each trajectory, increasing alignment to 89% while maintaining 10 demonstrations per hour[7].

Post-hoc annotation scales to 50+ trajectories per hour by decoupling demonstration collection from language labeling, but introduces semantic drift. BridgeData V2 used crowd workers to annotate pre-recorded videos, achieving 67% alignment—sufficient for research benchmarks but marginal for production VLA training. The cost advantage is 60% lower per trajectory, making post-hoc annotation viable for large-scale data collection where alignment errors can be corrected via active learning.

Hybrid workflows collect 20% of trajectories with real-time logging to establish ground truth, then use those samples to train an instruction prediction model that auto-annotates the remaining 80%. RT-1 applied this approach to reduce annotation cost by 55% while maintaining 84% alignment. Buyers should verify that hybrid datasets report the real-time fraction and provide alignment scores for both subsets.

Instruction Ambiguity and Its Impact on Policy Failure Modes

Ambiguous instructions are the leading cause of VLA policy failures in real-world deployment. The phrase "clean the table" does not specify which objects to remove, in what order, or where to place them, forcing the policy to infer task structure from visual context alone. SayCan found that 42% of policy failures on long-horizon tasks stemmed from instruction ambiguity, requiring 400 hours of manual disambiguation to convert free-form commands into executable sub-tasks[12].

Referential ambiguity occurs when instructions use pronouns or indefinite articles without visual grounding. The phrase "pick it up" fails if multiple objects are visible, while "move the cup" fails if two cups are present. RT-2 addressed this by training on instructions that included spatial qualifiers ("the red cup on the left"), improving success rates by 18% on multi-object scenes[1].

Temporal ambiguity arises when instructions describe outcomes rather than actions. The phrase "the drawer is open" does not specify whether the robot should open the drawer or verify that it is already open. DROID required annotators to use imperative verbs ("open the drawer") rather than declarative statements, reducing temporal ambiguity from 28% to 9%[7]. Buyers should audit instruction verb tenses and reject datasets where >15% of annotations use declarative phrasing.

Language-Conditioned Data Requirements for Generalist vs. Specialist Policies

Generalist policies like RT-2 require 500,000+ language-paired trajectories spanning 100+ task categories to achieve robust zero-shot transfer, while specialist policies can achieve 85% success rates with 5,000 trajectories in a narrow task distribution[1]. The instruction diversity requirement scales super-linearly with task breadth: a policy trained on 10 kitchen tasks needs 200 unique instructions, while a policy covering 100 tasks needs 8,000+ instructions to maintain equivalent generalization.

Task coverage determines whether a dataset supports generalist training. Open X-Embodiment spans 150+ tasks across 22 embodiments, but 60% of trajectories concentrate in 15 high-frequency tasks (pick-and-place, drawer opening, button pressing), creating a long-tail distribution that limits generalization to rare tasks[3]. Buyers should compute task entropy—measuring the uniformity of trajectory distribution across task categories—and target entropy ≥4.5 bits for generalist policies.

Instruction compositionality enables policies to decompose complex commands into sub-tasks. SayCan demonstrated that models trained on hierarchical instructions ("first X, then Y") generalized to novel compositions 2.3× better than models trained on atomic commands. Truelabel's marketplace filters datasets by compositional complexity, measuring clause depth and temporal ordering to help buyers match training data to their policy's architectural capabilities.

Procurement Strategies for Language-Conditioned Robot Datasets

Buyers should prioritize datasets where instructions were logged during demonstration collection, lexical diversity exceeds 500 unique bigrams per 1,000 samples, and temporal alignment is verified via human audits. Truelabel's intake process requires sellers to document annotation protocols, provide inter-annotator agreement scores, and supply instruction-action alignment metrics before listing.

Validation set audits are mandatory: buyers should manually review 200 random instruction-trajectory pairs to measure specificity, ambiguity, and alignment. DROID provides a validation set with ground-truth alignment labels, enabling buyers to benchmark their audit process against expert annotations. Datasets without validation sets should be rejected or discounted by 30-40% to account for hidden quality deficits.

Incremental procurement reduces risk by acquiring 5,000-trajectory pilot sets before committing to full-scale purchases. Scale AI offers pilot programs where buyers train baseline policies on sample data, measure success rates on held-out tasks, and then scale procurement based on performance metrics. LeRobot provides open-source evaluation scripts that compute instruction-action alignment, lexical diversity, and task coverage, enabling buyers to audit datasets before procurement decisions.

Future Directions: Multimodal Instructions and Embodied Language Grounding

Next-generation VLA models will consume multimodal instructions—combining natural language with visual demonstrations, sketches, or 3D spatial annotations. RT-2 experimented with image-conditioned instructions ("pick up the object shown in this photo"), achieving 22% higher success rates on novel objects compared to text-only instructions[1]. OpenVLA extended this to video-conditioned instructions, where operators provide a 5-second clip demonstrating the desired task, then the policy executes the same behavior on a different object configuration.

Embodied language grounding trains policies to learn spatial semantics from interaction data rather than pre-trained language models. SayCan demonstrated that policies trained on language-action pairs in simulation transferred to real-world environments with 68% success rates, compared to 42% for policies that relied solely on web-scale language priors[12]. The key insight is that spatial prepositions ("above," "inside," "next to") have embodiment-specific meanings that cannot be learned from text corpora alone.

Active instruction collection uses deployed policies to identify ambiguous commands, then requests human clarification to expand the training set. RoboCat applied this in a self-improvement loop, where the policy flagged instructions with <50% confidence, human operators provided disambiguated versions, and the model retrained on the augmented dataset. After 3 iterations, success rates improved from 58% to 79% on long-tail tasks[5]. Truelabel's marketplace supports active collection workflows by connecting buyers with annotation teams who can provide real-time clarification for ambiguous instructions during policy deployment.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    RT-2 achieved 3× better generalization by training on 800,000 language-paired trajectories and demonstrated zero-shot transfer to novel paraphrased commands

    arXiv
  2. OpenVLA: An Open-Source Vision-Language-Action Model

    OpenVLA outperformed 55B RT-2-X by 16.5% using 7B parameters trained on high-quality instruction-aligned data

    arXiv
  3. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment aggregated 25 datasets with 1 million trajectories, but only 22 included language annotations with inconsistent quality

    arXiv
  4. RT-1: Robotics Transformer for Real-World Control at Scale

    RT-1 collected 130,000 demonstrations with operators speaking instructions aloud, reducing misalignment to 8%

    arXiv
  5. RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation

    RoboCat found synthetic paraphrases improved success rates by 12% when preserving semantics but degraded by 18% with ambiguity

    arXiv
  6. CALVIN paper

    CALVIN uses 34 templated instructions across 24,000 demonstrations in a single simulated kitchen environment

    arXiv
  7. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    DROID contains 76,000 teleoperated trajectories with real-time instruction logging achieving 89% alignment

    arXiv
  8. Teleoperation datasets are becoming the highest-intent physical AI content category

    ALOHA collected 1,000 bimanual demonstrations with operators reading instructions aloud, achieving 95% alignment in 160 hours

    tonyzhaozh.github.io
  9. BridgeData V2: A Dataset for Robot Learning at Scale

    BridgeData V2 contains 60,000 real-robot trajectories with natural language, but 40% are post-hoc annotations describing outcomes

    arXiv
  10. LeRobot documentation

    LeRobot provides 12-page annotation guidelines reducing inter-annotator disagreement from 34% to 11%

    Hugging Face
  11. BridgeData V2: A Dataset for Robot Learning at Scale

    BridgeData V2 released under CC BY 4.0 but license does not address LLM-generated paraphrase inheritance

    arXiv
  12. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    SayCan required 400 hours of manual disambiguation to convert free-form instructions into executable sub-tasks

    arXiv

FAQ

What is the minimum dataset size for training a language-conditioned robot policy?

Specialist policies achieve 85% success rates with 5,000 language-paired trajectories in narrow task distributions, while generalist policies like RT-2 require 500,000+ trajectories spanning 100+ task categories for robust zero-shot transfer. The instruction diversity requirement scales super-linearly with task breadth: a policy covering 10 tasks needs 200 unique instructions, while a policy covering 100 tasks needs 8,000+ instructions to maintain equivalent generalization. Buyers should compute task entropy—measuring trajectory distribution uniformity across categories—and target entropy ≥4.5 bits for generalist policies.

How do I verify instruction-action alignment in a dataset before purchase?

Manually audit 200 random instruction-trajectory pairs to measure temporal alignment, specificity, and semantic consistency. Compute the time delta between instruction timestamps and the first action that changes task-relevant state (object contact, gripper closure)—datasets with alignment errors >2 seconds degrade policy performance by 15-20%. Measure instruction specificity by scoring noun definiteness, spatial precision, and action granularity; target specificity ≥0.65 for real-world deployment. Datasets without validation sets containing ground-truth alignment labels should be rejected or discounted by 30-40%.

Can I use synthetic paraphrases to expand a language-conditioned dataset?

Synthetic paraphrasing via LLMs can increase lexical diversity without additional human annotation, but introduces semantic drift risks. RoboCat found that GPT-4 paraphrases improved success rates by 12% when they preserved task semantics, but degraded performance by 18% when they introduced ambiguity. Use controlled paraphrasing prompts that require retention of object names, spatial prepositions, and action verbs—this reduces semantic drift to 6%. Audit a random sample of 500 instruction pairs to measure semantic preservation rates before deploying paraphrased data. Note that paraphrases generated via proprietary LLMs may inherit usage restrictions that prohibit training competing models.

What licensing terms should I look for in language-annotated robot datasets?

Verify that both trajectory data and language annotations carry compatible licenses for your intended use case. BridgeData V2 uses CC BY 4.0 for commercial use, but the license does not address whether LLM-generated paraphrases inherit original terms. DROID requires separate commercial agreements beyond its research license. Crowd-sourced annotations may carry contributor rights restricting commercial use—EPIC-KITCHENS annotations have a non-commercial clause prohibiting production model training without permission. Require sellers to document annotation provenance (human-written, LLM-generated, or hybrid) and provide legal opinions on commercialization rights.

How does instruction diversity affect policy generalization to novel commands?

Lexical diversity—measured as unique n-grams per 1,000 instructions—directly correlates with out-of-distribution performance. BridgeData V2 scores 340 unique bigrams per 1,000 samples, while DROID reaches 890 via crowd-sourced free-form annotations. RT-2 trained on 800,000 trajectories with web-scale language distributions achieved zero-shot transfer to novel paraphrased commands, while models trained on template-constrained datasets like CALVIN fail on paraphrases even when the underlying task is identical. Target lexical diversity ≥500 unique bigrams per 1,000 samples for real-world deployment, but verify that diversity stems from semantic variation rather than synthetic paraphrasing artifacts.

Looking for language-conditioned robot data?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

List Your Language-Conditioned Dataset