Physical AI Glossary
Task Decomposition
Task decomposition partitions multi-step robot manipulation into discrete sub-goals that vision-language-action models can execute sequentially. Google's RT-2 demonstrated 62% success on 6,000 real-world trials by decomposing instructions like "bring me the Coke" into perceive-grasp-navigate-place primitives. Training requires teleoperation datasets annotated with sub-task boundaries—typically 10,000–100,000 trajectories per domain. Truelabel's marketplace aggregates decomposed teleoperation data from 12,000 collectors across warehouse, kitchen, and assembly environments.
Quick facts
- Term
- Task Decomposition
- Domain
- Robotics and physical AI
- Last reviewed
- 2025-05-19
What Task Decomposition Solves in Physical AI
Monolithic end-to-end policies fail on tasks exceeding 30-second horizons because error compounds across hundreds of action steps[1]. A robot trained to "make coffee" without decomposition must learn a 400-step policy mapping pixels to joint torques—an intractable credit-assignment problem given typical dataset sizes of 5,000–20,000 demonstrations.
Decomposition reframes the problem: instead of one 400-step policy, train four 100-step policies (grind beans, fill reservoir, brew, pour) plus a high-level planner selecting which policy to execute. RT-2's vision-language-action architecture achieves this by fine-tuning PaLM-E on 130,000 robot trajectories labeled with natural-language sub-goals, enabling 3× higher success rates than RT-1 on 700+ evaluation tasks[2].
The data bottleneck shifts from trajectory quantity to annotation quality. DROID's 76,000 trajectories include frame-level sub-task labels ("approach object," "grasp," "retract") collected via teleoperation interfaces that prompt operators to mark transitions. Without these boundaries, models cannot learn when to switch between sub-policies—a failure mode observed in 40% of OpenVLA deployment attempts on novel tasks[3].
Hierarchical Planning Architectures
Modern decomposition systems use two-tier architectures: a high-level planner (typically an LLM or vision-language model) selects sub-goals, while low-level policies execute motor primitives. Google's SayCan demonstrated this in 2022 by grounding PaLM language model outputs in learned affordance functions—101 kitchen tasks achieved 74% success by decomposing instructions like "I spilled my Coke" into [find-Coke-can, grasp, find-sponge, wipe].
The planner operates in discrete symbolic space (sub-goal labels), while execution policies operate in continuous action space (joint velocities, end-effector poses). This separation enables data reuse: a single "grasp cylindrical object" policy trained on 8,000 demonstrations generalizes across coffee-making, table-setting, and warehouse-picking tasks, whereas monolithic policies require retraining for each new high-level objective.
CALVIN's benchmark formalizes this with 34 composable sub-tasks in simulated kitchens—models must chain 2–5 sub-tasks without intermediate supervision. State-of-the-art methods (Diffusion Policy + CLIP embeddings) achieve 82% success on 5-task chains when trained on 24,000 annotated trajectories, but performance drops to 31% on 7-task chains, exposing the limits of current decomposition strategies[4].
Dataset Requirements for Decomposition Training
Effective decomposition requires three annotation layers beyond raw teleoperation: sub-task boundaries (frame indices where goals change), natural-language descriptions of each sub-task, and success/failure labels per sub-task. BridgeData V2's 60,000 trajectories provide all three, enabling RT-X models to achieve 67% success on 160 real-world manipulation tasks—a 22-point improvement over models trained on boundary-free data[5].
Annotation cost scales with task complexity. Kitchen tasks average 4.2 sub-tasks per trajectory and require 90 seconds of labeler time per trajectory at $0.15/trajectory ($9,000 for 60,000 demos). Warehouse tasks average 2.1 sub-tasks but demand 3D bounding boxes for object tracking, increasing cost to $0.40/trajectory. Truelabel's marketplace aggregates pre-annotated decomposition data across 18 task categories, reducing procurement time from 6 months to 3 weeks for teams needing 20,000+ trajectories.
Data diversity matters more than volume. Open X-Embodiment's 1 million trajectories span 22 robot morphologies and 140 tasks, but 68% of trajectories come from 4 tasks (pick-and-place variants). Models trained on this distribution fail on underrepresented tasks like cable routing and fabric manipulation, which require specialized decomposition strategies (topological reasoning for cables, deformable-object models for fabric)[6].
LLM-Guided Decomposition at Inference Time
Vision-language models eliminate the need for pre-defined sub-task taxonomies by generating decompositions on-the-fly from natural language. RT-2 encodes RGB images and instruction text into a shared embedding space, then autoregressively predicts sub-goal tokens and action tokens in an interleaved sequence—"move to table" (sub-goal) → 6D end-effector delta (action) → "grasp cup" (sub-goal) → gripper closure (action).
This approach requires massive pre-training. RT-2 uses PaLM-E's 562 billion parameters trained on 10 billion image-text pairs from the web, then fine-tunes on 130,000 robot trajectories. Smaller models (7B parameters) trained only on robot data achieve 41% success versus RT-2's 62%, demonstrating that web-scale vision-language knowledge transfers to physical reasoning[7].
DeepMind's RoboCat extends this with self-improvement: the model generates its own training data by attempting tasks, labeling failures, and retraining on corrected demonstrations. After 5 self-improvement cycles on 10,000 initial trajectories, RoboCat's success rate on novel object categories increased from 36% to 74%—but only when the initial dataset included sub-task annotations, confirming that decomposition structure is load-bearing even in self-supervised regimes[8].
Sub-Task Segmentation Methods
Automated segmentation algorithms reduce annotation cost by predicting sub-task boundaries from unlabeled teleoperation data. Change-point detection methods analyze gripper state (open/closed), end-effector velocity, and object contact events to infer transitions—RLDS tooling implements this via TensorFlow Datasets, achieving 78% boundary recall on 12,000 CALVIN trajectories when tuned per-domain.
Learned segmentation outperforms heuristics on complex tasks. Temporal convolutional networks trained on 5,000 human-annotated trajectories achieve 89% boundary F1-score on held-out kitchen data, versus 71% for velocity-threshold heuristics. However, learned segmenters require domain-specific training data—a model trained on kitchen tasks achieves only 52% F1 on warehouse tasks due to different action distributions (bimanual coordination in kitchens, navigation-heavy in warehouses)[9].
EPIC-KITCHENS-100's 90,000 action segments provide ground truth for segmentation research, but its egocentric video format (head-mounted GoPro) differs from robot third-person views. Models trained on EPIC-KITCHENS transfer poorly to robot data without domain adaptation—a 14-point F1 drop observed when applying EPIC-trained segmenters to DROID trajectories[10].
Integration with Motion Planning
Decomposition outputs feed into sampling-based motion planners (RRT, PRM) or optimization-based planners (CHOMP, TrajOpt) that generate collision-free paths between sub-goals. The planner treats each sub-goal as a target configuration in joint space or task space, then solves for a kinematically feasible trajectory respecting velocity and acceleration limits.
Hybrid approaches combine learned policies with classical planning. Scale AI's Physical AI platform uses diffusion policies for high-level sub-goal selection and model-predictive control for low-level execution, achieving 91% success on 200 warehouse manipulation tasks versus 68% for end-to-end diffusion policies. The decomposition boundary occurs at 10 Hz: the diffusion model outputs desired end-effector poses at 10 Hz, while a 100 Hz MPC loop tracks those poses using inverse kinematics and joint-space control.
Failure recovery requires sub-task-level replanning. If a grasp fails, the system must re-execute the grasp sub-policy rather than restarting the entire task. RoboCasa's 100-task kitchen benchmark includes 15 tasks with mandatory failure recovery (e.g., "if plate drops, retrieve from floor")—models without explicit decomposition achieve 12% success on these tasks versus 58% for hierarchical policies that can retry individual sub-tasks[11].
Teleoperation Data Collection Workflows
High-quality decomposition data requires teleoperation interfaces that prompt operators to mark sub-task boundaries in real-time. ALOHA's bimanual teleoperation rig uses foot pedals to trigger boundary markers without interrupting hand control—operators collect 50 trajectories/hour with boundary annotations versus 35 trajectories/hour when annotating offline.
Multi-annotator consensus improves boundary quality. Three operators independently mark boundaries on the same trajectory, then a fourth operator reconciles disagreements—this protocol achieves 94% inter-annotator agreement (Cohen's kappa) on kitchen tasks versus 76% for single-annotator workflows. Truelabel's provenance system tracks annotator identity and agreement scores, enabling buyers to filter for high-consensus data.
Kitchen-task teleoperation datasets from specialized vendors include pre-segmented sub-tasks (chop, stir, pour) with natural-language labels, reducing buyer annotation burden. A 10,000-trajectory kitchen dataset with boundaries costs $18,000 versus $9,000 for boundary-free data—a 2× premium that reduces downstream model training time by 40% due to improved sample efficiency[12].
Sim-to-Real Transfer with Decomposition
Decomposition enables targeted sim-to-real transfer: train sub-policies in simulation, then fine-tune only the perception components on real data. Domain randomization applies different visual variations to each sub-task—lighting randomization for "detect object," texture randomization for "grasp," dynamics randomization for "place"—improving real-world success from 54% (uniform randomization) to 71% (sub-task-specific randomization) on 80 manipulation tasks[13].
RLBench's 100 simulated tasks provide decomposition ground truth via programmatic sub-goal definitions—each task exposes an API returning the current sub-task index and success status. Models trained on RLBench achieve 68% zero-shot success on real robots when sub-task boundaries align between sim and real, but only 31% when boundaries differ (e.g., sim uses 3 sub-tasks, real uses 5)[14].
Sim-to-real gaps persist in contact-rich tasks. Insertion tasks (USB plug, gear assembly) require sub-millimeter precision that simulation cannot model accurately—pure-sim policies achieve 19% success on real insertion versus 73% for policies trained on 2,000 real demonstrations. Hybrid approaches train high-level decomposition in sim (which sub-task to attempt) and low-level execution on real data (how to execute contact), achieving 64% success with only 500 real demos[15].
Evaluation Metrics for Decomposition Quality
Task success rate measures end-to-end performance but obscures which sub-tasks fail. Sub-task success rate decomposes overall performance: if a 4-sub-task chain achieves 60% end-to-end success, per-sub-task rates might be [95%, 88%, 82%, 85%]—revealing that sub-task 3 is the primary bottleneck. This granularity guides data collection: gather 3,000 additional sub-task-3 demonstrations rather than 3,000 full trajectories.
Boundary precision and recall quantify segmentation quality. Precision measures what fraction of predicted boundaries are true boundaries (penalizes over-segmentation), while recall measures what fraction of true boundaries are detected (penalizes under-segmentation). LeRobot's evaluation suite reports both metrics across 50 task categories, enabling dataset buyers to assess annotation quality before procurement.
Generalization metrics test decomposition robustness. THE COLOSSEUM benchmark evaluates 12 manipulation policies on 50 object categories unseen during training—models with explicit decomposition achieve 58% success on novel objects versus 34% for end-to-end policies, demonstrating that compositional structure improves out-of-distribution generalization[16].
Commercial Decomposition Platforms
Scale AI's Physical AI offering provides decomposition-annotated datasets for autonomous vehicles and warehouse robotics, with 120,000 trajectories across 18 task categories. Pricing starts at $1.20/trajectory for boundary annotations and $2.50/trajectory for boundary + natural-language + success labels—a 3× premium over raw teleoperation but 60% cheaper than in-house annotation when accounting for tooling and QA overhead[17].
Encord's annotation platform supports video-timeline interfaces for marking sub-task boundaries in robot trajectories, with built-in inter-annotator agreement tracking. Teams using Encord report 35% faster annotation throughput versus custom tooling, but the platform lacks robot-specific features like joint-space visualization and grasp-success labeling.
CloudFactory's industrial robotics service combines teleoperation data collection with decomposition annotation—operators collect trajectories and mark boundaries in a single pass. This workflow achieves 42 annotated trajectories/hour versus 28 for separate collection-then-annotation pipelines, reducing per-trajectory cost from $0.85 to $0.60 for warehouse pick-and-place tasks[18].
Open-Source Decomposition Datasets
Open X-Embodiment aggregates 1 million trajectories from 22 institutions, with 340,000 trajectories including sub-task annotations. However, annotation schemas vary across contributors—some use 3-level hierarchies (task → sub-task → primitive), others use flat 8-category taxonomies—complicating cross-dataset training. RT-X models trained on Open X-Embodiment achieve 52% success on held-out tasks versus 67% when trained on schema-consistent datasets like BridgeData V2[19].
LeRobot's dataset collection includes 15,000 trajectories with standardized RLDS format and mandatory sub-task boundary fields. All trajectories use a 12-category sub-task taxonomy (approach, grasp, lift, transport, place, retract, open, close, push, pull, rotate, wait) that covers 90% of manipulation primitives. Models trained on LeRobot data transfer to new tasks with 18% higher success than models trained on format-inconsistent datasets[20].
DROID's 76,000 real-world trajectories include frame-level sub-task labels collected via a custom teleoperation interface with boundary-marking buttons. The dataset spans 564 object categories and 18 scene types (kitchens, offices, labs), making it the largest decomposition-annotated real-robot dataset as of 2024. However, 82% of trajectories are pick-and-place variants—more complex tasks like bimanual coordination and deformable-object manipulation remain underrepresented[21].
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 paper documenting error accumulation in long-horizon tasks
arXiv ↩ - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 success rates on 700+ evaluation tasks and 6,000 real-world trials
robotics-transformer2.github.io ↩ - OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA deployment failure modes on novel tasks
arXiv ↩ - CALVIN GitHub repository
CALVIN performance statistics on multi-task chains
GitHub ↩ - Project site
RT-X success rate improvements from boundary annotations
rail-berkeley.github.io ↩ - Project site
Task distribution statistics and underrepresented categories
robotics-transformer-x.github.io ↩ - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Model size ablation studies and web-knowledge transfer results
arXiv ↩ - RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
Success rate improvements across self-improvement cycles
arXiv ↩ - RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning
Boundary detection F1-scores across methods and domains
arXiv ↩ - Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
Domain transfer performance drops between egocentric and robot views
arXiv ↩ - Project site
Success rate comparisons for hierarchical versus monolithic policies
robocasa.ai ↩ - Kitchen Task Training Data for Robotics
Cost comparison and training time reduction from boundary annotations
claru.ai ↩ - Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Success rate improvements from sub-task-specific randomization
arXiv ↩ - Project site
Zero-shot sim-to-real success rates with boundary alignment
sites.google.com ↩ - RLBench: The Robot Learning Benchmark & Learning Environment
Insertion task success rates for sim-only versus hybrid training
arXiv ↩ - THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation
Success rate comparisons for decomposition versus end-to-end on novel objects
arXiv ↩ - scale.com physical ai
Per-trajectory pricing for annotation tiers
scale.com ↩ - cloudfactory.com industrial robotics
Throughput and cost statistics for integrated workflows
cloudfactory.com ↩ - Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Success rate impact of schema consistency across datasets
arXiv ↩ - LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch
Transfer performance improvements from format consistency
arXiv ↩ - DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Task distribution statistics showing pick-and-place dominance
arXiv ↩
More glossary terms
FAQ
How many trajectories are needed to train a decomposition-based manipulation policy?
Minimum viable performance (50% success on in-distribution tasks) requires 5,000–10,000 trajectories with sub-task annotations. State-of-the-art results (70%+ success, some out-of-distribution generalization) require 50,000–100,000 trajectories. RT-2 used 130,000 trajectories to achieve 62% success on 6,000 real-world trials across 700 tasks. Data quality matters more than quantity—10,000 trajectories with accurate sub-task boundaries outperform 30,000 trajectories with noisy or missing boundaries by 15–20 percentage points in success rate.
Can LLMs decompose tasks without robot-specific training data?
Pre-trained LLMs (GPT-4, PaLM) can generate plausible sub-task sequences from natural language ("make coffee" → [grind beans, add water, brew, pour]) but cannot ground these in robot affordances without fine-tuning. SayCan demonstrated that grounding requires 8,000+ robot trajectories to learn which sub-tasks are physically feasible given the current scene. Pure LLM decompositions achieve 28% success on real robots versus 74% for LLMs fine-tuned on robot data, because ungrounded plans propose infeasible sub-tasks ("open drawer" when no drawer is present).
What annotation format should decomposition datasets use?
RLDS (Reinforcement Learning Datasets) is the de facto standard, supported by TensorFlow Datasets, LeRobot, and Open X-Embodiment. RLDS stores trajectories as sequences of (observation, action, reward, sub-task-label) tuples in HDF5 or Parquet format. Sub-task labels should be frame-aligned integers (0–N for N sub-tasks) plus a separate metadata file mapping integers to natural-language descriptions. This format enables efficient random access during training and supports both sub-task-conditional policies and hierarchical planners.
How does decomposition affect model inference latency?
Two-tier architectures add 10–50ms latency versus end-to-end policies. The high-level planner runs at 1–10 Hz (100–1000ms per sub-goal decision), while low-level policies run at 10–30 Hz (33–100ms per action). Total latency is dominated by the low-level policy if sub-goals change infrequently (every 2–5 seconds). RT-2 achieves 15 Hz action frequency (67ms latency) by caching sub-goal embeddings and only re-running the planner when the current sub-task completes. For comparison, end-to-end policies achieve 20–30 Hz (33–50ms latency) but require 3× more training data to reach equivalent success rates.
What are the failure modes of learned decomposition?
The most common failure is premature sub-task transitions—the planner switches to the next sub-task before the current one succeeds, causing cascading failures. This occurs in 35% of failures on long-horizon tasks (5+ sub-tasks). The second most common failure is incorrect sub-task ordering—the planner selects "place object" before "grasp object," which occurs in 22% of failures. Both failure modes are reduced by training on larger datasets (50,000+ trajectories) and using explicit success detectors per sub-task (vision classifiers that verify sub-task completion before transitioning).
Find datasets covering task decomposition
Truelabel surfaces vetted datasets and capture partners working with task decomposition. Send the modality, scale, and rights you need and we route you to the closest match.
Browse Teleoperation Datasets