Physical AI Glossary

Manipulation Trajectory

A manipulation trajectory is a time-ordered sequence of (observation, action, state) tuples recorded during a single robot manipulation episode. Each trajectory captures synchronized sensor streams—RGB-D video, joint positions, gripper state, force/torque readings—paired with the action commands (Cartesian deltas, joint velocities, or gripper open/close signals) executed at each timestep. Trajectories are the atomic training unit for imitation learning: datasets like DROID contain 76,000 trajectories across 564 skills, while Open X-Embodiment aggregates 1M+ trajectories from 22 robot embodiments to train generalist policies like RT-X.

Updated 2025-06-15

By truelabel

Reviewed by truelabel · Jun 15, 2025

manipulation trajectory

Browse Physical AI Datasets Browse glossary

Quick facts

Term: Manipulation Trajectory
Domain: Robotics and physical AI
Last reviewed: 2025-06-15

What Defines a Manipulation Trajectory

A manipulation trajectory records one complete attempt at a manipulation task from initial state to terminal condition (success, failure, or timeout). The TensorFlow Agents Trajectory API formalizes this as a named tuple containing observation, action, policy_state, reward, discount, and step_type fields, though physical AI datasets often extend this schema with embodiment-specific metadata.

Each timestep in a trajectory pairs sensory observations with the action executed in response. Observations typically include RGB or RGB-D camera frames (30–60 Hz), proprioceptive state vectors (joint angles, velocities, end-effector pose at 10–100 Hz), and optional tactile or force/torque readings. Actions encode motor commands: Cartesian end-effector deltas for operational-space control, joint velocity targets for joint-space control, or discrete gripper open/close signals. The DROID dataset records 6-DOF end-effector actions plus binary gripper state at 10 Hz, synchronized with wrist and third-person camera streams.

Temporal alignment across sensor modalities is critical. A 50ms camera-action misalignment can degrade trained policy success rates by 15–30% because the model learns spurious correlations between stale observations and future actions^[1]. The RLDS ecosystem addresses this by storing trajectories in a standardized schema with explicit timestamp fields, enabling downstream loaders to verify synchronization and resample streams to a common frequency.

Trajectory Collection Methods and Quality Signals

Teleoperation remains the dominant collection method for high-quality manipulation trajectories. Human operators control robots via joysticks, VR controllers, or kinesthetic teaching to demonstrate task execution. The ALOHA system uses bilateral teleoperation with two leader arms, achieving 50 trajectories per hour for bimanual mobile manipulation tasks. Teleoperation data exhibits higher action diversity and smoother motion profiles than scripted or autonomously generated trajectories, making it the preferred source for imitation learning datasets^[2].

Autonomous policy rollouts generate trajectories at scale but introduce distribution shift. A policy trained on 10,000 human demonstrations may produce 100,000 rollout trajectories for self-improvement, but these trajectories inherit the policy's failure modes and lack the corrective behaviors present in human data. The RoboNet dataset combines 15,000 human teleoperation trajectories with 120,000 autonomous rollouts across 7 robot platforms, using the human data to anchor the distribution and the rollout data to increase scene diversity.

Trajectory quality metrics include task success rate (binary or continuous reward), action smoothness (measured via jerk or acceleration norms), and observation coverage (percentage of workspace volume visited). The Open X-Embodiment dataset reports per-task success rates ranging from 34% (complex bimanual assembly) to 92% (single-object pick-place), with trajectory filtering removing episodes below 20% expected return to improve downstream policy performance by 12–18%^[3].

Storage Formats and Serialization Standards

The RLDS (Reinforcement Learning Datasets) format has emerged as the de facto standard for trajectory storage in physical AI. RLDS wraps TensorFlow Datasets to provide a nested structure: datasets contain episodes, episodes contain steps, and each step stores observation/action/metadata dictionaries. This schema supports arbitrary sensor modalities and action spaces while maintaining compatibility with JAX, PyTorch, and TensorFlow training pipelines. The LeRobot library extends RLDS with Parquet-backed storage for 40% faster random access during distributed training.

ROS bag files (rosbag2 with MCAP storage) dominate in robotics research labs due to native ROS integration. A typical manipulation trajectory stored as MCAP contains /camera/image_raw, /joint_states, /tf (transform tree), and /gripper_command topics, each with microsecond-precision timestamps. The rosbag2_storage_mcap plugin enables zero-copy playback and selective topic extraction, reducing trajectory loading time from 8 seconds (legacy rosbag) to 0.3 seconds for a 2-minute episode.

HDF5 remains common for legacy datasets but suffers from poor parallel I/O performance. The HDF5 hierarchical structure stores trajectories as nested groups (/episode_0001/observations/camera_0, /episode_0001/actions), but concurrent reads from multiple training workers trigger file-lock contention that limits throughput to 200–400 trajectories/second on NVMe storage. Converting HDF5 datasets to Parquet or RLDS formats typically yields 3–5× faster data loading in multi-GPU training scenarios^[4].

Trajectory Datasets Powering Foundation Models

The Open X-Embodiment dataset aggregates 1,000,000+ trajectories from 22 robot embodiments (Franka Panda, UR5, Kinova Gen3, mobile manipulators) across 160+ tasks. This cross-embodiment diversity enables training of RT-X models that generalize to unseen robot morphologies with 50–70% zero-shot success rates on held-out platforms. Each trajectory includes embodiment metadata (URDF, camera extrinsics, action space bounds) to support embodiment-conditioned policy architectures.

The DROID dataset contains 76,000 trajectories spanning 564 manipulation skills collected via teleoperation across 564 scenes and 86 object categories. DROID trajectories average 45 seconds (450 timesteps at 10 Hz) and include failure modes: 18% of trajectories end in task failure, providing negative examples that improve policy robustness by 22% compared to success-only training^[5]. The dataset uses a standardized action space (6-DOF end-effector delta + gripper) to enable cross-task transfer.

BridgeData V2 provides 60,000 trajectories for 13 kitchen manipulation tasks, with 4–6 camera viewpoints per trajectory to support multi-view policy architectures. The dataset includes language annotations for 100% of trajectories, enabling vision-language-action models like RT-2 to ground natural language commands in manipulation affordances. Trajectory durations range from 8 seconds (pick-place) to 90 seconds (multi-step assembly), with action frequencies of 5–10 Hz depending on task dynamics.

Trajectory Augmentation and Synthetic Generation

Temporal subsampling creates multiple training examples from a single trajectory. A 300-step trajectory resampled at 5 Hz, 10 Hz, and 20 Hz yields three distinct sequences with different action granularities, improving policy robustness to variable control frequencies by 15–25%. The RT-2 training pipeline applies random temporal crops (selecting 50–200 contiguous steps from longer trajectories) to increase effective dataset size by 4–8× without additional data collection.

Goal relabeling reinterprets trajectory segments as demonstrations of alternative tasks. A failed pick-place trajectory that drops the object mid-air can be relabeled as a successful "pick and move to height" demonstration by truncating at the drop timestep and assigning a new goal state. This technique, used in the CALVIN benchmark, increases task coverage from 34 base tasks to 200+ relabeled variants using the same 10,000 source trajectories.

Sim-to-real trajectory transfer uses domain randomization to generate synthetic trajectories that transfer to physical robots. The RLBench simulator produces 100,000+ trajectories per task with randomized lighting, textures, and object poses, then filters the top 20% by sim success rate for real-robot fine-tuning. This approach reduces real-world data requirements by 60–80% for tasks with well-defined geometric constraints (peg insertion, stacking) but remains ineffective for contact-rich manipulation (cloth folding, deformable object handling) where sim-to-real gaps exceed 40%^[6].

Trajectory Metadata and Provenance Tracking

Embodiment metadata enables cross-platform policy transfer. Each trajectory should record robot URDF, camera intrinsics/extrinsics, action space bounds, and control frequency. The LeRobot dataset format stores this as a per-episode metadata dictionary, allowing policies to condition on embodiment features (gripper width, arm reach, camera FOV) and adapt to morphological differences. Without embodiment metadata, a policy trained on Franka Panda trajectories (0.08m gripper width) fails catastrophically on UR5e platforms (0.14m gripper width) due to action-space mismatch.

Task annotations improve dataset searchability and enable language-conditioned policies. The EPIC-KITCHENS-100 dataset provides verb-noun annotations ("open drawer", "pick cup") for 90,000 egocentric video segments, enabling retrieval of trajectories by natural language query. The Open X-Embodiment extends this with free-form language instructions ("place the red block in the top drawer") for 40% of trajectories, supporting instruction-following policy architectures.

Data provenance tracking records collection context: operator identity, collection date, hardware version, and any post-processing applied (trajectory smoothing, action clipping, frame dropping). The truelabel.ai marketplace requires provenance metadata for all listed datasets, enabling buyers to filter by collection method (teleoperation vs. autonomous), operator skill level (novice vs. expert), and data quality metrics (success rate, action smoothness). Provenance gaps reduce dataset commercial value by 30–50% due to buyer uncertainty about training distribution^[7].

Trajectory-Based Policy Training Architectures

Behavioral cloning trains policies via supervised learning on (observation, action) pairs extracted from trajectories. The RT-1 (Robotics Transformer) architecture processes trajectory segments of 6 timesteps, using a Transformer encoder to attend over the observation history and predict the next action. RT-1 achieves 97% success on seen tasks and 62% on unseen tasks when trained on 130,000 trajectories from the RT-1 dataset, demonstrating that trajectory quantity and diversity are the primary drivers of generalization.

Diffusion policies model action sequences as denoising processes, predicting multi-step action chunks (8–16 timesteps) conditioned on current observations. This approach, implemented in the LeRobot Diffusion Policy, improves action smoothness by 35% and reduces task completion time by 18% compared to single-step behavioral cloning, particularly for contact-rich tasks where action sequences exhibit temporal dependencies (wiping, insertion, bimanual coordination).

Goal-conditioned policies use trajectory relabeling to learn from suboptimal demonstrations. The CALVIN framework trains policies on 24,000 trajectories with hindsight goal relabeling, achieving 88% success on 34 long-horizon tasks (5+ subtasks) by learning to reach arbitrary intermediate states rather than memorizing fixed task sequences. This approach reduces data requirements by 40–60% for multi-step manipulation compared to task-specific behavioral cloning^[8].

Commercial Trajectory Dataset Procurement

Enterprise buyers prioritize trajectory datasets with verified task success rates, embodiment diversity, and clear licensing terms. The Scale AI Physical AI platform offers curated trajectory datasets for 50+ manipulation tasks (pick-place, assembly, bin-picking) with guaranteed 85%+ success rates and per-trajectory quality scores based on action smoothness, observation clarity, and task completion time. Pricing ranges from $2–$8 per trajectory depending on task complexity and sensor modality count.

Custom trajectory collection services address domain-specific requirements. Silicon Valley Robotics Center provides teleoperation data collection for client-specified tasks, robots, and environments, delivering 500–2,000 trajectories per week with same-embodiment consistency (single robot platform, fixed camera setup, controlled lighting). Custom collection costs $50–$200 per trajectory but ensures distribution alignment with deployment conditions, reducing sim-to-real transfer gaps by 60–80% compared to off-the-shelf datasets.

Licensing models vary widely. The RoboNet dataset uses a permissive research-only license prohibiting commercial model training, while BridgeData allows commercial use with attribution. The truelabel.ai marketplace standardizes licensing via per-trajectory commercial-use grants, enabling buyers to train proprietary models on 10,000–100,000 trajectories with clear IP ownership and no downstream restrictions. Licensing ambiguity reduces dataset adoption by 40–60% in enterprise procurement workflows^[9].

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Physical AI glossaryGlossary hub RLDS format for robot training dataDelivery format detail Best robotics dataset marketplaces 2026Related page Best teleoperation data providers 2026Related page Teleoperation data vs robot demonstration dataRelated page HDF5 robot data format for robot training dataDelivery format detail LeRobot format format for robot training dataDelivery format detail MCAP format for robot training dataDelivery format detail

External references and source context

RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 paper documents camera-action misalignment impact on policy performance
arXiv ↩
BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 paper reports teleoperation data quality advantages over autonomous rollouts
arXiv ↩
Project site
RT-X project reports trajectory filtering improves policy performance by 12-18%
robotics-transformer-x.github.io ↩
LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch
LeRobot paper reports 3-5× faster data loading with Parquet vs HDF5 formats
arXiv ↩
Project site
DROID project site reports failure examples improve policy robustness by 22%
droid-dataset.github.io ↩
Crossing the Reality Gap: A Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning
Survey documents 40%+ sim-to-real gaps for contact-rich manipulation tasks
arXiv ↩
truelabel data provenance glossary
Provenance gaps reduce dataset commercial value by 30-50% due to buyer uncertainty
truelabel.ai ↩
CALVIN paper
CALVIN reduces data requirements by 40-60% for multi-step manipulation tasks
arXiv ↩
truelabel physical AI data marketplace bounty intake
truelabel.ai marketplace standardizes licensing via per-trajectory commercial-use grants
truelabel.ai ↩
RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 achieves 97% success on seen tasks using 130,000 training trajectories
arXiv

More glossary terms

Off-the-shelf datasetAn existing public or commercial dataset bought without custom collection.Teleoperation dataHuman-controlled robot trajectories used to bootstrap policies for new skills.Data provenanceTraceability metadata: source, consent, rights, capture conditions, chain of custody.Egocentric dataFirst-person camera footage capturing how a worker or operator sees a task.Physical AI training dataData that teaches models to perceive, reason about, and act in physical environments.Robot demonstrationsRecorded successful task executions used as demonstrations for imitation learning.

FAQ

How many trajectories are needed to train a manipulation policy?

Task-specific policies require 1,000–10,000 trajectories for 80%+ success rates on narrow tasks (single object, fixed scene). The RT-1 model achieves 97% success on 17 kitchen tasks using 130,000 trajectories, while OpenVLA demonstrates 75% zero-shot success on unseen tasks after training on 970,000 trajectories from Open X-Embodiment. Generalist policies targeting 100+ tasks typically require 500,000–1,000,000 trajectories to achieve robust cross-task transfer. Data efficiency improves 3–5× when trajectories include language annotations, embodiment metadata, and failure examples rather than success-only demonstrations.

What is the difference between a trajectory and an episode in robot learning?

The terms are often used interchangeably, but "episode" emphasizes the temporal boundary (start to terminal state) while "trajectory" emphasizes the recorded data structure (sequence of observations and actions). In RLDS format, an episode is the organizational unit containing metadata (task ID, success label, duration) and a trajectory is the time-series data within that episode. Some datasets use "episode" for the full recording (including pre-task setup and post-task reset) and "trajectory" for the task-execution segment only, but this distinction is not standardized across the field.

Can trajectories from different robot platforms be combined for training?

Yes, with embodiment-aware architectures. RT-X trains on 1M+ trajectories from 22 robot types by conditioning the policy on embodiment tokens (robot ID, action space bounds, camera parameters) and using a shared Transformer backbone. Cross-embodiment training improves zero-shot generalization by 50–70% on unseen platforms compared to single-embodiment policies, but requires trajectory datasets to include standardized embodiment metadata. Action space normalization (scaling joint velocities to [-1, 1] based on platform limits) and observation standardization (resizing images to common resolution, normalizing proprioceptive ranges) are critical preprocessing steps that reduce cross-platform performance gaps by 30–40%.

How do you handle trajectories with variable length in batch training?

Padding and masking are standard approaches. Trajectories are padded to the maximum length in a batch (typically 100–500 steps) with zero-valued observations and actions, and a binary mask indicates valid timesteps. The policy loss function multiplies per-timestep errors by the mask to ignore padded regions. RLDS loaders support dynamic batching that groups trajectories of similar length (±20% duration) to minimize padding overhead, improving GPU utilization by 15–25%. Alternatively, trajectory chunking splits long episodes into fixed-length segments (e.g., 32 steps) with overlapping windows, converting variable-length episodes into uniform-length training examples at the cost of losing cross-chunk temporal dependencies.

What metadata should accompany each trajectory for commercial use?

Essential metadata includes task identifier, success label (binary or continuous reward), robot embodiment (URDF or model name), action space specification (Cartesian vs. joint, position vs. velocity), observation modalities (camera resolutions, sensor types, recording frequencies), collection method (teleoperation, autonomous, scripted), operator skill level, collection date, and any post-processing applied (filtering, resampling, relabeling). The truelabel.ai marketplace requires provenance fields (collector identity, hardware version, calibration status) and licensing terms (commercial-use grant, attribution requirements, derivative work permissions) to enable buyer due diligence. Trajectories lacking embodiment metadata or success labels trade at 40–60% discounts due to reduced training utility.

Find datasets covering manipulation trajectory

Truelabel surfaces vetted datasets and capture partners working with manipulation trajectory. Send the modality, scale, and rights you need and we route you to the closest match.

Browse Physical AI Datasets