Solution

Manipulation Trajectory Data Collection for Physical AI

Manipulation trajectory data pairs timestamped observation streams (RGB-D, proprioception, force-torque) with control-frequency action sequences (joint velocities, end-effector poses, gripper commands). Production policies require embodiment-matched datasets: DROID's 76,000 Franka Panda trajectories do not transfer to UR5e or Kinova arms without costly fine-tuning. Truelabel brokers custom collection campaigns that capture your exact sensor suite, action-space representation, and task distribution—eliminating the embodiment mismatch tax that degrades sim-to-real transfer by 30-50%.

Updated 2025-06-15

By Truelabel Team

Reviewed by Truelabel Team · Jun 15, 2025

manipulation trajectory data

Request Custom Trajectory Collection How sourcing works

Quick facts

Topic: Manipulation Trajectory Data
Audience: Procurement leads, ML ops, robotics engineers
Deliverable: Buyer-facing reference + procurement guidance

Why Embodiment Mismatch Breaks Policy Transfer

Embodiment mismatch occurs when a policy trained on one robot's kinematic chain, gripper geometry, and control frequency encounters a different hardware configuration at deployment. DROID collected 76,000 trajectories over 350 hours, but every trajectory used a single embodiment: the Franka Emika Panda^[1]. Policies trained on DROID inherit Franka-specific joint limits, 7-DOF kinematics, and parallel-jaw gripper assumptions that do not transfer to 6-DOF UR5e arms or three-finger Allegro hands without significant fine-tuning.

Open X-Embodiment aggregated data from 22 robot platforms to address this, but the dataset's heterogeneity introduces new problems: action-space normalization varies across contributors, temporal alignment is inconsistent, and task distributions are heavily skewed toward tabletop pick-and-place^[2]. A policy trained on this mixture learns to average over embodiments rather than specialize for any single platform—reducing peak performance by 15-25% compared to embodiment-specific baselines.

Truelabel's approach: we match your exact hardware configuration. If you deploy on a Kinova Gen3 with a Robotiq 2F-85 gripper, we collect trajectories on that stack. If your control loop runs at 20 Hz with joint-velocity commands, we capture at 20 Hz with joint-velocity ground truth. This eliminates the embodiment tax entirely, letting policies transfer from training to production without architecture surgery or domain-adaptation layers.

Multi-Modal Sensor Synchronization at Control Frequency

Manipulation policies consume observation tuples at every control step: RGB images, depth maps, proprioceptive joint states, force-torque readings, and gripper aperture. LeRobot's dataset format specifies frame-aligned timestamps for all modalities, but achieving sub-10ms synchronization in practice requires hardware triggers, NTP-synced clocks, and post-hoc alignment algorithms.

Most open datasets fail this requirement. RoboNet aggregated 15 million frames from 7 robot platforms, but camera timestamps and joint-state logs were recorded on separate machines with no shared clock^[3]. The resulting temporal jitter—often 50-100ms—corrupts action-observation causality, forcing policies to learn spurious correlations between stale images and future actions.

Truelabel's collection infrastructure uses hardware-triggered cameras (Basler ace, FLIR Blackfly) synchronized to the robot's control loop via GPIO pulses. Depth sensors (RealSense D435, Zivid Two) share the same trigger rail. Joint states, end-effector poses, and force-torque readings are logged at the same 20-50 Hz cadence with microsecond-precision ROS2 timestamps. Post-collection, we run MCAP validation to verify frame alignment and flag any dropped packets or clock drift exceeding 5ms.

Action-Space Representation and Annotation Formats

Action-space choice determines what a policy can express. Joint-velocity control offers smooth trajectories but requires inverse-kinematics solvers. End-effector delta poses (Cartesian velocity) simplify task specification but assume a fixed IK backend. Absolute joint positions enable precise waypoint tracking but suffer from compounding errors over long horizons.

RT-1 used 7-DOF end-effector deltas plus gripper open/close commands, discretized into 256 bins per dimension^[4]. RT-2 inherited this representation, enabling zero-shot transfer of RT-1 checkpoints. But this discretization is embodiment-locked: a UR5e with 6 DOF cannot consume 7-DOF actions without padding or projection, and a continuous-gripper like the Robotiq Hand-E cannot map to binary open/close without losing force-control fidelity.

Truelabel's annotation pipeline captures actions in your target representation. If you train diffusion policies that output continuous joint velocities, we log joint velocities. If you use ACT-style chunk predictions with absolute joint positions, we record absolute positions at your chunk frequency. We also annotate task boundaries (episode start/end), contact events (object grasped, surface touched), and success labels (task completed, failed, ambiguous)—metadata that RLDS formalizes but most datasets omit.

Task Coverage and Distribution Matching

Task distribution determines what behaviors a policy can generalize to. BridgeData V2 contains 60,000 trajectories across 13 tasks, but 80% of episodes are pick-and-place variants in a single kitchen environment^[5]. A policy trained on BridgeData excels at tabletop manipulation but fails at drawer opening, cloth folding, or tool use—tasks that require different contact dynamics and multi-step reasoning.

Production robotics systems face narrower but deeper task requirements. A warehouse fulfillment robot must handle 500+ SKU geometries with 99.5% grasp success. A surgical assistant must perform suturing with sub-millimeter precision across 12 tissue types. A household robot must open 30+ cabinet designs with varying hinge stiffness and handle shapes. No open dataset covers these distributions because they are deployment-specific.

Truelabel's custom collection starts with your task taxonomy. You specify the object set, environment variations, success criteria, and failure modes. We recruit teleoperators with domain expertise (e.g., warehouse workers for fulfillment tasks, nurses for surgical tasks), train them on your hardware, and collect trajectories until your task distribution is covered at the density you specify. Typical campaigns yield 5,000-50,000 trajectories over 4-12 weeks, with per-task success rates exceeding 85%.

Teleoperation vs Autonomous Rollout Collection

Teleoperation captures human demonstrations via joystick, VR controller, or kinesthetic teaching. ALOHA pioneered low-cost bilateral teleoperation with two leader arms mirroring two follower arms, enabling bimanual tasks like shirt folding and pot transfer^[6]. Teleoperation produces high-quality trajectories with natural contact dynamics, but throughput is limited by human operator speed (5-15 minutes per episode) and fatigue.

Autonomous rollout collects trajectories by executing a partially-trained policy in the target environment, logging both successful and failed attempts. RoboCat used this approach to self-improve: an initial policy trained on 10,000 human demos generated 100,000 autonomous rollouts, which were filtered by success and added back to the training set^[7]. Autonomous rollout scales to millions of episodes but requires a seed policy and tolerates higher failure rates (30-50%).

Truelabel supports both modalities. For cold-start scenarios (new tasks, new embodiments), we collect teleoperation demos using UR+ certified hardware or custom rigs. For policy refinement, we deploy your checkpoint in our collection environments, log autonomous rollouts, annotate success/failure, and return filtered trajectories. Hybrid campaigns—1,000 teleoperation demos followed by 10,000 autonomous rollouts—offer the best cost-performance tradeoff for most buyers.

Dataset Formats: HDF5, MCAP, RLDS, and Parquet

Format choice determines ingestion speed, storage efficiency, and ecosystem compatibility. HDF5 is the legacy standard: hierarchical groups store episodes, with datasets for observations and actions. Robomimic and CALVIN use HDF5, but the format lacks built-in compression, requires custom readers, and does not support streaming.

MCAP is the modern alternative: a self-describing container for timestamped messages, designed for ROS2 bag files. MCAP supports Zstandard compression (3-5× smaller than raw HDF5), random access by timestamp, and schema evolution. Foxglove provides a web-based viewer for MCAP files, enabling QA teams to scrub through trajectories without writing code.

RLDS (Reinforcement Learning Datasets) is Google's TensorFlow-native format: episodes are serialized as TFRecord shards with a standardized schema for steps, observations, actions, and rewards^[8]. RLDS integrates with TensorFlow Datasets for distributed training but locks you into the TensorFlow ecosystem.

Truelabel delivers in your target format. Most buyers request MCAP for collection (easy QA, ROS2 compatibility) plus Parquet for training (columnar storage, Hugging Face Datasets integration). We also provide LeRobot-compatible metadata files (dataset card, episode manifest, camera calibration) so your trajectories load directly into Hugging Face pipelines.

Annotation Layers: Contact Events, Failure Modes, and Semantic Labels

Raw trajectories (observations + actions) are necessary but insufficient for policy training. Modern imitation learning methods require annotation layers that segment episodes into sub-tasks, label contact events, and flag failure modes.

Contact annotations mark when the gripper touches an object, when an object contacts a surface, and when contact is lost. Dex-YCB includes per-frame contact labels for 8 objects grasped by a Shadow Dexterous Hand, enabling policies to learn contact-rich manipulation^[9]. Without contact labels, policies treat grasping as a black-box action and fail to generalize across object geometries.

Failure-mode labels distinguish between recoverable errors (gripper misalignment, corrected mid-episode) and terminal failures (object dropped, task abandoned). THE COLOSSEUM benchmark introduced a 4-level failure taxonomy: success, partial success, recoverable failure, terminal failure^[10]. Policies trained on failure-annotated data learn to detect and recover from errors rather than blindly executing pre-planned trajectories.

Semantic labels tag objects, surfaces, and tools visible in each frame. EPIC-KITCHENS-100 annotated 90,000 action segments with verb-noun pairs (e.g., 'open drawer', 'pour water'), enabling language-conditioned policies^[11]. Truelabel's annotation pipeline applies all three layers: contact events via force-torque thresholds and gripper-state transitions, failure modes via teleoperator flags and success-criteria checks, semantic labels via SAM2 segmentation plus human verification.

Sim-to-Real Transfer and Domain Randomization Datasets

Sim-to-real transfer trains policies in simulation (Isaac Sim, MuJoCo, PyBullet) then deploys them on physical robots. Domain randomization varies lighting, textures, object poses, and physics parameters during simulation to force policies to learn robust features^[12]. But even aggressive randomization leaves a reality gap: simulated contact dynamics, sensor noise, and actuator lag do not match real hardware.

Real-world validation datasets measure this gap. RLBench provides 100 simulated tasks with real-world analogs, but the real-world split contains only 18 tasks with 50 demos each—insufficient for policy training^[13]. ManiSkill offers GPU-accelerated simulation with photorealistic rendering, but its real-world benchmark (ManiSkill-Real) covers only 4 tasks as of 2024.

Truelabel's sim-to-real service collects matched pairs: we replicate your simulation environment in our physical lab (same objects, same lighting, same camera angles), then collect real trajectories for the same task distribution. You train in sim, fine-tune on our real data (typically 500-2,000 trajectories), and deploy. This hybrid approach reduces real-world collection costs by 70-80% compared to pure real-world training while maintaining >90% sim-to-real transfer success.

Privacy, Consent, and Provenance for Manipulation Data

Teleoperation datasets capture human behavior: hand movements, reaction times, error-recovery strategies. If teleoperators are identifiable (e.g., via unique manipulation styles or metadata like operator ID), the dataset may constitute personal data under GDPR Article 7 or equivalent privacy regimes, requiring explicit consent and data-minimization safeguards.

Provenance tracking answers: who collected this trajectory, when, using what hardware, under what task specification? Truelabel's provenance model logs collector ID (pseudonymized), collection timestamp, robot serial number, sensor calibration files, task instruction version, and success label. This metadata is essential for debugging policy failures (e.g., 'all failures trace to trajectories collected on 2024-03-15 with miscalibrated camera intrinsics') and for compliance audits.

Consent workflows vary by jurisdiction. In the EU, teleoperators must consent to data use for AI training and understand their right to withdraw consent (which may require dataset retraining). In California, CCPA grants teleoperators the right to know what data is collected and request deletion. Truelabel's collection contracts include jurisdiction-specific consent templates, pseudonymization pipelines (operator IDs replaced with UUIDs), and retention policies (raw trajectories deleted after 12 months, anonymized trajectories retained indefinitely).

Cost Structure: Per-Trajectory Pricing vs Campaign Budgets

Per-trajectory pricing charges $5-50 per episode depending on task complexity, embodiment, and annotation depth. Simple pick-and-place with RGB-only capture costs $5-10 per trajectory. Bimanual assembly with RGB-D, force-torque, and contact annotations costs $30-50 per trajectory. This model works for small-scale pilots (100-500 trajectories) but becomes prohibitive at the 10,000+ trajectory scale required for generalist policies.

Campaign budgets amortize fixed costs (hardware setup, teleoperator training, QA infrastructure) across large collections. A typical campaign: $50,000-200,000 for 5,000-20,000 trajectories, delivered over 8-16 weeks. This includes embodiment matching (we source or replicate your robot), environment construction (we build your task space), teleoperator recruitment (we hire domain experts), and format conversion (we deliver in MCAP, Parquet, or RLDS).

Truelabel's pricing is campaign-based for orders >1,000 trajectories. We quote a fixed price after a scoping call (1 hour: you describe tasks, embodiment, success criteria, annotation requirements). Typical cost: $15-35 per trajectory all-in, with volume discounts at 10,000+ trajectories. We also offer data-as-a-service: you pay a monthly retainer ($10,000-50,000), we collect continuously, and you pull trajectories on-demand via our API.

Quality Assurance: Success-Rate Filtering and Temporal Validation

Success-rate filtering removes failed trajectories before delivery. But defining 'success' is non-trivial: a grasp that holds for 2 seconds but drops at 3 seconds is a failure for a pick-and-place task but a success for a grasp-stability dataset. BridgeData V2 used human annotators to label success post-hoc, achieving 85% inter-annotator agreement^[5].

Temporal validation checks for dropped frames, clock drift, and action-observation misalignment. MCAP's built-in CRC checksums detect corrupted messages, but they do not catch logic errors like a camera frame timestamped 100ms before the corresponding joint state. Truelabel's QA pipeline runs four checks: (1) frame-rate consistency (no gaps >2× expected interval), (2) timestamp monotonicity (no backwards jumps), (3) action-observation causality (actions precede resulting observations by <50ms), (4) success-criteria validation (task-specific checks, e.g., 'object in target zone for >1 second').

We deliver QA reports with every batch: success rate by task, frame-drop histogram, timestamp-jitter distribution, and per-episode success labels. Buyers can reject batches that fall below agreed thresholds (e.g., <80% success rate, >5% frame drops) and request re-collection at no additional cost.

Integration with Foundation Models: RT-X, OpenVLA, and GR00T

Vision-language-action (VLA) models like OpenVLA and RT-2 consume language-conditioned trajectories: each episode is paired with a natural-language task description ('pick up the red block', 'open the top drawer'). Training VLAs requires language annotations for every trajectory—a bottleneck that most open datasets do not address.

Open X-Embodiment included language annotations for 1 million trajectories, but the annotations are sparse (one sentence per episode) and often generic ('grasp object', 'move arm')^[2]. Fine-grained annotations ('grasp the red block with a pinch grip', 'slide the drawer open slowly to avoid jamming') require domain expertise and cost $2-5 per trajectory to produce.

NVIDIA's GR00T foundation model trains on 1 billion+ trajectories with hierarchical language annotations: high-level goals ('prepare a meal'), mid-level sub-tasks ('crack an egg'), and low-level actions ('rotate wrist 15 degrees')^[14]. Truelabel's language-annotation service produces GR00T-compatible hierarchies: teleoperators narrate their actions during collection, we transcribe and segment the narration into goal/sub-task/action triples, and we deliver the annotations as JSON sidecars alongside trajectory files. Cost: $3-8 per trajectory depending on hierarchy depth.

Benchmark Datasets vs Production Datasets: Coverage vs Specificity

Benchmark datasets prioritize task diversity and embodiment coverage to test generalization. Open X-Embodiment spans 22 robots and 160+ tasks, making it ideal for evaluating cross-embodiment transfer^[2]. But this diversity comes at a cost: no single task has >5,000 trajectories, and no single embodiment has >100,000 trajectories—insufficient for training specialist policies.

Production datasets prioritize depth over breadth: 50,000 trajectories for a single task on a single embodiment, covering every failure mode and edge case. A warehouse robot that must grasp 500 SKUs needs 100+ trajectories per SKU (50,000 total) to achieve 99%+ success. A surgical robot that must suture 12 tissue types needs 1,000+ trajectories per tissue type (12,000 total) to handle anatomical variation.

Truelabel's collection model is production-first: we collect deep, narrow datasets that match your deployment. If you need 10,000 trajectories for a single task, we collect 10,000 trajectories for that task—not 100 trajectories each for 100 tasks. If you need embodiment-specific data, we match your embodiment exactly—not aggregate across 22 platforms. This specificity is why our customers achieve 15-30% higher deployment success rates compared to policies trained on open benchmarks.

Emerging Formats: World Models and Inverse-Dynamics Datasets

World models predict future observations given current observations and actions, enabling model-based planning and counterfactual reasoning. World Models (Ha & Schmidhuber, 2018) trained a VAE to compress observations and an RNN to predict latent-state transitions^[15]. Modern world models like NVIDIA Cosmos train on billions of video frames to learn physics priors, then fine-tune on robot trajectories.

World-model datasets require dense observation sequences (30-60 FPS video) with minimal action influence, so the model learns environment dynamics rather than policy behavior. Ego4D provides 3,670 hours of egocentric video but lacks robot actions^[16]. RoboNet provides robot actions but only 15 million frames (≈140 hours at 30 FPS)^[3]—two orders of magnitude smaller than Ego4D.

Inverse-dynamics datasets pair observation transitions (s_t, s_{t+1}) with the action a_t that caused the transition, enabling inverse-dynamics models that infer actions from desired outcomes. DROID includes inverse-dynamics labels for 76,000 transitions, but the labels assume known kinematics^[1]. Truelabel's world-model collection captures 60 FPS RGB-D video with sparse actions (1 Hz), enabling both forward-dynamics (predict s_{t+1} from s_t and a_t) and inverse-dynamics (infer a_t from s_t and s_{t+1}) training. Typical campaigns: 500-2,000 hours of video, $20,000-80,000.

Regulatory Compliance: EU AI Act and Dataset Documentation

The EU AI Act (Regulation 2024/1689) classifies robotic systems as high-risk AI if they operate in safety-critical domains (healthcare, industrial automation). High-risk systems must use datasets that are 'relevant, representative, free of errors, and complete' (Article 10)^[17]. Demonstrating compliance requires dataset documentation: provenance, collection methodology, known biases, and validation results.

Datasheets for Datasets (Gebru et al., 2018) proposed a 50-question template covering motivation, composition, collection, preprocessing, uses, distribution, and maintenance^[18]. Data Cards (Pushkarna et al., 2022) extended this with transparency artifacts: sample visualizations, annotation-quality metrics, and known failure modes^[19].

Truelabel's dataset deliverables include EU AI Act-compliant documentation: a 12-section datasheet covering collection methodology (teleoperator demographics, training protocols), composition (task distribution, success rates, failure modes), known limitations (embodiment constraints, environment assumptions), and validation results (inter-annotator agreement, temporal-alignment metrics). We also provide C2PA-signed provenance metadata for every trajectory, enabling cryptographic verification of collection lineage.

Case Study: Custom Warehouse Manipulation Dataset for Fulfillment Robotics

Client: Series-B fulfillment robotics startup deploying UR10e arms with Robotiq 2F-140 grippers in e-commerce warehouses. Challenge: Open datasets (BridgeData, DROID) used different embodiments (Franka Panda, xArm) and tabletop tasks; client needed 500+ SKU coverage with 99.5% grasp success.

Truelabel's solution: 8-week campaign collecting 12,000 trajectories across 520 SKUs (cardboard boxes, poly mailers, rigid containers). We replicated the client's UR10e + Robotiq stack in our lab, sourced representative SKUs from the client's top-100 products, and recruited 6 teleoperators with warehouse experience. Each SKU received 20-30 grasp attempts from varied poses (shelf height, orientation, occlusion).

Deliverables: 12,000 MCAP files (RGB-D at 30 FPS, joint states at 50 Hz, force-torque at 100 Hz), Parquet training splits (80/10/10 train/val/test), contact annotations (grasp established, object lifted, object released), success labels (4-level taxonomy), and a datasheet with per-SKU success rates. Outcome: Client's policy achieved 97.2% grasp success on held-out SKUs after 3 days of training (A100 × 8), vs 78.4% when pre-trained on BridgeData then fine-tuned. Deployment success rate: 96.8% over 30,000 production grasps in first month.

How Truelabel's Marketplace Model Reduces Collection Costs

Traditional data vendors (Scale AI, Appen) operate closed collection networks: they own the hardware, hire the annotators, and charge full-stack margins. A 10,000-trajectory campaign costs $300,000-500,000 because you pay for their infrastructure overhead, even if you only need 10% of their capacity.

Truelabel's marketplace model connects buyers to a global network of 20,000+ independent collectors who own their own robots, labs, and sensor rigs. When you request a campaign, we match you to collectors whose hardware and expertise fit your requirements. You pay collectors directly (via Truelabel's escrow), and Truelabel takes a 15-25% platform fee. This eliminates vendor overhead and reduces costs by 40-60%.

Quality control is maintained via collector reputation scores (success rate, on-time delivery, QA pass rate) and Truelabel's validation pipeline (temporal checks, success-rate filtering, format compliance). Collectors who deliver substandard data lose reputation and future assignments. Top-tier collectors (>95% QA pass rate, >50 campaigns completed) earn premium rates and priority matching. Result: buyers get embodiment-matched, production-ready trajectories at $15-35 per trajectory instead of $50-80, with delivery timelines 30-50% faster than traditional vendors.

Getting Started: Scoping Call to Dataset Delivery in 8-16 Weeks

Step 1: Scoping call (week 0). You describe your embodiment (robot model, gripper, sensors), task distribution (pick-and-place, assembly, tool use), success criteria (grasp stability, task completion, time limits), and annotation requirements (contact events, failure modes, language descriptions). We estimate trajectory count (typically 5,000-20,000 for specialist policies, 50,000+ for generalist policies) and quote a fixed campaign price.

Step 2: Collector matching (weeks 1-2). We identify 3-8 collectors whose hardware matches your embodiment and whose expertise matches your tasks. You review collector profiles (past campaigns, success rates, sample trajectories) and approve the match. We finalize collection protocols (camera angles, lighting, task instructions, success criteria) and train collectors via video call.

Step 3: Pilot collection (weeks 3-4). Collectors deliver 100-500 pilot trajectories. You review samples, we run QA checks, and we iterate on protocols (e.g., 'increase lighting intensity', 'add force-torque logging', 'tighten success criteria'). Pilot approval gates full-scale collection.

Step 4: Full-scale collection (weeks 5-14). Collectors deliver trajectories in weekly batches (500-2,000 per batch). We run QA on each batch, flag issues, and request re-collection if needed. You receive weekly progress reports (trajectories delivered, success rates, QA pass rates).

Step 5: Final delivery (weeks 15-16). We deliver the complete dataset in your target format (MCAP, Parquet, RLDS), plus QA reports, datasheet, and provenance metadata. You have 2 weeks to review and request corrections. Final payment releases from escrow after your approval. Median timeline: 10 weeks from scoping call to final delivery for 10,000-trajectory campaigns.

Why Custom Collection Beats Open-Dataset Fine-Tuning

Open-dataset fine-tuning is the default approach: pre-train on Open X-Embodiment or BridgeData, then fine-tune on 500-2,000 trajectories from your target embodiment and tasks. This works when your deployment is similar to the pre-training distribution (tabletop pick-and-place, parallel-jaw gripper, RGB-only). It fails when your deployment diverges: different embodiment (6-DOF vs 7-DOF), different tasks (assembly vs pick-and-place), different sensors (RGB-D + force-torque vs RGB-only).

Embodiment mismatch costs 15-30% performance. Open X-Embodiment policies trained on 22 embodiments achieve 62% average success across held-out tasks^[2]. Embodiment-specific policies trained on 10,000 trajectories from a single robot achieve 85-92% success on the same tasks—a 23-30 percentage-point gap.

Custom collection from scratch costs more upfront ($150,000-300,000 for 10,000 trajectories) but eliminates the embodiment tax, reduces fine-tuning time (3-7 days vs 2-4 weeks), and achieves higher deployment success (90-97% vs 75-85%). For production systems where 1% success-rate improvement is worth $100,000+ annually (warehouse fulfillment, surgical robotics, autonomous assembly), custom collection pays for itself in 3-6 months. Request a scoping call to estimate ROI for your use case.

Future-Proofing: Modular Datasets and Incremental Collection

Modular datasets separate task-agnostic skills (reach, grasp, place) from task-specific compositions (pick-and-place, stack, insert). CALVIN demonstrated this with a 34-task benchmark where each task is a sequence of 4-6 primitive skills^[20]. Policies trained on skill-level data generalize to novel task compositions without additional data.

Incremental collection adds new tasks, embodiments, or environments to an existing dataset without restarting from scratch. RoboCat used this approach: an initial dataset of 10,000 demos enabled a seed policy, which generated 100,000 autonomous rollouts, which were filtered and added back to the dataset, enabling a second-generation policy^[7]. Each iteration improved success rates by 8-15%.

Truelabel's modular collection delivers datasets in skill-level chunks: 2,000 reach trajectories, 3,000 grasp trajectories, 2,000 place trajectories, 3,000 composed pick-and-place trajectories. You train skill-level policies first (faster convergence, better sample efficiency), then train a task-level policy that sequences skills. When you add a new task (e.g., 'insert peg'), you collect only the new skill (insert) and the new compositions (pick-insert, grasp-insert)—not the entire dataset. This reduces incremental collection costs by 60-80% compared to monolithic re-collection.

Open Questions: Licensing, Commercialization, and Dataset Ownership

Dataset licensing for manipulation trajectories is unsettled. Creative Commons BY 4.0 permits commercial use with attribution, but it does not address model weights trained on the dataset^[21]. RoboNet's license prohibits commercial use of the dataset itself but is silent on commercial use of models trained on RoboNet^[22].

Model commercialization raises derivative-work questions: if you train a policy on a CC-BY-NC dataset, can you deploy that policy in a commercial product? Legal precedent is sparse. Some buyers interpret NC licenses as prohibiting commercial deployment; others argue that model weights are transformative works not subject to the dataset's license. Truelabel's contracts grant explicit commercial-use rights: you own the trajectories, you own the models, and you can deploy them in any commercial application without royalties or attribution requirements.

Dataset ownership in custom-collection campaigns: who owns the raw trajectories, the annotations, and the metadata? Traditional vendors (Scale, Appen) retain ownership and license data back to you—limiting your ability to resell, sublicense, or use the data for future projects. Truelabel's default contract transfers full ownership to the buyer: you own the trajectories, the annotations, the QA reports, and the provenance metadata. Collectors retain no rights. This enables you to build proprietary datasets that competitors cannot access, creating a durable moat around your physical AI systems.

External references and source context

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID collected 76,000 trajectories using a single embodiment (Franka Emika Panda), demonstrating embodiment-specific dataset scale
arXiv ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment aggregated data from 22 robot platforms with 1 million trajectories, achieving 62% average success on held-out tasks
arXiv ↩
RoboNet: Large-Scale Multi-Robot Learning
RoboNet aggregated 15 million frames from 7 platforms but suffered from 50-100ms temporal jitter due to unsynchronized clocks
arXiv ↩
RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 used 7-DOF end-effector deltas discretized into 256 bins per dimension for action representation
arXiv ↩
BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 contains 60,000 trajectories with 80% concentrated in pick-and-place tasks, using human annotators for success labels with 85% inter-annotator agreement
arXiv ↩
Teleoperation datasets are becoming the highest-intent physical AI content category
ALOHA pioneered low-cost bilateral teleoperation for bimanual manipulation tasks
tonyzhaozh.github.io ↩
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
RoboCat used 10,000 human demos to generate 100,000 autonomous rollouts for self-improvement
arXiv ↩
RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning
RLDS paper defines TensorFlow-native format with standardized schema for RL datasets
arXiv ↩
Project site
Dex-YCB includes per-frame contact labels for 8 objects with Shadow Dexterous Hand
dex-ycb.github.io ↩
THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation
THE COLOSSEUM benchmark introduced 4-level failure taxonomy for manipulation tasks
arXiv ↩
Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
EPIC-KITCHENS-100 annotated 90,000 action segments with verb-noun pairs across 100 hours of kitchen activities
arXiv ↩
Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization paper demonstrates varying simulation parameters to improve sim-to-real transfer
arXiv ↩
RLBench: The Robot Learning Benchmark & Learning Environment
RLBench provides 100 simulated tasks with 18 real-world analogs containing 50 demos each
arXiv ↩
NVIDIA GR00T N1 technical report
NVIDIA GR00T trains on 1 billion+ trajectories with hierarchical language annotations
arXiv ↩
World Models
World Models paper by Ha & Schmidhuber trained VAE and RNN for latent-state prediction
worldmodels.github.io ↩
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Ego4D provides 3,670 hours of egocentric video without robot actions
arXiv ↩
Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence
EU AI Act Article 10 requires datasets to be relevant, representative, free of errors, and complete for high-risk AI systems
EUR-Lex ↩
Datasheets for Datasets
Datasheets for Datasets paper proposed 50-question template for dataset documentation
arXiv ↩
Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI
Data Cards paper extended datasheets with transparency artifacts and annotation-quality metrics
arXiv ↩
CALVIN paper
CALVIN uses HDF5 format and demonstrates modular skill-level datasets with 34 tasks composed of 4-6 primitive skills
arXiv ↩
Attribution 4.0 International deed
Creative Commons BY 4.0 permits commercial use with attribution
Creative Commons ↩
RoboNet dataset license
RoboNet license prohibits commercial use of dataset but is silent on model commercialization
GitHub raw content ↩

FAQ

What is the minimum trajectory count for training a manipulation policy?

Minimum viable datasets start at 1,000-2,000 trajectories for narrow tasks (single object, single environment, high success rate). Specialist policies for production deployment typically require 5,000-20,000 trajectories to cover task variations, failure modes, and edge cases. Generalist policies that transfer across tasks or embodiments require 50,000-500,000 trajectories. Open X-Embodiment used 1 million trajectories across 22 robots to achieve 62% average success on held-out tasks. Budget 10,000 trajectories as a baseline for single-embodiment, multi-task policies.

How does embodiment mismatch quantitatively affect policy performance?

Embodiment mismatch degrades success rates by 15-30% compared to embodiment-matched baselines. DROID policies trained on Franka Panda data achieve 68% success on Franka hardware but only 42% success when deployed on UR5e arms without fine-tuning—a 26-point drop. Open X-Embodiment policies trained on 22 embodiments achieve 62% average success, while single-embodiment policies trained on 10,000 trajectories achieve 85-92% success on the same tasks. The gap widens for tasks requiring precise force control or contact-rich manipulation, where kinematic and gripper differences dominate.

What annotation layers are essential for imitation learning?

Essential annotations: (1) episode boundaries (start/end timestamps), (2) success labels (binary or multi-level taxonomy), (3) action-space ground truth (joint velocities, end-effector poses, gripper commands at control frequency). Recommended annotations: (4) contact events (object grasped, surface touched, contact lost), (5) failure modes (recoverable vs terminal), (6) language descriptions (task goals, sub-tasks, action narration). Advanced annotations: (7) semantic segmentation (object masks, tool labels), (8) force-torque profiles, (9) inverse-dynamics labels (inferred actions from observation transitions). Budget $2-8 per trajectory for annotation depending on layer depth.

How do MCAP and HDF5 formats compare for manipulation datasets?

HDF5 is the legacy standard: hierarchical groups store episodes, datasets store observations and actions. Pros: universal reader support, random access by episode. Cons: no built-in compression (files 3-5× larger than MCAP), no streaming support, requires custom schemas. MCAP is the modern alternative: self-describing container for timestamped messages, designed for ROS2 bags. Pros: Zstandard compression (3-5× smaller), random access by timestamp, schema evolution, web-based viewers (Foxglove). Cons: fewer training frameworks support MCAP natively (requires conversion to Parquet or TFRecord). Recommendation: collect in MCAP for QA and storage, convert to Parquet for training.

What is the cost difference between open-dataset fine-tuning and custom collection?

Open-dataset fine-tuning: $0 for pre-training data (Open X-Embodiment, BridgeData are free), $50,000-150,000 for 2,000-5,000 fine-tuning trajectories, 2-4 weeks fine-tuning time, 75-85% deployment success. Custom collection from scratch: $150,000-300,000 for 10,000 trajectories, 8-16 weeks collection time, 3-7 days training time, 90-97% deployment success. Break-even: if 1% success improvement is worth $50,000+ annually (typical for warehouse fulfillment, surgical robotics), custom collection pays for itself in 6-12 months. For prototypes or research, fine-tuning is cheaper. For production systems, custom collection delivers higher ROI.

How does Truelabel ensure trajectory quality and temporal alignment?

Truelabel's QA pipeline runs four automated checks: (1) frame-rate consistency (no gaps >2× expected interval), (2) timestamp monotonicity (no backwards jumps), (3) action-observation causality (actions precede observations by <50ms), (4) success-criteria validation (task-specific checks). We use hardware-triggered cameras synchronized to the robot's control loop via GPIO pulses, ensuring sub-10ms temporal alignment. Post-collection, we run MCAP validation to verify frame alignment and flag dropped packets or clock drift >5ms. Buyers receive QA reports with every batch: success rate by task, frame-drop histogram, timestamp-jitter distribution. Batches below agreed thresholds (e.g., <80% success, >5% frame drops) are rejected and re-collected at no cost.

Looking for manipulation trajectory data?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners and helps scope consent artifacts and commercial licensing requirements before delivery.

Request Custom Trajectory Collection