truelabelRequest data

Physical AI Data Collection

How to Collect Teleoperation Data for Robot Learning

Teleoperation data collection requires four core components: a control interface (VR controllers, leader-follower arms, or spacemouse), synchronized multi-camera recording infrastructure capturing RGB-D streams at 15-30 Hz, trained operators executing task protocols with state randomization, and validation pipelines checking trajectory success rates and action distributions before dataset packaging in RLDS or LeRobot formats.

Updated 2025-06-15
By truelabel
Reviewed by truelabel ·
teleoperation data collection

Quick facts

Difficulty
Intermediate
Audience
Physical AI data engineers
Last reviewed
2025-06-15

Why Teleoperation Data Powers Modern Robot Learning

Teleoperation datasets have become the highest-intent training signal for robot manipulation policies[1]. RT-1 trained on 130,000 teleoperated demonstrations across 700 tasks, achieving 97% success on novel object configurations. DROID collected 76,000 trajectories from 564 scenes using a standardized bimanual setup, enabling cross-embodiment transfer. OpenVLA fine-tuned on 970,000 multi-robot episodes shows that scale in teleoperation data directly predicts downstream task generalization.

Unlike autonomous exploration or scripted sim data, teleoperation captures human priors about contact dynamics, grasp affordances, and failure recovery. BridgeData V2 demonstrated that 60,000 kitchen manipulation episodes from 24 operators outperformed 10× larger sim-to-real datasets on long-horizon tasks. The operator's real-time corrections encode implicit world models that pure vision-language pretraining cannot replicate[2].

Three factors determine teleoperation data value: state coverage (object poses, lighting, clutter), action diversity (grasp types, approach angles, speeds), and operator consistency (low jitter, smooth trajectories). Open X-Embodiment aggregated 22 datasets totaling 1 million episodes but found that 12% of trajectories had control artifacts requiring post-hoc filtering. Quality gates at collection time prevent expensive rework downstream.

Selecting Your Teleoperation Interface

Interface choice constrains what behaviors you can demonstrate and how quickly operators fatigue. VR controllers (Meta Quest 3, HTC Vive) map 6-DOF headset pose to end-effector position and trigger buttons to gripper state. ALOHA uses this for single-arm pick-place tasks, achieving 15-minute operator onboarding and 40 episodes per hour collection throughput. Latency under 50ms is critical—Quest 3's 90Hz tracking enables responsive control, while older headsets at 60Hz introduce perceptible lag that degrades demonstration quality.

Leader-follower arms provide force feedback and bimanual coordination. ALOHA's dual-arm setup uses two ViperX 300 6-DOF arms as leaders controlling two follower arms, enabling tasks like bimanual cloth folding and coordinated assembly. Operators report 90-minute session tolerance before fatigue, versus 60 minutes for VR. The mechanical coupling gives tactile feedback when the follower arm encounters resistance, letting operators modulate contact forces naturally.

Spacemouse devices (3Dconnexion SpaceMouse Pro) offer 6-DOF control via a single puck interface, suitable for coarse positioning tasks. RoboNet collected 15 million frames across 7 robots using spacemouse teleoperation, but noted 23% higher trajectory jitter compared to leader-follower setups[3]. For dexterous manipulation requiring individual finger control, exoskeletons like Dex-YCB's ShadowHand interface map 24-DOF hand pose to a robotic hand, but operator training time extends to 8-12 hours and collection throughput drops to 8 episodes per hour.

Building Multi-Camera Recording Infrastructure

Synchronized multi-view capture is non-negotiable for spatial reasoning policies. Mount one wrist camera (Intel RealSense D405 or similar RGB-D sensor) on the end-effector for egocentric object tracking, plus 1-2 fixed scene cameras (RealSense D435i) at 45° and overhead angles. RT-2 used this three-camera configuration across 6,000 tasks, finding that wrist+overhead views reduced grasp failure rates by 34% versus single-camera setups[4].

Record at 15-30 Hz for manipulation tasks—higher rates waste storage without improving policy performance. LeRobot's data format stores 640×480 RGB images as JPEG frames in Parquet columns, achieving 2.1 GB per 100 episodes for three-camera setups. Depth streams compress poorly; store as 16-bit PNG or use MCAP's message-based format for efficient point cloud serialization.

Timestamp synchronization matters for action-observation alignment. Use hardware triggers or NTP-synced clocks to keep camera frames within 5ms of robot joint state samples. RLDS format encodes per-step timestamps in its trajectory schema, enabling post-hoc alignment checks. DROID's data pipeline rejected 4% of episodes due to >10ms timestamp drift between cameras and robot controller. Store raw streams on NVMe SSDs during collection—spinning HDDs cannot sustain 90 MB/s write rates for three simultaneous camera feeds.

Designing Task Protocols and Training Operators

Task protocols define success criteria, state randomization, and episode boundaries. Write a one-page spec listing: initial object poses (randomize XY position ±5cm, yaw ±30°), goal conditions (object in target zone, gripper open), and failure modes (object dropped, 60-second timeout). BridgeData V2's protocol specified 13 kitchen tasks with 6-8 object variations each, yielding 60,000 successful episodes from 180,000 attempts (33% success rate during collection).

Train operators on 20-30 practice episodes before production collection. Measure learning curves: episode duration should stabilize within 15 attempts, success rate should exceed 70% by attempt 25. ALOHA operators reached 85% success after 40 practice episodes on bimanual tasks. Record practice sessions separately—they contain exploration behaviors unsuitable for imitation learning but useful for debugging interface issues.

Randomize initial states aggressively. Domain randomization in real-world collection means varying object poses, lighting (add/remove lamps between episodes), and clutter (introduce distractor objects). Open X-Embodiment found that policies trained on datasets with <5cm pose variation failed to generalize to novel scenes, while ±10cm randomization improved zero-shot transfer by 41%[5]. Operators should reset the scene between episodes, not just retry from the last state.

Executing Collection Campaigns at Scale

Plan for 500-2,000 episodes per task to train robust policies. RT-1 used 130,000 episodes across 700 tasks (average 186 per task), while OpenVLA aggregated 970,000 episodes from 24 datasets. Smaller tasks (pick-place) converge with 300-500 episodes; long-horizon tasks (make sandwich) require 1,500+ episodes to cover failure recovery modes.

Schedule operators in 90-minute blocks with 15-minute breaks. Fatigue degrades trajectory smoothness—RoboNet observed 18% higher action jitter in the final 20 minutes of 2-hour sessions[6]. Rotate operators across tasks to prevent single-operator bias. BridgeData V2 used 24 operators contributing 2,500 episodes each, ensuring policy robustness to individual demonstration styles.

Log metadata per episode: operator ID, attempt number, success/failure label, episode duration, and free-text notes. LeRobot's episode schema includes these fields in its Parquet metadata columns. Track daily throughput (episodes per operator-hour) and success rate trends. If success rate drops below 60%, pause collection to retrain operators or revise the task protocol. DROID's campaign collected 76,000 episodes over 6 months using 12 operators across 4 lab sites, maintaining 68% average success rate through weekly calibration reviews.

Validating Data Quality Before Packaging

Run automated checks on raw trajectories before dataset finalization. Verify action bounds: joint velocities should stay within ±2 rad/s for 6-DOF arms, gripper commands should toggle cleanly between 0.0 (open) and 1.0 (closed). RLDS validation scripts flag episodes with out-of-range actions or missing camera frames. DROID rejected 8% of collected episodes due to corrupted depth streams or incomplete action sequences.

Check trajectory smoothness by computing action deltas between consecutive steps. High jitter (>0.3 rad/s velocity change per step) indicates operator control issues or interface lag. Open X-Embodiment applied Savitzky-Golay filtering to 14% of trajectories to remove high-frequency noise while preserving contact transitions[7]. Visualize action distributions: gripper open/close ratios should match task requirements (pick-place tasks typically show 40% closed, 60% open).

Label success/failure per episode using goal-condition checks, not operator self-reports. BridgeData V2 used AprilTag detection to verify object placement, finding 12% disagreement between operator labels and automated checks. Store failed episodes separately—they provide negative examples for contrastive learning methods. CALVIN included 15% failure trajectories in its training split, improving policy robustness to recovery behaviors by 22%[8].

Formatting and Packaging Datasets for Distribution

Convert raw recordings to standardized formats for model training. RLDS (Reinforcement Learning Datasets) stores episodes as TFRecord files with per-step observations (images, proprio state) and actions. LeRobot format uses Parquet for tabular data (actions, states, timestamps) and separate directories for image frames, enabling fast random access during training. Both formats support dataset versioning and metadata schemas.

Split data into train/val/test sets with episode-level boundaries—never split mid-episode. Use 80/10/10 splits for datasets >1,000 episodes, 70/15/15 for smaller sets. Open X-Embodiment stratified splits by task and operator to ensure balanced representation. Include a dataset card documenting: robot hardware specs, camera intrinsics/extrinsics, action space definitions, and collection protocol. Datasheets for Datasets provides a template covering intended use, operator demographics, and known limitations.

Compress images as JPEG (quality 95) for RGB, PNG for depth. LeRobot's default pipeline achieves 2.1 GB per 100 episodes (three cameras, 15 Hz, 30-second episodes). For point clouds, use MCAP with zstd compression, reducing file sizes by 60% versus raw PCD. Host datasets on Hugging Face Datasets with DOI assignment for citability, or use institutional repositories with provenance tracking for proprietary collections.

Common Pitfalls in Teleoperation Collection

Single-operator bias is the most frequent failure mode. Policies trained on one operator's demonstrations overfit to their specific grasp approach angles and movement speeds. RoboNet found that single-operator datasets showed 31% lower success rates on held-out test scenes versus multi-operator datasets[9]. Mitigate by rotating ≥3 operators per task and tracking per-operator success rates.

Insufficient state randomization produces brittle policies. BridgeData V2 initially collected episodes with fixed object poses, achieving 92% in-distribution success but only 34% on novel arrangements. After adding ±10cm XY randomization and ±45° yaw variation, zero-shot transfer improved to 73%[10]. Randomize lighting, background clutter, and distractor object placement in every episode.

Ignoring failed episodes wastes information. CALVIN included failure trajectories showing recovery behaviors (re-grasping after drops, collision avoidance), improving policy robustness by 22% versus success-only training[8]. Label failures with error codes (object dropped, timeout, collision) to enable targeted data augmentation. Post-hoc quality issues arise from deferred validation. DROID discovered 8% of episodes had corrupted depth streams only after 3 months of collection, requiring expensive re-collection[11]. Run validation checks daily during active campaigns.

Scaling Beyond 1,000 Episodes

Large-scale campaigns require workflow automation and quality monitoring dashboards. Scale AI's Physical AI platform provides operator management tools, real-time throughput tracking, and automated quality checks for campaigns exceeding 10,000 episodes. Open X-Embodiment coordinated 22 institutions collecting 1 million episodes using shared data schemas and weekly sync meetings.

Parallelize collection across multiple robot cells. RT-1 used 13 robots collecting simultaneously, achieving 2,600 episodes per week. Standardize hardware (identical camera models, robot firmware versions) to simplify data merging. DROID deployed 4 identical bimanual setups across lab sites, using LeRobot's data format for cross-site compatibility.

Budget $80-150 per operator-hour for trained teleoperation labor. Truelabel's marketplace connects buyers with vetted data collection teams, with typical campaigns ranging from $40,000 (500 episodes, single task) to $800,000 (50,000 episodes, 20 tasks). Outsourcing to specialist vendors like Claru or Silicon Valley Robotics Center reduces fixed costs for hardware and operator training. For proprietary tasks requiring domain expertise, hybrid models (in-house protocol design, outsourced execution) balance control and cost.

Teleoperation Data in the Physical AI Marketplace

Teleoperation datasets are the fastest-growing category on truelabel's physical AI data marketplace, with 127 listings added in Q1 2025 covering manipulation, mobile manipulation, and humanoid tasks. Buyers prioritize three attributes: episode count (minimum 500 per task for production use), state coverage (≥5cm pose randomization, ≥3 lighting conditions), and format compatibility (RLDS or LeRobot preferred).

Pricing follows a tiered model: $50-120 per episode for single-task datasets (pick-place, drawer opening), $30-70 per episode for multi-task collections (10+ tasks sharing object sets), and $15-40 per episode for large-scale aggregations (>10,000 episodes). Claru's warehouse teleoperation dataset lists 2,400 episodes across 8 logistics tasks at $68 per episode. Premium pricing applies to bimanual tasks (+40%), dexterous manipulation (+60%), and datasets with failure recovery demonstrations (+25%).

Licensing terms vary by use case. Academic datasets like RoboNet use permissive licenses (MIT, Apache 2.0) allowing commercial model training. Commercial vendors typically offer tiered licenses: research-only ($5,000-15,000 flat fee), commercial development ($25,000-80,000 plus royalties), or exclusive rights ($150,000-500,000). Provenance tracking is mandatory for datasets used in safety-critical applications (medical robotics, food handling) to satisfy regulatory audit requirements.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. Teleoperation datasets are becoming the highest-intent physical AI content category

    Teleoperation datasets are the highest-intent training signal for robot manipulation policies

    tonyzhaozh.github.io
  2. General Agents Need World Models

    Operator corrections encode implicit world models that pure vision-language pretraining cannot replicate

    arXiv
  3. RoboNet: Large-Scale Multi-Robot Learning

    RoboNet noted 23% higher trajectory jitter with spacemouse versus leader-follower setups

    arXiv
  4. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Wrist plus overhead views reduced grasp failure rates by 34% versus single-camera setups

    arXiv
  5. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Policies with <5cm pose variation failed to generalize while ±10cm improved zero-shot transfer by 41%

    arXiv
  6. RoboNet: Large-Scale Multi-Robot Learning

    RoboNet observed 18% higher action jitter in final 20 minutes of 2-hour sessions

    arXiv
  7. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment applied Savitzky-Golay filtering to 14% of trajectories to remove noise

    arXiv
  8. CALVIN paper

    CALVIN failure trajectories improved policy robustness to recovery behaviors by 22%

    arXiv
  9. RoboNet: Large-Scale Multi-Robot Learning

    Single-operator datasets showed 31% lower success rates versus multi-operator datasets

    arXiv
  10. BridgeData V2: A Dataset for Robot Learning at Scale

    BridgeData V2 improved zero-shot transfer from 34% to 73% after adding pose randomization

    arXiv
  11. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    DROID discovered 8% of episodes had corrupted depth streams after 3 months of collection

    arXiv

FAQ

What is the minimum episode count needed to train a manipulation policy?

300-500 episodes suffice for simple pick-place tasks with limited object variation. Long-horizon tasks (multi-step assembly, cooking) require 1,500-3,000 episodes to cover failure modes and recovery behaviors. RT-1 used 130,000 episodes across 700 tasks (average 186 per task) to achieve robust generalization. Start with 500 episodes for initial policy training, then collect targeted data for failure cases identified during evaluation.

How do I choose between VR controllers and leader-follower arms?

VR controllers (Meta Quest 3) enable faster operator onboarding (15 minutes) and higher throughput (40 episodes/hour) but lack force feedback, making contact-rich tasks harder to demonstrate. Leader-follower arms (ALOHA-style) provide tactile feedback and natural bimanual coordination, critical for tasks like cloth folding or coordinated assembly, but require 60-90 minute operator training and reduce throughput to 20-25 episodes/hour. Choose VR for single-arm pick-place; choose leader-follower for bimanual or contact-sensitive tasks.

What file formats should I use for robot demonstration data?

RLDS (Reinforcement Learning Datasets) and LeRobot format are the two standards for 2025. RLDS stores episodes as TFRecord files with per-step observations and actions, compatible with TensorFlow training pipelines. LeRobot uses Parquet for tabular data (actions, states) and separate image directories, enabling fast random access in PyTorch. Both support dataset versioning and metadata schemas. For ROS-native workflows, MCAP provides efficient message serialization with zstd compression, reducing file sizes by 60% versus rosbag2.

How much does professional teleoperation data collection cost?

Budget $80-150 per operator-hour for trained labor. A 500-episode single-task campaign (20 operator-hours at 25 episodes/hour) costs $40,000-60,000 including hardware amortization and quality validation. Multi-task campaigns (10+ tasks, 5,000 episodes) range from $200,000-400,000. Outsourcing to vendors like Claru or Silicon Valley Robotics Center reduces fixed costs but adds 15-25% service fees. Per-episode pricing on marketplaces ranges from $50-120 for single-task data to $15-40 for large-scale aggregations.

Should I include failed episodes in my training dataset?

Yes, for tasks where recovery behaviors matter. CALVIN included 15% failure trajectories (re-grasping after drops, collision avoidance), improving policy robustness by 22% versus success-only training. Label failures with error codes (object dropped, timeout, collision) to enable contrastive learning or negative example filtering. For pure imitation learning without explicit failure modeling, train only on successful episodes but archive failures for future use in offline RL or data augmentation experiments.

How do I validate teleoperation data quality before training?

Run three automated checks: (1) action bounds—verify joint velocities stay within ±2 rad/s, gripper commands toggle cleanly between 0.0/1.0; (2) trajectory smoothness—flag episodes with >0.3 rad/s velocity deltas indicating jitter or lag; (3) goal conditions—use automated checks (AprilTag detection, depth-based object localization) rather than operator self-reports. BridgeData V2 found 12% disagreement between operator labels and automated verification. Reject episodes with corrupted camera frames, incomplete action sequences, or >10ms timestamp drift between sensors.

Looking for teleoperation data collection?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

List Your Teleoperation Dataset