Physical AI Data Collection

How to Setup a Teleoperation Rig for Physical AI Data Collection

Q: How do I validate that my teleoperation rig is producing high-quality demonstrations before collecting a full dataset?

Collect a pilot dataset of 100-200 episodes on a representative task, train a baseline policy (Diffusion Policy or ACT via LeRobot), and measure test success rate on 20 held-out episodes. Target 70-80% success for simple tasks. If success is <50%, diagnose via: (1) visualize policy-predicted actions vs ground-truth actions (high error indicates model underfitting or data quality issues), (2) overlay predicted grasp points on RGB frames to check camera calibration (misalignment indicates extrinsic errors), (3) compute per-episode action variance (high variance indicates operator inconsistency). Real-time quality metrics during teleoperation (joint velocity histograms, gripper force traces) help operators self-correct before saving low-quality episodes.

Q: What are the most common failure modes when scaling from a prototype rig to production data collection?

Camera desynchronization causes policies to grasp 5-10 cm away from targets—fix by enabling hardware triggering or increasing camera node priority. Calibration drift (camera mounts shift due to vibration) causes policies to fail after 1-2 weeks—re-run ChArUco calibration weekly and log residuals in metadata. Operator fatigue increases action variance by 40-60% after 4-6 hours—limit shifts to 4 hours with mandatory breaks. Gripper force miscalibration causes policies to drop or crush objects—log force traces alongside joint states and validate force modulation on test tasks. Storage bottlenecks occur when writing 3-4 camera streams to disk—use NVMe SSDs (3,000+ MB/s write speed) and JPEG compression to reduce file size by 80%.

Q: Can I use teleoperation datasets collected on one robot embodiment to train policies for a different robot?

Cross-embodiment transfer requires embodiment-agnostic action representations. Open X-Embodiment normalizes joint actions to [-1, 1] ranges and represents end-effector poses in a canonical base frame, enabling policies to transfer between 7-DoF and 6-DoF arms. RT-2 uses language-conditioned policies that abstract over embodiment-specific joint configurations. Transfer success rates are 40-60% for similar embodiments (e.g., Franka Emika to UR5) and 20-30% for dissimilar embodiments (e.g., 6-DoF arm to humanoid hand). Mixing 10-20% target-embodiment data with source-embodiment data improves transfer to 70-80%. DROID collected data across 6 embodiments and demonstrated 40% zero-shot transfer to unseen robots.

A teleoperation rig captures human demonstrations for robot imitation learning by pairing an input device (leader arm, VR controller, or SpaceMouse) with a follower robot, synchronized cameras, and a recording pipeline that logs joint states, end-effector poses, RGB-D streams, and task metadata into MCAP or HDF5 containers at 15-30 Hz. Production rigs balance interface fidelity (leader-follower arms yield 85-90% task success vs 60-75% for VR), hardware cost ($800-$16,000 per station), and operator throughput (20-50 episodes per 8-hour shift). The DROID dataset collected 76,000 trajectories across 564 skills using this architecture[ref:ref-droid-paper], while Open X-Embodiment aggregated 1 million episodes from 22 robot embodiments[ref:ref-open-x-embodiment].

Updated 2026-01-15

By truelabel

Reviewed by truelabel · Jan 15, 2026

teleoperation rig setup

List Your Teleoperation Dataset on Truelabel How sourcing works

Quick facts

Difficulty: Intermediate
Audience: Physical AI data engineers
Last reviewed: 2026-01-15

Why Teleoperation Rig Design Determines Dataset Quality

Teleoperation rigs are the primary data-generation infrastructure for imitation learning policies like RT-1, RT-2, and OpenVLA. A rig's interface fidelity, sensor coverage, and recording latency directly constrain downstream policy performance. The DROID dataset demonstrated that diverse teleoperation setups across 564 manipulation skills produce generalist policies that transfer to unseen tasks 40% more effectively than single-embodiment datasets^[1].

Three architectural decisions dominate rig quality: input device selection (leader-follower vs VR vs keyboard), camera placement and calibration, and the recording pipeline's timestamp synchronization. Leader-follower arms like ALOHA achieve 85-90% task success rates because kinematic mirroring preserves human dexterity, but cost $8,000-$16,000 per bimanual station. VR interfaces using Meta Quest 3 or HTC Vive cost $400-$800 but introduce 20-40ms latency and require operator training to map 6-DoF controller poses to end-effector commands. SpaceMouse devices cost $200-$400 and suit coarse pick-place tasks but cannot capture fine-grained finger articulation.

Camera calibration errors propagate into policy training as spatial misalignment between RGB observations and proprioceptive states. BridgeData V2 uses 3-4 calibrated RGB-D cameras per workspace to provide 270° coverage, logging 60,000 trajectories with ChArUco board extrinsics^[2]. Recording pipelines must timestamp-align joint states, camera frames, and gripper signals within 5ms to prevent action-observation desynchronization that degrades policy convergence. The LeRobot framework standardizes this via HDF5 episode containers with nanosecond-precision timestamps.

Choosing the Teleoperation Interface: Leader-Follower vs VR vs SpaceMouse

Leader-follower arms provide the highest demonstration fidelity by mechanically mirroring operator hand motions to the follower robot. The ALOHA system pairs two ViperX 300 6-DoF leader arms with two follower arms, enabling bimanual tasks like cloth folding and dishwasher loading. Each leader arm costs approximately $4,000; a full bimanual station totals $16,000. The kinematic correspondence is intuitive—operators grasp objects with the leader gripper, and the follower replicates the motion with sub-centimeter precision. ALOHA collected 650 bimanual demonstrations for ACT policy training, achieving 80% success on unseen dish-racking tasks^[3].

VR teleoperation uses head-mounted displays (Meta Quest 3, HTC Vive) to render the robot workspace from camera feeds while hand controllers command end-effector poses. Libraries like dex-retargeting and AnyTeleop map VR controller 6-DoF poses to robot inverse kinematics. VR interfaces cost $400-$800 but introduce 20-40ms round-trip latency from pose tracking to robot actuation. Operators require 2-4 hours of training to internalize the controller-to-gripper mapping. RoboNet used VR teleoperation across 7 robot platforms to collect 15 million frames, demonstrating cross-embodiment generalization^[4].

SpaceMouse devices (3Dconnexion SpaceMouse Wireless, $200-$400) provide 6-DoF input via a single puck controller. Operators translate and rotate the puck to command end-effector velocity. SpaceMouse suits coarse manipulation (pick-place, push) but cannot capture finger articulation or bimanual coordination. Throughput is 30-50 episodes per 8-hour shift for simple tasks. Scale AI's physical-AI data engine uses SpaceMouse interfaces for warehouse pick-place datasets, prioritizing cost over dexterity.

Camera Selection and Placement for Multi-View RGB-D Coverage

Camera placement determines the policy's observational grounding. Imitation learning models like RT-1 and Diffusion Policy require 2-4 synchronized RGB or RGB-D cameras to resolve occlusions and provide depth cues for grasping. Intel RealSense D435 cameras ($200 each) deliver 1280×720 RGB at 30 Hz with aligned depth streams, covering 87° horizontal FOV. Logitech C920 webcams ($80 each) provide 1080p RGB at 30 Hz but lack depth; they suit tasks where depth is inferred from motion parallax.

BridgeData V2 mounts three RealSense D435 cameras per workspace: one overhead (60cm above table), one wrist-mounted (15cm from gripper), and one front-angled (45° elevation, 80cm distance). This configuration provides 270° azimuthal coverage and resolves gripper-object occlusions during insertion tasks^[2]. Wrist cameras are critical for fine manipulation—DROID found that policies trained without wrist views fail 35% more often on peg-insertion and threading tasks^[1].

Camera calibration uses ChArUco boards (OpenCV library) to compute extrinsic transforms between camera frames and the robot base frame. The TSAI hand-eye calibration method solves for the wrist-camera transform by recording 15-20 robot poses with the board visible. Calibration residuals below 2mm are necessary for sub-centimeter grasping accuracy. LeRobot's calibration utilities automate ChArUco detection and export camera intrinsics/extrinsics as JSON for downstream policy training. Verify calibration by commanding the robot to touch a known 3D point and measuring the pixel-space reprojection error—target <3 pixels at 1280×720 resolution.

Recording Pipeline Architecture: MCAP vs HDF5 vs RLDS

The recording pipeline logs synchronized streams of joint states, camera frames, gripper signals, and task metadata into a container format. Three formats dominate physical-AI datasets: MCAP, HDF5, and RLDS. MCAP is a self-describing binary format designed for ROS 2 bag files, supporting nanosecond timestamps and zero-copy playback. MCAP's Protobuf schemas encode joint states, images, and point clouds in a single file with per-message CRC validation. The DROID dataset uses MCAP to store 76,000 trajectories totaling 3.2 TB, enabling random-access episode retrieval for distributed training^[1].

HDF5 organizes episodes as hierarchical groups with datasets for observations, actions, and rewards. LeRobot's HDF5 schema stores RGB frames as JPEG-compressed byte arrays (reducing file size by 80% vs raw pixels) and joint states as float32 arrays. Each episode group contains a `timestamps` dataset with int64 nanosecond offsets from episode start. HDF5's chunked storage enables parallel writes from multiple camera threads. ALOHA records bimanual demonstrations as HDF5 files with 30 Hz joint states and 15 Hz RGB frames, totaling 1.2 GB per 100-episode dataset^[3].

RLDS (Reinforcement Learning Datasets) wraps TensorFlow Datasets with a standardized schema for episodes, steps, observations, and actions. RLDS's Apache Beam pipeline converts raw logs into sharded TFRecord files optimized for TPU training. Open X-Embodiment uses RLDS to unify 1 million episodes from 22 robot types, enabling cross-embodiment policy pretraining^[5]. RLDS adds 15-20% storage overhead vs raw HDF5 but integrates natively with TensorFlow data loaders.

Workspace Configuration and Task Setup for Reproducible Demonstrations

Workspace layout affects demonstration consistency and policy generalization. A standardized workspace includes a flat table (80×120 cm), fixed object spawn zones marked with tape or 3D-printed fixtures, and consistent lighting (5000K LED panels, 800 lux at table height). RoboCasa uses kitchen counter mockups with drawer handles, cabinet doors, and appliance replicas to collect 100,000 household manipulation demonstrations^[6]. Object placement variance during data collection improves policy robustness—BridgeData V2 randomizes object positions within a 10×10 cm spawn grid, increasing generalization by 25% over fixed-layout datasets^[2].

Task instructions must be unambiguous and operator-verifiable. Define success criteria as binary predicates (e.g., "mug is upright in the drying rack, handle facing left") rather than subjective descriptions. DROID's 564 task definitions include pre-condition checks ("drawer is closed") and post-condition assertions ("all utensils in drawer") to filter incomplete demonstrations. Operators mark task success via a foot pedal or keyboard shortcut, which the recording pipeline logs as a boolean flag in the episode metadata.

Reset procedures between episodes are critical for throughput. Manual resets (operator repositions objects) take 30-60 seconds per episode. Automated resets using a second robot arm or conveyor belt reduce reset time to 10-15 seconds but add $5,000-$10,000 in hardware cost. ALOHA uses manual resets for bimanual tasks, achieving 20-30 episodes per 8-hour shift^[3]. Scale AI's warehouse teleoperation rigs use conveyor-fed object presentation, reaching 50 episodes per shift for pick-place tasks.

Operator Training and Demonstration Quality Validation

Operator skill directly determines dataset quality. Novice operators produce jerky trajectories with 30-50% higher action variance than experts, degrading policy smoothness. Training protocols should include 2-4 hours of practice on representative tasks before production data collection. ALOHA's operator training begins with single-arm pick-place (1 hour), progresses to bimanual coordination (2 hours), and concludes with timed task execution (1 hour). Operators must achieve 80% task success on 10 consecutive practice episodes before contributing to the production dataset^[3].

Real-time quality metrics help operators self-correct. Display joint velocity histograms, gripper force traces, and camera feed overlays during teleoperation. Sudden velocity spikes (>2 rad/s) indicate collisions or control instability. LeRobot's teleoperation GUI shows live plots of joint positions and end-effector trajectories, enabling operators to identify and abort low-quality demonstrations before saving.

Post-collection validation filters outlier episodes. Compute per-episode statistics: trajectory length (number of timesteps), action smoothness (mean absolute jerk), and task success rate. DROID discards episodes with <10 timesteps (incomplete) or >500 timesteps (operator struggled), retaining 92% of raw demonstrations^[1]. BridgeData V2 uses automated success detection via object-tracking algorithms, rejecting 8% of episodes where the target object did not reach the goal region^[2]. Manual review of 5-10% of episodes (stratified by operator and task) catches systematic errors like incorrect camera angles or workspace clutter.

Synchronization and Timestamp Alignment Across Sensors

Timestamp misalignment between joint states and camera frames causes action-observation desynchronization, where the policy learns to predict actions for stale observations. Target synchronization error is <5ms across all sensors. Hardware-triggered cameras (e.g., RealSense D435 with external trigger input) eliminate software scheduling jitter by synchronizing frame capture to a master clock. MCAP's nanosecond timestamps preserve sub-millisecond timing for post-hoc alignment.

ROS 2 nodes use `rclcpp::Time::now()` to stamp messages with the system clock. Enable the real-time kernel patch (PREEMPT_RT) on the recording workstation to reduce scheduling latency from 10-20ms to <1ms. ROS 2 bag recording writes timestamped messages to MCAP files with per-topic QoS policies. Configure camera nodes with `RELIABLE` QoS and joint-state nodes with `BEST_EFFORT` to prioritize low-latency proprioception over guaranteed image delivery.

LeRobot's episode synchronization resamples all sensor streams to a common 30 Hz timeline using linear interpolation for joint states and nearest-neighbor sampling for images. The resampling step computes a global episode timestamp array (0, 33ms, 66ms,...) and queries each sensor's raw timestamp buffer to find the closest sample. Verify synchronization by plotting joint position vs camera frame index—discontinuities indicate dropped frames or clock drift. DROID's recording pipeline uses NTP-synchronized clocks across distributed data-collection sites, achieving <2ms inter-site timestamp alignment^[1].

Data Storage and Episode Metadata for Dataset Provenance

Episode metadata enables dataset filtering, provenance tracking, and licensing compliance. Minimum metadata fields include: episode ID (UUID), task name, operator ID (anonymized), success flag, episode duration, robot model, camera serials, and recording timestamp. LeRobot's metadata schema adds `task_difficulty` (easy/medium/hard), `environment_id` (workspace identifier), and `data_collection_version` (rig configuration hash) to support stratified train/test splits.

Data provenance tracks the lineage of each episode from raw sensor logs to processed training samples. PROV-DM is a W3C standard for provenance graphs, encoding entities (datasets), activities (calibration, filtering), and agents (operators, scripts). Open X-Embodiment embeds PROV-DM metadata in RLDS datasets, documenting which episodes were collected under which rig configurations^[5]. This enables buyers to filter by camera type, robot model, or operator experience level.

Storage costs scale with episode count and sensor resolution. A 30-second episode at 30 Hz with 3 RGB cameras (1280×720 JPEG) and 7-DoF joint states totals 45 MB (900 frames × 50 KB/frame). A 10,000-episode dataset requires 450 GB. BridgeData V2's 60,000 episodes occupy 2.7 TB with 4-camera coverage^[2]. Cloud storage (AWS S3, Google Cloud Storage) costs $0.023/GB/month; a 3 TB dataset incurs $70/month. Truelabel's marketplace hosts datasets with per-episode licensing, enabling buyers to purchase subsets (e.g., 1,000 episodes of a specific task) rather than full archives.

Optimizing Operator Throughput and Cost per Episode

Operator throughput determines dataset collection cost. A single operator producing 30 episodes per 8-hour shift at $25/hour yields $6.67 per episode in labor cost. Hardware amortization adds $1-3 per episode (assuming $16,000 rig cost amortized over 10,000 episodes). Scale AI's data engine targets $5-10 per episode for warehouse manipulation tasks using SpaceMouse interfaces and automated resets.

Task complexity inversely correlates with throughput. Simple pick-place tasks (grasp object, move to bin) achieve 40-50 episodes per shift. Bimanual tasks (fold towel, load dishwasher) drop to 15-25 episodes per shift due to longer execution time and higher reset overhead. ALOHA's bimanual dish-racking task averages 90 seconds per episode (60s execution + 30s reset), yielding 20 episodes per shift^[3]. DROID's 564 tasks range from 10-second drawer-opening (50 episodes/shift) to 180-second meal-prep sequences (8 episodes/shift)^[1].

Parallelizing data collection across multiple rigs scales linearly with hardware investment. BridgeData V2 deployed 4 teleoperation stations in parallel, collecting 60,000 episodes over 6 months (250 episodes/day)^[2]. Open X-Embodiment aggregated data from 21 institutions, each contributing 10,000-100,000 episodes, reaching 1 million total episodes^[5]. Distributed collection requires standardized rig configurations and centralized metadata schemas to ensure dataset compatibility.

Validating End-to-End Pipeline with Policy Training

The ultimate validation of a teleoperation rig is whether policies trained on its data generalize to unseen scenarios. Collect a pilot dataset of 100-200 episodes on a representative task, train a baseline policy (Diffusion Policy or ACT), and measure task success rate on 20 held-out test episodes. Target 70-80% success for simple tasks, 50-60% for complex bimanual tasks.

LeRobot's training scripts load HDF5 episodes, apply data augmentation (random crops, color jitter), and train a Diffusion Policy model for 100,000 steps on a single GPU (8 hours). Evaluate the policy in simulation first (if a simulator exists) to catch gross errors like inverted joint directions or incorrect camera extrinsics. RoboSuite and ManiSkill provide simulated environments for pick-place and assembly tasks, enabling zero-cost policy iteration before real-robot deployment.

Real-robot evaluation requires 20-50 test rollouts per task to estimate success rate with 95% confidence intervals. ALOHA's bimanual policies achieve 80% success on dish-racking after training on 650 demonstrations, but only 40% success when trained on 200 demonstrations—indicating that dataset size is a critical variable^[3]. If test success is <50%, diagnose via: (1) visualize policy-predicted actions vs ground-truth actions on training episodes (high error indicates model underfitting), (2) check camera calibration by overlaying predicted grasp points on RGB frames (misalignment indicates extrinsic errors), (3) review operator demonstrations for consistency (high action variance indicates operator training gaps).

Scaling to Multi-Task and Multi-Embodiment Datasets

Multi-task datasets enable generalist policies that transfer across manipulation primitives. DROID's 564 tasks span grasping, pushing, insertion, folding, and pouring, collected across 6 robot embodiments (Franka Emika, UR5, Kinova Gen3)^[1]. Task diversity improves zero-shot generalization—policies trained on 500+ tasks succeed on 40% of unseen tasks without fine-tuning, vs 10% for single-task policies.

Multi-embodiment datasets require embodiment-agnostic action representations. Open X-Embodiment normalizes joint actions to [-1, 1] ranges and represents end-effector poses in a canonical base frame, enabling policies to transfer between 7-DoF and 6-DoF arms^[5]. RT-2 uses language-conditioned policies that map task descriptions ("pick up the red mug") to actions, abstracting over embodiment-specific joint configurations^[7].

Truelabel's marketplace aggregates teleoperation datasets from 12,000 collectors, providing buyers with task-filtered subsets (e.g., "all grasping tasks with wrist cameras") and embodiment-filtered subsets (e.g., "Franka Emika only"). Buyers specify task taxonomies, camera requirements, and success-rate thresholds, and the platform returns matching episodes with per-episode pricing. This marketplace model reduces the capital cost of building proprietary teleoperation rigs from $50,000-$200,000 to $5,000-$20,000 in dataset licensing fees.

Common Failure Modes and Debugging Strategies

Teleoperation rigs fail in predictable ways. Camera desynchronization manifests as policies that grasp 5-10 cm away from target objects. Diagnose by plotting camera frame timestamps vs joint-state timestamps—gaps >10ms indicate dropped frames. Fix by enabling hardware triggering or increasing camera node priority in the ROS 2 executor. MCAP's message inspection tools visualize per-topic timestamp distributions.

Calibration drift occurs when camera mounts shift due to vibration or thermal expansion. Symptoms include policies that succeed in the morning but fail in the afternoon. Re-run ChArUco calibration weekly and log calibration residuals in episode metadata. LeRobot's calibration validator compares current extrinsics to baseline and alerts if translation error exceeds 5mm or rotation error exceeds 2°.

Operator fatigue degrades demonstration quality after 4-6 hours of continuous teleoperation. Action variance increases by 40-60% in the final 2 hours of an 8-hour shift^[1]. Mitigate by scheduling 15-minute breaks every 90 minutes and rotating operators across tasks. ALOHA's data-collection protocol limits operators to 4-hour shifts with mandatory breaks^[3].

Gripper force miscalibration causes policies to drop objects or crush fragile items. Teleoperation rigs must log gripper force alongside joint states. DROID uses Robotiq 2F-85 grippers with force feedback, recording 0-100 N force traces at 30 Hz^[1]. Policies trained without force data fail on 25% of tasks requiring force modulation (e.g., closing a drawer without slamming).

Licensing and Commercialization of Teleoperation Datasets

Teleoperation datasets are high-value assets—BridgeData V2's 60,000 episodes required 6 months of operator time and $100,000 in hardware/labor costs^[2]. Licensing terms determine commercial viability. Creative Commons BY 4.0 permits commercial use with attribution, suitable for open-source research datasets. CC BY-NC 4.0 restricts commercial use, limiting buyers to academic research.

Proprietary datasets use custom licenses specifying: (1) permitted use cases (training, evaluation, redistribution), (2) geographic restrictions, (3) model commercialization rights, (4) derivative work terms. Scale AI's physical-AI datasets license episodes at $0.50-$5.00 per episode depending on task complexity and exclusivity. Truelabel's marketplace enforces per-episode licensing via cryptographic provenance tags, enabling buyers to audit dataset lineage and verify operator credentials.

Data provenance is critical for regulatory compliance. The EU AI Act requires high-risk AI systems to document training data sources, collection methods, and operator consent. PROV-DM metadata embedded in episode files provides auditable lineage from raw sensor logs to processed training samples. Open X-Embodiment publishes PROV-DM graphs for all 1 million episodes, enabling buyers to filter by data-collection site, operator demographics, and rig configuration^[5].

Future Directions: Autonomous Data Collection and Sim-to-Real

Autonomous data collection reduces operator labor by using policies to generate synthetic demonstrations. RoboCat uses a self-improvement loop: train a policy on human demonstrations, deploy it to collect autonomous rollouts, filter successful episodes, and retrain on the augmented dataset^[8]. After 5 iterations, RoboCat's autonomous data matches human teleoperation quality on 60% of tasks.

Domain randomization in simulation generates infinite synthetic demonstrations by varying object textures, lighting, and physics parameters. NVIDIA Cosmos uses diffusion models to generate photorealistic robot manipulation videos conditioned on task descriptions, producing 1 million synthetic episodes for pretraining^[9]. Sim-to-real transfer success rates reach 70-80% when synthetic data is mixed with 10-20% real teleoperation data.

Truelabel's marketplace will integrate autonomous data collection by 2026, enabling buyers to specify task requirements and receive datasets generated by a fleet of 12,000 distributed robots. Operators transition from teleoperation to quality validation, reviewing autonomous rollouts and flagging failures for retraining. This hybrid model reduces cost per episode from $5-10 to $1-2 while maintaining 80%+ task success rates.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Data provenance for physical AIRelated page HDF5 robot data format for robot training dataDelivery format detail MCAP format for robot training dataDelivery format detail Sourcing mocap human demonstrationsRelated page Sourcing teleop kitchen dataRelated page Sourcing teleop warehouse dataRelated page Robot demonstrations training dataTask-specific requirements Teleoperation training dataTask-specific requirements

External references and source context

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID dataset collected 76,000 trajectories across 564 skills using teleoperation rigs; demonstrates cross-embodiment generalization and operator fatigue effects.
arXiv ↩
BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 collected 60,000 trajectories with 3-4 calibrated RGB-D cameras; demonstrates object placement randomization improves generalization by 25%.
arXiv ↩
Teleoperation datasets are becoming the highest-intent physical AI content category
ALOHA bimanual teleoperation system using ViperX 300 leader-follower arms; collected 650 demonstrations achieving 80% dish-racking success.
tonyzhaozh.github.io ↩
RoboNet: Large-Scale Multi-Robot Learning
RoboNet collected 15 million frames across 7 robot platforms using VR teleoperation; demonstrates cross-embodiment learning.
arXiv ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment aggregated 1 million episodes from 22 robot embodiments; uses RLDS format and PROV-DM provenance metadata.
arXiv ↩
Project site
RoboCasa kitchen counter mockups for household manipulation; collected 100,000 demonstrations.
robocasa.ai ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 uses language-conditioned policies and web-scale pretraining; demonstrates cross-embodiment transfer via embodiment-agnostic representations.
arXiv ↩
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
RoboCat self-improving generalist agent using autonomous data collection loop; autonomous data matches human quality on 60% of tasks after 5 iterations.
arXiv ↩
NVIDIA Cosmos World Foundation Models
NVIDIA Cosmos uses diffusion models to generate photorealistic robot manipulation videos; produces 1 million synthetic episodes for pretraining.
NVIDIA Developer ↩

FAQ

What is the minimum hardware cost to build a functional teleoperation rig for robot manipulation data collection?

A minimal single-arm rig costs $3,000-$5,000: robot arm with position control API ($1,500-$2,500 for a used UR3 or Franka Emika), SpaceMouse or VR controller ($200-$800), two RGB cameras ($160 for 2× Logitech C920), and a Linux workstation with GPU ($1,000-$1,500). This configuration suits simple pick-place tasks at 30-40 episodes per 8-hour shift. Leader-follower rigs like ALOHA cost $16,000 for bimanual setups but achieve 85-90% task success vs 60-75% for VR interfaces. Production rigs add $2,000-$5,000 for RGB-D cameras (Intel RealSense D435), calibration fixtures, and automated reset mechanisms.

How many demonstration episodes are required to train a functional imitation learning policy?

Simple pick-place tasks require 50-200 episodes for 70-80% test success using Diffusion Policy or ACT. Bimanual tasks (folding, dishwasher loading) require 300-1,000 episodes for 60-70% success. ALOHA achieved 80% success on dish-racking with 650 demonstrations, but only 40% with 200 demonstrations. Multi-task generalist policies like RT-2 and OpenVLA require 10,000-1,000,000 episodes across diverse tasks to achieve 50-60% zero-shot success on unseen tasks. Dataset size trades off with task complexity and policy architecture—transformer-based policies require 3-5× more data than convolutional policies for equivalent performance.

What recording frame rate and timestamp precision are necessary for imitation learning datasets?

Joint states should be recorded at 15-30 Hz to capture manipulation dynamics without aliasing fast motions. RGB cameras typically run at 15-30 Hz (constrained by USB bandwidth for multiple cameras). Depth cameras (Intel RealSense) run at 15-30 Hz for aligned RGB-D streams. Timestamp precision must be <5ms across all sensors to prevent action-observation desynchronization. MCAP and HDF5 formats support nanosecond timestamps. Enable real-time kernel patches (PREEMPT_RT) on the recording workstation to reduce scheduling jitter from 10-20ms to <1ms. Hardware-triggered cameras eliminate software jitter entirely by synchronizing frame capture to a master clock.

How do I validate that my teleoperation rig is producing high-quality demonstrations before collecting a full dataset?

Collect a pilot dataset of 100-200 episodes on a representative task, train a baseline policy (Diffusion Policy or ACT via LeRobot), and measure test success rate on 20 held-out episodes. Target 70-80% success for simple tasks. If success is <50%, diagnose via: (1) visualize policy-predicted actions vs ground-truth actions (high error indicates model underfitting or data quality issues), (2) overlay predicted grasp points on RGB frames to check camera calibration (misalignment indicates extrinsic errors), (3) compute per-episode action variance (high variance indicates operator inconsistency). Real-time quality metrics during teleoperation (joint velocity histograms, gripper force traces) help operators self-correct before saving low-quality episodes.

What are the most common failure modes when scaling from a prototype rig to production data collection?

Camera desynchronization causes policies to grasp 5-10 cm away from targets—fix by enabling hardware triggering or increasing camera node priority. Calibration drift (camera mounts shift due to vibration) causes policies to fail after 1-2 weeks—re-run ChArUco calibration weekly and log residuals in metadata. Operator fatigue increases action variance by 40-60% after 4-6 hours—limit shifts to 4 hours with mandatory breaks. Gripper force miscalibration causes policies to drop or crush objects—log force traces alongside joint states and validate force modulation on test tasks. Storage bottlenecks occur when writing 3-4 camera streams to disk—use NVMe SSDs (3,000+ MB/s write speed) and JPEG compression to reduce file size by 80%.

Can I use teleoperation datasets collected on one robot embodiment to train policies for a different robot?

Cross-embodiment transfer requires embodiment-agnostic action representations. Open X-Embodiment normalizes joint actions to [-1, 1] ranges and represents end-effector poses in a canonical base frame, enabling policies to transfer between 7-DoF and 6-DoF arms. RT-2 uses language-conditioned policies that abstract over embodiment-specific joint configurations. Transfer success rates are 40-60% for similar embodiments (e.g., Franka Emika to UR5) and 20-30% for dissimilar embodiments (e.g., 6-DoF arm to humanoid hand). Mixing 10-20% target-embodiment data with source-embodiment data improves transfer to 70-80%. DROID collected data across 6 embodiments and demonstrated 40% zero-shot transfer to unseen robots.

Looking for teleoperation rig setup?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

List Your Teleoperation Dataset on Truelabel