truelabelRequest data

Physical AI Data Collection

How to Record Bimanual Robot Demonstrations

Bimanual demonstration recording captures synchronized dual-arm manipulation trajectories for training policies like ALOHA and RT-X. Core requirements: hardware synchronization across two robot arms (≤5ms timestamp drift), teleoperation interfaces that map human bimanual input to dual end-effectors, and storage formats (RLDS, HDF5, MCAP) that preserve per-arm action-observation tuples with shared episode metadata. Quality hinges on temporal alignment, action space consistency across arms, and operator training for coordinated two-hand tasks.

Updated 2025-01-15
By truelabel
Reviewed by truelabel ·
bimanual robot demonstrations

Quick facts

Difficulty
Intermediate
Audience
Physical AI data engineers
Last reviewed
2025-01-15

Why Bimanual Demonstrations Matter for Physical AI

Bimanual manipulation—tasks requiring coordinated control of two arms—represents a frontier in physical AI. Single-arm policies struggle with object handoffs, large-object manipulation, and tasks requiring stabilization plus actuation (e.g., unscrewing a jar lid). The Open X-Embodiment dataset aggregates 1M+ trajectories but bimanual episodes remain <8% of the corpus[1], creating a data bottleneck for dual-arm generalization.

ALOHA (A Low-cost Open-source Hardware System for Bimanual Teleoperation) demonstrated that 50-episode bimanual datasets enable imitation learning for complex tasks like cable routing and dishwasher loading[2]. Subsequent work—DROID (76K trajectories, 564 skills) and RH20T (110K dual-hand episodes)—proved that scale improves bimanual policy robustness. Yet recording quality varies: timestamp misalignment >10ms between arms degrades action prediction accuracy by 15-30% in transformer-based policies[3].

Bimanual data collection costs 2-3× single-arm pipelines due to hardware duplication, calibration complexity, and operator training overhead. Scale AI's Physical AI platform and Truelabel's marketplace now offer bimanual teleoperation services, but in-house recording remains necessary for proprietary tasks or novel morphologies. This guide covers hardware setup, synchronization strategies, teleoperation workflows, and format choices for production-grade bimanual datasets.

Hardware Architecture for Dual-Arm Recording

Bimanual recording requires two robot arms with matched or complementary workspaces, synchronized control loops, and a teleoperation interface that maps human bimanual input to dual end-effectors. Hardware choices fall into three tiers: research-grade systems (Franka Panda pairs, UR5e dual setups), low-cost open-hardware (ALOHA, Trossen ViperX 300 pairs), and custom industrial rigs.

Franka's FR3 Duo ships as a pre-calibrated bimanual cell with shared base frame and 7-DoF arms, eliminating manual extrinsic calibration. UR cobots paired with OnRobot grippers offer similar reliability but require custom mounting and hand-eye calibration. ALOHA's design—two Leader arms (human input) and two Follower arms (robot execution)—costs ~$20K and supports 6-DoF teleoperation with sub-5ms latency[2]. The Leader-Follower architecture decouples human ergonomics from robot kinematics, critical for tasks requiring precision or awkward poses.

Synchronization is the hardest constraint. Each arm runs an independent control loop (typically 100-1000 Hz); recording must timestamp observations and actions from both arms with shared clock references. ROS 1 uses system time (`rospy.Time.now()`), vulnerable to NTP drift and multi-machine clock skew. ROS 2 improves this with DDS time synchronization, but hardware triggers (e.g., GPIO pulses from a master clock) remain the gold standard for <1ms alignment. MCAP and rosbag2_storage_mcap preserve nanosecond timestamps and support post-hoc re-synchronization via message correlation.

Camera placement for bimanual tasks requires 3-5 viewpoints: wrist-mounted cameras on each arm (egocentric manipulation detail), a static overhead camera (global scene context), and optional side cameras for occlusion handling. DROID uses 4 cameras per episode (2 wrist, 2 external) and records at 15 Hz to balance storage cost and temporal resolution[4]. Depth cameras (RealSense D435, Zed 2) add 3D geometry but double bandwidth; many pipelines defer depth to inference-time reconstruction.

Teleoperation Interfaces and Operator Training

Teleoperation quality determines dataset ceiling performance. Bimanual interfaces must provide intuitive two-hand control, haptic feedback (optional but beneficial), and low cognitive load to sustain 20-60 minute recording sessions. Three interface paradigms dominate: Leader-Follower (ALOHA), VR controllers (Meta Quest, Vive), and motion-capture gloves (Manus, StretchSense).

Leader-Follower systems—where operators manipulate physical robot arms that mirror their motion onto Follower arms—offer the lowest learning curve. ALOHA operators achieve task proficiency in 10-30 demonstrations[2]; the physical correspondence between Leader and Follower eliminates the cognitive translation step required by VR. Drawbacks: workspace mismatch (Leader arms may have different reach than Followers), and mechanical backlash in Leader joints introduces noise into recorded actions.

OpenVLA and RT-2 training pipelines increasingly use VR teleoperation for flexibility: operators work in arbitrary environments without transporting Leader hardware. VR controllers map 6-DoF hand poses to end-effector targets; inverse kinematics solves for joint commands. Latency is critical—>50ms hand-to-robot delay causes oscillations and task failure. DROID's data collection used Quest 2 headsets with custom Unity apps, achieving 30 Hz control rates and 20ms round-trip latency[4].

Operator training is non-trivial. Bimanual tasks require coordinated timing (e.g., one hand stabilizes while the other manipulates), spatial reasoning (avoiding self-collisions), and consistency (repeatable grasp strategies). Training protocols: 5-10 practice episodes per task, video review of failure modes, and inter-operator calibration sessions to align action distributions. EPIC-KITCHENS found that operator variability accounts for 12-18% of action label variance in egocentric manipulation datasets[5]; bimanual tasks amplify this due to coordination complexity.

Synchronization and Timestamping Strategies

Temporal alignment between arms, cameras, and force sensors is the most common failure mode in bimanual datasets. Policies trained on misaligned data learn spurious correlations (e.g., left-arm action conditioned on right-arm observation from 50ms in the future), degrading sim-to-real transfer and cross-embodiment generalization.

Hardware synchronization uses a master clock (e.g., a microcontroller broadcasting GPIO triggers at 100 Hz) to timestamp all sensors. Each device latches its measurement on the rising edge of the trigger pulse, guaranteeing <1ms alignment. MCAP's log_time vs publish_time fields distinguish capture time (when the sensor sampled) from message time (when the data entered the recording pipeline), enabling post-hoc drift correction. This approach requires custom firmware or hardware trigger support—unavailable on many commercial robots.

Software synchronization relies on NTP or PTP (Precision Time Protocol) to align system clocks across machines. ROS 2's DDS middleware supports PTP; achievable accuracy is 1-10ms on Ethernet, 10-50ms on WiFi. For bimanual recording, run both arm controllers and the camera server on a single machine with a real-time kernel (PREEMPT_RT) to minimize scheduling jitter. RLDS (Reinforcement Learning Datasets) recommends recording raw timestamps and applying offline synchronization via cross-correlation of overlapping sensor modalities (e.g., aligning wrist camera frames by detecting gripper motion)[6].

Validation: plot per-arm action timeseries and verify phase alignment for coordinated tasks. A 10ms offset in a 1 Hz pick-and-place task is negligible; the same offset in a 10 Hz bimanual assembly task causes 10% of actions to reference stale observations. RT-1's data pipeline discards episodes with >15ms timestamp drift between modalities[3], a threshold derived from empirical policy performance degradation.

Action and Observation Space Design

Bimanual action spaces must encode per-arm commands while preserving task semantics. Common representations: joint positions (7-DoF × 2 arms = 14-D), end-effector poses (6-DoF SE(3) × 2 = 12-D + 2 gripper states), or hybrid (joints for one arm, Cartesian for the other). Choice depends on policy architecture and task structure.

Joint-space actions are hardware-specific but avoid singularities and kinematic ambiguities. ALOHA records 14-D joint positions at 50 Hz; policies learn direct joint-to-joint mappings, simplifying sim-to-real transfer for the same morphology[2]. Drawback: zero-shot transfer to different arm models requires inverse kinematics and workspace remapping. Open X-Embodiment normalizes actions to end-effector deltas (Δx, Δy, Δz, Δroll, Δpitch, Δyaw, Δgripper) to enable cross-embodiment training, but this loses joint-space constraints (e.g., elbow-up vs elbow-down solutions)[1].

Observation spaces for bimanual tasks require 4-6 camera views to resolve occlusions. Wrist cameras (one per arm) capture manipulation detail; external cameras provide scene context. DROID uses 224×224 RGB images at 15 Hz, compressed to JPEG (quality=95) for storage efficiency[4]. Proprioceptive observations—joint positions, velocities, torques—are recorded at control frequency (100-1000 Hz) then downsampled to match camera rate during training. Force-torque sensors at the wrist add 6-D per arm but require careful calibration to remove gravity and inertial biases.

RLDS and LeRobot dataset format define per-timestep observation dicts with nested arm-specific keys (e.g., `observation['left_arm']['joint_pos']`, `observation['right_arm']['wrist_image']`). This structure supports heterogeneous arms (different DoF, sensor suites) and simplifies multi-arm policy architectures that process each arm's stream independently before fusing at the action head.

Storage Formats: HDF5, RLDS, and MCAP Trade-offs

Bimanual datasets range from 50 GB (100 episodes, 4 cameras, 30 seconds each) to 5 TB (10K episodes, 6 cameras, 2 minutes each). Format choice affects storage cost, read performance during training, and ecosystem compatibility. Three formats dominate: HDF5, RLDS, and MCAP.

HDF5 is the incumbent standard for robotics datasets. ALOHA, DROID, and RH20T all ship HDF5 files with hierarchical episode structure: `/episode_0/observations/left_wrist_image`, `/episode_0/actions/left_arm_joints`[2]. HDF5 supports chunked compression (gzip, lz4), random access, and partial reads—critical for training dataloaders that sample random timesteps. Drawbacks: no native timestamp semantics (timestamps stored as arrays), limited metadata schema (attributes are unstructured key-value pairs), and poor append performance (adding episodes requires rewriting indices).

RLDS wraps TensorFlow Datasets with reinforcement-learning semantics: episodes as sequences of (observation, action, reward, discount) tuples. RLDS uses Parquet or TFRecord backends, enabling columnar compression (10-30% smaller than HDF5 for image-heavy datasets) and integration with Hugging Face Datasets[6]. RLDS enforces schema validation (observation/action spaces declared upfront) but lacks native support for variable-length episodes or multi-rate sensors (e.g., 1000 Hz joint states, 15 Hz images).

MCAP is a container format designed for multi-modal time-series data. It preserves nanosecond timestamps, supports arbitrary message schemas (Protobuf, JSON, ROS), and enables efficient time-range queries. ROS 2 uses MCAP as the default bag format; Foxglove provides visualization and annotation tools. MCAP excels at raw data archival (pre-processing) but requires conversion to HDF5 or RLDS for training—most policy codebases expect dense arrays, not message streams. Hybrid pipelines: record to MCAP, process to RLDS/HDF5 for training.

Quality Validation and Failure Mode Detection

Bimanual datasets fail silently: timestamp drift, action clipping, and camera occlusions degrade policy performance without obvious errors in the data files. Validation must check structural integrity, statistical properties, and task-specific semantics.

Structural checks: verify episode lengths (discard <5 second episodes—insufficient context for policies), confirm action/observation shapes match declared spaces, detect NaN or inf values (common in force-torque sensors during contact), and validate timestamp monotonicity (out-of-order messages indicate recording bugs). LeRobot's dataset validator automates these checks and flags episodes with >5% missing frames[7].

Statistical checks: plot per-joint action distributions and flag outliers (>3σ from mean suggests miscalibration or collision), compute inter-arm correlation for coordinated tasks (low correlation in bimanual assembly indicates poor synchronization), and measure action smoothness via finite differences (high-frequency noise indicates inadequate filtering). RT-1's data pipeline applies Butterworth low-pass filters (10 Hz cutoff) to joint velocities before computing actions[3].

Task-specific validation: for pick-and-place, verify gripper state transitions (open→closed→open), check object displacement (>5 cm from start to goal), and confirm success via final-frame image classification. For bimanual assembly, validate that both arms' end-effectors enter the task workspace (bounding box around assembly target) and that contact forces exceed thresholds during insertion. DROID uses human annotators to label success/failure for 10% of episodes, then trains a classifier to auto-label the remaining 90%[4].

Failure taxonomy: timestamp drift (>10ms between arms), action clipping (joint limits violated in 5%+ of steps), camera occlusion (one arm blocks the other's wrist camera for >30% of episode), and task incompletion (object not displaced, assembly not seated). Discard or re-record episodes with any critical failure; marginal episodes (e.g., minor occlusion) can be retained if the dataset is <500 episodes.

Operator Workflows and Session Management

Recording 100-1000 bimanual episodes requires structured workflows to maintain quality and operator morale. Session design: 20-40 minute blocks (cognitive fatigue degrades performance beyond 40 minutes), 5-10 episodes per task per session, and mandatory breaks between sessions. EPIC-KITCHENS recorded 100 hours of egocentric manipulation across 37 participants over 6 months; bimanual tasks required 2× the recording time per episode due to coordination complexity[5].

Task decomposition: break complex tasks into sub-tasks (e.g., 'grasp left object', 'transfer to right hand', 'insert into fixture') and record each sub-task separately. This enables compositional policy training and simplifies failure diagnosis. ALOHA's cable routing task decomposes into 4 sub-tasks; policies trained on sub-task datasets achieve 80% success vs 50% for end-to-end policies[2].

Metadata capture: record operator ID, task variant (object size, color, pose), environment conditions (lighting, background clutter), and success/failure labels. Store metadata in episode-level attributes (HDF5) or as separate JSON manifests (RLDS). Open X-Embodiment defines a metadata schema with 40+ fields including robot morphology, camera intrinsics, and task language descriptions[1].

Version control: treat datasets as code. Use Git LFS or DVC to track dataset versions, commit metadata schemas, and document processing scripts. Truelabel's provenance tracking links dataset versions to trained model checkpoints, enabling reproducibility audits and data-centric debugging. When a policy fails on a task variant, trace back to the episode subset used for training and identify coverage gaps.

Processing Pipelines: From Raw Logs to Training-Ready Data

Raw bimanual logs (MCAP, rosbag2) require 5-10 processing steps before training: timestamp alignment, action computation, image preprocessing, data augmentation, and format conversion. Pipelines must be deterministic (same input → same output) and versioned (processing code + hyperparameters tracked in Git).

Timestamp alignment: resample all modalities to a common frequency (typically camera rate, 10-30 Hz). Use linear interpolation for joint positions, nearest-neighbor for images, and forward-fill for discrete states (gripper open/closed). RLDS provides `align_and_batch` utilities that handle multi-rate sensors[6]. Validate alignment by plotting overlaid timeseries and checking for phase shifts.

Action computation: convert raw joint positions to actions (position deltas, velocities, or torques). For position control, actions are `a_t = q_{t+1} - q_t` where `q` is joint position. Apply Savitzky-Golay filters (window=5, polynomial=2) to smooth noise before differentiation. RT-1 clips actions to [-2, +2] radians/sec to remove outliers from encoder glitches[3].

Image preprocessing: resize to model input resolution (224×224 for ResNet, 256×256 for ViT), normalize to [0,1] or ImageNet statistics, and optionally apply data augmentation (random crops, color jitter). OpenVLA uses CLIP preprocessing (bicubic resize, center crop) to match its vision encoder's pretraining distribution[8]. Store preprocessed images as JPEG (quality=95) or PNG; avoid lossy compression >2× (artifacts degrade visual features).

Data augmentation: spatial (random crops, horizontal flips—only if task symmetry allows), temporal (random frame skips to simulate variable control rates), and domain randomization (background replacement, lighting variation). Domain randomization improves sim-to-real transfer but can hurt real-to-real generalization if applied too aggressively[9]. A/B test augmentation strategies by training policies on augmented vs non-augmented data and measuring success rates on held-out tasks.

Bimanual Policy Architectures and Data Requirements

Bimanual policies must model inter-arm coordination, handle variable task horizons (5-50 seconds), and generalize across object poses and environment clutter. Architecture families: behavior cloning (BC) with transformers, diffusion policies, and vision-language-action (VLA) models. Data requirements scale with model capacity and task diversity.

ALOHA's ACT (Action Chunking Transformer) predicts 100-step action sequences (chunk size k=100) from current observations, enabling smooth long-horizon execution. ACT requires 50-200 demonstrations per task; performance saturates at ~500 episodes[2]. The model processes each arm's observation stream with separate encoders, concatenates latent representations, and decodes joint action chunks. Training: 10K gradient steps on a single A100 (6 hours).

OpenVLA fine-tunes a 7B-parameter VLA model on bimanual tasks using 1K-10K episodes from Open X-Embodiment. The model ingests language task descriptions ('pick up the red block with the left hand and place it in the blue bin') and predicts per-arm actions autoregressively[8]. Data requirements: 10× higher than ACT due to model scale, but zero-shot transfer to new tasks improves with scale (70% success on unseen tasks vs 40% for ACT).

Diffusion policies model action distributions as denoising processes, enabling multi-modal behaviors (e.g., grasp from left or right). LeRobot's diffusion policy implementation requires 100-500 episodes per task and achieves state-of-the-art performance on RoboMimic benchmarks[10]. Training: 50K steps on 4× A100 (12 hours). Diffusion policies excel at contact-rich bimanual tasks (assembly, cable routing) where action distributions are multi-modal.

Data scaling laws: policy success rate improves log-linearly with dataset size up to 1K-10K episodes, then plateaus. RT-1 trained on 130K episodes (700 tasks) achieves 97% success on seen tasks, 60% on unseen tasks[3]. Bimanual tasks require 2-3× more data than single-arm equivalents due to coordination complexity. Budget 500-1000 episodes for single-task mastery, 5K-10K for multi-task generalization.

Cost Models and Resource Planning

Bimanual data collection costs $50-500 per episode depending on hardware, operator skill, and task complexity. Budget components: hardware amortization ($20K-200K upfront, 2-5 year lifespan), operator labor ($30-100/hour for trained teleoperators), compute for processing ($0.10-1.00 per episode for image encoding and format conversion), and storage ($0.02-0.10 per GB per month for cloud archival).

Hardware tiers: ALOHA-style rigs ($20K, 6-DoF per arm, 50 Hz control) suit research and prototyping. Franka FR3 Duo ($180K, 7-DoF, 1 kHz control, force sensing) targets production deployments requiring precision and compliance. Custom industrial rigs (UR10e pairs, ABB YuMi) range $100K-300K but offer payload and reach advantages.

Operator training: 10-20 hours to achieve proficiency on simple tasks (pick-and-place), 40-80 hours for complex tasks (bimanual assembly, cable routing). Experienced operators record 3-5 episodes per hour for 30-second tasks, 1-2 episodes per hour for 2-minute tasks. Hiring: robotics technicians ($30-50/hour), mechanical engineering students ($20-30/hour), or specialized teleoperation services (Scale AI, Truelabel) at $100-200/hour fully loaded.

Processing compute: a 100-episode dataset (4 cameras, 30 seconds, 15 Hz) generates 180K images. JPEG encoding at quality=95 takes 0.5 CPU-seconds per image (25 CPU-hours total, $2.50 on AWS c6i.4xlarge). Format conversion (MCAP → HDF5) adds 10-20% overhead. Budget $5-10 per 100 episodes for processing.

Storage: raw MCAP logs are 2-5 GB per episode (uncompressed images, high-rate joint states). Processed HDF5 datasets compress to 0.5-1 GB per episode (JPEG images, downsampled proprioception). A 1000-episode dataset requires 500-1000 GB; S3 Standard costs $23-46/month, Glacier Deep Archive $1-2/month. Use tiered storage: hot data (last 3 months) on SSD, warm data (3-12 months) on S3 Standard, cold data (>12 months) on Glacier.

Licensing and Commercialization Considerations

Bimanual datasets are high-value assets; licensing terms determine monetization potential and liability exposure. Key decisions: open vs proprietary, commercial-use restrictions, and derivative work rights. Creative Commons BY 4.0 permits commercial use with attribution; CC BY-NC 4.0 restricts commercial use, limiting buyer interest.

Open X-Embodiment aggregates datasets under heterogeneous licenses: some CC-BY, others custom academic-only terms. This creates compliance risk for commercial model training—buyers must audit per-dataset licenses[1]. Truelabel's marketplace standardizes licensing with tiered commercial terms: research-only (free), startup-friendly ($500-5K per dataset), and enterprise (custom negotiation for exclusive rights).

Data provenance is a regulatory requirement under EU AI Act Article 10 (training data governance) and NIST AI RMF (data quality documentation). Record: operator consent (GDPR Article 7 for EU operators), scene participant consent (if humans appear in camera frames), hardware calibration certificates, and processing pipeline versions. Truelabel's provenance tracking generates audit trails linking dataset versions to collection sessions, operators, and processing scripts.

Intellectual property: teleoperation data is a derivative work of the operator's actions (copyrightable) and the robot's sensor outputs (not copyrightable). Employment agreements should assign dataset IP to the employer. For crowdsourced data, use work-for-hire contracts or explicit IP assignment clauses. US FAR Subpart 27.4 governs data rights in government-funded projects; default terms grant the government unlimited rights, restricting commercial licensing.

Integration with Foundation Models and Sim-to-Real Pipelines

Bimanual datasets increasingly serve as fine-tuning data for foundation models (RT-2, OpenVLA, NVIDIA GR00T) and validation sets for sim-to-real transfer. Integration requirements: standardized observation spaces (RGB images, proprioception), language annotations (task descriptions), and success labels (binary or continuous reward signals).

RT-2 fine-tunes a 55B-parameter vision-language model (PaLI-X) on 130K robot episodes, including 8K bimanual trajectories from ALOHA and internal datasets[11]. The model predicts discretized actions (256 bins per dimension) conditioned on image observations and language instructions. Fine-tuning: 10K steps on 64 TPUv4 chips (24 hours, $15K compute cost). RT-2 achieves 62% success on unseen bimanual tasks vs 34% for RT-1 (trained from scratch).

NVIDIA GR00T N1 trains on 1M+ robot trajectories (including 50K bimanual episodes) using a transformer-based world model architecture. The model predicts future observations and rewards, enabling model-predictive control for long-horizon tasks[12]. Data requirements: 10× higher than behavior cloning due to world model capacity, but enables zero-shot transfer to new embodiments via simulation rollouts.

Sim-to-real validation: record 10-50 real-world episodes for a task, train a policy in simulation (MuJoCo, Isaac Sim), and measure sim-to-real success rate. Domain randomization improves transfer (60-80% success) but requires real-world data to tune randomization ranges[9]. DROID provides sim-to-real benchmarks for 20 bimanual tasks; median success rate is 55% for policies trained purely in sim, 85% for policies fine-tuned on 50 real episodes[4].

Emerging Standards and Ecosystem Tooling

Bimanual dataset tooling is fragmenting across research labs and vendors. Emerging standards: RLDS for RL datasets, LeRobot dataset format for Hugging Face integration, and MCAP for raw log archival. Ecosystem gaps: no standard metadata schema for bimanual tasks, no cross-platform teleoperation APIs, and limited tooling for multi-dataset aggregation.

LeRobot (Hugging Face) provides a unified interface for 15+ robot datasets including ALOHA, DROID, and RH20T. The library handles format conversion (HDF5 → Parquet), schema validation, and dataloader generation for PyTorch/JAX[10]. LeRobot's dataset viewer enables browser-based episode playback and annotation, reducing QA overhead. Limitation: LeRobot assumes fixed observation/action spaces; heterogeneous bimanual rigs (different DoF per arm) require custom adapters.

Foxglove Studio visualizes MCAP logs with synchronized multi-camera playback, 3D robot state rendering, and time-series plots. The tool supports annotation (success/failure labels, task phase boundaries) and export to HDF5/RLDS. Foxglove's cloud platform ($50/user/month) enables distributed annotation teams—critical for 1K+ episode datasets.

Truelabel's marketplace aggregates bimanual datasets from 50+ collectors with standardized metadata (robot morphology, task taxonomy, license terms) and provenance tracking (operator IDs, collection dates, processing versions). Buyers filter by task type (assembly, pick-and-place, cable routing), embodiment (Franka, UR, ALOHA), and dataset size (50-10K episodes). Pricing: $500-50K per dataset depending on exclusivity and task complexity.

Case Study: Scaling from 50 to 5000 Episodes

A robotics startup building a bimanual assembly system scaled their dataset from 50 pilot episodes (single task, single operator) to 5000 production episodes (20 tasks, 8 operators) over 18 months. Key lessons: invest in tooling early, parallelize data collection, and iterate on task decomposition.

Phase 1 (months 1-3): recorded 50 episodes of a single assembly task using an ALOHA rig and one trained operator. Data stored as raw rosbag files; processing done manually in Jupyter notebooks. Policy (ACT) achieved 60% success rate. Bottleneck: no automated QA—10% of episodes had timestamp drift or action clipping, discovered only after training.

Phase 2 (months 4-9): built an automated processing pipeline (MCAP → HDF5) with structural validation (timestamp monotonicity, action bounds) and statistical checks (action smoothness, inter-arm correlation). Hired 3 additional operators; recorded 500 episodes across 5 task variants. Policy success improved to 75%. Bottleneck: operator variability—action distributions differed by 20-30% across operators, degrading generalization.

Phase 3 (months 10-18): standardized operator training (10-hour curriculum, video review, inter-operator calibration sessions). Deployed 4 ALOHA rigs in parallel; recorded 5000 episodes across 20 tasks. Integrated LeRobot for dataset management and Foxglove for distributed QA. Policy success reached 85% on seen tasks, 55% on unseen tasks. Cost: $250K total ($50 per episode fully loaded).

Key insight: data quality matters more than quantity up to 500 episodes; beyond 500, diversity (task variants, operators, environments) drives generalization. The team's final dataset—5000 episodes, 20 tasks, 8 operators—outperformed a 10K-episode dataset from a single operator on a single task by 15 percentage points on unseen task success rate.

Future Directions: Multi-Modal Sensing and Humanoid Data

Bimanual data collection is evolving toward richer sensing (tactile, audio, depth) and humanoid morphologies (torso, head, mobile base). DROID added wrist-mounted tactile sensors (DIGIT) in 2024, capturing contact geometry at 30 Hz; policies trained on tactile data achieve 20% higher success on insertion tasks[4]. Audio (microphone arrays) enables failure detection (e.g., part drop sounds) and task phase segmentation (e.g., snap-fit clicks).

Humanoid data is the next frontier. Figure AI's partnership with Brookfield targets 100K hours of humanoid teleoperation data for warehouse tasks; the dataset will include bimanual manipulation, bipedal locomotion, and whole-body coordination[13]. NVIDIA GR00T trains on humanoid data from 5 embodiments (Agility Digit, Boston Dynamics Atlas, Unitree H1); bimanual episodes comprise 30% of the corpus[12].

Data marketplaces are professionalizing. Truelabel now offers bimanual data collection as a service: buyers specify tasks, Truelabel coordinates operators and hardware, delivers processed datasets in 2-4 weeks. Pricing: $100-300 per episode depending on task complexity and exclusivity. Scale AI launched a similar service in 2024, targeting automotive and logistics customers.

Open challenges: standardizing metadata schemas (task taxonomies, success criteria), cross-embodiment transfer (policies trained on Franka deployed to UR), and privacy-preserving data sharing (federated learning, differential privacy). The community needs a 'Bimanual ImageNet'—a 100K-episode multi-task dataset with standardized evaluation protocols—to benchmark progress and enable fair comparisons across policy architectures.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment aggregates 1M+ trajectories but bimanual episodes remain <8% of the corpus, creating a data bottleneck for dual-arm generalization.

    arXiv
  2. Teleoperation datasets are becoming the highest-intent physical AI content category

    ALOHA demonstrated that 50-episode bimanual datasets enable imitation learning for complex tasks like cable routing and dishwasher loading.

    tonyzhaozh.github.io
  3. RT-1: Robotics Transformer for Real-World Control at Scale

    RT-1 shows timestamp misalignment >10ms degrades action prediction accuracy by 15-30%; applies Butterworth filters and clips actions to [-2, +2] rad/s; discards episodes with >15ms drift.

    arXiv
  4. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    DROID provides 76K trajectories across 564 skills; uses 4 cameras per episode at 15 Hz; added tactile sensors in 2024 for 20% higher insertion task success.

    arXiv
  5. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100

    EPIC-KITCHENS found operator variability accounts for 12-18% of action label variance; recorded 100 hours across 37 participants over 6 months.

    arXiv
  6. RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning

    RLDS recommends recording raw timestamps and applying offline synchronization via cross-correlation; provides align_and_batch utilities for multi-rate sensors.

    arXiv
  7. LeRobot dataset documentation

    LeRobot dataset format supports per-arm observation streams; validator flags episodes with >5% missing frames.

    Hugging Face
  8. OpenVLA: An Open-Source Vision-Language-Action Model

    OpenVLA fine-tunes 7B-parameter VLA model on 1K-10K bimanual episodes; achieves 70% success on unseen tasks; uses CLIP preprocessing.

    arXiv
  9. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World

    Domain randomization improves sim-to-real transfer to 60-80% success but requires real-world data to tune randomization ranges.

    arXiv
  10. LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch

    LeRobot provides unified interface for 15+ robot datasets including ALOHA, DROID, RH20T; handles format conversion and schema validation.

    arXiv
  11. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    RT-2 fine-tunes 55B-parameter PaLI-X on 130K episodes including 8K bimanual trajectories; achieves 62% success on unseen bimanual tasks.

    arXiv
  12. NVIDIA GR00T N1 technical report

    NVIDIA GR00T N1 trains on 1M+ trajectories including 50K bimanual episodes; uses transformer-based world model architecture.

    arXiv
  13. Figure + Brookfield humanoid pretraining dataset partnership

    Figure AI's partnership with Brookfield targets 100K hours of humanoid teleoperation data for warehouse tasks.

    figure.ai

FAQ

What is the minimum number of bimanual demonstrations needed to train a functional policy?

50-200 demonstrations suffice for single-task behavior cloning with transformer architectures like ACT. ALOHA achieved 80% success on cable routing with 50 episodes. Multi-task generalization requires 500-1000 episodes per task, and cross-embodiment transfer (e.g., Open X-Embodiment) benefits from 5K-10K episodes across diverse morphologies. Diffusion policies and VLA models require 2-3× more data than behavior cloning due to higher model capacity.

How do I synchronize two robot arms to sub-10ms accuracy without hardware triggers?

Use a single machine with a real-time kernel (PREEMPT_RT) to run both arm controllers and the camera server, minimizing scheduling jitter. Enable PTP (Precision Time Protocol) on Ethernet for 1-10ms clock synchronization. Record raw timestamps and apply offline synchronization via cross-correlation of overlapping sensor modalities (e.g., aligning wrist camera frames by detecting gripper motion). RLDS and MCAP support post-hoc timestamp correction. Validate alignment by plotting per-arm action timeseries and checking phase coherence for coordinated tasks.

What are the most common failure modes in bimanual datasets and how do I detect them?

Timestamp drift (>10ms between arms) degrades action prediction accuracy by 15-30%. Detect via cross-correlation of action timeseries. Action clipping (joint limits violated in 5%+ of steps) indicates miscalibration; detect via per-joint distribution analysis. Camera occlusion (one arm blocks the other's wrist camera for >30% of episode) reduces visual context; detect via image entropy or optical flow magnitude. Task incompletion (object not displaced, assembly not seated) wastes training budget; detect via final-frame image classification or human annotation of 10% of episodes.

Should I use HDF5, RLDS, or MCAP for storing bimanual demonstrations?

Use MCAP for raw log archival (preserves nanosecond timestamps, supports arbitrary message schemas, enables time-range queries). Convert to HDF5 for training if your policy codebase expects dense arrays (most PyTorch/JAX pipelines). Use RLDS if you need TensorFlow Datasets integration, columnar compression (10-30% smaller than HDF5), or schema validation. Hybrid pipelines are common: record to MCAP, process to HDF5/RLDS for training. ALOHA, DROID, and RH20T all ship HDF5; Open X-Embodiment uses RLDS.

How much does it cost to collect 1000 bimanual episodes in-house vs outsourcing?

In-house: $50-150 per episode ($50K-150K total) including hardware amortization ($20K-200K upfront, 2-5 year lifespan), operator labor ($30-100/hour, 3-5 episodes/hour for 30-second tasks), processing compute ($5-10 per 100 episodes), and storage ($0.50-1 per episode). Outsourcing to Scale AI or Truelabel: $100-300 per episode ($100K-300K total) with 2-4 week delivery, no hardware investment, and guaranteed quality (structural validation, success labeling). Outsourcing makes sense for <500 episodes or novel tasks requiring rapid iteration; in-house is cost-effective at 1K+ episodes.

Can I use VR controllers instead of Leader-Follower arms for bimanual teleoperation?

Yes. VR controllers (Meta Quest, Vive) map 6-DoF hand poses to end-effector targets via inverse kinematics. Advantages: no Leader hardware to transport, arbitrary workspace, and lower upfront cost ($500 for headset vs $10K for Leader arms). Disadvantages: higher cognitive load (no physical correspondence), latency sensitivity (>50ms causes oscillations), and occlusion issues (controllers lost when hands overlap). DROID used Quest 2 with 30 Hz control and 20ms latency, achieving 85% task success. VR works best for pick-and-place and manipulation; Leader-Follower excels at contact-rich assembly and cable routing.

Looking for bimanual robot demonstrations?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

List Your Bimanual Dataset on Truelabel