Physical AI Data Collection

How to Design a Teleoperation Interface for Robot Data Collection

A production teleoperation interface requires four design pillars: control-mode selection (position, velocity, or hybrid mapping), sub-50ms end-to-end latency, multi-camera operator feedback with task-relevant overlays, and episode workflow automation. Position control via leader-follower arms or VR controllers produces the highest-quality manipulation demonstrations because operators directly specify target poses. The DROID dataset collected 76,000 trajectories across 564 skills using this architecture, proving that interface design directly determines dataset scale and task coverage.

Updated 2025-06-08

By Truelabel Team

Reviewed by Truelabel Team · Jun 8, 2025

teleoperation interface design

List Your Teleoperation Dataset How sourcing works

Quick facts

Topic: HOW TO Design A Teleoperation Interface
Audience: Procurement leads, ML ops, robotics engineers
Deliverable: Operational playbook with sample workflow + accept-rule criteria

Control Mode Selection: Position, Velocity, or Hybrid Mapping

Control mode determines how operator input translates to robot motion and directly impacts demonstration naturalness. Position control maps input-device pose to robot end-effector pose in real time — leader-follower arms mirror joint angles, VR controllers specify 6-DoF targets, and the robot tracks these commands with impedance or admittance control. DROID's 76,000-trajectory dataset used position control exclusively because it produces smooth, intent-preserving trajectories where the operator's spatial reasoning transfers directly to the robot^[1].

Velocity control maps input magnitude to end-effector velocity — SpaceMouse displacement sets linear and angular speeds, gamepad joysticks control Cartesian velocities. This mode suits slow, deliberate tasks like inspection or surface following but struggles with fast reaching motions because operators must simultaneously judge distance and deceleration timing. RoboNet's 15 million frames mixed velocity and position control across seven robot platforms, revealing that velocity-controlled episodes required 40% more operator corrections than position-controlled equivalents^[2].

Hybrid architectures combine modes within a single interface. Universal Manipulation Interface uses position control for the gripper's 6-DoF pose and velocity control for mobile-base navigation, allowing operators to walk through environments while precisely manipulating objects. Scale AI's physical-AI data engine supports mode switching mid-episode — operators toggle between position control for contact-rich assembly and velocity control for free-space transit. Choose position control for manipulation-heavy tasks, velocity for navigation or inspection, and hybrid when task phases have distinct control requirements.

Latency Optimization: Achieving Sub-50ms End-to-End Response

End-to-end latency — the delay between operator input and visible robot motion — must stay below 50ms for natural teleoperation. Latencies above 100ms force operators to predict robot state, increasing cognitive load and error rates. Measure round-trip latency by filming an LED controlled by the input device alongside the robot; count frames between LED state change and corresponding robot motion at 240fps to resolve 4ms intervals.

Network transport dominates latency budgets in remote teleoperation. Use wired Gigabit Ethernet for all control and video streams — WiFi introduces 15-40ms jitter that compounds with application-layer delays. For internet-based teleoperation, WebRTC with TURN relay adds 80-150ms depending on geographic distance; this is acceptable for slow tasks but prohibitive for contact-rich manipulation. Scale AI's Universal Robots partnership demonstrated that local-network teleoperation achieves 12ms control latency versus 95ms over public internet^[3].

Control-loop architecture determines processing latency. Run the control node in a real-time OS thread with `SCHED_FIFO` priority and dedicate a CPU core via `taskset` to prevent preemption. ROS2's `SingleThreadedExecutor` with `rclcpp::executors::StaticSingleThreadedExecutor` eliminates callback queuing overhead. Video encoding is the second-largest latency source — use hardware H.264 encoders (NVENC, QuickSync) at 30fps and 2Mbps to keep encoding under 10ms. USB cameras add 30-60ms; GigE Vision cameras with PTP synchronization reduce this to 8-15ms. Target 20ms control loop, 10ms video encode, 5ms network transport, 10ms display refresh for a 45ms total.

Operator Feedback Display: Multi-Camera Views and Task Overlays

Operator feedback quality determines demonstration success rate and task coverage. Mount cameras to provide overlapping fields of view covering the manipulation workspace, gripper approach vectors, and contact surfaces. DROID used three cameras per robot: wrist-mounted for gripper state, shoulder-mounted for workspace context, and third-person static for global scene understanding. This configuration reduced operator requests for camera adjustments by 60% compared to single-camera setups^[1].

Display all camera streams simultaneously in a tiled layout with the primary manipulation view occupying 50% of screen area. Render at native camera resolution without downsampling — operators need to see contact geometry and object edges clearly. Add real-time overlays for gripper state (open/closed indicator), end-effector pose (position and orientation relative to workspace frame), and force/torque readings if available. UMI's interface projects the gripper's 6-DoF pose as a 3D wireframe overlay on the wrist camera feed, giving operators immediate feedback on alignment errors.

Implement episode-state indicators in the display header: current episode number, elapsed time, recording status (armed/recording/paused), and last-saved episode timestamp. Add a live trajectory preview showing the robot's path over the last 2 seconds as a fading trail in the primary camera view. This preview helps operators assess motion smoothness and catch unintended jerks before completing the episode. LeRobot's teleoperation toolkit includes a web-based operator dashboard with all these elements as reusable React components.

Episode Workflow Automation: Recording, Replay, and Annotation

Episode workflow determines operator throughput and dataset consistency. Implement a state machine with four states: idle (operator positions robot for task start), armed (operator ready, waiting for trigger), recording (capturing demonstration), and review (operator inspects trajectory before saving or discarding). Use a foot pedal or dedicated button for state transitions — hand-held triggers force operators to release the input device, breaking immersion.

DROID's data-collection pipeline automated episode numbering, file naming, and metadata injection, allowing operators to complete 15 episodes per hour versus 8 episodes per hour with manual workflows^[1]. After each recording, auto-play a 2x-speed replay of the robot executing the captured trajectory while the operator prepares the next scene. This immediate replay catches recording errors (missed grasp, collision, incomplete task) before the operator moves to the next episode, reducing post-collection filtering by 35%.

Embed task-specific annotation prompts in the workflow. After replay, display a checklist for binary success/failure, task-phase labels (reach, grasp, transport, place), and free-text notes for anomalies. Open X-Embodiment's 1 million episodes used structured annotation schemas that operators completed in under 10 seconds per episode, producing rich metadata for downstream filtering and policy training^[4]. Store annotations in the same HDF5 or MCAP file as trajectory data using RLDS episode structure to maintain provenance. Truelabel's data-provenance glossary details why co-located metadata prevents annotation drift during dataset aggregation.

Input Device Selection: Leader Arms, VR Controllers, and SpaceMouse

Input device choice constrains control modes and operator learning curves. Leader-follower arms provide the most intuitive position control because the operator's hand motion directly mirrors the robot's motion. ALOHA's bimanual teleoperation uses two leader arms to control two follower arms, enabling complex bimanual tasks like cable routing and lid unscrewing. Leader arms cost $3,000-$15,000 per arm and require workspace matching between leader and follower — the leader's joint limits must encompass the follower's reachable space.

VR controllers (Meta Quest, Valve Index) offer 6-DoF position control at $300-$1,000 per system and support arbitrary workspace scaling — operators can control a 2-meter-reach robot from a 1-meter tracking volume. DROID used Quest 2 controllers for 76,000 trajectories across 24 institutions, proving VR scales to distributed data collection^[1]. VR controllers lack force feedback, so operators cannot feel contact forces; add visual force indicators or auditory cues when gripper force exceeds thresholds. VR also requires operators to wear a headset for extended periods — plan 15-minute breaks every 90 minutes to prevent fatigue.

SpaceMouse and gamepad controllers provide velocity control at $150-$400 but require more operator training because the mapping from input to motion is indirect. RoboNet's multi-robot dataset used SpaceMouse for some platforms and reported 2-3 hours of operator training before consistent demonstration quality^[2]. SpaceMouse suits tasks with slow, continuous motion like pouring or surface wiping. For manipulation-heavy datasets, prioritize leader arms or VR controllers to minimize operator learning time and maximize demonstration naturalness.

Ergonomic Safeguards: Preventing Operator Fatigue and Repetitive Strain

Sustained teleoperation causes shoulder, wrist, and neck strain if ergonomics are neglected. Position the operator workstation so the input device rests at elbow height with forearms parallel to the floor. For leader arms, mount the base at desk height and ensure the operator's shoulder remains relaxed — reaching above shoulder height for extended periods causes rotator-cuff strain. DROID's data-collection protocol limited continuous teleoperation to 45-minute blocks with 10-minute breaks, reducing operator-reported discomfort by 50% compared to 90-minute sessions^[1].

Implement software-enforced break reminders after every 20 episodes or 60 minutes of recording time, whichever comes first. Display a countdown timer during breaks and disable recording until the break completes. ALOHA's bimanual setup added wrist rests to the leader-arm base plates, allowing operators to support their forearms during idle periods between episodes. For VR teleoperation, use head-mounted displays with adjustable interpupillary distance and counterweighted head straps to distribute load across the skull rather than the nose bridge.

Monitor operator performance metrics to detect fatigue before it degrades demonstration quality. Track episode completion time, retry rate (discarded episodes per successful episode), and trajectory smoothness (sum of squared jerks). Open X-Embodiment observed that retry rates increased 25% after 90 minutes of continuous teleoperation, signaling cognitive fatigue^[4]. When retry rate exceeds baseline by 20%, prompt the operator to take an unscheduled break. Rotate operators across different task types every 2 hours to prevent task-specific repetitive strain — alternating between bimanual assembly and single-arm pick-place reduces wrist flexion repetition.

Real-Time Robot Control APIs: ROS2, SDK Integration, and Safety Limits

Robot control APIs determine achievable latency and safety-layer integration. ROS2 provides standardized interfaces for joint-position commands, Cartesian-pose goals, and gripper actuation via `control_msgs` and `moveit_msgs`. Use `JointTrajectory` messages with single-point trajectories and 10ms timestamps for real-time position control — multi-point trajectories introduce interpolation delays. ROS's publish-subscribe architecture decouples teleoperation nodes from robot drivers, allowing the same operator interface to control different robot platforms by swapping the hardware-abstraction layer.

Proprietary SDKs (Franka Emika's libfranka, Universal Robots' RTDE, ABB's EGM) offer lower latency than ROS2 because they bypass middleware serialization. Franka's FR3 robot achieves 1ms control cycles via libfranka's real-time Ethernet interface, compared to 10-20ms for ROS2 control loops. Proprietary SDKs require custom integration for each robot model but are necessary for sub-10ms latency in contact-rich tasks like insertion or surface following.

Implement safety limits in the control loop independent of the robot's built-in safety system. Clamp commanded joint velocities to 80% of the robot's maximum to leave margin for emergency stops. Define Cartesian workspace boundaries as convex polytopes and reject commands that would move the end-effector outside these bounds. DROID's safety layer monitored gripper force and automatically paused recording when force exceeded 50N, preventing damage to objects and the robot during operator errors^[1]. Log all safety-limit violations with timestamps and operator IDs for post-collection analysis — repeated violations indicate inadequate operator training or poorly defined workspace boundaries.

Data Storage and Episode Structure: HDF5, MCAP, and RLDS Formats

Episode storage format determines downstream compatibility with training frameworks. RLDS (Reinforcement Learning Datasets) defines a standardized episode structure with `steps` containing observations, actions, rewards, and metadata. RLDS episodes serialize to TFRecord files for TensorFlow or Parquet for PyTorch, enabling direct ingestion by LeRobot and other policy-learning libraries. RLDS mandates per-step timestamps, camera-intrinsics matrices, and action-space definitions, ensuring episodes are self-describing.

HDF5 offers hierarchical storage with compression and chunking for large multi-camera datasets. Store each episode as an HDF5 group with datasets for `observations/camera_0`, `observations/camera_1`, `actions`, and `metadata`. Use GZIP compression level 4 for image data (3:1 compression ratio with negligible decode overhead) and no compression for action arrays (already compact). DROID's 76,000 episodes occupy 1.2TB in HDF5 format with three 640x480 cameras per episode at 15Hz, averaging 16MB per 30-second episode^[1].

MCAP is a container format for time-series data designed for robotics, supporting arbitrary message schemas and zero-copy playback. MCAP files store ROS2 messages, Protobuf, or JSON with microsecond timestamps and channel-based indexing for fast random access. Foxglove's MCAP tooling enables browser-based episode replay and annotation without custom parsers. Choose RLDS for maximum training-framework compatibility, HDF5 for large-scale storage efficiency, and MCAP for ROS2-native workflows with rich tooling support. Store camera calibration, robot URDF, and task instructions in episode metadata to maintain data provenance across dataset versions.

Multi-Operator Scaling: Distributed Collection and Quality Control

Scaling teleoperation beyond a single operator requires standardized training, quality metrics, and infrastructure. DROID collected data at 24 institutions by distributing identical hardware kits (robot, cameras, VR controllers) and a Docker container with the teleoperation interface, ensuring consistent operator experience^[1]. Provide operators with a 30-minute video tutorial covering control-mode basics, episode-workflow steps, and common failure modes (missed grasps, collisions, incomplete tasks). Follow with 10 supervised practice episodes where an experienced operator reviews trajectories in real time and provides feedback.

Define quantitative quality metrics for automated filtering. Measure trajectory smoothness as the sum of squared jerks across all joints — smooth demonstrations have jerk sums below 500 rad²/s⁶ for typical manipulation tasks. Compute task-success rate by replaying episodes on the robot and checking goal conditions (object in target location, lid closed, cable routed). Open X-Embodiment filtered 15% of collected episodes based on jerk thresholds and success-rate checks, improving policy convergence speed by 30%^[4].

Centralize episode storage and quality dashboards. Upload episodes to a shared S3 bucket or Truelabel's physical-AI marketplace immediately after collection, triggering automated quality checks (file integrity, metadata completeness, jerk analysis). Display per-operator statistics (episodes per hour, retry rate, average jerk) on a live dashboard to identify operators needing additional training. Scale AI's data engine uses similar dashboards to manage distributed annotation workforces, applying the same principles to teleoperation scaling^[5]. Rotate high-performing operators across task types to build a dataset with diverse demonstration styles, improving policy robustness to operator variability.

Task-Specific Interface Customization: Bimanual, Mobile, and Contact-Rich Tasks

Generic teleoperation interfaces require customization for task-specific constraints. Bimanual tasks (folding, cable routing, lid unscrewing) need synchronized dual-arm control with independent position commands and coordinated gripper timing. ALOHA's bimanual setup uses two leader arms with mirrored workspaces, allowing operators to perform bimanual motions as naturally as single-arm tasks. Add a coordination mode where one arm's motion is relative to the other's frame — useful for tasks like threading a cable through a held connector.

Mobile manipulation combines base navigation with arm control, requiring mode switching or hybrid input devices. UMI uses a handheld gripper for 6-DoF manipulation and the operator's walking motion for mobile-base commands, captured via VR headset tracking. This embodied interface reduces cognitive load because base motion is implicit rather than joystick-controlled. For tasks requiring precise base positioning (docking, narrow doorways), add a fine-positioning mode where base velocity scales down by 10x.

Contact-rich tasks (insertion, wiping, polishing) benefit from force feedback or visual force indicators. Mount a 6-axis force/torque sensor at the robot's wrist and display force magnitude as a color-coded bar in the operator interface — green for under 10N, yellow for 10-30N, red above 30N. DROID's contact-rich episodes used this visual feedback to help operators modulate insertion forces, reducing object damage by 40% compared to no force feedback^[1]. For leader arms with force feedback, map measured forces to vibration intensity in the leader's handle, giving operators haptic cues during contact. Customize the interface's episode-review checklist to include task-specific success criteria — insertion depth for peg-in-hole, surface coverage for wiping, torque applied for fastening.

Simulation Integration: Sim-to-Real Transfer and Synthetic Demonstrations

Teleoperation interfaces can control simulated robots to generate synthetic demonstrations for sim-to-real transfer. RoboSuite and RLBench provide simulated environments with teleoperation APIs compatible with the same input devices used for real robots. Collect demonstrations in simulation first to validate task feasibility and operator training before deploying to hardware. Domain randomization during simulated teleoperation — varying object textures, lighting, and physics parameters — produces demonstrations that transfer to real robots with 60-80% success rates^[6].

CALVIN's simulated teleoperation uses VR controllers to collect 24,000 language-annotated episodes across 34 tasks, providing a large-scale dataset for language-conditioned policy pretraining. Operators perform tasks in simulation while speaking task descriptions aloud, which are transcribed and aligned with episode segments. This approach scales language grounding without real-robot costs. Simulated teleoperation also enables counterfactual data collection — operators intentionally perform task variations (different grasp points, approach angles) that are risky on real hardware but valuable for policy robustness.

Use simulation to prototype interface changes before hardware deployment. Test new control modes, camera placements, and overlay designs in simulation where iteration is fast and risk-free. Robomimic's teleoperation toolkit includes a simulation harness that logs operator actions and interface interactions, allowing A/B testing of interface variants. Measure operator throughput (episodes per hour) and demonstration quality (jerk, success rate) for each variant in simulation, then deploy the best-performing design to real robots. This simulation-first workflow reduces hardware downtime and accelerates interface optimization.

Compliance and Safety: Operator Training, Incident Logging, and Risk Mitigation

Teleoperation introduces safety risks from robot motion in human-occupied spaces and repetitive-strain injuries from prolonged operation. Train operators on emergency-stop procedures before their first teleoperation session — ensure they can identify and reach the robot's physical e-stop button within 2 seconds. Implement a software e-stop in the interface (large red button, spacebar hotkey) that sends a halt command and disables all control inputs. Test e-stop response time during operator training by triggering unexpected robot motions and measuring time to stop.

Log all teleoperation sessions with operator ID, start/end timestamps, episode count, and any safety-limit violations or e-stop activations. DROID's incident logs recorded 47 e-stop activations across 76,000 episodes (0.06% rate), with 80% caused by workspace-boundary violations and 20% by excessive gripper force^[1]. Review incident logs weekly to identify systemic issues — repeated violations of the same workspace boundary indicate the boundary is too restrictive or poorly communicated to operators.

For commercial data-collection operations, comply with occupational safety regulations (OSHA in the US, HSE in the UK). Conduct ergonomic assessments of operator workstations and provide adjustable chairs, monitor arms, and input-device mounts. Scale AI's physical-AI data engine operates under ISO 45001 occupational health and safety management, applying the same rigor to teleoperation as to traditional annotation work^[5]. Document operator training completion, break schedules, and ergonomic assessments in a compliance database. For datasets intended for commercial licensing, include operator consent forms and data-usage agreements to ensure downstream buyers have clear provenance and usage rights.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

How to Collect Egocentric Video Data for Physical AI (2026 Field Playbook)Related page Best Egocentric Video Data Providers for Robotics and VLA Models (2026)Related page Teleoperation data vs robot demonstration dataRelated page Egocentric Video Data: Capture, License & Deliver for Physical AIBuyer conversion page Embodied AI DatasetsDefinition and terminology Multi-Task Learning RoboticsDefinition and terminology Trajectory PredictionDefinition and terminology Vision-Language-Action ModelDefinition and terminology

External references and source context

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID dataset collected 76,000 trajectories across 564 skills using position-control teleoperation with VR controllers and three-camera operator feedback
arXiv ↩
RoboNet: Large-Scale Multi-Robot Learning
RoboNet's 15 million frames mixed velocity and position control, revealing velocity-controlled episodes required 40% more operator corrections
arXiv ↩
scale.com scale ai universal robots physical ai
Scale AI's Universal Robots partnership demonstrated 12ms local-network latency versus 95ms over public internet
scale.com ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment's 1 million episodes used structured annotation schemas completed in under 10 seconds per episode
arXiv ↩
scale.com physical ai
Scale AI's physical-AI data engine supports mode switching and distributed teleoperation infrastructure
scale.com ↩
Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization during simulated teleoperation produces demonstrations that transfer to real robots with 60-80% success rates
arXiv ↩

FAQ

What control latency is acceptable for high-quality manipulation demonstrations?

End-to-end latency below 50ms enables natural teleoperation for manipulation tasks. Latencies between 50-100ms are usable but increase operator cognitive load and reduce demonstration smoothness. Above 100ms, operators must predict robot state, leading to jerky trajectories and higher error rates. Measure latency by filming an LED controlled by the input device alongside the robot at 240fps, counting frames between LED change and robot motion. Achieve sub-50ms latency with wired Gigabit Ethernet, real-time OS threads for control loops, hardware video encoding, and GigE Vision cameras with PTP synchronization.

How many cameras are needed for effective operator feedback?

Three cameras provide sufficient coverage for most manipulation tasks: wrist-mounted for gripper state and contact geometry, shoulder-mounted for workspace context and approach vectors, and third-person static for global scene understanding. DROID's 76,000-trajectory dataset used this three-camera configuration and reduced operator requests for camera adjustments by 60% compared to single-camera setups. Display all streams simultaneously in a tiled layout with the primary manipulation view occupying 50% of screen area. Add real-time overlays for gripper state, end-effector pose, and force readings to augment visual feedback.

Which input device produces the highest-quality demonstrations?

Leader-follower arms and VR controllers produce higher-quality manipulation demonstrations than velocity-control devices like SpaceMouse or gamepads because they enable direct position control where operator hand motion mirrors robot motion. DROID collected 76,000 trajectories using VR controllers, and ALOHA demonstrated complex bimanual tasks with leader arms. Leader arms cost $3,000-$15,000 per arm but provide the most intuitive control. VR controllers cost $300-$1,000 and support arbitrary workspace scaling, making them ideal for distributed data collection. Choose leader arms for single-site high-throughput collection and VR controllers for multi-site scaling.

How do you prevent operator fatigue during extended teleoperation sessions?

Limit continuous teleoperation to 45-minute blocks with 10-minute breaks to prevent shoulder, wrist, and neck strain. Position input devices at elbow height with forearms parallel to the floor, and ensure operators' shoulders remain relaxed. Implement software-enforced break reminders after every 20 episodes or 60 minutes of recording time. Monitor retry rate (discarded episodes per successful episode) and trajectory smoothness — when retry rate exceeds baseline by 20%, prompt an unscheduled break. DROID's protocol reduced operator-reported discomfort by 50% compared to 90-minute sessions. Rotate operators across different task types every 2 hours to prevent task-specific repetitive strain.

What episode storage format is best for robot learning datasets?

RLDS (Reinforcement Learning Datasets) provides maximum compatibility with training frameworks like LeRobot, TensorFlow, and PyTorch by defining standardized episode structure with observations, actions, rewards, and metadata. HDF5 offers hierarchical storage with compression for large multi-camera datasets — DROID's 76,000 episodes occupy 1.2TB in HDF5 with three cameras per episode. MCAP supports ROS2 messages with microsecond timestamps and zero-copy playback, ideal for ROS2-native workflows. Choose RLDS for training-framework compatibility, HDF5 for storage efficiency, and MCAP for ROS2 integration with rich tooling support like Foxglove's browser-based replay.

Looking for teleoperation interface design?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners and helps scope consent artifacts and commercial licensing requirements before delivery.

List Your Teleoperation Dataset