Physical AI Data Engineering
How to Create a Robot Demonstration Dataset
Creating a robot demonstration dataset requires five engineering phases: defining the observation-action contract and task specifications, assembling teleoperation hardware with synchronized multi-modal recording, training operators and running pilot collections to validate quality metrics, executing full-scale collection with real-time QA monitoring, and post-processing episodes into training-ready formats like RLDS or LeRobot HDF5 with train-validation splits and metadata.
Quick facts
- Difficulty
- Intermediate
- Audience
- Physical AI data engineers
- Last reviewed
- 2025-06-15
Why Demonstration Datasets Drive Modern Robot Learning
Demonstration datasets have become the primary training substrate for generalist robot policies. RT-1 trained on 130,000 demonstrations across 700 tasks to achieve 97% success on novel instructions[1]. Open X-Embodiment aggregated 1 million episodes from 22 robot embodiments to train RT-X models that generalize across morphologies[2]. DROID collected 76,000 manipulation trajectories in 564 real-world scenes using a standardized teleoperation protocol[3].
The shift from scripted data to human demonstrations reflects three technical realities. First, teleoperation captures implicit recovery behaviors that scripted policies miss—when a human operator corrects a grasp mid-episode, that correction becomes training signal for robustness. Second, demonstrations encode task semantics through temporal structure: the sequence of approach, grasp, lift, transport, and release phases provides richer supervision than isolated state-action pairs. Third, large-scale demonstration datasets enable vision-language-action models like OpenVLA to ground natural language instructions in physical affordances[4].
Dataset scale requirements vary by architecture. Behavioral cloning policies trained on single tasks converge with 50-200 episodes. Diffusion policies targeting multi-task generalization require 500-2,000 episodes per task family. Vision-language-action transformers demand 100,000+ episodes spanning diverse scenes and objects to learn transferable representations. BridgeData V2 demonstrated that dataset diversity—measured by unique object-scene combinations—matters more than raw episode count for out-of-distribution generalization[5].
Observation-Action Contract Specification
The observation-action contract defines the exact tensor shapes, data types, coordinate frames, and sampling rates that bridge hardware and model. Every modality must specify its dimensionality, units, and synchronization requirements before collection begins. A typical bimanual manipulation contract includes image_left (480×640×3 uint8, 30 Hz), image_right (480×640×3 uint8, 30 Hz), joint_positions (14-dim float32 radians, 50 Hz), gripper_states (2-dim float32 normalized 0-1, 50 Hz), and end_effector_poses (2×7-dim float32 position+quaternion in base frame, 50 Hz).
Coordinate frame choices cascade through the entire pipeline. RLDS standardizes on right-handed Cartesian frames with Z-up for end-effector poses and joint angles in radians following ROS conventions[6]. Gripper commands use normalized scalars where 0.0 represents fully closed and 1.0 fully open, avoiding hardware-specific units. Camera extrinsics must be calibrated and stored as 4×4 homogeneous transforms from camera optical frame to robot base frame.
Action space design determines policy expressiveness. Position control actions specify target joint angles or Cartesian poses; velocity control actions specify joint velocities or end-effector twists. Diffusion Policy implementations typically use position control with 10 Hz action frequency and 16-step action chunks to balance reactivity and trajectory smoothness[7]. Hybrid action spaces that combine high-level discrete mode switches (approach, grasp, transport) with continuous parameters enable hierarchical policies but complicate the observation contract with mode indicators.
Temporal alignment across modalities requires hardware timestamps, not software logging times. Use ROS2 message timestamps or equivalent middleware timing to synchronize camera frames, joint states, and force-torque readings within 10 milliseconds[8]. Misaligned observations corrupt the causal structure that imitation learning depends on—a 50ms camera lag means the policy learns to react to stale visual information.
Teleoperation Hardware and Recording Infrastructure
Teleoperation hardware determines dataset quality more than any post-processing step. The two dominant paradigms are leader-follower systems and 6-DOF input devices. Leader-follower setups like ALOHA use a kinematically matched leader arm that the operator manipulates while joint encoders stream target positions to the follower robot[9]. This approach provides intuitive whole-arm control and natural bimanual coordination but requires custom hardware fabrication. 6-DOF devices like SpaceMouse or Phantom Omni map 3D translations and rotations to end-effector commands through inverse kinematics, enabling operation with commercial hardware but introducing a cognitive mapping layer.
DROID standardized on a SpaceMouse Pro paired with a foot pedal for gripper control, achieving 76,000 episodes across 564 scenes with 18 operators[10]. The key insight: operator training time matters more than hardware sophistication. DROID operators completed 2-hour training sessions focusing on smooth motion profiles and consistent grasp approaches before contributing to the production dataset. Scale AI's partnership with Universal Robots demonstrated that even collaborative robot arms with compliant control can generate high-quality demonstrations when operators receive structured feedback on trajectory smoothness metrics[11].
Recording infrastructure must capture all modalities synchronously without dropping frames. A minimal ROS2 recording node subscribes to camera image topics, joint state topics, and gripper state topics, then writes timestamped messages to MCAP files using the rosbag2_storage_mcap plugin[12]. MCAP provides indexed random access and efficient compression, critical for datasets exceeding 1 TB. For non-ROS systems, implement a custom recorder that writes HDF5 files with one group per episode and datasets for each modality[13].
Camera placement follows task-specific heuristics. Tabletop manipulation benefits from a fixed third-person view 60-80 cm above the workspace at 30-45° elevation plus wrist-mounted cameras on each end-effector. RT-2 used a single overhead camera for 6,000 tasks, demonstrating that view diversity matters less than consistent framing[14]. Depth cameras add 10-20% to dataset size but enable policies to reason about occlusions and 3D geometry—essential for bin picking and dense clutter scenarios.
Operator Training and Pilot Collection
Operator training determines dataset consistency. A structured training protocol includes three phases: hardware familiarization (30 minutes), task demonstration by an expert (15 minutes), and supervised practice until the operator achieves 80% success rate on a held-out validation task (1-2 hours). During practice, monitor trajectory smoothness using jerk metrics—the third derivative of position—and flag episodes with jerk spikes exceeding 50 m/s³ for review.
Pilot collection validates the entire pipeline before scaling. Collect 50-100 episodes with 2-3 trained operators, then run automated quality checks: verify that all modalities have matching episode lengths within 5 frames, confirm that gripper state transitions from open to closed occur in every pick episode, check that camera exposure remains stable across lighting conditions, and validate that action magnitudes fall within expected ranges (joint velocities below 2 rad/s for safe manipulation).
BridgeData's pilot phase revealed that 15% of episodes contained unintended collisions that corrupted the demonstration[15]. The solution: add a real-time collision detection layer that monitors joint torques and automatically flags episodes where torque exceeds 20 Nm on any joint. Operators review flagged episodes immediately and decide whether to discard or annotate the collision as intentional (e.g., pushing a drawer closed).
Task specification refinement happens during pilot collection. If operators achieve only 60% success rate on a task after training, the task is either too difficult for teleoperation or the success criteria are ambiguous. CALVIN's task suite iterated through three pilot rounds, ultimately removing tasks that required sub-millimeter precision or force feedback that the teleoperation hardware could not provide[16]. The final suite balanced difficulty to maintain operator engagement while ensuring that 90% of episodes reached the goal state.
Full-Scale Collection with Real-Time Quality Assurance
Full-scale collection requires real-time quality assurance to prevent accumulating unusable data. Implement three QA layers: per-episode automated checks, per-session operator monitoring, and per-dataset validation. Per-episode checks run immediately after recording and flag episodes with missing frames, out-of-range joint positions, or camera exposure failures. Operators review flagged episodes within 5 minutes and re-record if necessary.
Per-session monitoring tracks operator performance metrics across a 2-hour collection shift. Compute success rate (percentage of episodes reaching goal state), average episode duration, and trajectory smoothness score (inverse of mean jerk). If an operator's success rate drops below 70% or smoothness score degrades by more than 20% relative to their baseline, pause collection and investigate fatigue or hardware issues. EPIC-KITCHENS collected 100 hours of egocentric video across 45 kitchens by scheduling operators for 90-minute sessions with 15-minute breaks[17].
Per-dataset validation runs nightly on the accumulated data. Check for distribution drift in observation statistics: compute mean and standard deviation of joint positions, end-effector poses, and pixel intensities, then flag any modality whose statistics shift by more than 2 standard deviations from the pilot baseline. This catches systematic issues like camera calibration drift or robot base frame changes. RoboNet aggregated data from 7 robot platforms and discovered that 12% of episodes had inconsistent coordinate frame conventions[18].
Collection velocity targets depend on task complexity. Simple pick-and-place tasks yield 20-30 episodes per operator-hour. Multi-step assembly tasks with precise alignment requirements drop to 8-12 episodes per operator-hour. Scale AI's data engine processes 10,000+ robot demonstrations per week by parallelizing collection across multiple sites and operators[19]. For academic labs, a single robot with 2 trained operators collecting 4 hours per day generates 200-300 episodes per week.
Post-Processing: Cleaning, Filtering, and Normalization
Post-processing transforms raw recordings into training-ready episodes. The cleaning phase removes incomplete episodes, filters out failures that do not provide useful training signal, and trims pre-task and post-task idle periods. Incomplete episodes—those where the operator aborted mid-task or the recording crashed—are discarded entirely. Failure episodes where the robot dropped an object or missed a grasp are retained if they contain recovery attempts but discarded if the operator immediately reset without correction.
Temporal trimming uses heuristic rules to identify task start and end frames. For pick-and-place tasks, detect task start as the first frame where gripper velocity exceeds 0.05 m/s and task end as the first frame after gripper opening where end-effector velocity drops below 0.02 m/s for 10 consecutive frames. Manual review of 50 randomly sampled episodes validates that heuristics capture 95%+ of task boundaries correctly. RLDS provides a standardized episode boundary annotation format that downstream training code can consume without custom parsing[20].
Normalization standardizes observation and action ranges across episodes. Joint positions are normalized to [-1, 1] using the robot's joint limits. End-effector poses are expressed relative to a canonical workspace frame—typically the robot base frame with origin at the workspace center. Image observations are resized to the model's expected input resolution (commonly 224×224 or 256×256) using bilinear interpolation and pixel values are scaled to [0, 1] float32.
Action normalization requires careful handling of temporal dependencies. LeRobot's dataset format stores actions as deltas (change from current state to next state) rather than absolute targets, enabling policies to learn relative corrections[21]. For position-controlled robots, compute action deltas as `action_t = state_{t+1} - state_t` in joint space. For velocity-controlled robots, actions are already in delta form. Clip action magnitudes to 3 standard deviations from the mean to remove outliers caused by teleoperation glitches.
Dataset Formatting and Train-Validation Splits
Dataset formatting determines compatibility with training frameworks. The two dominant formats are RLDS (Reinforcement Learning Datasets) and LeRobot HDF5. RLDS stores episodes as TFRecord files with a standardized schema: each episode is a sequence of steps, each step contains observations (dict of tensors), actions (tensor), rewards (scalar), and metadata[6]. LeRobot uses HDF5 with one group per episode and datasets for each modality, plus a top-level metadata group storing episode lengths, success labels, and task identifiers[7].
Train-validation splits must respect temporal and scene diversity. Random splits that assign episodes to train or validation sets independently risk data leakage when multiple episodes share the same object arrangement or lighting condition. Instead, split by collection session: assign entire sessions (typically 20-50 episodes collected consecutively) to train or validation. This ensures that validation episodes test generalization to unseen scene configurations. A typical split allocates 85% of sessions to training, 10% to validation, and 5% to a held-out test set.
Open X-Embodiment introduced cross-embodiment splits where validation episodes come from robot platforms not seen during training[22]. This tests whether policies learn task semantics rather than embodiment-specific motion patterns. For single-embodiment datasets, split by task variation: if collecting pick-and-place with 20 object models, assign 17 objects to training and 3 to validation.
Metadata fields enable downstream filtering and analysis. Store per-episode metadata including operator_id (anonymized identifier), collection_date (ISO 8601 timestamp), success_label (boolean), task_id (string identifier), scene_id (string identifier), and episode_duration_seconds (float). Store per-dataset metadata including robot_model, camera_intrinsics (3×3 matrix per camera), camera_extrinsics (4×4 transform per camera), action_space_description (string), and observation_space_description (string). Datasheets for Datasets provides a comprehensive metadata template that covers intended use, data collection process, and known limitations[23].
Quality Validation and Training Smoke Tests
Quality validation runs a battery of automated checks before releasing the dataset. Check 1: verify that all episodes have matching observation and action sequence lengths within 1 frame. Check 2: confirm that action magnitudes fall within 3 standard deviations of the dataset mean for each action dimension. Check 3: validate that camera intrinsics matrices are positive definite and focal lengths are within 10% of manufacturer specifications. Check 4: ensure that at least 80% of episodes have success_label=True.
Distribution analysis detects collection biases. Plot histograms of initial object positions, final object positions, episode durations, and gripper open-close cycle counts. If object positions cluster in one region of the workspace, the dataset lacks spatial diversity. If 90% of episodes complete in under 15 seconds but 10% take over 60 seconds, investigate whether long episodes represent difficult variations or operator errors. Paullada et al. surveyed 1,000 ML datasets and found that 43% lacked documentation of known distribution biases[24].
Training smoke tests validate that the dataset format is compatible with target training frameworks. Write a minimal training script that loads 10 episodes, constructs batches, and runs one forward pass through a small policy network. For LeRobot's Diffusion Policy, this means loading image observations, normalizing them, passing them through a ResNet encoder, and predicting 16-step action sequences[25]. If the smoke test fails, debug data loading before scaling to full training.
Dataset versioning prevents training on inconsistent data. Use semantic versioning (major.minor.patch) where major increments indicate breaking changes to the observation-action contract, minor increments add new episodes or tasks, and patch increments fix metadata or correct mislabeled episodes. Store dataset versions in a registry with release notes documenting changes. Truelabel's marketplace requires dataset providers to maintain version history and provenance metadata for every release[26].
Scale Guidelines by Policy Architecture
Behavioral cloning policies trained on single tasks converge with 50-200 demonstrations depending on task complexity. A pick-and-place task with fixed object positions and deterministic success criteria requires 50-100 episodes. Adding object pose variation increases the requirement to 100-150 episodes. Introducing multiple object models or scene clutter pushes the requirement to 150-200 episodes. Beyond 200 episodes, single-task BC policies show diminishing returns—validation success rate improves by less than 2% per additional 50 episodes.
Diffusion policies targeting multi-task generalization require 500-2,000 episodes per task family. Diffusion Policy trained on 200 episodes per task achieved 85% success on in-distribution test scenes[27]. Scaling to 500 episodes per task improved out-of-distribution generalization by 12 percentage points. Task families—groups of tasks sharing similar motion primitives like pick-place-stack—benefit from shared data: 1,000 episodes split across 5 related tasks outperform 200 episodes on a single task.
Vision-language-action transformers demand 100,000+ episodes spanning diverse scenes, objects, and language instructions. RT-1 trained on 130,000 demonstrations across 700 tasks[1]. Open X-Embodiment aggregated 1 million episodes from 22 embodiments[22]. OpenVLA required 970,000 episodes from the Open X-Embodiment dataset to achieve 50% success on unseen tasks with novel objects[4]. The scaling law appears log-linear: doubling dataset size improves zero-shot task success rate by 5-8 percentage points up to 1 million episodes.
Dataset diversity metrics predict generalization better than raw episode count. Measure diversity as the number of unique object-scene-instruction tuples. A dataset with 10,000 episodes but only 50 unique scenes has lower effective diversity than a dataset with 5,000 episodes across 200 scenes. BridgeData V2 demonstrated that policies trained on 10,000 diverse episodes outperformed policies trained on 50,000 episodes from a narrow distribution[28].
Data Requirements by Task Complexity
Task complexity determines minimum viable dataset size. Simple pick-and-place with fixed object positions and no occlusions requires 50-100 episodes. Adding 6-DOF object pose variation increases the requirement to 150-200 episodes. Introducing multiple object models (5-10 distinct shapes) pushes the requirement to 300-500 episodes. Dense clutter scenarios where the robot must move obstacles to reach the target object require 800-1,200 episodes to capture the combinatorial space of object arrangements.
Multi-step tasks scale super-linearly with step count. A two-step task (pick object A, place in bin B) requires 200-300 episodes. A three-step task (pick A, place in B, pick C, place in D) requires 600-900 episodes—not 400-600 as linear scaling would predict. The reason: multi-step tasks have more failure modes and require demonstrations of recovery behaviors at each step. CALVIN's 5-step task chains required 24,000 episodes to achieve 60% success on unseen chains[16].
Bimanual coordination adds 50-100% to data requirements. A single-arm pick-and-place task that requires 200 episodes becomes a 300-400 episode requirement when both arms must coordinate. The challenge: bimanual tasks have coupled action spaces where each arm's motion depends on the other's state. Demonstrations must cover the space of coordination patterns—simultaneous motion, sequential motion, and leader-follower patterns.
Long-horizon tasks with sparse rewards require dense subgoal annotations or hierarchical data collection. A task like 'prepare a sandwich' that takes 2-3 minutes and involves 15-20 primitive actions cannot be learned from end-to-end demonstrations alone. Either annotate subgoal completions (bread sliced, cheese placed, sandwich assembled) or collect separate datasets for each primitive and compose them hierarchically. EPIC-KITCHENS annotated 90,000 action segments across 700 hours of cooking video to enable learning of long-horizon meal preparation[29].
Common Failure Modes and Mitigation Strategies
Observation misalignment causes policies to learn spurious correlations. If camera timestamps lag joint state timestamps by 50 milliseconds, the policy learns to react to stale visual information and fails when deployed with correct synchronization. Mitigation: use hardware timestamps from the robot middleware, not software logging times. Validate synchronization by recording a calibration sequence where the robot moves through a known trajectory while a checkerboard is visible—cross-correlate joint positions with checkerboard pose estimates to measure lag.
Inconsistent coordinate frames corrupt spatial reasoning. If 20% of episodes use a different base frame origin due to robot recalibration, the policy learns a mixture distribution that generalizes poorly. Mitigation: store camera extrinsics and robot base frame transforms in per-episode metadata. Run a validation script that checks whether end-effector poses in the base frame fall within the expected workspace bounds for all episodes. Flag episodes with outlier poses for review.
Operator fatigue degrades demonstration quality after 60-90 minutes of continuous teleoperation. Trajectories become jerkier, success rates drop, and operators take shortcuts that violate task specifications. Mitigation: schedule 15-minute breaks every 90 minutes. Monitor per-session smoothness metrics and pause collection if an operator's jerk score increases by more than 30% relative to their baseline. DROID's collection protocol limited operators to 4-hour shifts with mandatory breaks[3].
Dataset imbalance biases policies toward common scenarios. If 80% of pick episodes involve grasping objects from the left side of the workspace, the policy fails on right-side grasps. Mitigation: track spatial distribution of object initial poses and success outcomes. If any workspace region has fewer than 10% of episodes, explicitly collect additional data in that region. Use stratified sampling during training to balance rare and common scenarios.
Dataset Documentation and Metadata Standards
Dataset documentation enables reproducibility and downstream use. A complete dataset release includes a README with dataset overview, task descriptions, observation-action contract specification, collection protocol summary, known limitations, and citation information. The README should answer: what tasks does this dataset cover, what robot platform was used, how many episodes are included, what is the train-validation-test split, and what are the intended use cases.
Datasheets for Datasets provides a structured template covering motivation (why was the dataset created), composition (what instances does it contain), collection process (how was data acquired), preprocessing (what cleaning was applied), uses (what tasks is it suitable for), distribution (how is it released), and maintenance (who maintains it)[23]. A datasheet for a robot demonstration dataset should specify teleoperation hardware, operator training protocol, quality assurance procedures, and any known distribution biases.
Metadata schemas enable programmatic filtering and search. Store per-episode metadata as JSON or YAML files with fields including episode_id, task_id, success_label, operator_id, collection_timestamp, episode_duration_seconds, and robot_configuration. Store per-dataset metadata including robot_model, end_effector_type, camera_models, observation_space_spec (dict mapping modality names to shapes and dtypes), action_space_spec (dict mapping action dimensions to units and ranges), and coordinate_frame_conventions.
Data provenance tracking documents the full lineage from raw recordings to processed episodes[30]. Record which version of the recording software was used, which post-processing scripts were applied, what filtering criteria removed episodes, and what normalization transforms were applied. This enables debugging when policies trained on the dataset exhibit unexpected behaviors—you can trace back to the raw data and identify whether the issue originated in collection or processing.
Licensing and Commercial Use Considerations
Dataset licensing determines downstream use rights. Academic datasets typically use permissive licenses like Creative Commons Attribution 4.0 that allow commercial use with attribution[31]. Some datasets use CC BY-NC to restrict commercial use, but this prevents companies from training production models on the data[32]. Restrictive licenses reduce dataset impact—researchers avoid datasets they cannot use in commercial projects.
Robot platform licensing affects dataset distribution. If demonstrations were collected on a robot with proprietary control software, the dataset may inherit licensing restrictions from the robot vendor. Check the robot's end-user license agreement for clauses about data collection and distribution. Some vendors explicitly permit dataset release; others require written permission. RoboNet's dataset license explicitly states that data was collected on robots with permissive academic licenses[33].
Privacy considerations arise when datasets include human operators or bystanders in camera views. If wrist-mounted cameras capture human hands or faces, the dataset may require consent from operators and anonymization of identifying features. GDPR Article 7 requires explicit consent for data collection involving EU residents[34]. Mitigation: blur faces and skin tones in camera images, or collect data in controlled environments without bystanders.
Truelabel's marketplace requires dataset providers to specify licensing terms, commercial use permissions, and any platform-specific restrictions before listing[26]. Buyers can filter datasets by license type to ensure compatibility with their intended use case. Clear licensing accelerates dataset adoption—ambiguous terms force buyers to seek legal review, adding weeks to procurement timelines.
Integration with Training Frameworks
Training framework integration determines how quickly researchers can use your dataset. LeRobot provides native loaders for its HDF5 format that handle batching, normalization, and augmentation[7]. To make a dataset LeRobot-compatible, structure HDF5 files with one group per episode, datasets named observation.image, observation.state, and action, and a metadata group with episode_lengths and success_labels arrays.
RLDS datasets integrate with TensorFlow Datasets and JAX training loops[20]. Convert episodes to TFRecord format using the RLDS builder API, which handles sharding, compression, and metadata generation. RLDS datasets appear in the TensorFlow Datasets catalog and can be loaded with a single line: `dataset = tfds.load('your_dataset_name')`. This discoverability drives adoption—researchers browse the catalog and try datasets without manual download and parsing.
Custom loaders require documentation and examples. Provide a Python script that loads one episode, prints observation and action shapes, and visualizes a camera frame. Include a minimal training example that loads batches and runs one forward pass through a policy network. LeRobot's training examples demonstrate end-to-end workflows from dataset loading through policy evaluation[25].
Dataset hosting affects accessibility. Hugging Face Datasets Hub provides free hosting for datasets under 300 GB with built-in versioning and download management. For larger datasets, use cloud storage (AWS S3, Google Cloud Storage) with public read access and provide download scripts that handle resumable transfers. Hugging Face's dataset loading library supports streaming from cloud storage, enabling training on datasets too large to fit in local storage[35].
Emerging Standards and Future Directions
Standardization efforts aim to reduce dataset fragmentation. Open X-Embodiment introduced a common data format that 22 institutions adopted to enable cross-embodiment training[22]. The format specifies observation and action tensor names, coordinate frame conventions, and metadata fields. Adopting this standard makes your dataset immediately compatible with RT-X models and other Open X-Embodiment consumers.
NVIDIA's Cosmos World Foundation Models initiative is developing standardized formats for physical AI datasets that include 3D scene representations, object-centric annotations, and physics metadata[36]. These formats enable training world models that predict future states given actions—a capability that current demonstration datasets do not support because they lack 3D geometry and physics ground truth.
Multi-modal datasets that combine demonstrations with natural language annotations, force-torque readings, and tactile sensor data are becoming the new frontier. DROID included language instructions for 30% of episodes, enabling language-conditioned policy training[3]. Future datasets will likely include audio (to capture contact sounds), thermal imaging (to detect object temperature), and event cameras (to capture high-speed motion).
Synthetic-to-real transfer is reducing the need for large-scale real-world collection. Domain randomization techniques enable policies trained entirely in simulation to transfer to real robots[37]. However, simulation still struggles with contact-rich manipulation and deformable objects. Hybrid datasets that combine 10,000 real demonstrations with 100,000 synthetic episodes may become the standard approach—real data provides contact dynamics and object interaction priors, synthetic data provides scale and diversity.
Cost and Timeline Estimation
Dataset creation costs scale with episode count and task complexity. A 500-episode single-task dataset with one robot and two trained operators requires 4-6 weeks: 1 week for hardware setup and pilot collection, 2-3 weeks for full-scale collection at 25 episodes per operator-day, and 1-2 weeks for post-processing and validation. Personnel costs dominate: two operators at 20 hours per week for 3 weeks equals 120 operator-hours at $25-50 per hour, totaling $3,000-6,000.
Multi-task datasets with 5,000 episodes across 10 tasks require 12-16 weeks and $30,000-60,000 in operator costs. Add hardware costs: a teleoperation-capable robot arm ($15,000-50,000), RGB-D cameras ($500-2,000 per camera), teleoperation device ($500-5,000), and compute infrastructure for data storage and processing ($2,000-5,000). Total capital expenditure for a new lab: $20,000-60,000.
Scale AI's data engine offers turnkey demonstration dataset collection at $50-200 per episode depending on task complexity and volume[19]. For a 1,000-episode dataset, this translates to $50,000-200,000—competitive with in-house collection when accounting for hardware amortization and operator training overhead. Truelabel's marketplace lists existing datasets at $0.10-5.00 per episode, enabling buyers to acquire 10,000 episodes for $1,000-50,000 without collection overhead[26].
Timeline compression requires parallel collection. Running two robots simultaneously with four operators doubles throughput but increases coordination overhead. DROID collected 76,000 episodes in 6 months by parallelizing across 18 operators and multiple robot platforms[3]. For academic labs with limited resources, partnering with other institutions to share data collection burden is increasingly common—each lab contributes 1,000 episodes to a shared 10,000-episode dataset.
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 trained on 130,000 demonstrations across 700 tasks achieving 97% success rate
arXiv ↩ - Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment aggregated 1 million episodes from 22 robot embodiments
arXiv ↩ - DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID collected 76,000 manipulation trajectories in 564 real-world scenes
arXiv ↩ - OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA vision-language-action model grounding natural language in physical affordances
arXiv ↩ - BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 demonstrated dataset diversity matters more than raw episode count
arXiv ↩ - RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning
RLDS standardizes coordinate frames and episode format for reinforcement learning datasets
arXiv ↩ - LeRobot documentation
LeRobot documentation for dataset format and training workflows
Hugging Face ↩ - ROS bag documentation
ROS bag recording with hardware timestamps for temporal synchronization
docs.ros.org ↩ - Teleoperation datasets are becoming the highest-intent physical AI content category
ALOHA leader-follower teleoperation system for bimanual manipulation
tonyzhaozh.github.io ↩ - Project site
DROID standardized SpaceMouse teleoperation across 18 operators and 564 scenes
droid-dataset.github.io ↩ - scale.com scale ai universal robots physical ai
Scale AI partnership with Universal Robots for demonstration data collection
scale.com ↩ - MCAP file format
MCAP file format for indexed random access and efficient compression
mcap.dev ↩ - Introduction to HDF5
HDF5 hierarchical data format for multi-modal episode storage
The HDF Group ↩ - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 used single overhead camera for 6,000 tasks demonstrating consistent framing
arXiv ↩ - Project site
BridgeData pilot phase revealed 15% of episodes contained unintended collisions
rail-berkeley.github.io ↩ - CALVIN paper
CALVIN task suite iterated through three pilot rounds to balance difficulty
arXiv ↩ - Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
EPIC-KITCHENS collected 100 hours of egocentric video with 90-minute operator sessions
arXiv ↩ - RoboNet: Large-Scale Multi-Robot Learning
RoboNet aggregated data from 7 platforms and found 12% had inconsistent coordinate frames
arXiv ↩ - scale.com physical ai
Scale AI data engine processes 10,000+ robot demonstrations per week
scale.com ↩ - RLDS: Reinforcement Learning Datasets
RLDS ecosystem for generating, sharing, and using reinforcement learning datasets
GitHub ↩ - LeRobot dataset documentation
LeRobot dataset format stores actions as deltas for relative corrections
Hugging Face ↩ - Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment cross-embodiment splits and common data format adoption
arXiv ↩ - Datasheets for Datasets
Datasheets for Datasets structured template for dataset documentation
arXiv ↩ - Data and its (dis)contents: A survey of dataset development and use in machine learning research
Survey found 43% of ML datasets lacked documentation of distribution biases
Patterns ↩ - Diffusion Policy training example
LeRobot Diffusion Policy training example for smoke testing dataset format
GitHub ↩ - truelabel physical AI data marketplace bounty intake
Truelabel marketplace requires version history and provenance metadata
truelabel.ai ↩ - Diffusion Policy training example
Diffusion Policy uses 10 Hz action frequency with 16-step action chunks
GitHub ↩ - BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 policies trained on 10,000 diverse episodes outperformed 50,000 narrow episodes
arXiv ↩ - Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
EPIC-KITCHENS annotated 90,000 action segments across 700 hours of cooking video
arXiv ↩ - truelabel data provenance glossary
Data provenance tracking documents full lineage from raw recordings to processed episodes
truelabel.ai ↩ - Attribution 4.0 International deed
Creative Commons Attribution 4.0 allows commercial use with attribution
Creative Commons ↩ - Creative Commons Attribution-NonCommercial 4.0 International deed
CC BY-NC restricts commercial use preventing production model training
creativecommons.org ↩ - RoboNet dataset license
RoboNet dataset license explicitly permits academic and commercial use
GitHub raw content ↩ - GDPR Article 7 — Conditions for consent
GDPR Article 7 requires explicit consent for data collection involving EU residents
GDPR-Info.eu ↩ - Hugging Face Datasets documentation
Hugging Face Datasets library supports streaming from cloud storage
Hugging Face ↩ - NVIDIA Cosmos World Foundation Models
NVIDIA Cosmos developing standardized formats for physical AI datasets with 3D representations
NVIDIA Developer ↩ - Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization enables sim-to-real transfer for robotic manipulation
arXiv ↩
FAQ
How many demonstrations do I need to train a robot manipulation policy?
The required number depends on your policy architecture and task complexity. Behavioral cloning on a single fixed task converges with 50-200 episodes. Multi-task diffusion policies require 500-2,000 episodes per task family. Vision-language-action transformers like RT-1 and OpenVLA need 100,000+ episodes across diverse scenes and objects. Dataset diversity—measured by unique object-scene combinations—matters more than raw episode count for out-of-distribution generalization. BridgeData V2 demonstrated that 10,000 diverse episodes outperform 50,000 episodes from a narrow distribution.
What teleoperation hardware should I use for demonstration collection?
The choice depends on your budget and task requirements. Leader-follower systems like ALOHA provide intuitive bimanual control but require custom fabrication. Commercial 6-DOF devices like SpaceMouse Pro cost $500-1,500 and work with any robot through inverse kinematics mapping. DROID standardized on SpaceMouse Pro with foot pedal gripper control and collected 76,000 episodes across 18 operators. Operator training time matters more than hardware sophistication—invest in structured training protocols that teach smooth motion profiles and consistent task execution.
How do I ensure temporal synchronization across cameras and robot state?
Use hardware timestamps from your robot middleware (ROS2 message timestamps or equivalent), not software logging times. Subscribe to all sensor topics in a single recording node and write timestamped messages to MCAP or HDF5 files. Validate synchronization by recording a calibration sequence where the robot moves through a known trajectory while a checkerboard is visible in camera views—cross-correlate joint positions with checkerboard pose estimates to measure lag. Target synchronization within 10 milliseconds across all modalities. Misaligned observations corrupt the causal structure that imitation learning depends on.
What dataset format should I use for robot demonstrations?
The two dominant formats are RLDS (Reinforcement Learning Datasets) and LeRobot HDF5. RLDS stores episodes as TFRecord files with a standardized schema and integrates with TensorFlow Datasets. LeRobot uses HDF5 with one group per episode and provides native PyTorch loaders. Choose RLDS if you train with JAX or TensorFlow; choose LeRobot HDF5 if you train with PyTorch. Both formats support multi-modal observations, variable-length episodes, and rich metadata. Open X-Embodiment introduced a common format that 22 institutions adopted—using this standard makes your dataset immediately compatible with RT-X models.
How should I split my dataset into train, validation, and test sets?
Split by collection session, not by random episode assignment. Assign entire sessions (20-50 consecutive episodes) to train or validation to prevent data leakage when multiple episodes share the same scene configuration. A typical split allocates 85% of sessions to training, 10% to validation, and 5% to held-out test. For multi-object tasks, split by object model—assign 17 of 20 objects to training and 3 to validation. This tests whether policies learn task semantics rather than memorizing specific object appearances. Open X-Embodiment introduced cross-embodiment splits where validation episodes come from robot platforms not seen during training.
What metadata should I include with each episode?
Store per-episode metadata including operator_id (anonymized), collection_date (ISO 8601 timestamp), success_label (boolean), task_id (string), scene_id (string), and episode_duration_seconds (float). Store per-dataset metadata including robot_model, camera_intrinsics (3×3 matrix per camera), camera_extrinsics (4×4 transform per camera), action_space_description, and observation_space_description. Use Datasheets for Datasets as a template for comprehensive documentation covering motivation, composition, collection process, preprocessing, intended uses, and known limitations. Data provenance tracking documents the full lineage from raw recordings to processed episodes, enabling debugging when policies exhibit unexpected behaviors.
Looking for robot demonstration dataset?
Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.
List Your Robot Dataset on Truelabel