Physical AI Data Engineering

How to Build a Manipulation Dataset for Robot Learning

A manipulation dataset pairs robot trajectories with multi-modal observations (RGB-D, proprioception, force) collected via teleoperation or scripted policies. Production pipelines require task specification, hardware calibration (camera intrinsics, temporal sync), teleoperation interfaces (VR, leader-follower, SpaceMouse), episode recording in RLDS or HDF5 formats, and validation against success metrics before training. The Open X-Embodiment consortium aggregated 1 million trajectories across 22 robot embodiments, demonstrating that cross-embodiment generalization demands standardized action spaces and rich language annotations alongside pixel observations.

Updated 2026-01-15

By truelabel

Reviewed by truelabel · Jan 15, 2026

manipulation dataset

List Your Manipulation Dataset on Truelabel How sourcing works

Quick facts

Difficulty: Intermediate
Audience: Physical AI data engineers
Last reviewed: 2026-01-15

Define Task Scope and Success Criteria Before Hardware Setup

Start with a task specification document that enumerates every manipulation primitive your policy must execute: pick-and-place, drawer opening, articulated object manipulation, bimanual coordination. Each task variant requires explicit success conditions (object within 2 cm of target pose, drawer angle exceeds 45 degrees, grasp stability over 3 seconds) and initial state distributions (object count, pose sampling bounds, lighting conditions). Open X-Embodiment aggregated 527 skills across 160,000 tasks by enforcing per-task success predicates and language annotations^[1]. Document observation modalities (RGB-D resolution, frame rate, proprioceptive state vector layout) and action space (end-effector delta SE(3) with gripper command versus joint position targets) in a versioned YAML contract.

RT-1 trained on 130,000 episodes used 6 Hz control with 512×512 RGB and 7-DOF delta actions, while DROID collected 76,000 trajectories at 5 Hz with wrist-mounted cameras and absolute joint commands^[2]. Coordinate frame conventions (right-hand rule, z-up, base frame versus end-effector frame) must be locked before data collection begins to avoid post-hoc transformations that introduce noise. Specify language instruction templates (imperative commands, object-centric descriptions, spatial relations) and whether you need human-verified success labels or automated heuristics. BridgeData V2 paired 60,000 trajectories with natural language goals, enabling RT-2 to transfer web-scale vision-language knowledge to robotic control^[3].

Select Robot Embodiment and Gripper Configuration for Target Tasks

Embodiment choice constrains dataset transferability: 7-DOF arms (Franka Emika Panda, UR5e) offer redundancy for obstacle avoidance, while 6-DOF arms (ViperX, Kinova Gen3) simplify inverse kinematics but limit workspace dexterity. Franka FR3 Duo dual-arm systems enable bimanual tasks but double data collection complexity and require synchronized control loops. Gripper selection impacts grasp diversity: parallel-jaw grippers (Robotiq 2F-85) handle 80 percent of tabletop tasks, while underactuated hands (Allegro, Shadow) capture richer contact dynamics at the cost of teleoperation difficulty^[1]. DexYCB recorded 582,000 frames of dexterous grasps with MANO hand pose annotations, but teleoperation required expert operators and 15-minute setup per object.

Wrist-mounted cameras (Intel RealSense D435, ZED Mini) provide egocentric views critical for contact-rich tasks, while external cameras (overhead, side-view) supply global context for navigation and multi-object scenes. EPIC-KITCHENS-100 demonstrated that egocentric video alone misses third-person spatial reasoning, prompting DROID to deploy 3-camera rigs (wrist, shoulder, external) across 564 scenes^[2]. 5 pixels. Temporal synchronization matters: Scale AI's Physical AI platform enforces hardware-triggered capture with sub-millisecond jitter using PTP (Precision Time Protocol) to align RGB, depth, and proprioceptive streams.

Build Teleoperation Interface with Sub-100ms Latency and Force Feedback

Teleoperation quality determines dataset diversity and success rate. VR controllers (Meta Quest 3, Valve Index) offer 6-DOF tracking at 90 Hz but require custom OpenXR SDK integration for gripper mapping and haptic feedback. Leader-follower systems (ALOHA, UMI Gripper) provide intuitive kinesthetic teaching: ALOHA collected 825 bimanual episodes with 50-gram force sensors in each fingertip, enabling contact-rich tasks like cable routing and dishwasher loading^[4]. SpaceMouse interfaces (3Dconnexion SpaceMouse Wireless) map 6-DOF input to end-effector velocity commands but lack force feedback, reducing grasp success rates by 15-20 percent versus haptic systems. Implement dead-man switches and emergency stops in the control loop: ROS bag recording should trigger only when the operator confirms task start to avoid recording idle states.

Latency budgets: visual feedback under 50 ms (camera capture to display), control loop under 10 ms (command to actuator), total teleoperation latency under 100 ms to maintain operator flow. RLDS introduced episode boundaries and metadata fields (success, language annotation, operator ID) as first-class schema elements, enabling LeRobot to filter 12,000 episodes by success rate and task difficulty during training^[5]. Record operator gaze (Tobii eye tracker, Pupil Labs) and hand pose (MediaPipe Hands, Leap Motion) as auxiliary signals: Ego4D showed that gaze-object alignment predicts manipulation intent 3 seconds before contact, useful for imitation learning priors.

Implement Multi-Camera Synchronization and Calibration Pipelines

Hardware-triggered capture eliminates frame skew: configure all cameras to trigger on a shared GPIO pulse or PTP clock signal, ensuring RGB-D pairs align within 1 millisecond. MCAP supports multi-channel recording with nanosecond timestamps, used by Foxglove to replay 4-camera streams with proprioceptive data in a unified timeline. 3 pixels. calibrateHandEye[/link] using Tsai-Lenz or Park-Martin methods. Verify calibration by projecting known 3D points (gripper fingertips, table corners) into image space and measuring pixel error; reject calibrations with mean error above 2 pixels. Depth alignment: RealSense cameras require librealsense2 alignment filters to register depth to RGB, while ZED cameras provide pre-aligned depth maps at 720p/60fps.

Store calibration matrices (K, D, R, T) in YAML files versioned with dataset metadata: RLDS embeds camera_info messages in episode metadata, enabling downstream consumers to reproject depth to point clouds without external calibration files^[6]. Test synchronization by recording a fast-moving object (metronome, LED blinker) and verifying frame alignment across all cameras within 5 milliseconds.

Design Episode Recording Schema in RLDS or HDF5 Format

RLDS (Reinforcement Learning Datasets) defines episodes as sequences of (observation, action, reward, discount) tuples with metadata (language, success, scene_id)^[6]. Each step contains nested dictionaries: observation['image']['wrist_rgb'] as uint8[H,W,3], observation['state'] as float32[14] (7 joint positions + 7 velocities), action as float32[8] (7-DOF delta + gripper). 8 TB while preserving visual fidelity for policy training^[7]. HDF5 offers hierarchical storage: create groups /episode_000001/observations/images/cam0 with chunked datasets (chunk size 10 frames) and gzip compression level 4 for 3:1 compression ratios. HDF5 supports parallel writes via MPI-IO, enabling real-time recording at 30 Hz across 4 cameras without frame drops. attrs['operator_id'] = 'user_42'.

BridgeData V2 used TFRecord shards with 500 episodes per file, balancing random access speed and filesystem overhead. 16'. Data provenance enables buyers to filter by collection conditions (lighting, surface texture, object set) and trace model failures to specific hardware configurations.

Execute Structured Data Collection with Diversity Targets

Randomize initial conditions across episodes: object poses sampled from uniform distributions (±10 cm XY, ±15 degrees yaw), lighting intensity varied 200-800 lux, background textures rotated every 50 episodes. Domain randomization improves sim-to-real transfer by 40 percent, and the same principle applies to real-world data collection^[8]. Set episode quotas per task variant: 100 episodes for simple pick-and-place, 200 episodes for drawer opening with 3 object configurations, 300 episodes for bimanual tasks. Open X-Embodiment collected 1 million trajectories by distributing quotas across 21 institutions, each contributing 10,000-80,000 episodes with standardized action spaces^[1]. Track success rates in real-time: if a task drops below 60 percent success after 20 episodes, pause collection and debug teleoperation latency or gripper calibration.

ALOHA achieved 85 percent success on cable insertion by iterating teleoperation ergonomics (controller height, force scaling) before scaling to 825 episodes. Record failure modes: label episodes with failure_reason in ['grasp_slip', 'collision', 'timeout', 'operator_error'] to enable targeted data augmentation or policy debugging. Collect negative examples (failed grasps, collisions) at 10-15 percent of total episodes: RT-1 showed that including 12,000 failure trajectories improved out-of-distribution robustness by teaching the policy to avoid known failure states^[9]. Rotate operators every 100 episodes to capture behavioral diversity: DROID engaged 50 operators across 564 scenes, yielding 3× higher action entropy than single-operator datasets.

Validate Episodes with Automated Success Detection and Manual Review

Implement rule-based success detectors: object-in-region checks (3D bounding box overlap), joint-state thresholds (drawer angle > 45 degrees), gripper-force signals (contact detected for 1 second). CALVIN used 34 success predicates (object lifted, button pressed, LED state changed) to auto-label 24,000 episodes, achieving 92 percent agreement with human annotators^[10]. For ambiguous tasks (tidy shelf, arrange objects), require human review: present operators with side-by-side video (wrist + external camera) and binary success labels. EPIC-KITCHENS-100 collected 90,000 action segments with start/end timestamps and verb-noun annotations, demonstrating that temporal boundaries matter for multi-step tasks. 2 for in-hand manipulation). Flag outliers: episodes with zero gripper commands, trajectories shorter than 10 steps, or camera frames with >30 percent saturation.

LeRobot provides validation scripts that check schema compliance (all required keys present, dtype matches), temporal consistency (timestamps monotonic, frame deltas <200 ms), and visual quality (brightness histogram, motion blur score)^[5]. Run sanity checks: replay 10 random episodes in simulation (MuJoCo, PyBullet) by feeding recorded actions to a kinematic model and verifying end-effector trajectories match recorded poses within 1 cm. Robomimic introduced playback validation to catch coordinate frame bugs before training.

Annotate Language Instructions and Semantic Segmentation Masks

Language annotations enable vision-language-action models: RT-2 trained on 60,000 language-conditioned trajectories achieved 62 percent success on unseen instructions by grounding web-scale vision-language priors in robotic control^[3]. Collect instructions at episode granularity (imperative commands: 'pick the red mug') and step granularity (dense captions: 'approach mug', 'close gripper', 'lift 10 cm'). BridgeData V2 used template-based generation ('pick <object> and place on <surface>') augmented with 5,000 human-written paraphrases to increase linguistic diversity. 85 IoU inter-annotator agreement on tabletop scenes. Segment Anything can auto-generate masks from bounding-box prompts, reducing annotation time by 70 percent but requiring manual review for contact regions (gripper-object interface, object-object occlusions).

Store masks as PNG with indexed color (object_id as pixel value) or COCO JSON with polygon coordinates. DROID included 3D bounding boxes for 150 object categories, enabling downstream tasks like affordance prediction and grasp pose estimation. Annotate contact events: label frames where gripper force exceeds 2 N as 'contact_start', enabling policies to learn force-sensitive behaviors (gentle placement, compliant insertion). ALOHA recorded 6-axis force-torque at 100 Hz, paired with binary contact labels for 825 episodes.

Convert Raw Data to Training-Ready Format with Compression and Sharding

264 video encoding (CRF 23, preset medium) to compress RGB streams by 15:1 while preserving edge sharpness for visual servoing tasks^[7]. Depth maps compress poorly with lossy codecs: store as 16-bit PNG (lossless) or apply zlib compression in HDF5 (3:1 ratio). tfrecord-00000-of-00020), enabling parallel data loading across 8-16 workers. 8 GB (CSV) to 420 MB (Parquet with Snappy codec). Include dataset splits in metadata: 80 percent train (40,000 episodes), 10 percent validation (5,000 episodes), 10 percent test (5,000 episodes), stratified by task and success rate. Hugging Face Datasets supports streaming from cloud storage (S3, GCS) with automatic caching, reducing local disk requirements from 12 TB to 500 GB for active training batches.

Generate dataset cards: Datasheets for Datasets template includes 57 fields (motivation, composition, collection process, preprocessing, uses, distribution, maintenance)^[11]. Document known biases: if 70 percent of episodes use red objects, note 'color distribution skewed toward red (70%), blue (20%), green (10%)' to inform data augmentation strategies. Publish checksums (SHA-256) for all files and a manifest JSON mapping episode IDs to file paths, success labels, and language annotations.

Benchmark Dataset Quality with Imitation Learning Baselines

Train a behavior cloning policy (Diffusion Policy, ACT, VINN) on 1,000 episodes and measure success rate on held-out test tasks. Diffusion Policy achieved 85 percent success on 12 manipulation tasks with 200 demonstrations per task, establishing a quality floor for teleoperation datasets. If test success drops below 50 percent, diagnose: insufficient diversity (all episodes from same initial state), label noise (success flags incorrect), or action space mismatch (recorded actions incompatible with policy output). 12 after 50k steps)^[12]. Compare against public benchmarks: CALVIN reports 34-task success rates (mean 78 percent, std 12 percent) for policies trained on 24,000 episodes, while Robomimic provides 6 tasks with 200-episode baselines (success 60-90 percent).

Compute dataset efficiency: plot success rate versus number of training episodes (100, 500, 1000, 5000) to identify diminishing returns. RT-1 showed that success saturates at 80 percent with 50,000 episodes for pick-and-place but continues improving to 130,000 episodes for multi-step tasks^[9]. Measure cross-embodiment transfer: train on your dataset, evaluate on Open X-Embodiment tasks using a shared action space (normalized end-effector delta). RT-X models achieved 50 percent zero-shot success on unseen robots by training on 1 million multi-embodiment trajectories, demonstrating that dataset scale and diversity unlock generalization.

Package Dataset with Licensing, Provenance, and Buyer-Readiness Metadata

0[/link] restricts commercial use, limiting buyer interest by 60 percent. RoboNet uses a custom non-commercial license requiring written permission for commercial deployment, creating procurement friction for enterprise buyers. Include a Datasheet for Datasets with 12 mandatory sections: motivation (why collected), composition (episode count, task distribution), collection process (hardware, operators, duration), preprocessing (filtering, augmentation), uses (intended applications, out-of-scope uses), distribution (access method, versioning), maintenance (update schedule, contact). EPIC-KITCHENS-100 provides a 14-page datasheet covering 32 kitchens, 45 participants, 90,000 action segments, and known biases (right-handed operators, Western kitchen layouts)^[13]. Document data provenance: robot serial numbers, calibration dates, software versions (ROS distro, driver commits), operator IDs (anonymized), collection sites (lab, warehouse, home).

DROID tracked 564 scenes across 4 institutions with per-episode metadata (location, lighting, surface material), enabling buyers to filter by deployment conditions. Publish model cards for any pre-trained policies: Model Cards for Model Reporting template includes training data, evaluation metrics, ethical considerations, and caveats^[14]. Truelabel's physical AI marketplace requires dataset cards with 18 fields (embodiment, action space, observation modalities, episode count, success rate, language annotations, file format, total size, license, price) to enable programmatic search and procurement.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

HDF5 robot data format for robot training dataDelivery format detail RLDS format for robot training dataDelivery format detail Bimanual manipulation training dataTask-specific requirements Dexterous manipulation training dataTask-specific requirements Manipulation training dataTask-specific requirements Teleoperation training dataTask-specific requirements Best teleoperation data providers 2026Related page Data provenance for physical AIRelated page

External references and source context

Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment aggregated 1 million trajectories across 22 robot embodiments with 527 skills, demonstrating cross-embodiment generalization
arXiv ↩
Project site
DROID collected 76,000 trajectories at 5 Hz across 564 scenes with 3-camera rigs and 50 operators
droid-dataset.github.io ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 achieved 62 percent success on unseen instructions by transferring web-scale vision-language knowledge to robotic control
arXiv ↩
Teleoperation datasets are becoming the highest-intent physical AI content category
ALOHA collected 825 bimanual episodes with 50-gram force sensors achieving 85 percent success on cable insertion
tonyzhaozh.github.io ↩
LeRobot documentation
LeRobot filtered 12,000 episodes by success rate and task difficulty during training
Hugging Face ↩
RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning
RLDS introduced episode boundaries and metadata fields as first-class schema elements for RL datasets
arXiv ↩
LeRobot dataset documentation
LeRobot dataset v3 uses H.264 compression reducing 50,000 episodes from 12 TB to 1.8 TB
Hugging Face ↩
Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization improves sim-to-real transfer by 40 percent
arXiv ↩
RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 trained on 130,000 episodes at 6 Hz with 512×512 RGB and 7-DOF delta actions for 700 tasks
arXiv ↩
CALVIN paper
CALVIN used 34 success predicates to auto-label 24,000 episodes achieving 92 percent agreement with human annotators
arXiv ↩
Datasheets for Datasets
Datasheets for Datasets template includes 57 fields covering motivation, composition, collection, and maintenance
arXiv ↩
Training ACT with LeRobot Notebook
LeRobot ACT training provides reference hyperparameters and expected convergence curves
GitHub ↩
EPIC-KITCHENS-100 annotations license
EPIC-KITCHENS-100 collected 90,000 action segments with start/end timestamps and verb-noun annotations
GitHub ↩
Model Cards for Model Reporting
Model Cards for Model Reporting template includes training data, evaluation metrics, and ethical considerations
arXiv ↩
CVAT polygon annotation manual
OpenCV ArUco markers enable camera intrinsic and hand-eye calibration with sub-pixel reprojection error
docs.cvat.ai
encord.com annotate
Encord Annotate supports polygon tools achieving 0.85 IoU inter-annotator agreement on tabletop scenes
encord.com
Apache Parquet file format
Apache Parquet compresses 50,000 episodes from 2.8 GB CSV to 420 MB with Snappy codec
Apache Parquet
Attribution 4.0 International deed
CC-BY-4.0 permits commercial training with attribution
Creative Commons
Creative Commons Attribution-NonCommercial 4.0 International deed
CC-BY-NC-4.0 restricts commercial use limiting buyer interest by 60 percent
creativecommons.org
ROS bag format 2.0
ROS bags capture raw sensor streams with nanosecond timestamps but require post-processing
ROS Wiki
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
PointNet architectures are robust to 10-15 percent missing points in depth data
arXiv

FAQ

What episode count do I need to train a generalist manipulation policy in 2026?

[link:ref-open-x-embodiment]Open X-Embodiment[/link] demonstrated that 1 million trajectories across 22 embodiments enable cross-robot generalization, but single-task policies converge with 200-1,000 episodes depending on task complexity[ref:ref-open-x-embodiment]. [link:ref-rt1-paper]RT-1[/link] used 130,000 episodes for 700 tasks, averaging 185 episodes per task, while [link:ref-aloha-dataset]ALOHA[/link] achieved 85 percent success on bimanual insertion with 825 episodes by focusing on high-quality teleoperation and rich proprioceptive feedback. For novel tasks, start with 100 episodes to validate data quality via behavior cloning, then scale to 500-1,000 episodes if test success exceeds 60 percent. Multi-task policies require 10,000+ episodes with balanced task distribution (no single task exceeding 20 percent of total episodes) to avoid catastrophic forgetting.

Should I use RLDS, HDF5, or ROS bags for manipulation dataset storage?

[link:ref-rlds-paper]RLDS[/link] is the de facto standard for reinforcement learning datasets, used by [link:ref-open-x-embodiment]Open X-Embodiment[/link], [link:ref-bridgedata-v2]BridgeData V2[/link], and [link:ref-lerobot-dataset-v3]LeRobot[/link] for its schema enforcement (observation, action, reward, metadata) and TensorFlow Datasets integration[ref:ref-rlds-paper]. HDF5 offers lower-level control and faster random access (10× faster than TFRecord for sparse sampling), preferred by [link:ref-robomimic-project]Robomimic[/link] and [link:ref-droid-dataset]DROID[/link] for large-scale datasets (50,000+ episodes). [link:ref-ros-bag-format]ROS bags[/link] capture raw sensor streams with nanosecond timestamps but require post-processing to extract synchronized observation-action pairs; [link:ref-mcap-format]MCAP[/link] modernizes ROS bags with better compression and indexing, used by [link:ref-foxglove-mcap]Foxglove[/link] for multi-camera playback. Choose RLDS for public release and cross-framework compatibility, HDF5 for internal iteration and large-scale storage, MCAP for raw data archival with full sensor fidelity.

How do I calibrate multiple cameras to a robot base frame with sub-centimeter accuracy?

Hand-eye calibration solves AX=XB where A is the robot end-effector pose, X is the camera-to-end-effector transform, and B is the marker pose in camera frame. Collect 15-20 robot poses with an ArUco marker visible in the camera, ensuring poses span the workspace (±30 cm XYZ, ±45 degrees rotation). Use [link:ref-opencv-aruco]OpenCV's calibrateHandEye[/link] with the Tsai-Lenz method (robust to noise) or Park-Martin method (better for small rotations). Verify calibration by projecting known 3D points (gripper fingertips measured with calipers) into image space and measuring pixel error; reject calibrations with mean error above 2 pixels or max error above 5 pixels. For multi-camera setups, calibrate each camera independently to the robot base, then verify inter-camera transforms by placing a checkerboard visible to all cameras and comparing reconstructed 3D corners (error should be under 5 mm). [link:ref-droid-dataset]DROID[/link] used 3-camera rigs with hand-eye calibration achieving 3 mm reprojection error across 564 scenes.

What teleoperation interface minimizes operator fatigue for 1,000+ episode collection?

Leader-follower systems like [link:ref-aloha-dataset]ALOHA[/link] provide kinesthetic teaching with force feedback, enabling 4-hour collection sessions with 15-minute breaks every hour[ref:ref-aloha-dataset]. VR controllers (Meta Quest 3) offer 6-DOF tracking but cause arm fatigue after 90 minutes; mount controllers on a desk-mounted gimbal to support operator wrists. SpaceMouse interfaces reduce fatigue for pick-and-place tasks (2-3 hour sessions) but lack force feedback, lowering grasp success rates by 15 percent. Implement adaptive control scaling: reduce velocity gains by 20 percent after 50 consecutive episodes to compensate for operator fatigue. [link:ref-droid-dataset]DROID[/link] rotated 50 operators across 564 scenes, limiting each operator to 200 episodes per day to maintain data quality (success rate dropped from 78 percent to 62 percent after 250 episodes in pilot studies). Provide real-time success feedback (visual + audio cues) to maintain operator engagement and reduce error rates by 12 percent.

How do I handle depth sensor noise and missing data in manipulation datasets?

Intel RealSense D435 depth maps exhibit 2-5 cm noise at 1.5 m range and missing pixels on reflective surfaces (metal, glass) and thin objects (cables, utensils). Apply temporal filtering: median filter over 3-5 frames reduces noise by 40 percent while preserving edge sharpness for contact detection. [link:ref-pointnet-paper]PointNet[/link] architectures are robust to 10-15 percent missing points, so mask invalid depth pixels (value 0 or >3 m) rather than interpolating. For critical tasks (insertion, cable routing), use structured light (ZED 2i) or stereo cameras with active IR patterns to reduce missing data from 18 percent to 4 percent. [link:ref-droid-dataset]DROID[/link] recorded RGB-D at 5 Hz with hole-filling disabled, relying on policy architectures (ResNet + spatial softmax) to learn robustness to sparse depth. Store raw depth maps without post-processing to preserve buyer flexibility: include a preprocessing script that applies your recommended filtering pipeline (bilateral filter, temporal median, outlier removal) as a reference implementation.

What metadata fields are mandatory for commercial manipulation dataset sales?

[link:ref-truelabel-marketplace]Truelabel's marketplace[/link] requires 18 fields: robot embodiment (manufacturer, model, DOF, gripper type), action space (end-effector delta vs joint position, coordinate frame, control frequency), observation modalities (RGB resolution, depth format, proprioceptive state vector), episode count, success rate (overall and per-task), language annotations (yes/no, granularity), file format (RLDS/HDF5/MCAP), total size (GB), license (CC-BY-4.0, custom commercial), price, collection date range, number of scenes, number of operators, calibration accuracy (reprojection error), known failure modes, and contact information. [link:ref-datasheets-for-datasets]Datasheets for Datasets[/link] adds 39 optional fields covering motivation, composition, preprocessing, uses, and maintenance[ref:ref-datasheets-for-datasets]. Enterprise buyers filter by action space compatibility (80 percent require end-effector delta in base frame), success rate (minimum 65 percent), and license (commercial use permitted). [link:ref-droid-dataset]DROID[/link] published a 12-page datasheet with per-scene metadata (lighting, surface, object set), enabling buyers to subset 76,000 trajectories by deployment conditions.

Looking for manipulation dataset?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

List Your Manipulation Dataset on Truelabel