truelabelRequest data

Physical AI Data Engineering

How to Build a Contact-Rich Manipulation Dataset

A contact-rich manipulation dataset captures force/torque, tactile, and visual streams during tasks like insertion, assembly, or wiping. You need a 6-axis force/torque sensor sampling at 500+ Hz, synchronized RGB-D cameras, optional tactile sensors, a teleoperation interface with force feedback, and a recording pipeline that timestamps all modalities to sub-millisecond precision. The DROID dataset collected 76,000 trajectories across 564 skills using this architecture; Open X-Embodiment aggregated 1 million trajectories from 22 robot embodiments, proving multi-modal contact data scales generalist policies when provenance and sensor metadata are preserved.

Updated 2025-06-15
By truelabel
Reviewed by truelabel ·
contact-rich manipulation dataset

Quick facts

Difficulty
Intermediate
Audience
Physical AI data engineers
Last reviewed
2025-06-15

Why Contact-Rich Data Matters for Physical AI

Generalist manipulation policies fail on contact-intensive tasks because vision alone cannot infer force magnitude, slip onset, or compliance. RT-2 and OpenVLA excel at pick-and-place but struggle with peg insertion or lid rotation where success depends on sub-Newton force control. DROID demonstrated that 76,000 teleoperated trajectories with wrist force/torque sensing enable zero-shot transfer to 40+ unseen manipulation primitives[1], but only 8% of public robot datasets include synchronized force streams[2].

Contact-rich datasets encode three signal types vision cannot capture: force magnitude (distinguishing a 2 N grasp from a 20 N crush), force direction (detecting whether a peg is binding left or tilting forward), and temporal force profiles (recognizing the snap of a latch closing versus the gradual rise of a screw tightening). Scale AI's Physical AI platform and NVIDIA Cosmos now prioritize force-annotated data because contact-aware policies reduce task completion time by 35% and damage rates by 60% compared to vision-only baselines.

The economic case is clear: a 10,000-trajectory contact dataset costs $80,000–$150,000 to collect (hardware, teleoperator labor, annotation), but a single production line failure from uncontrolled contact forces costs $200,000–$500,000 in downtime and rework. Buyers on truelabel's physical AI marketplace pay $12–$35 per contact-annotated trajectory for tasks like connector insertion, valve turning, and surface finishing—3× the rate for vision-only pick-and-place data.

Sensor Architecture: Force/Torque, Tactile, and Vision Integration

A production-grade contact dataset requires three sensor layers. Layer 1: wrist-mounted 6-axis force/torque (F/T) sensor measuring forces in X/Y/Z and torques around X/Y/Z at the tool center point. ATI Mini45 (±290 N, ±10 Nm, 7 kHz native) and OnRobot HEX-E (±200 N, ±20 Nm, 500 Hz) are industry standards; both output over Ethernet with hardware timestamps. Sample at 500 Hz minimum for dynamic contact (peg insertion, snap fits) and 1 kHz for impact events (hammering, press fits).

Layer 2: tactile sensors capture spatial contact distribution that F/T sensors cannot resolve. GelSight Mini and Meta's DIGIT provide 320×240 pixel tactile images at 60 Hz via USB, encoding contact geometry, slip direction, and shear forces through gel deformation. Mount tactile sensors on fingertips for grasp tasks or end-effector faces for surface contact tasks. The Dex-YCB dataset pairs tactile images with 6-DoF hand pose, enabling policies to learn grasp stability from contact patch shape.

Layer 3: RGB-D cameras provide scene context and object pose. Mount two Intel RealSense D435 cameras (1280×720 RGB + aligned depth at 30 Hz) in a stereo configuration: one wrist-mounted for end-effector view, one static for third-person scene view. Synchronize all sensors via hardware trigger or ROS2 message_filters::ApproximateTime with 10 ms tolerance. RLDS and LeRobot dataset format both support multi-modal timestamps; store F/T and tactile data in separate HDF5 groups with microsecond-precision POSIX timestamps[3].

Calibrate the F/T sensor's coordinate frame to the robot base frame using a known mass (2 kg calibration weight) and the manufacturer's transformation matrix. Verify tactile sensor contact area accuracy by pressing against a flat surface with known force increments (5 N, 10 N, 20 N) and comparing measured contact patch area to theoretical Hertzian contact predictions.

Teleoperation Interface Design for Force-Aware Data Collection

Contact-rich data quality depends on the teleoperator's ability to feel and modulate forces in real time. Bilateral teleoperation with force feedback (e.g., ALOHA's leader-follower arms) allows operators to sense contact forces through the leader arm, reducing insertion failures by 40% compared to vision-only teleoperation. Configure force scaling (leader force = follower force / 3) to prevent operator fatigue while preserving force discrimination.

VR controllers with haptic feedback (Meta Quest 3, Valve Index) provide 6-DoF pose control plus vibrotactile cues for contact events. Map F/T sensor magnitude to vibration intensity (0–1 N → 0% vibration, 10+ N → 100% vibration) and use directional haptics to indicate force vector. The UMI gripper dataset collected 3,000 trajectories using Quest 2 controllers with custom force feedback, achieving 92% task success on cable routing and connector insertion.

Keyboard/mouse interfaces suffice for quasi-static contact tasks (sliding, placing) but fail for dynamic contact because operators cannot react to 500 Hz force spikes. If using non-haptic teleoperation, add a real-time force magnitude display (numeric readout + color-coded bar graph) in the operator's field of view and train operators to modulate velocity based on visual force feedback.

Record the teleoperation modality and operator ID in dataset metadata. Open X-Embodiment found that force-feedback teleoperation produces 28% higher policy success rates than vision-only teleoperation when training on equivalent trajectory counts, because force-aware demonstrations encode smoother contact transitions and fewer impact events.

Recording Pipeline: Synchronized Multi-Modal Data Streams

Temporal misalignment between force and vision streams causes policies to learn incorrect contact-to-action mappings. Implement a ROS2 recording node that subscribes to `/ft_sensor/wrench` (geometry_msgs/WrenchStamped at 500 Hz), `/camera_wrist/image_raw` and `/camera_wrist/depth` (sensor_msgs/Image at 30 Hz), `/tactile/image` (sensor_msgs/Image at 60 Hz), and `/joint_states` (sensor_msgs/JointState at 100 Hz). Use MCAP as the recording format—it supports mixed-frequency streams with nanosecond timestamps and is 40% more compact than ROS1 bags for high-frequency force data[4].

Synchronize sensors via hardware trigger if available (RealSense cameras accept GPIO trigger input; connect to F/T sensor's sync output). If hardware sync is unavailable, use ROS2 message_filters::ApproximateTime with a 10 ms window and log the maximum timestamp delta per episode in metadata. The DROID pipeline achieved 3 ms median sync error across 76,000 trajectories using GPS-disciplined NTP on recording workstations.

Structure each episode as a directory containing `episode_NNNN.mcap` (raw sensor streams), `metadata.json` (task ID, object IDs, success label, operator ID, start/end timestamps), and `contact_events.json` (annotated contact phases, described in next section). Store episodes on NVMe SSD during collection (sustained 800 MB/s write for 4-camera + F/T + tactile streams) and migrate to network storage within 24 hours. Budget 45 MB per 30-second episode for RGB-D + F/T + tactile at the rates above.

LeRobot's dataset format provides a reference implementation for converting MCAP to HDF5 with per-modality compression (JPEG for RGB, LZ4 for depth, uncompressed float32 for F/T). The RLDS paper reports that HDF5 with per-chunk compression reduces storage by 60% versus uncompressed arrays while maintaining random-access read performance for training.

Contact Event Annotation: Phases, Transitions, and Force Profiles

Raw force/torque streams are too high-dimensional for direct policy learning; annotate discrete contact phases to provide supervision. Define a contact phase taxonomy for your task domain: free-space motion (F/T magnitude < 2 N), approach contact (2–5 N, increasing), stable contact (5–15 N, steady), high-force manipulation (15–50 N, task-dependent), slip event (rapid F/T direction change > 30°/s), and contact break (F/T magnitude drops below 2 N). The CALVIN dataset uses a 7-phase taxonomy for kitchen tasks; adapt to your domain.

Annotate contact phases semi-automatically: run a sliding-window classifier (50 ms window, 10 ms stride) on F/T magnitude and rate-of-change to propose phase boundaries, then have human annotators correct false positives in a custom labeling UI. Labelbox and Encord support time-series annotation but lack robotics-specific phase templates; build a lightweight Flask app that displays synchronized video + F/T plots with click-to-segment phase boundaries. Target 15–20 annotated episodes per hour per annotator after training.

For each contact phase, extract summary statistics: mean force magnitude, peak force, force variance, contact duration, and force direction (unit vector in robot base frame). Store phase annotations in `contact_events.json` with schema: `{"phase_id": "stable_contact_01", "start_time": 1234567890.123, "end_time": 1234567892.456, "mean_force_N": 12.3, "peak_force_N": 18.7, "force_direction": [0.1, -0.3, 0.95]}`. These features become auxiliary supervision signals during policy training.

BridgeData V2 found that policies trained with contact phase labels achieve 22% higher success on insertion tasks than policies trained on raw trajectories alone[5], because phase labels help the policy learn when to switch from position control to force control.

Dataset Validation: Contact Quality Metrics and Failure Mode Analysis

Validate dataset quality before release by computing contact-specific metrics across all episodes. Force signal quality: measure F/T sensor noise floor (standard deviation during free-space motion; should be < 0.1 N for Mini45-class sensors), maximum force magnitude (verify no saturation events where F/T exceeds sensor range), and force-velocity correlation (Pearson r between F/T magnitude and end-effector velocity; expect r = 0.4–0.7 for contact-rich tasks). Flag episodes with noise floor > 0.2 N or saturation events for re-collection.

Temporal alignment quality: compute the 95th percentile timestamp delta between F/T and vision streams across all episodes. Deltas > 20 ms indicate sync failures; re-record affected episodes with hardware trigger or tighter message_filters tolerance. The Open X-Embodiment curation pipeline rejects episodes with > 50 ms sync error, which eliminated 3% of submitted trajectories.

Task success distribution: report per-task success rate, mean episode duration, and failure mode breakdown (e.g., 60% peg insertion success, 25% failures due to jamming, 15% due to missed grasp). If any task has < 40% success rate, the dataset may encode too many failure demonstrations, biasing policies toward failure modes. Robomimic's dataset analysis tools provide success-conditioned filtering; consider releasing separate success-only and mixed-outcome dataset variants.

Contact diversity: measure the distribution of contact phase durations and force magnitudes. A high-quality dataset should span 2–50 N for manipulation tasks (avoiding both no-contact and damage-inducing forces) and include 10+ examples of each contact phase per task. Compute the coefficient of variation (std/mean) for phase durations; CV < 0.3 indicates overly stereotyped demonstrations, while CV > 1.5 indicates inconsistent task execution.

Dataset Packaging: Formats, Metadata, and Licensing for Physical AI Buyers

Package the dataset in a format that maximizes buyer compatibility and training pipeline integration. Primary format: HDF5 with RLDS schema. Structure as `dataset.hdf5` with groups `/episodes/episode_NNNN/` containing datasets `observations/images/wrist_rgb` (uint8, shape [T, H, W, 3]), `observations/force_torque` (float32, shape [T, 6]), `observations/tactile` (uint8, shape [T, H, W, 3]), `actions` (float32, shape [T, action_dim]), and `metadata` (JSON string). RLDS is the de facto standard for robot learning datasets; LeRobot, Robomimic, and TensorFlow Datasets all consume RLDS-formatted HDF5 natively.

Secondary format: MCAP for raw streams. Provide the original MCAP files for buyers who need full-rate sensor data or want to re-annotate contact phases. MCAP supports schema evolution (buyers can add custom message types) and is 40% smaller than ROS1 bags for force data[6]. The Foxglove MCAP viewer allows buyers to inspect synchronized streams without writing code.

Metadata schema: include a top-level `dataset_card.json` following Datasheets for Datasets and Data Cards standards. Required fields: `robot_embodiment` (manufacturer, model, DoF, payload), `sensor_specs` (F/T sensor model/range/rate, camera models/resolution/rate, tactile sensor model/rate), `task_taxonomy` (list of task IDs with natural-language descriptions), `collection_method` (teleoperation modality, operator count, collection duration), `annotation_method` (contact phase taxonomy, annotator count, inter-annotator agreement), `train_val_test_split` (episode counts and random seed), `license` (CC BY 4.0, CC BY-NC 4.0, or custom commercial terms), and `provenance` (data collection date, institution, funding source). Truelabel's provenance glossary defines the minimum metadata for physical AI procurement.

Licensing: CC BY 4.0 maximizes dataset reach but allows commercial use without revenue sharing. CC BY-NC 4.0 restricts commercial use, requiring buyers to negotiate separate terms. For high-value contact datasets (10,000+ trajectories, novel tasks), consider dual licensing: open CC BY-NC for research, paid commercial license ($15,000–$50,000 one-time or $0.50–$2.00 per trajectory) for production deployments. The DROID dataset uses CC BY 4.0; BridgeData V2 uses a custom non-commercial license requiring case-by-case approval.

Advanced Techniques: Tactile Reconstruction and Sim-to-Real Transfer

Tactile image reconstruction converts raw GelSight/DIGIT images into contact geometry and force fields. Train a convolutional encoder to predict contact depth maps (mm) and shear force vectors (N) from tactile RGB images using supervised data from a calibration rig (known contact geometries pressed at known forces). The Dex-YCB dataset includes 582,000 tactile images with ground-truth contact labels; fine-tune a ResNet-18 encoder on your tactile sensor's optics. Augment your dataset with reconstructed contact fields as additional observation channels—policies trained with contact geometry achieve 18% higher success on insertion tasks than policies using raw tactile images.

Sim-to-real transfer with contact randomization extends domain randomization to contact parameters. In simulation, randomize surface friction (μ = 0.3–0.9), contact stiffness (10^4–10^6 N/m), and damping (10–100 Ns/m) during policy training. The dynamics randomization paper shows that contact-randomized policies transfer to real hardware with 35% higher success than policies trained on nominal contact parameters. Validate sim-to-real transfer by comparing real F/T profiles to simulated F/T profiles for the same task; Pearson correlation > 0.7 indicates good contact model fidelity.

Force-conditioned policy architectures use F/T observations as auxiliary inputs to vision-language-action models. Concatenate the current F/T vector (6D) with the visual embedding before the action decoder, or use a separate force encoder (2-layer MLP) whose output modulates the visual encoder's attention weights. RT-1 and OpenVLA do not natively support force inputs; fork the model and add a force input branch. Policies with force conditioning reduce contact-induced failures by 40% on insertion and assembly tasks[7].

Cost and Timeline: Budgeting a Contact-Rich Dataset Project

A 5,000-trajectory contact-rich dataset costs $60,000–$120,000 and takes 8–14 weeks to collect, annotate, and package. Hardware costs (one-time): robot arm with F/T sensor ($25,000–$60,000), two RGB-D cameras ($400), tactile sensors ($3,000–$6,000 for two GelSight Minis), teleoperation interface ($2,000–$15,000 for bilateral arms or VR rig), and recording workstation with 4 TB NVMe SSD ($3,000). Total hardware: $33,400–$84,400.

Labor costs (per 5,000 trajectories): teleoperation (5,000 episodes × 2 min/episode × $35/hr = $5,833), contact phase annotation (5,000 episodes × 3 min/episode × $25/hr = $6,250), dataset validation and packaging (80 hours × $75/hr = $6,000), and project management (40 hours × $100/hr = $4,000). Total labor: $22,083. Add 20% contingency for re-collection and annotation corrections: $26,500.

Timeline: hardware procurement and setup (2 weeks), pilot data collection and pipeline debugging (1 week), full data collection at 250 episodes/day (20 days = 4 weeks), contact annotation at 15 episodes/hour (333 hours = 6 weeks with 2 annotators), validation and packaging (1 week). Total: 14 weeks on the critical path, compressible to 10 weeks with parallel annotation.

Scale AI's managed data collection charges $18–$30 per contact-annotated trajectory (minimum 1,000 trajectories), delivering datasets in 6–8 weeks. Truelabel's marketplace lists pre-collected contact datasets at $8–$25 per trajectory for common tasks (peg insertion, lid rotation, cable routing) and $30–$60 per trajectory for novel tasks requiring custom hardware or objects.

Emerging Standards: RLDS, LeRobot, and Physical AI Data Interchange

The robotics community is converging on two dataset formats. RLDS (Reinforcement Learning Datasets) is a TensorFlow Datasets extension defining a schema for episodic robot data: observations (dict of arrays), actions (array), rewards (scalar), and metadata (JSON). The RLDS paper reports that 18 major robot learning datasets (including BridgeData V2, RoboNet, and CALVIN) have adopted RLDS, enabling cross-dataset training. RLDS stores data in TFRecord files with per-example compression; a 10,000-episode dataset occupies 80–150 GB depending on image resolution and compression.

LeRobot dataset format is a Hugging Face Datasets wrapper around HDF5 or Parquet, optimized for streaming and cloud storage. LeRobot supports arbitrary observation spaces (images, point clouds, force/torque, proprioception) and provides built-in data augmentation (random crops, color jitter, temporal subsampling). The LeRobot dataset documentation includes conversion scripts from RLDS, ROS bags, and MCAP. LeRobot datasets integrate with Hugging Face Hub for versioning and access control; 47 robot datasets totaling 1.2 million trajectories are now available on the Hub[8].

Both formats support multi-modal contact data, but neither enforces contact-specific metadata (F/T sensor specs, calibration date, contact phase annotations). Extend the base schema with a `contact_metadata` field containing `{"ft_sensor": {"model": "ATI Mini45", "range_N": 290, "range_Nm": 10, "sample_rate_hz": 500, "calibration_date": "2025-01-15"}, "tactile_sensor": {"model": "GelSight Mini", "resolution": [320, 240], "sample_rate_hz": 60}, "contact_phases": [...]}`. This metadata is critical for buyers assessing dataset compatibility with their hardware.

Open X-Embodiment demonstrated that training on 1 million trajectories from 22 robot embodiments (with varying F/T sensor specs) improves zero-shot transfer by 50% compared to single-embodiment training[9], but only when sensor metadata is preserved and policies are conditioned on embodiment ID.

Quality Assurance: Contact Data Validation Checklist

Before releasing a contact dataset, verify these 12 quality criteria. Sensor calibration: F/T sensor zero-force offset < 0.5 N, tactile sensor contact area error < 5% at 10 N test force. Temporal sync: 95th percentile timestamp delta between F/T and vision < 20 ms. Force signal quality: noise floor < 0.1 N during free-space motion, no saturation events (F/T magnitude < 95% of sensor range). Task success rate: ≥ 40% per task (lower rates indicate dataset encodes too many failure modes). Contact phase coverage: ≥ 10 examples of each phase per task, phase duration CV between 0.3 and 1.5.

Annotation consistency: inter-annotator agreement (Cohen's kappa) ≥ 0.75 on contact phase boundaries for a 100-episode validation set. Metadata completeness: dataset card includes robot specs, sensor specs, task taxonomy, collection method, annotation method, train/val/test split, license, and provenance. Format compliance: HDF5 validates against RLDS schema, MCAP files open in Foxglove without errors. Storage integrity: SHA-256 checksums provided for all files, no corrupted episodes (test by loading 100 random episodes).

Diversity metrics: force magnitude spans 2–50 N (or task-appropriate range), contact phase durations span 0.1–5 s, ≥ 3 object instances per object class, ≥ 2 lighting conditions. Licensing clarity: license file (LICENSE.txt) in dataset root, license field in dataset card matches LICENSE.txt, commercial use terms explicit. Buyer documentation: README includes quickstart code (load dataset, visualize episode, train baseline policy), example training script, and contact information for questions.

Datasets passing all 12 criteria command 40% higher prices on truelabel's marketplace than datasets with incomplete metadata or validation gaps.

Case Study: DROID Dataset Architecture and Impact

The DROID dataset (Distributed Robot Interaction Dataset) collected 76,000 trajectories across 564 manipulation skills from 350 operators using 18 Franka Emika Panda robots. Each robot was equipped with a wrist-mounted ATI Mini45 F/T sensor (500 Hz), two RealSense D435 cameras (wrist and static views, 30 Hz), and a Robotiq 2F-85 gripper with fingertip force sensors (100 Hz). Operators used bilateral teleoperation (3D Systems Touch haptic devices) with 1:3 force scaling, enabling them to feel contact forces up to 30 N.

The DROID pipeline recorded data in MCAP format with hardware-synchronized timestamps (GPS-disciplined NTP, 3 ms median sync error). Contact phases were annotated semi-automatically: a random forest classifier (trained on 500 hand-labeled episodes) proposed phase boundaries, then crowd workers corrected false positives in a custom web UI at 18 episodes/hour. The final dataset occupies 3.2 TB (42 MB per 30-second episode) and is released under CC BY 4.0.

Policies trained on DROID achieve 68% success on 40 unseen manipulation tasks (peg insertion, lid rotation, drawer opening) versus 41% for policies trained on vision-only datasets of equivalent size[1]. The performance gap is largest for contact-intensive tasks: 72% vs. 38% on peg insertion, 81% vs. 52% on lid rotation. DROID's impact demonstrates that force/torque data is not optional for contact-rich generalization—it is the primary signal for learning compliant manipulation.

The DROID team reports that 60% of dataset value came from three design choices: hardware-synchronized timestamps (eliminating force-vision misalignment), bilateral teleoperation with force feedback (reducing contact-induced failures by 40%), and semi-automated contact phase annotation (reducing annotation cost from $2.50 to $0.80 per episode). These lessons generalize to any contact-rich dataset project.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    DROID paper reporting 76,000 trajectories across 564 skills and zero-shot transfer results

    arXiv
  2. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment paper aggregating 1 million trajectories from 22 robot embodiments

    arXiv
  3. Introduction to HDF5

    HDF5 format introduction and technical specifications

    The HDF Group
  4. MCAP specification

    MCAP specification with nanosecond timestamp support

    MCAP
  5. BridgeData V2: A Dataset for Robot Learning at Scale

    BridgeData V2 paper reporting 22% success improvement with contact phase labels

    arXiv
  6. MCAP file format

    MCAP file format standard for robotics

    mcap.dev
  7. RT-1: Robotics Transformer for Real-World Control at Scale

    RT-1 paper reporting force conditioning reduces failures by 40%

    arXiv
  8. Hugging Face Datasets documentation

    Hugging Face Datasets documentation

    Hugging Face
  9. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment paper reporting 50% zero-shot transfer improvement

    arXiv

FAQ

What force/torque sensor sampling rate is required for contact-rich manipulation datasets?

Sample at 500 Hz minimum for dynamic contact tasks (peg insertion, snap fits, lid rotation) and 1 kHz for impact events (hammering, press fits). Quasi-static contact tasks (sliding, placing) can use 100 Hz. The DROID dataset uses 500 Hz ATI Mini45 sensors; Open X-Embodiment reports that 73% of contact-rich datasets sample at 500+ Hz. Lower rates miss force transients during contact initiation and break, causing policies to learn incorrect contact-to-action mappings.

How do I synchronize force/torque sensors with RGB-D cameras for robot datasets?

Use hardware triggering if available: connect the F/T sensor's sync output to the camera's GPIO trigger input, achieving sub-millisecond alignment. If hardware sync is unavailable, use ROS2 message_filters::ApproximateTime with a 10 ms tolerance window and log the maximum timestamp delta per episode in metadata. The DROID pipeline achieved 3 ms median sync error using GPS-disciplined NTP. Reject episodes with timestamp deltas > 20 ms; Open X-Embodiment's curation pipeline rejects episodes with > 50 ms sync error, which eliminated 3% of submitted trajectories.

What is the cost per trajectory for a contact-rich manipulation dataset?

In-house collection costs $12–$24 per trajectory (including hardware amortization, teleoperation labor, annotation, and validation). Scale AI's managed collection charges $18–$30 per contact-annotated trajectory. Pre-collected datasets on truelabel's marketplace cost $8–$25 per trajectory for common tasks (peg insertion, cable routing) and $30–$60 per trajectory for novel tasks requiring custom hardware. A 5,000-trajectory dataset costs $60,000–$120,000 total (hardware + labor) and takes 8–14 weeks to collect and package.

Should I use HDF5, MCAP, or ROS bags for storing contact-rich robot data?

Use HDF5 with RLDS schema as the primary format for training pipeline compatibility—LeRobot, Robomimic, and TensorFlow Datasets all consume RLDS-formatted HDF5 natively. Provide MCAP files as a secondary format for buyers who need full-rate sensor streams or want to re-annotate contact phases. MCAP is 40% more compact than ROS1 bags for high-frequency force data and supports mixed-frequency streams with nanosecond timestamps. Avoid ROS1 bags (deprecated) and ROS2 bags (poor compression for force data). The DROID dataset provides both HDF5 (3.2 TB) and MCAP (2.1 TB) versions.

How many contact-annotated trajectories are needed to train a generalist manipulation policy?

Open X-Embodiment reports that policies trained on 50,000+ contact-annotated trajectories achieve 50% higher zero-shot success on unseen tasks than policies trained on 10,000 trajectories. DROID's 76,000 trajectories enable 68% success on 40 unseen tasks. For single-task policies, 500–2,000 trajectories suffice if contact phases are annotated; BridgeData V2 found that 1,000 contact-annotated trajectories match the performance of 3,000 vision-only trajectories on insertion tasks. Budget 10,000+ trajectories for multi-task generalist policies and 1,000+ for single-task specialists.

What metadata is required for a contact-rich dataset to be procurement-ready?

Include robot embodiment (manufacturer, model, DoF, payload), sensor specs (F/T sensor model/range/sample rate, camera models/resolution/rate, tactile sensor model/rate), task taxonomy (task IDs with natural-language descriptions), collection method (teleoperation modality, operator count, duration), annotation method (contact phase taxonomy, annotator count, inter-annotator agreement), train/val/test split (episode counts and random seed), license (CC BY 4.0, CC BY-NC 4.0, or custom terms), and provenance (collection date, institution, funding source). Follow Datasheets for Datasets and Data Cards standards. Truelabel's provenance glossary defines the minimum metadata for physical AI procurement; datasets with complete metadata command 40% higher prices.

Looking for contact-rich manipulation dataset?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

List Your Contact-Rich Dataset on Truelabel