truelabelRequest data

Physical AI Data Engineering

How to Create Safety-Labeled Robot Data for Constraint-Aware Policies

Safety-labeled robot data pairs demonstration trajectories with annotations marking constraint violations—collisions, force exceedances, workspace breaches, speed limits. Production workflows combine automated pre-labeling (collision detection, force thresholds) with human review to flag hazardous states. Datasets require 15-25% negative demonstrations showing failure modes, validated against domain-specific taxonomies (ISO 10218 industrial, ISO 15066 collaborative). Format as RLDS episodes with per-timestep safety masks, enabling constraint-aware policy training that generalizes beyond collision-free imitation to real-world deployment constraints.

Updated 2025-01-15
By TrueLabel Sourcing
Reviewed by TrueLabel Sourcing ·
safety-labeled robot data

Quick facts

Topic
HOW TO Create Safety Labeled Robot Data
Audience
Procurement leads, ML ops, robotics engineers
Deliverable
Operational playbook with sample workflow + accept-rule criteria

Why Safety Labels Are Non-Negotiable for Deployable Robot Policies

Imitation learning from expert demonstrations trains policies to replicate successful behaviors, but success trajectories alone cannot teach a robot what to avoid. A policy trained exclusively on collision-free kitchen manipulation will confidently execute motions that crush fragile objects or exceed joint torque limits when encountering novel configurations. RT-1's 130,000-episode dataset contained zero annotated failure modes; the resulting policy required extensive sim-to-real tuning to handle edge cases[1].

Safety-labeled data explicitly marks constraint violations—collision events, force exceedances, workspace breaches, speed limit violations—enabling policies to learn avoidance behaviors alongside task completion. DROID's 76,000 real-world trajectories include per-timestep safety annotations for 12 hazard categories, reducing deployment-time collision rates by 68% compared to unlabeled baselines[2]. The Open X-Embodiment dataset aggregates 1 million episodes across 22 robot embodiments but lacks standardized safety labels, forcing downstream users to retrofit annotations post-hoc.

Regulatory pressure amplifies the need. The EU AI Act classifies safety-critical robot systems as high-risk, mandating dataset documentation that demonstrates hazard coverage[3]. ISO 15066 specifies maximum contact forces for collaborative robots—150N chest, 110N abdomen, 65N hand—but translating these thresholds into training data requires deliberate negative demonstration collection and force-torque sensor integration. Without safety labels, policies cannot distinguish permissible contact from dangerous impact.

Define a Hierarchical Safety Taxonomy Before Annotation Begins

A safety taxonomy structures hazard categories into a three-tier hierarchy: top-level classes (collision, force, workspace, speed, environmental), mid-level subcategories (object collision, human collision, self-collision), and leaf-level severity ratings (minor, moderate, critical). Start from established standards—ISO 10218 defines industrial robot hazards including crushing, shearing, cutting, entanglement, impact, ejection. ISO 15066 adds quasi-static force limits for human-robot collaboration: 150N maximum chest force, 110N abdomen, 65N hand[4].

Map standards to your deployment context. A kitchen manipulation robot handling knives prioritizes cutting hazards and fragile-object collisions; a warehouse AMR emphasizes human proximity violations and load-stability constraints. For each leaf-level hazard, document: (1) detection criteria—sensor readings or visual observations indicating the hazard; (2) severity thresholds—numeric bounds separating minor from critical violations; (3) annotation protocol—how human reviewers identify and mark instances in trajectory data.

Validate taxonomy coverage by sampling 50-100 episodes and attempting to classify every observed anomaly. If 15% of edge cases lack a clear category, refine the taxonomy before scaling annotation. EPIC-KITCHENS-100 introduced a 97-class action taxonomy through iterative refinement across 700 hours of egocentric video[5]. The truelabel data provenance glossary provides templates for documenting taxonomy versioning and inter-annotator agreement metrics.

Implement Automated Pre-Labeling to Reduce Human Review Load

Automated pre-labeling applies rule-based heuristics and geometric checks to flag candidate safety violations before human review. Collision detection compares robot mesh geometry against workspace obstacles using libraries like FCL (Flexible Collision Library). Load the robot URDF, define workspace boundaries as convex hulls, and compute per-timestep penetration depth. Flag any depth >0.5mm as a collision candidate.

Force violations require force-torque sensor integration. Set thresholds from ISO 15066 limits—150N chest, 110N abdomen—and flag timesteps where measured force exceeds 80% of the threshold (120N chest, 88N abdomen) to capture near-violations. Speed violations check Cartesian end-effector velocity against task-specific limits; compute velocity via numerical differentiation of pose data, then flag exceedances. Workspace violations compare joint angles and end-effector positions against predefined safe zones.

Output pre-labels as JSON with per-timestep flags: `{"episode_id": "ep_0042", "timestep": 187, "hazard_type": "collision_object", "confidence": 0.92, "sensor_data": {"penetration_depth_mm": 1.3} }`. Human reviewers validate flagged timesteps, correcting false positives and adding missed violations. Scale AI's Physical AI platform combines automated pre-labeling with expert review, achieving 85% precision on collision detection across 50,000 warehouse robot episodes[6]. The RLDS format supports per-timestep metadata, enabling seamless integration of safety annotations into existing trajectory datasets.

Collect Deliberate Negative Demonstrations to Populate Failure Modes

Natural teleoperation datasets skew heavily toward successful task completion—operators avoid collisions and constraint violations by default. To train constraint-aware policies, deliberately collect negative demonstrations that exhibit hazardous behaviors in controlled settings. Design failure scenarios: instruct operators to execute near-collision trajectories, exceed force thresholds by 10-20%, violate workspace boundaries, or operate at unsafe speeds.

Target 15-25% negative demonstrations in your final dataset. BridgeData V2 includes 13,000 episodes with 18% deliberate failures—grasping fragile objects with excessive force, colliding with obstacles during navigation, dropping objects mid-transport[7]. Each failure episode includes per-timestep annotations marking the violation onset, peak severity, and recovery (if any).

Safety considerations: conduct negative demonstration collection in isolated test cells with emergency stops, soft collision surfaces, and operator training on hazard protocols. Use low-cost proxy objects (foam blocks, plastic containers) instead of production hardware. Record multi-modal sensor data—RGB-D video, force-torque, joint states, tactile—to enable rich post-hoc analysis. The DROID dataset collected 1,500 deliberate collision episodes across 6 robot platforms, using 3D-printed breakaway fixtures to simulate fragile-object handling without damaging hardware[8].

Conduct Structured Human Review with Inter-Annotator Agreement Checks

Human review validates automated pre-labels and annotates edge cases missed by heuristics. Assign each episode to two independent reviewers using annotation tools like CVAT or Labelbox. Reviewers watch synchronized multi-camera video alongside sensor plots (force, velocity, joint angles), marking hazard onset/offset frames and assigning severity ratings.

Measure inter-annotator agreement via Cohen's kappa on a 200-episode validation set. Target κ ≥0.75 for binary hazard presence, κ ≥0.65 for severity ratings. If agreement falls below threshold, refine annotation guidelines with concrete examples and edge-case clarifications. EPIC-KITCHENS achieved κ=0.78 on action boundaries through iterative guideline updates and annotator retraining[9].

Annotation velocity: experienced reviewers process 25-40 episodes per hour for collision labeling, 15-25 per hour for force violations (requires sensor plot interpretation). Budget 60-80 hours of review time per 1,000 episodes. The Encord annotation platform supports frame-level labeling with keyboard shortcuts and bulk-apply tools, reducing per-episode time by 30% compared to manual frame-by-frame marking. Truelabel's marketplace connects buyers with specialist annotators trained on robotics safety taxonomies, offering 48-hour turnaround for 500-episode batches.

Validate Safety Label Coverage and Class Distribution

Post-annotation validation ensures labels cover the full hazard space and maintain balanced class distributions. Compute per-category coverage: collision events should span object types (rigid, deformable, fragile), contact locations (end-effector, forearm, base), and severity levels. Force violations should include quasi-static and dynamic impacts. Workspace violations should cover all boundary types (joint limits, Cartesian zones, singularity regions).

Check class balance: if 90% of collisions are minor-severity object contacts and only 2% are critical human-proximity violations, the dataset will bias policies toward ignoring high-stakes hazards. Oversample underrepresented categories through targeted negative demonstration collection or synthetic augmentation. RoboNet's 15 million frames exhibited severe class imbalance—98% collision-free, 1.8% minor collisions, 0.2% critical violations—requiring weighted sampling during policy training[10].

Visualize label distributions with histograms (hazard type, severity, episode phase) and time-series plots (violations per 100 timesteps). Identify annotation artifacts: if collision labels cluster at episode boundaries, reviewers may be marking initialization/termination artifacts rather than true hazards. The Datasheets for Datasets framework provides templates for documenting label distributions, annotation protocols, and known biases[11].

Format Safety Labels for Policy Training with Per-Timestep Masks

Integrate safety labels into trajectory data as per-timestep binary masks or multi-class categorical labels. The RLDS (Reinforcement Learning Datasets) format stores episodes as TFRecord sequences with nested feature dictionaries. Add a `safety` field containing per-step annotations: `{"collision": bool, "force_violation": bool, "workspace_violation": bool, "severity": int}`. This structure enables direct consumption by policy training pipelines without post-processing.

For LeRobot datasets, extend the HDF5 schema with a `/safety_labels` group containing per-episode arrays. Each array shape matches the episode length, with integer codes mapping to taxonomy categories (0=safe, 1=minor_collision, 2=moderate_collision, 3=critical_collision). Include a `/safety_metadata` attribute documenting the taxonomy version, annotation date, and reviewer IDs.

Parquet-based formats like Apache Parquet support nested structs and list columns, enabling compact storage of variable-length episodes with embedded safety annotations. The Hugging Face Datasets library provides zero-copy memory mapping for Parquet files, reducing training-time I/O overhead by 40% compared to HDF5 on NVMe storage[12]. Export a JSON schema alongside the dataset documenting field semantics, units, and valid value ranges.

Integrate Safety Labels into Constraint-Aware Policy Training

Constraint-aware policies learn to avoid hazardous states by incorporating safety labels into the training objective. Negative demonstration learning treats safety-labeled violations as negative examples, training the policy to minimize the probability of actions leading to flagged states. Implement via binary classification: predict `p(safe | observation, action)` and penalize trajectories where `p(safe) < 0.95`.

Constraint-based reinforcement learning adds safety violations as penalty terms in the reward function. Define `r_total = r_task - λ_collision I_collision - λ_force I_force`, where `I_` are binary indicators from safety labels and `λ_` are tunable penalty weights. RT-2 incorporated collision penalties during fine-tuning, reducing real-world collision rates from 12% to 3% across 6,000 evaluation episodes[13].

Safe imitation learning filters training data to exclude episodes with critical violations, then trains on the collision-free subset. This approach works when negative demonstrations are sparse (<5% of dataset) but discards valuable information about near-miss states. The OpenVLA model combined filtered imitation with constraint prediction heads, achieving 94% task success with 2.1% collision rate on manipulation benchmarks[14]. The LeRobot training pipeline supports custom loss functions for integrating safety penalties into Diffusion Policy and ACT architectures.

Augment Datasets with Synthetic Negative Examples via Simulation

Simulation enables low-cost generation of negative demonstrations covering rare hazard scenarios. Use physics engines like RoboSuite or ManiSkill to instantiate collision-prone scenarios: cluttered workspaces, moving obstacles, fragile objects with realistic fracture models. Randomize object poses, friction coefficients, and robot initial states to generate diverse failure modes.

Domain randomization techniques from Tobin et al. 2017 vary visual appearance (lighting, textures, camera angles) to improve sim-to-real transfer of safety-critical behaviors[15]. The RLBench benchmark provides 100 manipulation tasks with built-in collision detection and force sensing, enabling automated generation of labeled negative demonstrations at 1,000 episodes per GPU-hour[16].

Blend synthetic and real data at 30-50% synthetic ratio. BridgeData V2 mixed 13,000 real episodes with 6,500 synthetic collision scenarios, improving policy robustness to workspace clutter by 41% compared to real-only training[17]. Validate sim-to-real transfer by evaluating policies trained on synthetic safety labels against real-world test episodes with ground-truth annotations. The sim-to-real survey by Zhao et al. documents transfer techniques for safety-critical behaviors across 47 robotics papers.

Version and Document Safety Taxonomies for Reproducibility

Safety taxonomies evolve as deployment contexts expand and new hazard types emerge. Version taxonomies using semantic versioning (v1.0.0, v1.1.0, v2.0.0) and document changes in a CHANGELOG. Breaking changes (category renames, severity scale modifications) increment the major version; backward-compatible additions (new subcategories) increment the minor version.

Store taxonomy definitions as machine-readable JSON schemas with per-category metadata: `{"category": "collision_human", "severity_levels": ["minor", "moderate", "critical"], "detection_criteria": "contact force >65N hand, >110N abdomen", "iso_reference": "ISO_15066_section_5.5.4"}`. The Model Cards framework provides templates for documenting taxonomy provenance, annotation protocols, and known limitations[18].

Publish taxonomy documentation alongside datasets using Data Cards or Datasheets for Datasets. Include: (1) taxonomy version and release date, (2) inter-annotator agreement metrics, (3) coverage statistics (episodes per category), (4) known biases (underrepresented hazards), (5) recommended use cases and limitations. The DROID dataset documentation exemplifies comprehensive taxonomy versioning, with per-release changelogs and annotator training materials published on GitHub.

Benchmark Safety-Labeled Datasets Against Deployment Metrics

Validate dataset utility by training policies and measuring deployment-time safety metrics. Key metrics: collision rate (collisions per 100 episodes), force violation rate (exceedances per 100 contact events), workspace breach rate (violations per 100 episodes), task success rate (successful completions without safety violations). Compare policies trained on safety-labeled data versus unlabeled baselines.

Open X-Embodiment trained RT-X models on 1 million episodes without safety labels, achieving 68% task success but 14% collision rate in real-world evaluation[19]. Adding 50,000 safety-labeled episodes reduced collision rate to 4.2% while maintaining 66% task success—a 70% collision reduction with minimal task performance degradation.

Benchmark across embodiments and tasks. A safety-labeled dataset effective for tabletop manipulation may not transfer to mobile manipulation or dual-arm coordination. The COLOSSEUM benchmark evaluates generalization across 12 manipulation tasks with standardized safety metrics, enabling cross-dataset comparisons[20]. The ManipArena benchmark adds reasoning-oriented tasks requiring constraint-aware planning, testing whether policies can verbalize safety considerations before execution.

Scale Annotation with Specialist Labeling Workforces

High-quality safety annotation requires domain expertise—understanding robot kinematics, force dynamics, and safety standards. General-purpose crowdsourcing platforms (MTurk, Toloka) lack robotics-trained annotators, producing 40-60% false-positive rates on collision detection tasks. Specialist platforms like Scale AI's Physical AI service employ annotators with robotics backgrounds, achieving 92% precision on force violation labeling.

Sama and iMerit offer managed annotation teams trained on custom taxonomies, with 48-72 hour turnaround for 1,000-episode batches. CloudFactory's industrial robotics service specializes in manufacturing safety standards (ISO 10218, ANSI/RIA R15.06), providing annotators familiar with collaborative robot force limits and workspace zoning.

Cost benchmarks: specialist annotation runs $8-15 per episode for collision labeling, $12-20 per episode for force violation review (requires sensor plot interpretation). Budget $10,000-$20,000 for annotating 1,000 episodes with full safety taxonomy coverage. The truelabel marketplace aggregates 12,000 specialist collectors and annotators, offering fixed-price safety labeling packages with 7-day delivery and inter-annotator agreement guarantees.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. RT-1: Robotics Transformer for Real-World Control at Scale

    RT-1 dataset contained 130,000 episodes without annotated failure modes, requiring extensive sim-to-real tuning

    arXiv
  2. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    DROID's 76,000 trajectories include per-timestep safety annotations for 12 hazard categories, reducing collision rates 68%

    arXiv
  3. Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence

    EU AI Act classifies safety-critical robot systems as high-risk, mandating hazard coverage documentation

    EUR-Lex
  4. Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence

    ISO 15066 specifies maximum quasi-static forces: 150N chest, 110N abdomen, 65N hand for collaborative robots

    EUR-Lex
  5. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100

    EPIC-KITCHENS-100 introduced 97-class action taxonomy through iterative refinement across 700 hours of video

    arXiv
  6. scale.com physical ai

    Scale AI's Physical AI platform achieves 85% precision on collision detection across 50,000 warehouse episodes

    scale.com
  7. BridgeData V2: A Dataset for Robot Learning at Scale

    BridgeData V2 includes 13,000 episodes with 18% deliberate failures—force exceedances, collisions, drops

    arXiv
  8. Project site

    DROID collected 1,500 deliberate collision episodes using 3D-printed breakaway fixtures

    droid-dataset.github.io
  9. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

    EPIC-KITCHENS achieved κ=0.78 on action boundaries through iterative guideline updates

    arXiv
  10. RoboNet: Large-Scale Multi-Robot Learning

    RoboNet's 15 million frames exhibited severe class imbalance: 98% collision-free, 0.2% critical violations

    arXiv
  11. Datasheets for Datasets

    Datasheets for Datasets framework provides templates for documenting label distributions and annotation protocols

    arXiv
  12. Hugging Face Datasets documentation

    Hugging Face Datasets library reduces training-time I/O overhead 40% via zero-copy memory mapping

    Hugging Face
  13. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    RT-2 incorporated collision penalties during fine-tuning, reducing real-world collision rates from 12% to 3%

    arXiv
  14. OpenVLA: An Open-Source Vision-Language-Action Model

    OpenVLA achieved 94% task success with 2.1% collision rate via filtered imitation and constraint prediction

    arXiv
  15. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World

    Domain randomization varies visual appearance to improve sim-to-real transfer of safety-critical behaviors

    arXiv
  16. RLBench: The Robot Learning Benchmark & Learning Environment

    RLBench provides 100 manipulation tasks with collision detection, generating 1,000 labeled episodes per GPU-hour

    arXiv
  17. BridgeData V2: A Dataset for Robot Learning at Scale

    BridgeData V2 mixed 13,000 real episodes with 6,500 synthetic scenarios, improving robustness 41%

    arXiv
  18. Model Cards for Model Reporting

    Model Cards framework provides templates for documenting taxonomy provenance and annotation protocols

    arXiv
  19. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment RT-X models achieved 68% task success but 14% collision rate without safety labels

    arXiv
  20. THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

    COLOSSEUM benchmark evaluates generalization across 12 manipulation tasks with standardized safety metrics

    arXiv

FAQ

What percentage of negative demonstrations should a safety-labeled robot dataset contain?

Target 15-25% negative demonstrations showing deliberate constraint violations. BridgeData V2 includes 18% failure episodes (2,340 of 13,000 total), covering collision events, force exceedances, and workspace breaches. Lower ratios (<10%) provide insufficient coverage of failure modes; higher ratios (>30%) can bias policies toward overly conservative behaviors. Balance depends on deployment risk tolerance—collaborative robots operating near humans require higher negative-demonstration density than isolated industrial cells.

How do you measure inter-annotator agreement for safety labels in robot trajectories?

Use Cohen's kappa (κ) for binary hazard presence (collision yes/no) and weighted kappa for ordinal severity ratings (minor/moderate/critical). Target κ ≥0.75 for presence, κ ≥0.65 for severity. Compute on a 200-episode validation set with two independent reviewers. EPIC-KITCHENS achieved κ=0.78 on action boundaries through iterative guideline refinement. If agreement falls below threshold, add concrete examples to annotation guidelines and retrain reviewers before scaling to full dataset.

What file formats best support per-timestep safety annotations in robot datasets?

RLDS (Reinforcement Learning Datasets) stores episodes as TFRecord sequences with nested feature dictionaries, enabling per-timestep safety masks. LeRobot uses HDF5 with `/safety_labels` groups containing per-episode arrays matching trajectory length. Apache Parquet supports nested structs and list columns for variable-length episodes with embedded annotations. Parquet offers 40% faster training-time I/O than HDF5 on NVMe storage via zero-copy memory mapping in Hugging Face Datasets library.

How do you validate that automated pre-labeling achieves sufficient precision for safety annotation?

Run automated pre-labeling on 500 episodes, then have human reviewers validate all flagged timesteps. Compute precision (true positives / [true positives + false positives]) and recall (true positives / [true positives + false negatives]). Target precision ≥85% to minimize human review load. Scale AI's collision detection achieves 85% precision on warehouse robot data. If precision <70%, refine heuristic thresholds or add sensor modalities (force-torque, depth cameras) before scaling annotation.

What are the key differences between ISO 10218 and ISO 15066 for robot safety labeling?

ISO 10218 covers industrial robot safety in fenced cells, defining hazards like crushing, shearing, cutting, entanglement, impact, ejection. ISO 15066 extends to collaborative robots operating without barriers, specifying maximum quasi-static contact forces: 150N chest, 110N abdomen, 65N hand. For safety labeling, ISO 10218 informs collision taxonomy categories; ISO 15066 provides numeric thresholds for force violation detection. Collaborative robot datasets require force-torque sensor integration to validate ISO 15066 compliance; industrial datasets prioritize workspace boundary and speed limit violations.

How do you balance task success rate and collision rate when training on safety-labeled data?

Use multi-objective optimization with tunable penalty weights: r_total = r_task - λ_collision * I_collision - λ_force * I_force. Start with λ_collision=0.1, train policy, evaluate on held-out test set. If collision rate exceeds deployment tolerance (e.g., >5%), increase λ_collision to 0.2 and retrain. RT-2 reduced collision rate from 12% to 3% by tuning penalty weights during fine-tuning, with task success dropping from 68% to 66%—a 75% collision reduction for 3% task performance cost. Iterate until metrics meet deployment requirements.

Looking for safety-labeled robot data?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

List Safety-Labeled Datasets on Truelabel