Physical AI Data Engineering

How to Create a Semantic Segmentation Dataset for Physical AI

Creating a semantic segmentation dataset requires defining a class taxonomy aligned to robot perception tasks, collecting representative imagery from target environments, annotating pixel-level masks using tools like CVAT or Label Studio with SAM 2 model assistance, validating annotations against inter-annotator agreement thresholds above 85% IoU, and exporting to training-ready formats like COCO JSON or HDF5. For manipulation robots, proven taxonomies include 8-12 classes covering graspable objects, support surfaces, obstacles, robot links, and human hands. Annotation velocity reaches 15-30 images per hour with model-assisted workflows compared to 3-5 images per hour for manual polygon tracing.

Updated 2025-05-15

By truelabel

Reviewed by truelabel · May 15, 2025

semantic segmentation dataset

List Your Segmentation Dataset on Truelabel How sourcing works

Quick facts

Difficulty: Intermediate
Audience: Physical AI data engineers
Last reviewed: 2025-05-15

Why Semantic Segmentation Datasets Power Physical AI Perception

Semantic segmentation assigns a class label to every pixel in an image, producing dense spatial understanding that object detection bounding boxes cannot provide. Manipulation robots use segmentation masks to compute grasp poses on irregular objects, plan collision-free trajectories around obstacles, and distinguish support surfaces from background clutterScale AI's physical AI data engine. Autonomous vehicles rely on segmentation to separate drivable road surface from curbs, pedestrians, and lane markings at centimeter precision^[1].

The Open X-Embodiment dataset aggregated 1 million robot trajectories but found that only 22% of contributing datasets included pixel-level segmentation annotations, limiting cross-embodiment transfer for vision-based policies^[2]. DROID's 76,000 manipulation trajectories included RGB-D streams but no semantic labels, forcing downstream users to train segmentation models from scratch or rely on zero-shot foundation models with 60-70% accuracy on novel objects^[3].

Segmentation datasets require 8-12× more annotation labor than bounding boxes due to pixel-level precision demands. A 5,000-image dataset with 12 classes and average 6 instances per image costs $18,000-$35,000 for manual annotation at $0.60-$1.20 per mask, compared to $3,000-$6,000 for equivalent bounding box labels^[4]. Model-assisted workflows using Segment Anything Model (SAM) reduce per-mask time from 90 seconds to 25 seconds, cutting costs by 65-72%^[5].

Designing a Class Taxonomy for Robot Perception Tasks

Class taxonomy design determines what your robot can perceive and how it generalizes to novel scenes. Start by cataloging 100-200 representative images from your target environment and manually listing every visible entity. Group entities by functional role rather than visual similarity — a cardboard box and a plastic bin both serve as containers, even though their textures differ.

For tabletop manipulation, proven taxonomies include 8-12 classes: target_object (items the robot will grasp, subdivided by SKU if needed), support_surface (tables, counters, shelves), container (bins, boxes, drawers), obstacle_rigid (appliances, walls, fixtures), obstacle_deformable (cloth, paper, bags), robot_arm (the robot's own links visible in egocentric views), gripper (end-effector annotated separately for grasp planning), human_hand (critical for collaborative scenarios), and background (everything else). The BridgeData V2 dataset used 9 classes across 13 kitchen environments, achieving 91% segmentation mIoU on held-out scenes^[6].

Document boundary rules in an 8-15 page annotation guideline. Specify how to handle occlusions (annotate visible pixels only or infer hidden regions), thin structures (cables, utensil handles — require 2-pixel minimum width), transparent objects (glass, plastic wrap — annotate container boundaries), and reflective surfaces (metal appliances — annotate physical surface, ignore reflections). Include 3-5 visual examples per class showing correct and incorrect annotations. Test guidelines on 50 pilot images with 3 annotators and measure inter-annotator agreement; revise rules until IoU exceeds 85% for all classes^[5].

Selecting Annotation Tools and Model-Assisted Workflows

Annotation platforms differ in model-assisted labeling support, export formats, and cost structure. CVAT is open-source with SAM integration via serverless functions, supporting COCO JSON and YOLO segmentation exports. Labelbox offers hosted SAM 2 with active learning pipelines that prioritize high-uncertainty images, reducing annotation volume by 30-40% for equivalent model performance^[7]. Encord provides frame-by-frame video segmentation with temporal consistency checks, critical for robot trajectory datasets where object masks must track across 200-500 frame sequences^[5].

Model-assisted labeling workflows use foundation models to generate initial masks that annotators refine. SAM 2 accepts point prompts (single click on object center), box prompts (rough bounding box), or mask prompts (coarse polygon) and returns pixel-accurate masks in 0.3-0.8 seconds per instance on an RTX 4090^[8]. For a 5,000-image dataset with 6 instances per image, SAM assistance reduces total annotation time from 750 hours (manual polygon tracing at 90 seconds per mask) to 210 hours (point prompts + refinement at 25 seconds per mask), saving $16,200 at $30/hour labor rates^[4].

Configure your platform to pre-segment images with SAM before human review. In CVAT, deploy the SAM serverless function and enable auto-annotation on upload. In Label Studio, install the SAM backend via Label Studio ML and set confidence threshold to 0.7 to surface only high-quality initial masks. Annotators then click to accept, adjust boundaries with polygon editing tools, or reject and re-prompt. Track per-annotator refinement rates — if >40% of masks require major edits, your prompts or class definitions need revision^[7].

Annotation Execution and Quality Control Protocols

Annotation quality determines downstream model performance more than dataset size. A 2,000-image dataset with 95% mask accuracy outperforms a 10,000-image dataset with 75% accuracy for the same training compute budget^[5]. Implement three-tier quality control: real-time validation (platform checks for mask completeness, minimum area thresholds, class consistency), peer review (10-15% of images reviewed by a second annotator with IoU comparison), and expert audit (project lead reviews 5% of images weekly, focusing on edge cases).

Measure inter-annotator agreement on 100 gold-standard images annotated by 3 independent annotators. Compute mean IoU per class — target ≥85% for rigid objects, ≥75% for deformable objects, ≥70% for thin structures. If agreement falls below thresholds, revise annotation guidelines with clarifying examples and re-train annotators. The EPIC-KITCHENS dataset achieved 89% mean IoU across 100 object classes by iterating guidelines through 4 revision cycles with 37 annotators^[9].

Track annotation velocity and error rates per annotator. Expect 15-30 images per hour with model assistance (3-5 images per hour manual). Flag annotators with velocity >2× team median (likely skipping refinement) or <0.5× median (over-editing or tool unfamiliarity). Provide weekly feedback with specific mask examples showing correct boundary placement, occlusion handling, and class assignment. Budget 10-15% of total annotation hours for quality control and rework^[7].

Handle edge cases with explicit protocols. For occlusions, annotate only visible pixels unless the annotation guideline specifies amodal completion (inferring hidden regions). For overlapping instances, assign pixels to the foreground object and mark occlusion boundaries. For ambiguous classes (is a cutting board a support_surface or a target_object?), defer to functional role in the robot's task — if the robot grasps it, it's a target_object; if objects rest on it, it's a support_surface. Document all edge-case decisions in the guideline appendix^[10].

Export Formats and Training Pipeline Integration

Segmentation annotations export to multiple formats depending on training framework and model architecture. COCO JSON is the most widely supported format, storing masks as RLE-encoded polygons with per-instance class IDs, bounding boxes, and image metadata. PyTorch's torchvision.datasets.CocoDetection and TensorFlow's TFDS COCO loader parse this format natively. YOLO segmentation uses per-image text files with normalized polygon coordinates, optimized for YOLOv8-seg and YOLOv9-seg training. HDF5 stores masks as 2D integer arrays (height × width) with class IDs as pixel values, enabling fast random access for large datasets but requiring custom data loaders^[11].

For robot trajectory datasets, integrate segmentation masks into existing RLDS episode structures or LeRobot HDF5 schemas. RLDS episodes store observations as nested dictionaries; add a 'segmentation_mask' key alongside 'rgb' and 'depth' with shape (H, W) and dtype uint8. LeRobot's observation schema supports arbitrary keys — append masks as 'observation.segmentation' with the same timestamp alignment as RGB frames. The DROID dataset format uses HDF5 groups per trajectory with '/observations/images/segmentation' arrays, enabling frame-by-frame mask lookup during policy training^[3].

Validate exports before training by loading 10 random samples and visualizing masks overlaid on RGB images. Check for class ID consistency (background=0, classes 1-N match taxonomy order), mask completeness (no missing instances), boundary precision (masks align with object edges within 2-3 pixels), and format compliance (COCO JSON validates against the official schema, YOLO files parse without errors). Use Roboflow's dataset health check or write a custom validation script that computes per-class instance counts, mean mask area, and boundary smoothness metrics^[12].

Training Baseline Models for Dataset Validation

Training a segmentation model on your dataset validates annotation quality and coverage before releasing to downstream users. Use a lightweight architecture like DeepLabV3+ with a MobileNetV2 backbone for fast iteration — training on 2,000 images with 12 classes takes 4-6 hours on a single RTX 4090, reaching 75-85% mIoU on a held-out test set if annotations are high-quality^[5]. If mIoU falls below 70%, inspect per-class confusion matrices to identify annotation errors (e.g., support_surface frequently misclassified as background suggests inconsistent boundary annotation).

Split your dataset 70% train, 15% validation, 15% test with stratification by scene and lighting condition to prevent overfitting to specific environments. For robot datasets, ensure test scenes include novel object arrangements, lighting variations, and occlusion patterns not present in training. The BridgeData V2 split held out 2 of 13 kitchens entirely for testing, revealing a 12-point mIoU drop on unseen environments that prompted additional data collection in visually distinct spaces^[6].

Benchmark against zero-shot foundation models to quantify annotation value. Run SAM 2 or NVIDIA Cosmos on your test set with automatic mask generation (no prompts) and compute mIoU against ground truth. If your trained model achieves <10-point improvement over zero-shot, your dataset may lack sufficient diversity or annotation quality to justify the labeling cost. Target ≥15-point improvement for manipulation tasks, ≥20-point for outdoor navigation where foundation models struggle with domain shift^[13].

Document baseline results in a dataset card following the Datasheets for Datasets framework: model architecture, training hyperparameters, per-class mIoU, inference latency, and failure modes (e.g., thin structures, transparent objects). Include qualitative examples showing correct predictions and common errors. This transparency helps buyers assess dataset fit for their perception stack and sets realistic performance expectations^[14].

Handling Challenging Object Categories in Physical AI

Certain object categories resist accurate segmentation due to material properties or geometric complexity. Transparent objects (glass containers, plastic wrap) lack clear boundaries in RGB images; annotate container edges where material transitions occur and document that masks represent physical extent, not visual appearance. Depth sensors often fail on transparent surfaces — the Dex-YCB dataset excluded glass objects entirely after finding 40% of depth pixels invalid^[15]. Consider multi-modal annotation with polarized imaging or structured light to capture transparent boundaries.

Thin structures (cables, utensil handles, plant stems) require sub-pixel precision that standard polygon tools struggle to capture. Set minimum annotation width to 2-3 pixels and accept that masks will be approximate. For critical thin structures (e.g., surgical instruments), use skeleton-based annotation where annotators trace centerlines with thickness attributes, then generate masks via morphological dilation. The EPIC-KITCHENS dataset used 3-pixel minimum width for utensil handles, accepting 15-20% boundary error as unavoidable given 1920×1080 resolution^[9].

Deformable objects (cloth, bags, food items) change shape between frames in video datasets, requiring per-frame annotation rather than interpolation. Budget 3-5× more annotation time for deformable object sequences compared to rigid objects. The RH20T dataset annotated cloth manipulation tasks at 5 fps instead of 30 fps, interpolating intermediate frames with optical flow and manually correcting 20% of interpolated masks where flow failed^[16].

Reflective surfaces (metal appliances, polished countertops) show mirror images of surrounding objects that annotators may incorrectly segment as separate instances. Annotation guidelines must specify: annotate only the physical surface, ignore reflections. If a reflected object is also physically present in the scene, annotate both instances separately. Use matte lighting during data collection to minimize reflections — the RoboCasa simulation environment disables specular reflections on all surfaces to simplify segmentation ground truth generation^[17].

Dataset Documentation and Provenance for Physical AI Buyers

Physical AI dataset buyers require detailed provenance documentation to assess training legality, model commercialization rights, and data quality. Create a dataset card following the Datasheets for Datasets template: collection methodology, annotation protocols, annotator demographics and training, quality metrics (inter-annotator IoU, expert audit pass rate), known limitations (lighting conditions, object categories, failure modes), and intended use cases. The DROID dataset card specifies 76,000 trajectories across 564 scenes with 18 robot embodiments, but omits annotator training details and inter-annotator agreement, limiting buyer confidence in label quality^[3].

Document data provenance at image and annotation level. For each image, record: capture timestamp, camera model and settings, scene identifier, lighting conditions, and any post-processing (color correction, cropping). For each annotation, record: annotator ID, annotation timestamp, tool version, model-assistance method (manual, SAM-assisted, auto-generated), and review status (unreviewed, peer-reviewed, expert-approved). Store provenance in a separate JSON or CSV file linked to image IDs, enabling buyers to filter by quality tier or exclude auto-generated annotations^[18].

Specify licensing terms that address model training and commercialization. Creative Commons BY 4.0 permits commercial use with attribution but does not explicitly grant model training rights — buyers in EU jurisdictions under the AI Act may require explicit training grants. Consider dual licensing: CC BY 4.0 for academic use, commercial license for production deployments. The BridgeData V2 dataset uses MIT license for code and data, granting unrestricted commercial use, but does not address model output ownership or derivative work rights^[6].

Publish dataset statistics: total images, per-class instance counts, mean instances per image, mask area distributions, and scene diversity metrics (number of unique environments, lighting conditions, camera viewpoints). Include a data sheet with annotation cost breakdown (labor hours, tooling costs, quality control overhead) to help buyers benchmark pricing. The truelabel marketplace requires these statistics for all listed datasets, enabling buyers to compare coverage and cost-efficiency across vendors^[19].

Scaling Annotation with Distributed Teams and Active Learning

Scaling beyond 5,000 images requires distributed annotation teams and active learning to maintain quality while controlling costs. Partition your dataset into batches of 500-1,000 images and assign each batch to a dedicated annotator pair (primary annotator + peer reviewer). Rotate batches across pairs weekly to prevent annotator drift where individuals develop idiosyncratic boundary placement styles. The EPIC-KITCHENS dataset used 37 annotators across 18 months, rotating assignments every 2 weeks and conducting monthly calibration sessions where all annotators labeled the same 20 images and discussed disagreements^[9].

Implement active learning to prioritize high-value images for annotation. Train an initial segmentation model on 1,000-2,000 labeled images, then run inference on the remaining unlabeled pool. Select images for annotation based on: prediction uncertainty (high entropy in per-pixel class probabilities), novel object configurations (low feature similarity to training set), or model disagreement (ensemble of 3 models produces inconsistent masks). Labelbox's active learning pipeline reduces annotation volume by 35-45% for equivalent model performance by focusing labeling effort on informative samples^[7].

Budget annotation costs realistically: $0.60-$1.20 per mask for manual polygon tracing, $0.15-$0.35 per mask with SAM assistance, plus 15% overhead for quality control and rework. A 10,000-image dataset with 6 instances per image (60,000 masks) costs $9,000-$21,000 with model assistance, $36,000-$72,000 manual. Offshore annotation teams (Philippines, India, Kenya) charge $8-$15/hour compared to $25-$40/hour for US-based annotators, but require more intensive quality oversight — budget 20-25% of annotation hours for review instead of 10-15%^[4].

Track annotation velocity and error rates in a live dashboard. Flag batches with <80% first-pass acceptance rate (high refinement needs) or >2× median annotation time (annotator struggling with guidelines or tooling). Conduct weekly calibration calls where the project lead reviews edge cases, updates guidelines, and demonstrates correct annotation techniques. Archive all guideline revisions with version numbers and effective dates so annotations can be traced to the ruleset in effect when they were created^[5].

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Robotics data annotation companies for 2026Related page HDF5 robot data format for robot training dataDelivery format detail LeRobot format format for robot training dataDelivery format detail MCAP format for robot training dataDelivery format detail Parquet robot data format for robot training dataDelivery format detail Pickle robot data format for robot training dataDelivery format detail Point cloud format for robot training dataDelivery format detail RLDS format for robot training dataDelivery format detail

External references and source context

Dataset page
Autonomous vehicle segmentation precision requirements
waymo.com ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment dataset segmentation coverage statistics
arXiv ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID dataset RGB-D streams without semantic labels
arXiv ↩
Scale AI: Expanding Our Data Engine for Physical AI
Scale AI's data engine for physical AI and annotation cost benchmarks
scale.com ↩
encord.com annotate
Encord annotation velocity and quality metrics
encord.com ↩
BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 class taxonomy and segmentation mIoU
arXiv ↩
labelbox
Labelbox active learning and annotation reduction
labelbox.com ↩
LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch
Segment Anything Model (SAM) performance metrics
arXiv ↩
Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
EPIC-KITCHENS inter-annotator agreement and annotation protocols
arXiv ↩
CVAT polygon annotation manual
CVAT polygon editing and quality protocols
docs.cvat.ai ↩
Introduction to HDF5
HDF5 format for segmentation mask storage
The HDF Group ↩
roboflow.com annotate
Roboflow dataset health check and validation
roboflow.com ↩
NVIDIA Cosmos World Foundation Models
NVIDIA Cosmos zero-shot segmentation benchmarks
NVIDIA Developer ↩
Datasheets for Datasets
Datasheets for Datasets documentation framework
arXiv ↩
Project site
Dex-YCB transparent object depth sensor failures
dex-ycb.github.io ↩
Project site
RH20T deformable object annotation at reduced frame rate
rh20t.github.io ↩
Project site
RoboCasa simulation environment reflective surface handling
robocasa.ai ↩
truelabel data provenance glossary
Data provenance documentation requirements
truelabel.ai ↩
truelabel physical AI data marketplace bounty intake
Truelabel marketplace dataset statistics requirements
truelabel.ai ↩

FAQ

What is the minimum dataset size for training a semantic segmentation model for robot manipulation?

Minimum viable datasets contain 1,500-2,500 images with 8-12 classes for tabletop manipulation tasks, achieving 70-80% mIoU on held-out test scenes. Larger datasets (5,000-10,000 images) reach 85-92% mIoU and generalize better to novel object arrangements and lighting conditions. The BridgeData V2 dataset used 13,000 trajectories with segmentation annotations across 13 kitchen environments to train policies that generalized to unseen kitchens with 78% success rate. For outdoor navigation or warehouse robotics, target 10,000-25,000 images to cover diverse lighting, weather, and clutter conditions.

How does model-assisted annotation with SAM 2 reduce labeling costs?

SAM 2 model-assisted workflows reduce per-mask annotation time from 90 seconds (manual polygon tracing) to 25 seconds (point prompt + boundary refinement), cutting labor costs by 65-72%. For a 5,000-image dataset with 6 instances per image (30,000 masks), SAM assistance saves $16,200 at $30/hour labor rates compared to fully manual annotation. Quality remains high — inter-annotator IoU with SAM assistance averages 87-91% compared to 89-93% for manual annotation, a 2-4 point difference that does not significantly impact downstream model performance. SAM 2 struggles with thin structures under 5 pixels wide and transparent objects, requiring manual annotation for these categories.

What annotation formats are compatible with PyTorch and TensorFlow segmentation training pipelines?

COCO JSON is the most widely supported format, compatible with PyTorch's torchvision.datasets.CocoDetection and TensorFlow's TFDS COCO loader. COCO stores masks as RLE-encoded polygons with per-instance class IDs, bounding boxes, and image metadata. YOLO segmentation format uses per-image text files with normalized polygon coordinates, optimized for YOLOv8-seg and YOLOv9-seg. HDF5 stores masks as 2D integer arrays (height × width) with class IDs as pixel values, enabling fast random access but requiring custom data loaders. For robot trajectory datasets, integrate masks into RLDS episode structures or LeRobot HDF5 schemas by adding a 'segmentation_mask' key alongside RGB and depth observations.

How do you measure annotation quality for semantic segmentation datasets?

Measure inter-annotator agreement on 100 gold-standard images annotated by 3 independent annotators. Compute mean Intersection over Union (IoU) per class — target ≥85% for rigid objects, ≥75% for deformable objects, ≥70% for thin structures. Implement three-tier quality control: real-time validation (platform checks for mask completeness and class consistency), peer review (10-15% of images reviewed by a second annotator), and expert audit (project lead reviews 5% of images weekly). Track per-annotator refinement rates — if >40% of model-assisted masks require major edits, annotation guidelines or prompts need revision. Train a baseline segmentation model and compute per-class mIoU on a held-out test set; if mIoU falls below 70%, inspect confusion matrices to identify systematic annotation errors.

What are the most challenging object categories for semantic segmentation in robotics?

Transparent objects (glass, plastic wrap) lack clear RGB boundaries and cause depth sensor failures — annotate container edges where material transitions occur and document that masks represent physical extent. Thin structures (cables, utensil handles) require sub-pixel precision; set minimum annotation width to 2-3 pixels and accept 15-20% boundary error. Deformable objects (cloth, bags, food) change shape between frames, requiring per-frame annotation at 3-5× the time cost of rigid objects. Reflective surfaces (metal appliances, polished countertops) show mirror images that annotators may incorrectly segment as separate instances — guidelines must specify annotating only physical surfaces and ignoring reflections. The EPIC-KITCHENS dataset excluded highly transparent objects and used 3-pixel minimum width for thin structures, accepting approximate boundaries as unavoidable at 1920×1080 resolution.

How should semantic segmentation datasets be documented for physical AI buyers?

Create a dataset card following the Datasheets for Datasets template: collection methodology, annotation protocols, annotator training, inter-annotator IoU metrics, expert audit pass rates, known limitations (lighting, object categories, failure modes), and intended use cases. Document data provenance at image level (capture timestamp, camera model, scene ID, lighting, post-processing) and annotation level (annotator ID, timestamp, tool version, model-assistance method, review status). Specify licensing terms that address model training and commercialization — Creative Commons BY 4.0 permits commercial use but may not satisfy EU AI Act training-right requirements; consider dual licensing for academic and commercial use. Publish dataset statistics: total images, per-class instance counts, mean instances per image, mask area distributions, scene diversity metrics, and annotation cost breakdown to help buyers benchmark pricing and coverage.

Looking for semantic segmentation dataset?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

List Your Segmentation Dataset on Truelabel