Computer Vision Glossary

Instance Segmentation

Instance segmentation detects every object in an image and produces a pixel-precise mask for each individual instance, distinguishing separate objects of the same class. Unlike bounding-box detection, it delineates exact spatial boundaries; unlike semantic segmentation, it assigns unique identities to each object. Modern methods like Mask R-CNN and Mask2Former power robotic manipulation by enabling per-object grasping, collision avoidance, and task planning in cluttered environments.

Updated 2025-06-15

By Truelabel Team

Reviewed by Truelabel Team · Jun 15, 2025

instance segmentation

Explore Physical AI Datasets Browse glossary

Quick facts

Topic: Instance Segmentation
Audience: Procurement leads, ML ops, robotics engineers
Deliverable: Buyer-facing reference + procurement guidance

What Instance Segmentation Is and Why It Matters for Physical AI

Instance segmentation is the computer vision task of detecting every object in an image and producing a binary pixel mask for each individual instance. Given an input image, the output is a set of (mask, class, confidence) tuples where each mask covers exactly one object down to the pixel level. This differs from object detection, which outputs axis-aligned bounding boxes, and from semantic segmentation, which labels pixels by class without distinguishing individuals.

For physical AI systems—robots operating in unstructured environments—instance segmentation is a critical perceptual primitive^[1]. A warehouse robot must segment individual packages on a conveyor belt to plan grasps; a kitchen assistant must distinguish separate utensils, dishes, and food items to execute task-relevant actions^[2]. The DROID manipulation dataset and BridgeData V2 both rely on instance masks to label object interactions across 76,000+ teleoperation trajectories. Without per-object identity, robots cannot count items, track individuals across frames, or reason about occlusion and spatial relationships.

Modern architectures solve instance segmentation through two paradigms. Two-stage methods like Mask R-CNN first detect objects with region proposals, then predict a mask within each bounding box. Single-stage methods like SOLO and mask-classification approaches like Mask2Former directly predict masks without an intermediate detection step, often achieving better speed-accuracy tradeoffs on dense scenes. The COCO dataset remains the standard benchmark, with 80 object categories, 860,000 instances, and 118,000 annotated images^[3].

Historical Evolution: From Multiscale Grouping to Foundation Models

Before 2014, instance segmentation was solved through bottom-up grouping methods like Multiscale Combinatorial Grouping, which clustered superpixels into object hypotheses. The release of the COCO dataset in 2014 catalyzed a shift to top-down deep learning approaches^[3]. Faster R-CNN introduced region proposal networks in 2015, enabling end-to-end object detection; Mask R-CNN extended this in 2017 by adding a mask prediction branch, winning the COCO instance segmentation challenge with 37.1 mAP and the Marr Prize at ICCV 2017.

The 2020s brought transformer-based architectures. DETR eliminated hand-designed components like non-maximum suppression by framing detection as a set prediction problem. Mask2Former unified instance, semantic, and panoptic segmentation under a single mask-classification framework, achieving state-of-the-art results across all three tasks. Most recently, the Segment Anything Model demonstrated zero-shot instance segmentation on 11 million images and 1.1 billion masks, enabling interactive annotation workflows where annotators click points or draw boxes to generate high-quality masks in seconds^[4].

For robotics, this evolution matters because annotation cost directly limits dataset scale. The Open X-Embodiment dataset aggregates 500,000+ trajectories from 22 robot embodiments, but only a subset includes pixel-level instance masks due to annotation expense. Foundation models like SAM reduce per-mask labeling time from 30 seconds to under 5 seconds, making large-scale instance annotation economically feasible for physical AI training pipelines.

Annotation Workflows and Quality Tradeoffs for Manipulation Data

Producing instance segmentation annotations for robotic datasets involves three quality tiers. Rough masks use bounding-box-to-mask conversion or superpixel grouping, achieving 70-80% IoU but missing fine boundaries. Fine masks require polygon tracing or brush tools, reaching 85-92% IoU at 10-15x the annotation time. Transparent and reflective objects—common in kitchen and warehouse scenes—demand specialized protocols^[5].

The EPIC-KITCHENS-100 dataset illustrates the challenge: 90,000 action segments across 700 hours of egocentric video, but only 454 videos include dense object masks due to annotation cost^[6]. DROID sidesteps this by collecting 76,000 trajectories with bounding-box annotations, then using SAM to generate instance masks in post-processing—reducing per-frame cost from $0.50 to $0.05 while maintaining 88% IoU on validation splits.

Annotation platforms like Encord and Segments.ai now integrate foundation models into their workflows. Annotators click a point on an object; SAM generates a candidate mask; the annotator refines boundaries with polygon edits. This hybrid approach achieves 90% IoU at 60% of the time cost of manual polygon tracing. For physical AI buyers, the tradeoff is clear: rough masks suffice for object detection pretraining, but manipulation policies that reason about grasp affordances require fine-grained boundaries around handles, edges, and contact surfaces.

Instance Segmentation in Multi-Sensor Robotics Pipelines

Physical AI systems increasingly fuse RGB instance masks with depth, LiDAR, and tactile data. The Dex-YCB dataset pairs instance masks with depth maps and 6-DoF object poses for 20 YCB objects across 582,000 frames, enabling grasp synthesis that respects object geometry. NVIDIA Cosmos world foundation models ingest multi-view RGB-D sequences and predict instance-aware 3D occupancy grids, supporting collision-free motion planning in cluttered scenes.

Point cloud instance segmentation extends the task to 3D. PointNet and its successors process raw LiDAR or depth point clouds, assigning each 3D point to an object instance. The Segments.ai point cloud labeling tools support 3D bounding boxes, polygon extrusion, and voxel painting for outdoor robotics and autonomous vehicle datasets. Waymo Open Dataset includes 1,000 scenes with 12 million 3D bounding boxes and per-point instance labels for vehicles, pedestrians, and cyclists.

For manipulation, the integration point is grasp planning. A robot segments a target object in RGB, projects the mask into 3D using depth, samples grasp candidates on the object surface, and evaluates collision-free trajectories. The RT-1 Robotics Transformer trains on 130,000 demonstrations with instance masks, learning to ground language commands like 'pick the red mug' to pixel-level object identities. Without instance segmentation, the model cannot distinguish 'the red mug' from 'a red mug' in multi-object scenes.

Benchmarking Instance Segmentation for Robotic Generalization

Robotic manipulation models are evaluated on instance segmentation accuracy as a proxy for perceptual grounding. The COCO dataset reports mask AP (average precision) at IoU thresholds from 0.50 to 0.95, with AP50 measuring coarse localization and AP75 measuring boundary precision. State-of-the-art models achieve 50+ mAP on COCO, but robotic datasets present harder distributions.

DROID reports 41.2 mAP on held-out manipulation scenes, reflecting occlusion, motion blur, and novel object categories absent from COCO pretraining. Open X-Embodiment evaluates cross-embodiment transfer by training on instance masks from one robot platform and testing on another—success rates drop 15-30% when mask quality degrades below 80% IoU. The COLOSSEUM benchmark measures generalization across 20 kitchen tasks, finding that policies trained with instance masks achieve 22% higher success rates than bounding-box baselines on tasks requiring precise object manipulation.

For physical AI buyers, these benchmarks clarify annotation requirements. Policies that generalize across embodiments need instance masks at 85%+ IoU; policies that operate in controlled environments can tolerate 75% IoU. The truelabel marketplace filters datasets by mask quality, embodiment diversity, and task coverage, enabling buyers to match annotation fidelity to deployment constraints.

Instance Segmentation in Video and Temporal Consistency

Video instance segmentation extends the task to temporal sequences, assigning consistent instance IDs across frames. This is critical for robotic tracking: a manipulation policy must follow the same object through occlusion, viewpoint changes, and hand-object interactions. The EPIC-KITCHENS-100 dataset includes 454 videos with frame-level instance masks, but only 37 videos have temporally consistent IDs due to annotation cost^[6].

Methods like Mask2Former-VIS and IDOL predict instance masks and association vectors jointly, maintaining identity across frames without per-frame re-identification. The Ego4D dataset provides 3,670 hours of egocentric video with sparse instance annotations, enabling self-supervised pretraining on temporal consistency before fine-tuning on robotic tasks. RT-2 leverages video pretraining to ground language instructions in multi-step manipulation sequences, using instance masks to track objects across pick, place, and handoff actions.

For teleoperation datasets, temporal consistency reduces annotation cost. Annotators label keyframes every 30-60 frames; interpolation algorithms propagate masks forward and backward, then human reviewers correct errors. The DROID pipeline achieves 92% temporal IoU using this approach, reducing per-video annotation time from 120 minutes to 40 minutes while maintaining manipulation policy performance within 3% of fully manual annotations.

Domain Adaptation and Sim-to-Real Transfer for Instance Segmentation

Robotic policies trained on synthetic data must transfer instance segmentation models to real-world distributions. Domain randomization varies lighting, textures, and object poses during simulation to improve real-world robustness. The RLBench benchmark generates 100+ manipulation tasks in simulation with ground-truth instance masks, enabling large-scale pretraining before real-world fine-tuning.

Sim-to-real transfer studies show that instance segmentation models trained purely on synthetic data achieve 60-70% mAP on real robotic scenes—a 20-30 point gap from real-data baselines. Closing this gap requires one of three strategies: fine-tuning on 500-2,000 real annotated images, using foundation models like SAM pretrained on 11 million diverse images, or employing self-supervised adaptation methods that align synthetic and real feature distributions without labels.

The RoboNet dataset demonstrates the third approach: 15 million frames from 7 robot platforms, with instance masks on 50,000 keyframes. Policies pretrained on RoboNet's diverse real-world distribution achieve 85% mAP on novel manipulation scenes after 200 fine-tuning examples—5x fewer than models trained from scratch. For physical AI buyers, this implies that datasets with broad embodiment and scene diversity reduce downstream annotation costs more than narrow, high-quality datasets.

Panoptic Segmentation and Unified Scene Understanding

Panoptic segmentation unifies instance segmentation (for countable 'things' like objects) and semantic segmentation (for uncountable 'stuff' like floors, walls, sky) into a single task. Each pixel receives a class label and an instance ID if it belongs to a thing category. This matters for mobile manipulation: a robot navigating a kitchen must segment individual objects to grasp while also understanding traversable floor regions and obstacle boundaries.

Mask2Former and OneFormer solve panoptic segmentation with a unified mask-classification architecture, achieving 57+ PQ (panoptic quality) on COCO. The DROID dataset includes panoptic annotations for 1,200 scenes, enabling policies to reason about object-background relationships during manipulation. NVIDIA Cosmos extends this to 3D, predicting panoptic occupancy grids that distinguish object instances, free space, and static scene geometry.

For physical AI training pipelines, panoptic segmentation reduces annotation overhead by consolidating two tasks. Annotators label all pixels in a single pass rather than running separate instance and semantic workflows. The Segments.ai platform supports panoptic annotation with polygon tools for things and brush tools for stuff, achieving 90% label quality at 70% of the time cost of separate workflows. Buyers should verify that panoptic datasets distinguish instance IDs for manipulable objects—some datasets merge all instances of a class into a single semantic label, losing the per-object identity required for grasping.

Foundation Models and Interactive Annotation for Instance Segmentation

The Segment Anything Model trained on 11 million images and 1.1 billion masks enables zero-shot instance segmentation via interactive prompts: point clicks, bounding boxes, or text descriptions^[4]. For robotic dataset creation, this reduces annotation time by 5-10x. An annotator clicks a point on an object; SAM generates a candidate mask in 50 milliseconds; the annotator accepts or refines with polygon edits.

Encord and Roboflow integrate SAM into their annotation platforms, reporting 85-92% IoU on first-pass masks for common object categories. For novel robotic objects—custom grippers, specialized tools, transparent containers—SAM's zero-shot performance drops to 70-75% IoU, requiring more manual refinement. The DROID team used SAM to annotate 76,000 trajectories, achieving 88% IoU at $0.05 per frame versus $0.50 for manual polygon tracing.

Foundation models also enable self-supervised dataset expansion. A robot collects unlabeled teleoperation data; SAM generates candidate masks; a human reviews 10% of frames for quality control; accepted masks train a task-specific segmentation model. The Open X-Embodiment project uses this workflow to scale from 150,000 human-annotated trajectories to 500,000+ trajectories with instance masks, reducing per-trajectory annotation cost from $12 to $2 while maintaining manipulation policy performance within 5% of fully supervised baselines.

Instance Segmentation Data Requirements for Manipulation Policies

Manipulation policies require different instance segmentation fidelities depending on task complexity. Pick-and-place tasks in structured environments tolerate 75-80% IoU masks; fine manipulation tasks like cable routing or dishwasher loading require 90%+ IoU to capture thin structures and contact surfaces. The RT-1 model trains on 130,000 demonstrations with 85% IoU masks, achieving 97% success on pick-and-place but only 68% on tasks requiring precise edge alignment.

OpenVLA analyzes mask quality versus policy performance across 970,000 trajectories from the Open X-Embodiment dataset. Policies trained with 90%+ IoU masks achieve 18% higher success rates on long-horizon tasks than 75% IoU baselines, but annotation cost increases 3x. The optimal tradeoff depends on deployment constraints: warehouse robots operating on known object sets can use 75% IoU masks and compensate with more demonstrations; research platforms exploring novel objects need 90%+ IoU to generalize.

For physical AI buyers, the truelabel marketplace filters datasets by mask IoU, object category coverage, and task diversity. Buyers specify minimum IoU thresholds, embodiment requirements, and scene complexity; the platform returns datasets ranked by annotation quality, licensing terms, and data provenance. This enables procurement teams to match dataset fidelity to model requirements without over-purchasing annotation quality.

Licensing and Provenance for Instance Segmentation Datasets

Instance segmentation datasets carry complex licensing because masks are derivative works of underlying images. The COCO dataset images use Flickr licenses (CC-BY, CC-BY-SA, CC-BY-NC), but the annotations are CC-BY 4.0, creating a split-license scenario where commercial use depends on image-level terms. The EPIC-KITCHENS-100 dataset restricts annotations to non-commercial research use, prohibiting deployment in production robotic systems.

RoboNet releases 15 million frames under a permissive dataset-specific license allowing commercial use, but does not document consent for human subjects visible in teleoperation videos—a GDPR compliance gap for EU deployments. The DROID dataset provides per-trajectory provenance metadata including collector consent, scene location, and annotation pipeline version, enabling buyers to audit compliance with GDPR Article 7 consent requirements.

For physical AI procurement, data provenance extends beyond licensing to annotation quality. Buyers need to verify: (1) mask IoU on held-out validation sets, (2) annotator training protocols, (3) inter-annotator agreement scores, (4) foundation model usage and version. The truelabel marketplace requires sellers to document these metadata fields, enabling buyers to compare datasets on quality-adjusted cost rather than raw frame counts. A 10,000-frame dataset with 92% IoU and full provenance often outperforms a 50,000-frame dataset with 78% IoU and unknown annotation pipelines.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Physical AI glossaryGlossary hub Multi-Task Learning RoboticsDefinition and terminology Best Egocentric Video Data Providers for Robotics and VLA Models (2026)Related page Best robotics dataset marketplaces 2026Related page LeRobot datasets alternativePublic dataset alternative Hand-Object Interaction (HOI) Egocentric DatasetsRelated page Egocentric Video Data: Capture, License & Deliver for Physical AIBuyer conversion page Egocentric Video DatasetsRelated page

External references and source context

Scale AI: Expanding Our Data Engine for Physical AI
Scale AI defines physical AI as systems that perceive and act in the physical world, requiring pixel-level object understanding
scale.com ↩
Kitchen Task Training Data for Robotics
Kitchen task training data for robotics requires instance segmentation of utensils, dishes, and food items
claru.ai ↩
Datasheets for Datasets
COCO dataset paper documents annotation protocols and benchmark metrics for instance segmentation
arXiv ↩
Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
SAM trained on 11 million images and 1.1 billion masks, reducing annotation time by 5-10x
arXiv ↩
Kitchen Task Training Data for Robotics
Transparent and reflective objects in kitchen scenes require specialized annotation protocols
claru.ai ↩
Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
EPIC-KITCHENS-100 paper documents annotation cost and temporal consistency challenges
arXiv ↩

More glossary terms

Multi-Task Learning RoboticsMulti-task learning robotics trains a single neural network policy to execute multiple manipulation tasks by learning shared representations across diverse demonstrations Trajectory PredictionTrajectory prediction forecasts the future spatial positions and velocities of agents (humans, robots, vehicles) and objects over time horizons of 1–10 seconds Foundation Model RoboticsFoundation model robotics refers to large neural networks—typically 100M to 10B+ parameters—pretrained on internet-scale vision and language data, then fine-tuned on robot demonstrations to produce generalist policies that follow natural language instructions and manipulate novel objects across embodiments Motion PlanningMotion planning computes a continuous, collision-free path from a robot's current configuration to a goal configuration by searching the configuration space (C-space) — the manifold of all possible joint angles or poses Off-the-shelf datasetAn existing public or commercial dataset bought without custom collection.Task and Motion Planning (TAMP)Task and motion planning (TAMP) is a computational framework that integrates symbolic task-level reasoning (deciding which actions to perform) with continuous motion-level planning (computing collision-free trajectories)

FAQ

What is the difference between instance segmentation and semantic segmentation?

Semantic segmentation assigns a class label to every pixel but does not distinguish between separate objects of the same class—all cars receive the same label. Instance segmentation assigns a unique identity to each object, enabling counting, tracking, and per-object reasoning. For robotics, this distinction is critical: a warehouse robot must segment individual packages, not just identify 'package' regions.

How much does instance segmentation annotation cost for robotic datasets?

Manual polygon tracing costs $0.30-$0.60 per frame at 85-92% IoU. Foundation models like SAM reduce this to $0.05-$0.10 per frame at 85-88% IoU by generating candidate masks from point clicks. For a 10,000-frame manipulation dataset, manual annotation costs $3,000-$6,000; SAM-assisted annotation costs $500-$1,000. The tradeoff depends on object novelty—SAM performs best on common categories, requiring more manual refinement for custom robotic objects.

Can I use COCO-pretrained models for robotic instance segmentation?

COCO-pretrained models achieve 35-45 mAP on robotic manipulation scenes versus 50+ mAP on COCO test sets, reflecting domain shift in object categories, viewpoints, and occlusion patterns. Fine-tuning on 500-2,000 robotic images closes this gap to within 5 mAP of models trained from scratch. For zero-shot deployment, foundation models like Segment Anything perform better than COCO models because they train on 11 million diverse images including robotic and egocentric viewpoints.

What IoU threshold should I require for manipulation policy training?

Pick-and-place tasks in structured environments tolerate 75-80% IoU; fine manipulation tasks requiring edge alignment or thin-object grasping need 90%+ IoU. Policies trained with 90% IoU masks achieve 15-20% higher success rates on long-horizon tasks than 75% IoU baselines, but annotation cost increases 2-3x. Specify minimum IoU in dataset procurement contracts and validate on held-out scenes before scaling training.

How do I verify instance segmentation quality in a purchased dataset?

Request validation set annotations with ground-truth masks, then compute mask AP at IoU 0.50, 0.75, and 0.95. Check inter-annotator agreement on 100-200 frames—scores below 85% IoU indicate inconsistent labeling. Verify temporal consistency for video datasets by tracking instance IDs across frames. Review annotation pipeline documentation for foundation model usage, annotator training protocols, and quality control procedures. The truelabel marketplace requires sellers to provide these metrics, enabling apples-to-apples quality comparisons.

Find datasets covering instance segmentation

Truelabel surfaces vetted datasets and capture partners working with instance segmentation. Send the modality, scale, and rights you need and we route you to the closest match.

Explore Physical AI Datasets