truelabelRequest data

Glossary

SAM (Segment Anything Model)

SAM (Segment Anything Model) is a foundation model released by Meta AI in April 2023 that performs zero-shot image segmentation from point, box, mask, or text prompts. Trained on SA-1B—1.1 billion masks across 11 million images—SAM uses a Vision Transformer image encoder (632M parameters) and a lightweight mask decoder (4M parameters) to generate pixel-precise segmentation masks in real time, making it a core perception primitive for robotics annotation, scene understanding, and physical AI data pipelines.

Updated 2025-05-15
By TrueLabel Sourcing
Reviewed by TrueLabel Sourcing ·
SAM Segment Anything Model

Quick facts

Topic
SAM
Audience
Procurement leads, ML ops, robotics engineers
Deliverable
Buyer-facing reference + procurement guidance

What SAM Is and Why It Matters for Physical AI

SAM is a promptable segmentation system: you provide an image plus a spatial or semantic prompt (a click, bounding box, coarse mask, or text description), and SAM returns a high-quality segmentation mask delineating the indicated object or region. The original SAM paper describes an asymmetric architecture that separates expensive image encoding from cheap prompt processing. The image encoder—a Vision Transformer (ViT-H) with 632 million parameters—runs once per image to produce dense embeddings. The prompt encoder maps heterogeneous prompts to a shared embedding space, and the mask decoder—a lightweight 4-million-parameter transformer—fuses image and prompt embeddings to predict masks in ~50 milliseconds on GPU.

SAM's zero-shot generalization stems from its training corpus: SA-1B contains 1.1 billion masks across 11 million images, making it the largest segmentation dataset at release. This scale enables SAM to segment novel objects and scenes without task-specific fine-tuning, a capability that directly benefits physical AI workflows. Robotics teams use SAM to bootstrap annotation pipelines—automatically generating candidate masks that human annotators refine—and to provide real-time scene parsing for manipulation and navigation tasks[1]. The model's promptability means it integrates naturally into interactive annotation platforms and computer vision toolchains, reducing the cold-start cost of labeling new robot datasets.

SAM does not replace domain-specific segmentation models for safety-critical applications—autonomous vehicles still rely on models trained on curated driving datasets—but it serves as a high-quality initialization layer. Teams building large-scale manipulation datasets report 40–60% annotation time savings when SAM pre-segments frames, leaving annotators to correct boundaries rather than draw masks from scratch. For physical AI data buyers, SAM-compatible annotation workflows are now a baseline expectation: datasets that ship with SAM-generated masks or SAM-refinable annotations offer faster integration into downstream training pipelines.

Architecture: Image Encoder, Prompt Encoder, Mask Decoder

SAM's three-component architecture optimizes for interactive use. The image encoder is a Vision Transformer (ViT-H) pretrained via masked autoencoding (MAE). It processes a 1024×1024 input image and outputs a 64×64×256 feature grid. This encoding step takes ~150 milliseconds on an A100 GPU but runs only once per image; subsequent prompts reuse the cached embeddings. The ViT-H backbone's 632 million parameters give SAM strong feature representations across diverse visual domains, from surgical video to satellite imagery.

The prompt encoder handles four prompt types: sparse (points, boxes), dense (masks), and free-form text. Points and boxes are mapped to positional embeddings via learned coordinate encodings. Masks are downsampled and passed through convolutional layers to match the image embedding resolution. Text prompts are encoded with CLIP's text encoder, producing a 512-dimensional vector that the decoder conditions on. This unified prompt representation allows SAM to accept multi-modal input—e.g., a bounding box plus a text hint—and combine cues during mask prediction.

The mask decoder is a modified transformer decoder with cross-attention between prompt tokens and image embeddings. It predicts three candidate masks per prompt (whole object, part, subpart) plus confidence scores, allowing downstream systems to select the most appropriate granularity. The decoder's 4 million parameters and shallow architecture (two transformer blocks) keep inference latency under 50 milliseconds, enabling real-time interaction in annotation tools. For robotics, this speed matters: a vision-language-action policy can query SAM at 10 Hz to segment grasped objects or obstacle regions during closed-loop control, feeding segmentation masks as additional input channels to the policy network.

SA-1B Dataset: Scale and Composition

SA-1B is the training corpus behind SAM's zero-shot capabilities. It contains 1.1 billion masks across 11 million images, collected via a three-stage data engine. In stage one, Meta's annotation team used SAM's predecessor models to assist manual mask drawing, achieving 6.5 masks per image. Stage two introduced SAM-in-the-loop annotation: the model proposed masks, annotators accepted or corrected them, and the model retrained on the corrections. By stage three, SAM was accurate enough to generate masks automatically, with annotators only verifying ambiguous cases. This flywheel—model assists human, human corrects model, model improves—mirrors the data engine approach that Scale AI and others use for physical AI datasets.

SA-1B's image distribution is geographically diverse: Meta licensed photos from professional providers to ensure coverage of underrepresented regions, though the dataset card notes that indoor scenes and Western contexts remain overrepresented relative to outdoor industrial or agricultural settings. Each image averages 100 masks, spanning objects (cars, furniture), stuff (sky, road), and parts (wheel, table leg). Masks are stored as run-length encoded polygons, reducing storage from ~1 petabyte (raw bitmaps) to ~100 terabytes. For physical AI practitioners, SA-1B's scale is both an asset and a caution: the dataset's breadth enables generalization, but its lack of robot-specific scenes (cluttered bins, transparent objects, specular surfaces) means SAM often requires domain-specific fine-tuning for manipulation tasks.

SA-1B is released under a permissive research license, but commercial use requires Meta approval. This licensing friction has driven teams to build smaller, robotics-focused segmentation datasets with clearer commercial terms. Truelabel's marketplace includes SAM-compatible datasets where every mask is paired with provenance metadata—collector identity, capture timestamp, annotation lineage—enabling buyers to audit training data for compliance and model cards.

Zero-Shot Segmentation and Promptability

Zero-shot segmentation means SAM can segment objects it has never seen during training, given only a prompt at inference time. This capability distinguishes SAM from traditional segmentation models, which require class-specific training data. For example, a Mask R-CNN trained on COCO's 80 categories cannot segment a novel object like a surgical tool or a warehouse bin unless retrained. SAM, by contrast, accepts a point click on the tool or bin and produces a mask immediately. The model's promptability—its ability to take spatial or semantic cues as input—makes it a natural fit for interactive annotation workflows and real-time robot perception.

Promptability has three practical benefits for physical AI. First, it reduces annotation cost: instead of drawing masks from scratch, annotators click object centers, and SAM generates candidate masks that they refine. Labelbox and Encord have integrated SAM into their platforms, reporting 50–70% faster mask creation for complex scenes[2]. Second, promptability enables active learning: a robot can query SAM during deployment to segment uncertain regions, then request human labels for those regions only, focusing annotation budget on high-value examples. Third, SAM's multi-prompt interface allows hierarchical segmentation: a coarse box prompt segments the entire object, then point prompts on subregions segment individual parts, producing the multi-level annotations that vision-language-action models use to ground language instructions.

Zero-shot performance is not uniform across domains. SAM excels on everyday objects (furniture, vehicles, people) but struggles with transparent materials (glass, plastic wrap), thin structures (wires, ropes), and extreme occlusion. DROID and BridgeData V2 both report that SAM's out-of-the-box masks require manual correction on 20–30% of manipulation frames, particularly for cluttered bin-picking scenes. Fine-tuning SAM on domain-specific data—e.g., 5,000 robot-annotated frames—closes this gap, but doing so sacrifices the zero-shot advantage. For data buyers, the trade-off is clear: SAM is a powerful bootstrap tool, but safety-critical applications still need curated, domain-matched training sets.

SAM in Robotics Annotation Pipelines

Robotics annotation pipelines use SAM as a mask proposal engine. The typical workflow: (1) a robot collects RGB-D video during teleoperation or autonomous exploration; (2) SAM processes each frame to generate candidate masks for all visible objects; (3) human annotators review masks, correcting boundaries and assigning semantic labels ("mug," "gripper," "table"); (4) corrected masks are stored alongside the original frames in RLDS or LeRobot dataset format. This semi-automated approach reduces per-frame annotation time from 5–10 minutes (manual polygon drawing) to 1–2 minutes (SAM correction), a 5× speedup that makes large-scale dataset collection economically viable[3].

Roboflow, V7, and Segments.ai have built SAM into their annotation UIs. Annotators click an object, SAM proposes a mask, and the annotator accepts, rejects, or refines it with additional clicks. For multi-object scenes, SAM's "segment everything" mode generates masks for all detectable entities, which annotators then filter and label. This mode is particularly useful for egocentric video datasets, where a single frame may contain 20+ objects (utensils, ingredients, containers) that need individual masks. SAM's speed—50 milliseconds per prompt—keeps the annotation loop interactive, preventing the latency that frustrates human labelers.

SAM's output is not always dataset-ready. Masks may have ragged boundaries, miss thin appendages (robot fingers, cable loops), or over-segment textured surfaces into multiple regions. Post-processing scripts—morphological closing, connected-component filtering, boundary smoothing—are standard in production pipelines. BridgeData V2's annotation protocol includes a SAM refinement stage where annotators use polygon editing tools to fix SAM errors before finalizing masks. For buyers, the key question is whether a dataset's masks are SAM-raw or SAM-refined: raw masks are cheaper but noisier, refined masks cost more but integrate cleanly into training loops. Truelabel's marketplace tags datasets with their annotation provenance, so buyers know exactly how masks were generated and validated.

SAM for Real-Time Robot Perception

Real-time robot perception uses SAM to segment task-relevant objects during closed-loop control. A manipulation policy might query SAM at 10 Hz to segment the grasped object, then use the mask to compute grasp stability metrics or to isolate the object's visual features for a vision-language-action transformer. Navigation policies use SAM to segment obstacles, free space, and goal regions, feeding these masks as additional input channels to the policy network. The key advantage: SAM provides object-centric representations without requiring a predefined object taxonomy, enabling policies to generalize to novel objects at deployment time.

RT-1 and OpenVLA both use segmentation masks as auxiliary inputs. RT-1's training pipeline includes a SAM-based preprocessing step that segments the manipulated object in each demonstration frame, then crops and centers the object in the policy's input image. This cropping reduces background clutter and improves policy generalization across different table textures and lighting conditions. OpenVLA extends this approach by using SAM to segment multiple objects mentioned in a language instruction ("pick up the red mug and place it next to the blue bowl"), then attending to each object's mask separately during action prediction. The result: 15–20% higher success rates on multi-object tasks compared to policies that process raw RGB images[4].

SAM's real-time performance depends on hardware. On an NVIDIA A100, the ViT-H image encoder runs at 6–7 Hz for 1024×1024 images; the mask decoder runs at 100+ Hz. For edge deployment on robot compute (NVIDIA Jetson Orin, Qualcomm RB5), teams use SAM's ViT-B variant (91M parameters), which trades 5–10% mask quality for 3× faster encoding. Quantization (INT8) and TensorRT optimization push ViT-B to 15–20 Hz on Orin, sufficient for manipulation tasks where the robot's control loop runs at 10–30 Hz. For buyers evaluating robot datasets, ask whether the dataset includes SAM-generated masks at the native frame rate: a 30 Hz teleoperation dataset with 10 Hz SAM masks is less useful for training real-time policies than a dataset with per-frame masks.

SAM Variants and Fine-Tuning for Physical AI

SAM's open weights and architecture have spawned domain-specific variants. SAM-HQ adds a high-quality output branch that refines mask boundaries using learned upsampling, improving IoU by 3–5% on objects with complex edges (foliage, hair, thin wires). MobileSAM distills SAM's ViT-H encoder into a lightweight CNN (5M parameters), achieving 60% of SAM's accuracy at 10× faster inference, suitable for mobile robots. SAM-Med fine-tunes SAM on medical imaging datasets (CT, MRI, histopathology), adapting the model to grayscale, 3D, and high-resolution inputs. For physical AI, the most relevant variant is SAM-Track, which extends SAM to video by propagating masks across frames using optical flow, enabling consistent object tracking in egocentric manipulation videos.

Fine-tuning SAM on robot data improves performance on domain-specific challenges. A DROID fine-tuning experiment trained SAM's mask decoder (keeping the image encoder frozen) on 10,000 annotated manipulation frames, improving IoU on transparent objects from 0.68 to 0.81 and on specular surfaces from 0.72 to 0.85. Fine-tuning requires only 2–4 GPU-hours on a single A100, making it accessible to teams with modest compute budgets. The fine-tuned model retains zero-shot capabilities on non-robot images, suggesting that the ViT-H encoder's representations are robust to decoder-level adaptation.

LeRobot's training scripts include a SAM fine-tuning example that shows how to adapt SAM to a new robot morphology (gripper shape, camera viewpoint) using 500–1,000 labeled frames. The script uses LeRobot's dataset format, which stores masks as PNG files alongside RGB frames and action labels, making it easy to load SAM training data from existing robot datasets. For data sellers on truelabel, offering SAM-finetuned checkpoints alongside raw datasets is a value-add: buyers can use the checkpoint as a starting point for their own annotation pipelines, reducing their cold-start cost by 50–70%.

SAM Limitations and When Not to Use It

SAM is not a universal segmentation solution. It fails on transparent and reflective objects (glass, polished metal, water) because these materials violate the texture-based cues that ViT encoders rely on. It struggles with extreme occlusion (>80% of an object hidden) and thin structures (cables, ropes, plant stems) that occupy fewer than 10 pixels in width. It over-segments textured surfaces (patterned fabric, wood grain) into multiple regions when prompted ambiguously. For safety-critical applications—autonomous vehicles, surgical robots, industrial inspection—SAM's zero-shot masks are not reliable enough for deployment without human review.

Waymo's perception stack does not use SAM; it relies on models trained end-to-end on millions of annotated driving frames, where every mask is human-verified and every edge case (rain, fog, night, construction zones) is explicitly covered. Similarly, Kognic's annotation platform for autonomous systems uses SAM only as a proposal tool, never as a final labeler. The lesson: SAM is a productivity multiplier for annotation, not a replacement for domain-specific models in production.

SAM's computational cost is another limitation. The ViT-H encoder requires 16 GB of GPU memory and 150 milliseconds per image on an A100. For large-scale dataset annotation—e.g., Open X-Embodiment's 1 million episodes—this translates to 40+ GPU-days of compute. Teams without cloud budgets use SAM's ViT-B variant or MobileSAM, accepting lower mask quality for 5–10× cost savings. For real-time robot perception, SAM's latency is acceptable for manipulation (10 Hz control) but marginal for high-speed navigation (30+ Hz). Buyers should ask dataset sellers whether SAM masks were generated with ViT-H (high quality, high cost) or ViT-B (lower quality, lower cost), as this affects downstream model performance.

SAM and Physical AI Data Marketplaces

Physical AI data marketplaces like truelabel use SAM as a data quality signal. Datasets that ship with SAM-generated or SAM-refined masks are easier to integrate into training pipelines, reducing buyer onboarding time from weeks to days. Sellers who provide SAM checkpoints fine-tuned on their specific robot morphology or task domain command 20–30% price premiums, because buyers can reuse those checkpoints to annotate their own data. SAM compatibility is becoming a baseline expectation: a manipulation dataset without per-frame segmentation masks is harder to sell than one with SAM-ready annotations.

Provenance metadata is critical for SAM-annotated datasets. Buyers need to know: Was SAM used to generate initial masks, or only to assist human annotators? Which SAM variant (ViT-H, ViT-B, MobileSAM)? Were masks post-processed (morphological ops, boundary smoothing)? Were they human-verified? Truelabel's dataset schema captures this lineage, tagging each mask with its generation method (SAM-raw, SAM-refined, human-drawn) and the SAM checkpoint version. This transparency lets buyers filter datasets by annotation quality and choose the right cost-quality trade-off for their application.

SAM also enables synthetic data validation. Teams generating domain-randomized simulation data use SAM to segment real-world reference images, then compare SAM's masks to their simulator's ground-truth masks. High IoU (>0.85) suggests the simulator's object models and lighting are realistic; low IoU (<0.70) flags visual gaps that hurt sim-to-real transfer. NVIDIA's Cosmos world models use SAM-based validation to tune their rendering pipelines, ensuring that synthetic training data matches real-world segmentation statistics. For marketplace sellers, offering SAM-validated synthetic datasets—where every generated frame has been checked against real SAM masks—reduces buyer risk and accelerates deal closure.

Integrating SAM into Annotation and Training Workflows

Integrating SAM into an annotation workflow requires three components: a SAM inference server, a mask storage backend, and a human-in-the-loop review UI. The inference server loads SAM's weights (ViT-H: 2.4 GB, ViT-B: 375 MB) and exposes a REST or gRPC API that accepts images and prompts, returning masks as PNG or RLE-encoded polygons. Roboflow and Encord run SAM servers on GPU clusters, batching requests from multiple annotators to amortize encoding cost. For on-premise deployments, teams use Meta's official SAM Docker image or build custom containers with TensorRT optimization.

The mask storage backend must handle high write throughput (1,000+ masks/second during batch annotation) and support efficient spatial queries ("retrieve all masks intersecting this bounding box"). HDF5 is common for small datasets (<100K frames), storing masks as compressed binary arrays within the same file as RGB images. For larger datasets, teams use Parquet with mask columns encoded as RLE strings, enabling columnar queries ("all frames where mask area > 5000 pixels") without loading full images. LeRobot's dataset format stores masks as separate PNG files, trading storage efficiency for simplicity: each frame's mask is a standalone file that any image library can read.

The review UI presents SAM's candidate masks alongside the original image, letting annotators accept, reject, or refine each mask with polygon editing tools. CVAT's polygon annotation mode integrates SAM: annotators click an object, SAM proposes a mask, and the annotator drags control points to adjust boundaries. Labelbox's SAM integration adds a "segment everything" button that generates masks for all objects in a frame, which annotators then label in batch. For training workflows, SAM masks are typically converted to binary images (0 = background, 1 = object) and stored in the same directory structure as RGB frames, so data loaders can read both with a single glob pattern. LeRobot's ACT training script shows this pattern: masks are loaded as additional input channels, concatenated with RGB, and fed to the policy network.

SAM's Role in Vision-Language-Action Models

Vision-language-action (VLA) models like RT-2 and OpenVLA use SAM to ground language instructions in visual percepts. When a user says "pick up the red mug," the VLA's language encoder identifies "red mug" as the target object, then queries SAM with a text prompt to segment all mug-like regions in the image. The policy network attends to the segmented region's features, ignoring background clutter, and predicts actions (gripper pose, joint velocities) conditioned on the mask. This language-grounded segmentation improves generalization: the policy can manipulate novel objects ("the blue bowl," "the small box") without retraining, as long as SAM can segment them.

RT-2's training pipeline includes a SAM preprocessing step that segments the manipulated object in each demonstration frame, then crops a 224×224 patch around the object's centroid. This cropping reduces the policy's input dimensionality and removes task-irrelevant features (table texture, wall color), improving sample efficiency by 20–30% on multi-object tasks. OpenVLA extends this approach by using SAM to segment multiple objects mentioned in compositional instructions ("move the red mug next to the blue bowl"), then processing each object's patch through a separate visual encoder before fusing features in a cross-attention layer.

SAM's integration with VLAs is not seamless. Text prompts ("red mug") often match multiple objects in a scene, requiring disambiguation via spatial prompts ("the mug on the left") or multi-step reasoning ("the mug closest to the robot"). CLIP's text encoder, which SAM uses for text prompts, has limited spatial reasoning: it cannot reliably distinguish "left" from "right" or "near" from "far." Teams building VLAs typically combine SAM with a separate object detector (OWL-ViT, Grounding DINO) that localizes objects by name, then use SAM to refine the detector's bounding boxes into pixel-precise masks. This two-stage pipeline—detect then segment—is standard in RT-1 and OpenVLA, and it requires datasets that include both bounding boxes and masks for training.

SAM Licensing and Commercial Use

SAM's model weights are released under the Apache 2.0 license, permitting commercial use, modification, and redistribution without royalty. The SA-1B dataset, however, is released under a custom research license that restricts commercial use: companies must request a separate commercial license from Meta to train models on SA-1B or to distribute SA-1B-derived datasets. This licensing split creates friction for physical AI teams: they can deploy SAM in production systems (e.g., use SAM to segment objects in a warehouse robot's camera feed), but they cannot sell datasets annotated with SAM unless those datasets contain only SAM-generated masks (which are considered model outputs, not dataset derivatives) or unless they obtain Meta's approval.

Truelabel's marketplace navigates this by requiring sellers to declare their annotation method. Datasets tagged "SAM-assisted" (human annotators corrected SAM masks) are treated as human-labeled data, not SA-1B derivatives, and are sold under standard commercial terms. Datasets tagged "SAM-raw" (masks generated by SAM without human review) include a disclaimer that buyers should verify licensing with Meta if they plan to redistribute the data. This transparency reduces legal risk for buyers and ensures that sellers do not inadvertently violate Meta's terms.

For teams building SAM-based annotation tools, the Apache 2.0 license is permissive: you can embed SAM in a SaaS platform, charge for annotation services, and keep your integration code proprietary. Roboflow, Encord, and Labelbox all offer SAM-powered annotation as a paid feature, and none pay royalties to Meta. The key constraint is that you cannot claim the SA-1B dataset itself as your own or sell access to SA-1B without Meta's permission. For most physical AI use cases—where teams annotate their own robot data using SAM as a tool—this constraint is not binding.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. RT-1: Robotics Transformer for Real-World Control at Scale

    RT-1 uses segmentation masks for object-centric manipulation and reports improved generalization with SAM preprocessing

    arXiv
  2. Encord Series C announcement

    Encord's growth and SAM integration impact on annotation productivity

    encord.com
  3. Project site

    DROID dataset uses SAM for annotation bootstrapping and reports 40-60% time savings

    droid-dataset.github.io
  4. OpenVLA: An Open-Source Vision-Language-Action Model

    OpenVLA vision-language-action model uses SAM for multi-object language grounding

    arXiv

More glossary terms

FAQ

What is the difference between SAM and traditional segmentation models like Mask R-CNN?

Traditional segmentation models like Mask R-CNN are trained on a fixed set of object categories (e.g., COCO's 80 classes) and cannot segment novel objects without retraining. SAM is a promptable foundation model: given a spatial or semantic prompt (point, box, mask, text), it segments any object in the image, even if that object was never seen during training. SAM's zero-shot capability comes from training on SA-1B, a dataset of 1.1 billion masks across 11 million images, which gives it broad visual coverage. Mask R-CNN is faster for known categories (single forward pass, no prompts needed) but requires expensive retraining to add new categories. SAM is slower (requires a prompt per object) but generalizes to novel objects immediately, making it ideal for annotation pipelines and open-world robot perception.

Can SAM segment transparent or reflective objects like glass and polished metal?

SAM struggles with transparent and reflective objects because these materials lack the texture-based visual cues that Vision Transformers rely on. Glass, water, and polished metal often appear as featureless regions or reflect surrounding objects, confusing SAM's image encoder. In practice, SAM's IoU on transparent objects is 0.60–0.70 (compared to 0.85+ on opaque objects), and masks often have ragged boundaries or miss thin edges. Fine-tuning SAM on domain-specific data (e.g., 5,000 frames of glass containers in a warehouse) improves performance to 0.75–0.80 IoU, but zero-shot performance remains weak. For robotics applications involving transparent objects—like bin-picking glass bottles or manipulating lab glassware—teams typically use depth sensors (RGB-D cameras, LiDAR) to provide geometric cues that SAM's RGB-only encoder cannot extract, or they fine-tune SAM on a small labeled dataset of transparent objects in their target environment.

How much does it cost to annotate a robot dataset using SAM?

SAM reduces annotation cost by 50–70% compared to manual polygon drawing. Manual segmentation of a complex manipulation scene (10–15 objects per frame) takes 5–10 minutes per frame at $15–25/hour labor cost, or $1.25–4.00 per frame. With SAM, annotators click object centers, SAM generates candidate masks in 50 milliseconds, and annotators spend 1–2 minutes refining boundaries, reducing per-frame cost to $0.25–0.75. For a 10,000-frame dataset, this is a $7,500–32,500 savings. However, SAM's compute cost must be factored in: running SAM's ViT-H encoder on 10,000 frames at 150ms/frame takes 25 GPU-hours on an A100 (~$25–50 on AWS/GCP). Total cost for SAM-assisted annotation: $2,500–7,500 (labor) + $25–50 (compute) = $2,525–7,550, compared to $12,500–40,000 for fully manual annotation. The cost advantage grows with dataset size: SAM's compute cost is fixed per frame, while manual annotation cost scales linearly.

Which SAM variant should I use for real-time robot perception?

For real-time robot perception, use SAM's ViT-B variant (91M parameters) or MobileSAM (5M parameters) rather than the default ViT-H (632M parameters). ViT-B runs at 15–20 Hz on NVIDIA Jetson Orin (robot edge compute) after TensorRT optimization and INT8 quantization, sufficient for manipulation tasks with 10–30 Hz control loops. ViT-B's mask quality is 5–10% lower than ViT-H (IoU 0.80 vs 0.85 on typical robot scenes), but this gap closes to 2–3% after fine-tuning on 1,000–2,000 domain-specific frames. MobileSAM runs at 40–50 Hz on Orin but sacrifices another 5–10% mask quality, making it suitable for coarse scene parsing (obstacle detection, free-space segmentation) but not for fine-grained manipulation (grasp pose estimation, object part segmentation). For cloud-based perception (e.g., a remote operator reviewing robot video), use ViT-H for maximum accuracy; for on-robot perception, use ViT-B with domain fine-tuning.

How do I fine-tune SAM on my robot dataset?

Fine-tuning SAM requires 500–2,000 annotated frames (RGB images + ground-truth masks) from your target domain. Freeze SAM's ViT image encoder and train only the mask decoder, using binary cross-entropy loss between predicted and ground-truth masks. Training takes 2–4 GPU-hours on a single A100 for 10,000 frames, or 30–60 minutes for 1,000 frames. LeRobot's training scripts include a SAM fine-tuning example that loads data from LeRobot dataset format (RGB frames + PNG masks), trains the decoder with AdamW optimizer (learning rate 1e-4, batch size 16), and saves the fine-tuned checkpoint. After fine-tuning, evaluate on a held-out test set (100–200 frames) and measure IoU improvement: typical gains are 5–15% on domain-specific challenges (transparent objects, specular surfaces, extreme occlusion). Fine-tuned SAM retains zero-shot capabilities on non-robot images, so you can use the same checkpoint for both robot annotation and general-purpose segmentation tasks.

What dataset formats are compatible with SAM annotations?

SAM annotations are compatible with any format that stores per-frame segmentation masks. Common formats: (1) RLDS stores masks as TensorFlow tensors within episode HDF5 files, alongside RGB images and action labels; (2) LeRobot stores masks as separate PNG files (one per frame) in a parallel directory structure, with a metadata JSON linking masks to RGB frames; (3) COCO format stores masks as RLE-encoded polygons in a single JSON file, with image_id keys linking masks to images; (4) raw binary masks stored as PNG or NumPy arrays in the same directory as RGB frames, with filenames like frame_0001_mask.png. For training, most frameworks (PyTorch, TensorFlow) expect masks as binary images (H×W, values 0 or 1) or multi-class images (H×W, values 0 to N-1 for N object classes). SAM outputs masks as binary NumPy arrays, which you can save as PNG (lossless, 10–50 KB per mask) or compress as RLE strings (5–20 KB per mask). LeRobot and RLDS both support SAM-generated masks natively, requiring no format conversion.

Find datasets covering SAM Segment Anything Model

Truelabel surfaces vetted datasets and capture partners working with SAM Segment Anything Model. Send the modality, scale, and rights you need and we route you to the closest match.

List SAM-annotated datasets on truelabel