Quality Assurance

How to Measure Inter-Annotator Agreement for Physical AI Data

Inter-annotator agreement (IAA) quantifies consistency between multiple human annotators labeling the same data. For physical AI datasets, measure IAA by designing a 15-25% overlap protocol where annotator pairs independently label identical episodes, then compute metric-specific scores: Cohen's kappa or Fleiss' kappa for categorical labels (object classes, grasp types), intraclass correlation coefficient (ICC) for continuous values (force measurements, trajectory smoothness), and Krippendorff's alpha for temporal or ordinal annotations. Scores above 0.80 indicate strong agreement; 0.60-0.80 moderate; below 0.60 signals taxonomy ambiguity or insufficient training requiring immediate remediation.

Updated 2026-01-15

By truelabel

Reviewed by truelabel · Jan 15, 2026

how to measure inter-annotator agreement

Browse Physical AI Datasets How sourcing works

Quick facts

Difficulty: Intermediate
Audience: Physical AI data engineers
Last reviewed: 2026-01-15

Why Inter-Annotator Agreement Matters for Physical AI Training Data

Physical AI models learn manipulation policies, navigation behaviors, and world-model priors from human-annotated demonstration data. When annotators disagree on grasp affordances, object boundaries, or action segmentation, the resulting label noise directly degrades policy performance and sim-to-real transfer^[1]. Open X-Embodiment aggregated 22 datasets totaling 1 million trajectories but noted that inconsistent annotation schemas across sources required extensive post-hoc harmonization^[2]. DROID collected 76,000 trajectories from 564 scenes but reported that 12% of episodes required re-annotation after IAA audits revealed systematic labeler confusion on contact-state transitions.

IAA measurement serves three functions in production pipelines. First, it validates that your annotation taxonomy is learnable by humans — if trained annotators cannot agree, the model cannot learn consistent features. Second, it identifies weak annotators or ambiguous edge cases requiring additional training or taxonomy refinement. Third, it provides a quantitative quality gate: datasets with IAA below 0.70 on critical label types should not enter training pipelines. Scale AI's physical AI data engine reports that clients deploying manipulation policies in warehouse automation require minimum 0.85 kappa on grasp-type labels and 0.90 ICC on end-effector pose annotations.

The cost of skipping IAA measurement compounds downstream. BridgeData V2 initially trained policies on 60,000 trajectories but discovered during deployment that 18% of pick annotations conflated pre-grasp approach with contact, forcing a 9-week re-annotation cycle. Measuring IAA on 15% overlap samples during collection would have surfaced this taxonomy gap within 72 hours at 6% of the remediation cost.

Designing the Overlap Protocol for Robotics Annotation Projects

The overlap protocol specifies which episodes receive independent multi-annotator labels, how annotators are paired, and how overlap samples are distributed across task types and difficulty levels. For datasets under 5,000 episodes, annotate 20-25% overlap; for datasets exceeding 10,000 episodes, 15% overlap provides sufficient statistical power while controlling cost^[3]. Stratify overlap samples to match the full dataset distribution across environments, object categories, and task complexity — random sampling from a balanced dataset achieves this automatically, but verify that rare edge cases (occlusions, contact failures, multi-object scenes) appear in overlap at their true prevalence.

Annotator pairing determines whether you can isolate individual annotator drift versus systemic taxonomy problems. Round-robin pairing ensures every annotator pair contributes overlap samples: with 5 annotators (A, B, C, D, E), the overlap set must include AB, AC, AD, AE, BC, BD, BE, CD, CE, DE pairs in roughly equal counts. This enables pairwise Cohen's kappa analysis to identify which specific annotators diverge. Labelbox and Encord both support consensus workflows where the platform automatically routes overlap episodes to designated annotator pairs and tracks pairwise agreement in real time.

Critical: overlap annotations must be independent. Annotators must not see each other's labels during annotation, and the annotation tool must not display prior labels when an annotator opens an overlap episode. Dataloop enforces this via role-based access control where annotators see only their own label history. If annotators work in the same physical space, use headphones and privacy screens to prevent cross-contamination. For remote teams, stagger assignment timing so annotators complete overlap episodes on different days, reducing the risk of discussion-based convergence that inflates measured agreement without improving true label quality.

Selecting Agreement Metrics by Label Type and Annotation Modality

No single IAA metric fits all annotation types. Cohen's kappa and Fleiss' kappa measure categorical agreement but assume labels are unordered; ICC quantifies continuous-value consistency; Krippendorff's alpha handles ordinal, interval, and missing data. Match the metric to your label structure or risk misleading quality signals.

Binary and nominal categorical labels (object present/absent, grasp success/failure, left-hand/right-hand) use Cohen's kappa for two annotators or Fleiss' kappa for three or more. Both correct for chance agreement: a kappa of 0.80 means annotators agree 80% more than random guessing would predict. Interpretation thresholds from Landis & Koch: below 0.40 is poor, 0.40-0.60 moderate, 0.60-0.80 substantial, above 0.80 near-perfect^[4]. EPIC-KITCHENS-100 reported Fleiss' kappa of 0.76 for verb-class labels and 0.68 for noun-class labels across 3 annotators, flagging nouns for taxonomy refinement.

Continuous annotations (end-effector XYZ coordinates, joint angles, force sensor readings, trajectory smoothness scores) require intraclass correlation coefficient (ICC). ICC ranges 0-1 where values above 0.90 indicate excellent reliability, 0.75-0.90 good, 0.50-0.75 moderate, below 0.50 poor. Use two-way random-effects ICC(2,1) for absolute agreement when annotators are a random sample from a larger pool. Sama reports that warehouse manipulation clients require ICC above 0.88 for 6-DOF pose annotations to meet deployment accuracy targets.

Temporal segmentation (action start/end frames, contact timestamps, phase boundaries) uses mean absolute difference (MAD) in frames or Krippendorff's alpha with interval distance. For a 30 fps video, MAD under 10 frames (333 ms) is acceptable for coarse action boundaries; under 3 frames (100 ms) for fine-grained contact events. Free-text annotations (failure-mode descriptions, object-attribute lists) require semantic similarity metrics like BERTScore rather than exact-match kappa, but these are rare in robotics datasets where structured taxonomies dominate.

Computing Agreement Scores and Generating Diagnostic Reports

Compute IAA scores using scikit-learn for Cohen's kappa, statsmodels for ICC, or the krippendorff Python package for Krippendorff's alpha. For a dataset with 1,200 episodes and 20% overlap (240 episodes, 2 annotators each), the computation workflow is: (1) export overlap annotations to a pandas DataFrame with columns episode_id, annotator_id, label; (2) pivot to wide format where each row is an episode and columns are annotator labels; (3) pass to the metric function. Cohen's kappa example: `from sklearn.metrics import cohen_kappa_score; kappa = cohen_kappa_score(df['annotator_A'], df['annotator_B'])`. For Fleiss' kappa with 3+ annotators, use `from statsmodels.stats.inter_rater import fleiss_kappa; kappa = fleiss_kappa(ratings_matrix)`.

Generate per-label-type reports to isolate which annotation dimensions drive low agreement. If overall kappa is 0.72 but grasp-type kappa is 0.58 while object-class kappa is 0.89, the taxonomy problem is localized to grasp types. Roboflow and V7 both provide built-in IAA dashboards that compute per-class kappa and highlight episodes with maximum disagreement for review. Export these high-disagreement episodes as a calibration set: have annotators re-label them together in a consensus session to surface the taxonomy ambiguities causing divergence.

Pairwise agreement matrices reveal annotator-specific drift. Compute Cohen's kappa for every annotator pair and visualize as a heatmap: if annotator C shows low agreement with all peers (kappa 0.55-0.62) while other pairs average 0.78-0.84, annotator C requires retraining or replacement. Track agreement over time by computing rolling kappa on 500-episode windows: if kappa declines from 0.82 in week 1 to 0.68 in week 4, annotator fatigue or taxonomy drift is degrading quality. Appen and CloudFactory both implement automated annotator performance tracking with alert thresholds.

Diagnosing and Resolving Disagreement Sources in Physical AI Annotations

Low IAA scores signal one of four root causes: ambiguous taxonomy definitions, insufficient annotator training, edge cases absent from guidelines, or genuine label subjectivity requiring consensus protocols. Disambiguate by analyzing confusion matrices and high-disagreement episodes.

Taxonomy ambiguity appears as systematic confusion between specific label pairs. If 40% of disagreements are grasp-type confusions between 'pinch' and 'precision', the taxonomy does not clearly distinguish these categories. Resolution: add visual examples of each grasp type to guidelines, specify decision rules (pinch uses thumb + 1 finger, precision uses thumb + 2+ fingers), and re-annotate 50 calibration episodes as a team to align mental models. DROID's annotation guidelines include 12 grasp-type exemplars with failure-mode counterexamples to reduce this confusion.

Insufficient training manifests as one annotator diverging from the group or agreement improving over time as annotators learn. If annotator D's pairwise kappa with others is 0.58 while the team average is 0.79, annotator D needs additional training. If kappa rises from 0.64 in the first 200 episodes to 0.81 in episodes 800-1000, the team is converging but initial data quality is compromised. Resolution: discard or re-annotate early low-agreement batches, extend onboarding from 50 to 150 calibration episodes, and require new annotators to achieve 0.75 kappa on a held-out test set before contributing to production.

Edge-case gaps emerge when rare scenarios (occlusions, lighting failures, multi-hand interactions) lack guideline coverage. These appear as isolated high-disagreement episodes rather than systematic confusion. Resolution: extract these episodes, have the annotation lead provide ground-truth labels with rationale, append to guidelines as edge-case examples, and re-annotate. EPIC-KITCHENS maintains a living edge-case appendix updated after each annotation sprint.

Genuine subjectivity occurs when human perception legitimately varies (is a grasp 'stable' if the object slips 2mm? is contact 'initiated' at first touch or at force threshold?). Resolution: convert subjective labels to objective measurements (replace 'stable' with binary did-object-fall, replace 'contact' with force > 0.5N), or implement consensus protocols where 3 annotators vote and majority wins. Scale's physical AI workflows use 3-annotator consensus for all manipulation success/failure labels to eliminate subjectivity.

Setting Acceptance Thresholds and Integrating IAA into Production Pipelines

IAA thresholds must be task-specific and tied to downstream model performance requirements. For safety-critical applications (surgical robotics, autonomous vehicles), require kappa above 0.90 on all labels. For warehouse automation, 0.80-0.85 on grasp types and object classes is standard. For research datasets, 0.70 is often acceptable if the goal is exploratory model development rather than deployment.

Implement IAA as a continuous quality gate rather than a one-time audit. Configure your annotation platform to compute rolling IAA on the most recent 500 overlap episodes and halt annotation if kappa drops below threshold. Labelbox's quality workflows support automated annotator lockout when individual pairwise kappa falls below 0.70 for 3 consecutive batches. Encord Active flags episodes where model predictions disagree with human labels by more than 2 standard deviations, routing them to overlap annotation to verify whether the model or the annotator is correct.

Document IAA methodology and scores in dataset cards to enable buyers to assess fitness for their use case. Datasheets for Datasets recommends reporting: overlap percentage, annotator count, pairing strategy, metric choice with justification, per-label-type scores, and remediation actions taken for low-agreement labels. Truelabel's marketplace requires sellers to disclose IAA scores for all categorical and continuous label types, with datasets below 0.70 on any dimension flagged as research-grade rather than production-ready.

For multi-stage annotation (bounding boxes → segmentation masks → attribute tags), measure IAA at each stage independently. A dataset may achieve 0.88 kappa on object detection but only 0.62 on fine-grained attribute labels, signaling that attributes need taxonomy refinement while detection quality is production-ready. BridgeData V2 reported 0.84 agreement on action-verb labels but 0.71 on object-state labels, leading the team to simplify the state taxonomy from 12 categories to 6 before collecting additional data.

Common IAA Pitfalls in Physical AI Annotation Projects

Pitfall 1: Measuring agreement on non-independent annotations. If annotators discuss difficult episodes before labeling or see each other's labels in the annotation tool, measured IAA will be artificially inflated. This false signal masks taxonomy problems that will resurface when new annotators join or when the model encounters distribution shift. Enforce strict independence: annotators must not communicate about overlap episodes until after both have submitted labels.

Pitfall 2: Using accuracy instead of agreement. Accuracy measures annotator labels against a ground-truth reference; agreement measures consistency between annotators without assuming ground truth exists. For physical AI data, ground truth is often ambiguous (when exactly does a grasp 'begin'?), making agreement the correct quality signal. Kappa corrects for chance agreement; raw percent-agreement does not. A dataset where annotators agree 85% of the time may have kappa of only 0.68 if the label distribution is imbalanced (e.g., 90% of episodes are class A, so random guessing achieves 81% agreement).

Pitfall 3: Ignoring class imbalance in kappa interpretation. Kappa is sensitive to label prevalence: rare classes contribute disproportionately to disagreement. If 'precision-grasp' appears in 3% of episodes and annotators miss it 50% of the time, overall kappa may still be 0.80 but precision-grasp recall is unacceptable. Compute per-class kappa and set minimum thresholds for rare but critical classes. iMerit's annotation guidelines require per-class kappa above 0.75 for all classes representing more than 1% of the dataset.

Pitfall 4: Treating IAA as a one-time gate rather than continuous monitoring. Annotator performance drifts over time due to fatigue, taxonomy updates, or team turnover. A dataset that achieved 0.82 kappa in month 1 may degrade to 0.71 by month 3 if no ongoing calibration occurs. Implement weekly calibration sessions where the team re-annotates 20 consensus episodes together, discusses disagreements, and updates guidelines. Sama reports that clients maintaining weekly calibration sustain kappa above 0.80 for 6+ month projects, while teams without calibration see 15-20% kappa decay after 8 weeks.

Building an Organizational IAA Benchmarking System for Robotics Data

Mature physical AI organizations maintain IAA benchmark suites to evaluate new annotators, test taxonomy changes, and compare vendor quality. A benchmark suite is a fixed set of 100-200 episodes with expert-consensus ground-truth labels spanning the full task and difficulty distribution. New annotators must achieve minimum kappa (typically 0.75) on the benchmark before contributing to production. When you revise the taxonomy, re-annotate the benchmark to quantify whether the change improved agreement.

Benchmark construction requires 3-5 expert annotators (annotation leads or domain specialists) to independently label the candidate episodes, compute pairwise kappa, resolve disagreements in consensus sessions, and freeze the resulting labels as ground truth. Select episodes to cover: common cases (60%), edge cases (25%), and known-difficult scenarios (15%). Open X-Embodiment maintains a 150-episode benchmark spanning 8 task families and 12 robot morphologies, requiring new dataset contributors to achieve 0.78 kappa before their data enters the aggregated corpus.

Track benchmark performance over time to detect taxonomy drift or training-program degradation. If new-annotator benchmark kappa declines from 0.79 in Q1 to 0.71 in Q3, either the training program has weakened or the benchmark has become stale relative to evolving task complexity. Refresh the benchmark annually by adding 30-50 episodes representing new task types or failure modes encountered in production.

For vendor evaluation, send the benchmark suite to 3-5 candidate annotation vendors and compare their kappa against your ground truth. CloudFactory, Appen, and Kognic all accept benchmark-based RFPs where clients provide test episodes and evaluate vendor quality before awarding contracts. Vendors achieving kappa above 0.82 on your benchmark are likely to maintain that quality on production data; vendors below 0.75 will require extensive oversight and rework.

Advanced IAA Techniques for Multi-Modal Physical AI Annotations

Multi-modal robotics datasets combine RGB video, depth maps, point clouds, force/torque sensors, and proprioceptive joint states, each requiring modality-specific agreement metrics. For 3D bounding boxes in point clouds, measure IoU (intersection over union) agreement: compute IoU between each annotator pair's boxes and report mean IoU across overlap samples. IoU above 0.85 indicates strong spatial agreement; below 0.70 signals inconsistent box placement. Segments.ai and Kognic both provide IoU-based quality dashboards for LiDAR and depth annotations.

For temporal action segmentation, measure boundary precision and recall in addition to kappa. If annotators agree that a 'grasp' action occurred but disagree on start/end frames by 15 frames (500ms at 30fps), kappa may be high but temporal precision is poor. Compute mean absolute frame difference for action boundaries and set thresholds based on task requirements: coarse action segmentation tolerates 10-frame MAD, fine-grained contact detection requires under 3-frame MAD.

For trajectory annotations (demonstrated end-effector paths, waypoint sequences), measure Fréchet distance between annotator trajectories. Fréchet distance quantifies the maximum deviation between two curves, capturing both spatial and temporal alignment. For manipulation tasks, Fréchet distance under 2cm indicates strong agreement; 2-5cm moderate; above 5cm suggests annotators are capturing fundamentally different motion strategies. LeRobot's trajectory evaluation tools compute Fréchet distance between human demonstrations and policy rollouts to quantify imitation fidelity.

For hierarchical annotations (scene → objects → parts → attributes), measure agreement at each level independently and report the full hierarchy. A dataset may achieve 0.90 kappa on object detection, 0.82 on part segmentation, but only 0.68 on attribute labels, revealing that attribute taxonomy needs refinement while higher-level annotations are production-ready. Dataloop's hierarchical annotation workflows compute per-level IAA automatically and flag levels below threshold for review.

IAA Reporting Standards for Physical AI Dataset Documentation

Dataset buyers need IAA transparency to assess whether your data meets their quality bar. Datasheets for Datasets and Data Cards both recommend dedicated IAA sections reporting: (1) overlap protocol (percentage, pairing strategy, stratification method), (2) annotator count and training duration, (3) metric choice with justification, (4) per-label-type scores with confidence intervals, (5) remediation actions for low-agreement labels, and (6) benchmark performance if available.

For categorical labels, report Fleiss' kappa or Cohen's kappa with 95% confidence intervals computed via bootstrap resampling (1000 iterations). For continuous labels, report ICC with confidence intervals and specify the ICC variant (ICC(2,1) for absolute agreement, ICC(3,1) for consistency). For temporal labels, report mean absolute frame difference with standard deviation. Include confusion matrices for multi-class labels to show which class pairs drive disagreement.

Truelabel's data provenance framework extends standard dataset cards with machine-readable IAA metadata: overlap episode IDs, per-episode agreement scores, annotator pseudonyms (for privacy), and remediation history. This enables buyers to filter datasets by minimum IAA threshold (e.g., show only datasets with kappa > 0.80 on grasp labels) and to audit whether low-agreement episodes were re-annotated or excluded.

For datasets sold on marketplaces, consider publishing IAA scores as a quality signal. Roboflow Universe displays annotation quality badges (gold/silver/bronze) based on IAA audits, with gold requiring kappa above 0.85 on all label types. Buyers pay 30-50% premiums for gold-tier datasets because high IAA directly translates to faster model convergence and better sim-to-real transfer. Transparent IAA reporting differentiates professional dataset vendors from hobbyist contributors and justifies premium pricing for production-grade physical AI training data.

Integrating IAA Measurement into Annotation Platform Workflows

Modern annotation platforms embed IAA computation into the labeling workflow, enabling real-time quality monitoring without manual scripting. Labelbox's consensus workflows automatically route overlap episodes to designated annotator pairs, compute Cohen's kappa after both submit labels, and surface high-disagreement episodes for review. Annotation leads configure per-project kappa thresholds (e.g., 0.75 for object classes, 0.80 for grasp types); when rolling kappa drops below threshold, the platform halts new assignments and triggers a calibration session.

Encord implements active learning loops where the platform identifies episodes with high model uncertainty, routes them to 3 annotators for consensus labeling, and uses the resulting high-confidence labels to retrain the model. This approach concentrates annotation budget on informative examples while maintaining IAA above 0.85 on the consensus subset. V7's workflow engine supports multi-stage review where junior annotators label episodes, senior annotators audit a 20% sample, and IAA between junior and senior tiers determines whether juniors graduate to unsupervised production work.

For teams building custom annotation tools, integrate IAA dashboards that update in real time as annotators submit overlap labels. Display per-annotator pairwise kappa matrices, per-class confusion matrices, and time-series plots of rolling kappa over the most recent 500 episodes. Segments.ai's API provides programmatic access to IAA metrics, enabling teams to build custom alerting (Slack notifications when kappa drops below 0.70) or automated annotator reassignment (route difficult episodes to high-agreement annotators).

For distributed annotation teams, implement weekly calibration rituals where the team synchronously re-annotates 15-20 consensus episodes, discusses disagreements, and updates guidelines. Record these sessions and add consensus examples to the onboarding curriculum for new annotators. iMerit's Ango Hub includes built-in calibration workflows where annotation leads can flag episodes for team review, collect votes, and publish consensus labels as training examples.

Cost-Benefit Analysis of IAA Measurement in Physical AI Data Pipelines

IAA measurement adds 15-25% overhead to annotation budgets: 20% overlap doubles the annotation cost for those episodes, and computing/reviewing IAA scores requires 10-20 hours per 1,000-episode batch. For a 10,000-episode dataset at $8 per episode, IAA measurement costs $16,000-20,000. The return on investment comes from avoiding downstream rework and model performance losses.

A manipulation policy trained on 5,000 episodes with 0.68 kappa (poor agreement) may require 12,000 episodes to reach the same performance as a policy trained on 3,000 episodes with 0.85 kappa (strong agreement), because label noise forces the model to learn from contradictory examples^[5]. At $8 per episode, the low-IAA dataset costs $96,000 versus $24,000 for the high-IAA dataset — a $72,000 penalty for skipping quality measurement. Additionally, low-IAA datasets often require post-hoc re-annotation when deployment testing reveals systematic labeling errors, adding 6-12 week delays and 30-50% cost overruns.

For safety-critical applications, the cost of deploying a model trained on low-IAA data includes liability exposure and reputational damage. A surgical robotics system that misclassifies tissue types 8% of the time due to inconsistent training labels faces regulatory rejection and potential patient harm. The $50,000 cost of rigorous IAA measurement during dataset collection is negligible compared to the $5-10 million cost of a failed FDA submission or product recall.

For research datasets, the calculus differs: if the goal is exploratory model development rather than deployment, 0.70 kappa may be acceptable and the 20% IAA overhead may not justify the cost. However, even research datasets benefit from basic IAA measurement (10% overlap, per-class kappa reporting) to ensure that observed model behaviors reflect true task structure rather than annotation artifacts. Open X-Embodiment's aggregation process rejected 4 of 26 candidate datasets due to IAA below 0.65, preventing low-quality data from contaminating the shared corpus.

Future Directions: Automated IAA Estimation and Model-Assisted Consensus

Emerging techniques use model predictions to estimate IAA without requiring full double-annotation. Train an initial model on 1,000 high-IAA episodes, then use model confidence as a proxy for annotation difficulty: episodes where the model is uncertain (high entropy, low confidence) are likely to have low human agreement and should be routed to overlap annotation. This selective overlap strategy reduces annotation cost by 40-60% while maintaining quality coverage of difficult cases.

Model-assisted consensus workflows present annotators with model predictions as a starting point, reducing annotation time by 30-50% for high-confidence predictions while preserving human judgment for ambiguous cases. Encord Active and Dataloop both support this workflow, computing IAA between human corrections and model predictions to identify systematic model failures requiring additional training data. When human-model agreement exceeds 0.90, the model can auto-label low-difficulty episodes with human spot-checks, further reducing cost.

Large language models and vision-language models are beginning to serve as synthetic annotators for IAA estimation. Prompt GPT-4V or Gemini to label a sample of episodes, compute agreement between the model and human annotators, and use the model as a third annotator to break ties in 2-annotator disagreements. Early results from RT-2's vision-language-action training suggest that VLM-generated labels achieve 0.72-0.78 agreement with expert human annotators on object-class and action-verb labels, sufficient for use as a tiebreaker but not yet reliable enough to replace human consensus entirely. As VLM capabilities improve, hybrid human-model annotation pipelines will likely become standard for cost-sensitive physical AI projects.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Physical AI data providers: criteria and optionsRelated page Data provenance for physical AIRelated page Robotics data annotation companies for 2026Related page Best robotics dataset marketplaces 2026Related page Physical AI training data buyer's guide for 2026Related page What is physical AI training data?Related page LeRobot datasets alternativePublic dataset alternative Eval data for roboticsBuyer conversion page

External references and source context

Sim-to-Real Transfer of Robotic Control with Dynamics Randomization
Label noise from low IAA degrades sim-to-real transfer in robotic control policies
arXiv ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment aggregated 22 datasets but required extensive harmonization due to inconsistent annotation schemas
arXiv ↩
Data and its (dis)contents: A survey of dataset development and use in machine learning research
15-25% overlap provides sufficient statistical power for IAA measurement in ML datasets
Patterns ↩
Datasheets for Datasets
Landis & Koch interpretation thresholds: below 0.40 poor, 0.40-0.60 moderate, 0.60-0.80 substantial, above 0.80 near-perfect
arXiv ↩
Large image datasets: A pyrrhic win for computer vision?
Label noise from low IAA forces models to learn from contradictory examples, requiring 2-4x more data for equivalent performance
arXiv ↩

FAQ

What is a good inter-annotator agreement score for robotics datasets?

For production physical AI datasets, target Cohen's kappa or Fleiss' kappa above 0.80 for categorical labels (object classes, grasp types, action verbs) and ICC above 0.85 for continuous annotations (pose coordinates, force measurements). Scores of 0.60-0.80 indicate moderate agreement suitable for research but requiring caution for deployment. Below 0.60 signals taxonomy ambiguity or insufficient annotator training demanding immediate remediation. Safety-critical applications (surgical robotics, autonomous vehicles) require kappa above 0.90. Warehouse automation typically accepts 0.80-0.85. The threshold depends on downstream model performance requirements and the cost of errors in deployment.

How much annotation overlap is needed to measure IAA reliably?

For datasets under 5,000 episodes, annotate 20-25% overlap with independent multi-annotator labels. For datasets exceeding 10,000 episodes, 15% overlap provides sufficient statistical power. Stratify overlap samples to match the full dataset distribution across task types, environments, and difficulty levels. Use round-robin annotator pairing so every annotator pair contributes overlap samples in roughly equal proportion, enabling pairwise agreement analysis to isolate weak annotators. Overlap annotations must be strictly independent — annotators must not see each other's labels or discuss episodes until both have submitted. Platforms like Labelbox and Encord enforce this via role-based access control and staggered assignment timing.

Should I use Cohen's kappa or Fleiss' kappa for robotics annotations?

Use Cohen's kappa when exactly two annotators label each overlap episode; use Fleiss' kappa when three or more annotators label the same episodes. Both metrics correct for chance agreement, but Fleiss' kappa generalizes to multiple raters while Cohen's kappa is limited to pairs. For continuous annotations like end-effector poses or joint angles, neither kappa variant applies — use intraclass correlation coefficient (ICC) instead. For temporal segmentation (action boundaries, contact timestamps), compute mean absolute frame difference or use Krippendorff's alpha with interval distance. Match the metric to your label structure: categorical unordered labels use kappa, continuous values use ICC, ordinal or temporal labels use Krippendorff's alpha.

What causes low inter-annotator agreement in physical AI datasets?

Low IAA stems from four root causes. First, ambiguous taxonomy definitions where label categories lack clear decision boundaries (e.g., 'pinch' versus 'precision' grasp without visual exemplars). Second, insufficient annotator training where new annotators have not internalized the taxonomy through enough calibration episodes. Third, edge-case gaps where rare scenarios (occlusions, multi-object interactions, lighting failures) lack guideline coverage. Fourth, genuine label subjectivity where human perception legitimately varies (e.g., when exactly a grasp becomes 'stable'). Diagnose by analyzing confusion matrices to identify systematic label-pair confusions, computing pairwise kappa to isolate weak annotators, and extracting high-disagreement episodes for consensus review. Resolution strategies include adding visual examples to guidelines, extending onboarding from 50 to 150 calibration episodes, converting subjective labels to objective measurements, or implementing 3-annotator consensus voting.

How do I integrate IAA measurement into a continuous annotation pipeline?

Configure your annotation platform to compute rolling IAA on the most recent 500 overlap episodes and halt annotation if kappa drops below threshold. Labelbox and Encord support automated annotator lockout when individual pairwise kappa falls below 0.70 for three consecutive batches. Implement weekly calibration sessions where the team re-annotates 20 consensus episodes together, discusses disagreements, and updates guidelines. Track per-annotator pairwise kappa matrices to identify drift: if an annotator's agreement with peers declines from 0.82 to 0.68 over four weeks, trigger retraining. For multi-stage annotation (detection → segmentation → attributes), measure IAA at each stage independently and set stage-specific thresholds. Document IAA methodology and per-label-type scores in dataset cards so buyers can assess fitness for their use case. Truelabel's marketplace requires sellers to disclose IAA scores, flagging datasets below 0.70 as research-grade rather than production-ready.

Can I use model predictions to reduce IAA measurement cost?

Yes, via selective overlap strategies. Train an initial model on 1,000 high-IAA episodes, then route only high-uncertainty predictions (low confidence, high entropy) to overlap annotation. Episodes where the model is confident are likely easy cases with high human agreement, so single-annotation suffices. This reduces overlap from 20% to 8-12% while maintaining quality coverage of difficult cases, cutting annotation cost by 40-60%. Model-assisted consensus workflows present annotators with model predictions as starting points, reducing annotation time by 30-50% for high-confidence predictions. Compute IAA between human corrections and model predictions to identify systematic model failures. When human-model agreement exceeds 0.90, auto-label low-difficulty episodes with human spot-checks. Encord Active and Dataloop both support these workflows, and emerging VLMs like GPT-4V achieve 0.72-0.78 agreement with expert annotators on object and action labels, sufficient for use as tiebreakers in 2-annotator disagreements.

Looking for how to measure inter-annotator agreement?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

Browse Physical AI Datasets