truelabelRequest data

Glossary

Activity Annotation

Activity annotation assigns semantic labels and precise temporal boundaries to human actions in video, producing structured timelines of start/end timestamps, action classes, and object tags. Granularity ranges from atomic motor primitives (grasp, release) to multi-minute tasks (prepare meal). The EPIC-KITCHENS-100 dataset contains 90,000 action segments across 100 hours of egocentric kitchen video[ref:ref-epic-100], while DROID captures 76,000 manipulation trajectories in 350 diverse environments[ref:ref-droid-paper].

Updated 2025-05-19
By truelabel
Reviewed by truelabel ·
activity annotation

Quick facts

Term
Activity Annotation
Domain
Robotics and physical AI
Last reviewed
2025-05-19

What Activity Annotation Delivers for Physical AI

Activity annotation transforms raw video into machine-readable action sequences by marking when each action starts, when it ends, what verb describes it, and which objects participate. A single annotation might record that a human grasped a knife at frame 1,203, sliced downward from frame 1,210 to 1,228, then released the knife at frame 1,235 — each interval tagged with verb and noun labels.

Physical AI systems consume these annotations in three ways. Imitation learning policies like RT-1 and RT-2 map annotated action sequences to robot control primitives, learning to replicate human demonstrations. Vision-language-action models such as OpenVLA ground natural-language instructions in annotated video, enabling zero-shot task generalization. World models use temporal action boundaries to segment continuous experience into discrete state transitions, improving sample efficiency during reinforcement learning[1].

Annotation granularity determines model capability. Coarse labels like 'cook meal' span minutes but lack the motor detail robots need. Atomic labels like 'rotate wrist 15° clockwise' capture individual joint commands but require 10–50× more annotations per video hour. The EPIC-KITCHENS-100 dataset balances both: 90,000 action segments at intermediate granularity plus 20,000 atomic narrations[2]. DROID goes further, pairing 76,000 teleoperation trajectories with object-centric action labels across 350 real-world scenes[3].

Temporal precision matters as much as label accuracy. A 200-millisecond boundary error in a 'grasp' annotation can teach a robot to close its gripper before contact or after the object has moved. Professional annotation platforms like Encord and V7 provide frame-accurate timeline editors, but human reaction time still introduces ±3–5 frame jitter at 30 fps. High-stakes applications pre-filter annotations by inter-annotator agreement: segments where two independent labelers disagree by more than 10 frames get re-reviewed or discarded.

Taxonomy Design: Verbs, Nouns, and Hierarchies

An activity taxonomy defines the vocabulary of action labels and their relationships. Flat taxonomies list 50–300 independent action classes with no parent-child structure — simple to annotate but brittle when new actions appear. Hierarchical taxonomies organize actions into trees: 'manipulate object' branches into 'grasp,' 'move,' 'release,' each with finer subdivisions. The EPIC-KITCHENS-100 taxonomy separates verbs (97 classes like 'cut,' 'pour,' 'open') from nouns (300 object classes like 'knife,' 'bowl,' 'door'), yielding 4,000+ valid verb-noun pairs without enumerating every combination[2].

Robot-centric taxonomies prioritize affordance alignment — labels that map directly to control primitives. The CALVIN benchmark uses 34 atomic actions ('slide block left,' 'rotate knob') that correspond one-to-one with policy outputs. LeRobot datasets adopt a two-tier structure: high-level task names ('pick and place,' 'drawer open') for episode retrieval, plus low-level action vectors (joint positions, gripper state) for policy training[4].

Egocentric video taxonomies emphasize object interactions because the camera wearer's hands dominate the frame. Ego4D annotates 670 hours across 74 scenarios with a 110-verb, 478-noun taxonomy, capturing how humans manipulate objects in homes, factories, and hospitals[5]. Third-person surveillance taxonomies instead focus on whole-body poses and scene context — less useful for robot manipulation but critical for human-robot collaboration safety.

Taxonomy coverage determines dataset reusability. A kitchen-only taxonomy with 'chop,' 'stir,' 'pour' won't transfer to warehouse picking tasks. The Open X-Embodiment consortium aggregated 22 datasets spanning kitchens, labs, warehouses, and outdoor environments, then harmonized 527 task descriptions into a unified taxonomy that 90% of contributing teams could map to[6]. Buyers should verify that a dataset's taxonomy covers their target domain or budget for re-annotation.

Temporal Segmentation Methods and Tooling

Temporal segmentation marks the exact frame where one action ends and the next begins. Manual frame-by-frame annotation remains the gold standard: annotators scrub through video at 0.25× speed, clicking start/end boundaries while watching for motion changes. CVAT and Labelbox provide timeline interfaces with keyboard shortcuts for frame-stepping, but human annotators still require 15–30 minutes per video minute for atomic-level segmentation.

Keyframe interpolation accelerates coarse annotation. Annotators mark action boundaries at 1-second intervals, then interpolation algorithms refine boundaries by detecting motion discontinuities in optical flow. This cuts annotation time by 40–60% but introduces 5–10 frame errors at action transitions. Scale AI's Physical AI platform combines interpolation with active learning: the system flags low-confidence boundaries for human review, achieving 95% of manual accuracy at 3× speed[7].

Pre-segmentation models like InternVideo and VideoMAE predict candidate action boundaries, which annotators then correct. On EPIC-KITCHENS, pre-segmentation reduces annotation time from 22 minutes to 8 minutes per video minute while maintaining 92% boundary agreement with ground truth[2]. The catch: pre-segmentation models trained on kitchen video fail in novel domains (warehouses, hospitals), requiring domain-specific fine-tuning or fallback to manual annotation.

File formats encode temporal annotations as JSON arrays of `{start_frame, end_frame, verb_label, noun_label}` objects. The RLDS standard wraps these in TFRecord episodes with synchronized RGB frames, depth maps, and robot state vectors, enabling end-to-end policy training without format conversion[8]. MCAP stores annotations as timestamped messages alongside ROS bag sensor streams, preserving microsecond synchronization for real-time playback.

Inter-Annotator Agreement and Quality Control

Inter-annotator agreement (IAA) quantifies how consistently multiple humans label the same video. Temporal IoU measures boundary overlap: if Annotator A marks 'grasp' from frame 100–120 and Annotator B marks 98–118, IoU = 18 / 22 = 82%. Production datasets target ≥80% mean IoU across all action segments. EPIC-KITCHENS-100 reports 84% IoU on verb labels and 79% on noun labels after two-pass review[2].

Cohen's kappa adjusts for chance agreement in label assignment. A kappa of 0.6–0.8 indicates substantial agreement; below 0.4 signals taxonomy ambiguity or insufficient annotator training. The Ego4D team discarded 12% of initial annotations due to kappa < 0.5, then revised taxonomy definitions and retrained annotators, raising kappa to 0.71[5].

Quality control pipelines run three checks. Intra-annotator consistency re-shows 10% of videos to the same annotator weeks later; drift > 15% triggers retraining. Cross-validation assigns 20% of videos to two independent annotators; disagreements above threshold go to expert arbitration. Automated outlier detection flags annotations with abnormal duration (e.g., 'grasp' lasting 8 seconds) or impossible transitions (e.g., 'release' before 'grasp').

Appen and Sama operate distributed annotation workforces with built-in IAA tracking, but their general-purpose platforms lack robot-specific validation rules. CloudFactory offers custom QC pipelines for manipulation datasets, including physics-based checks (gripper can't teleport 50 cm between frames) and affordance validation (can't 'pour' from a closed bottle)[9]. Buyers should request IAA reports and sample 5% of delivered annotations before accepting full datasets.

Dataset Scale Requirements for Robot Policies

Modern vision-language-action models require 10,000–100,000 annotated demonstrations to generalize across tasks. RT-1 trained on 130,000 robot trajectories spanning 700 tasks, achieving 97% success on seen tasks and 76% on novel object configurations[10]. RT-2 added 6 billion web images to the training mix, improving zero-shot generalization to 62% on unseen tasks[11].

Atomic-level annotation scales poorly: a 10-minute video with 1,200 atomic actions requires 4–6 hours of expert annotation time. The DROID dataset collected 76,000 trajectories via teleoperation, then annotated only task-level labels ('pick apple,' 'open drawer') rather than per-frame actions, cutting annotation cost by 80% while preserving policy training utility[3]. LeRobot datasets follow the same pattern: episode-level task descriptions plus dense robot state logs, no per-frame action labels[4].

Domain diversity matters more than raw volume. A policy trained on 50,000 kitchen demonstrations won't transfer to warehouse picking. The Open X-Embodiment dataset aggregates 1 million trajectories across 22 robot platforms and 20 environments, enabling cross-embodiment transfer that single-domain datasets cannot match[6]. RoboNet pioneered this approach in 2019 with 15 million frames from 7 robots, demonstrating that multi-robot data improves sim-to-real transfer by 34%[12].

Annotation budget allocation: for a 10,000-demonstration dataset, allocate 60% of budget to data collection (teleoperation, scripted tasks), 25% to coarse task-level annotation, 10% to quality control, and 5% to metadata (camera calibration, object inventories). Fine-grained atomic annotation should be reserved for the 500–1,000 highest-value demonstrations that will serve as few-shot exemplars.

Egocentric vs. Third-Person Annotation Trade-offs

Egocentric video captures the world from the actor's viewpoint, with hands and manipulated objects dominating the frame. EPIC-KITCHENS-100 and Ego4D pioneered large-scale egocentric annotation, yielding datasets where 80% of pixels belong to task-relevant objects — ideal for learning object-centric policies[2]. Egocentric annotation is 30–40% faster than third-person because annotators don't track full-body pose, only hand-object interactions.

Third-person video shows the actor's full body and surrounding environment, critical for tasks requiring spatial reasoning (navigation, multi-agent coordination). The Something-Something v2 dataset contains 220,000 third-person videos of humans manipulating objects, annotated with 174 fine-grained action classes[13]. Third-person annotation requires tracking both the actor and objects across occlusions, increasing annotation time by 50% compared to egocentric.

Robot deployment context determines optimal viewpoint. Wrist-mounted cameras on robot arms produce egocentric video that matches human egocentric datasets, enabling direct policy transfer. LeRobot datasets use wrist cameras for 90% of manipulation tasks[4]. Overhead cameras in warehouses and factories produce third-person video; policies trained on egocentric human data require viewpoint adaptation layers that cost 10–20% of downstream task performance.

Multi-view annotation combines both perspectives. The DROID dataset synchronizes wrist, shoulder, and static third-person cameras, annotating object interactions in the egocentric stream while tracking full-body pose in third-person[3]. This costs 2× the annotation budget but enables policies that reason about both fine manipulation and coarse navigation. Buyers should match annotation viewpoint to deployment camera configuration or budget for viewpoint transfer losses.

Annotation Platforms and Service Providers

Annotation platforms fall into three tiers. Self-service tools like CVAT (open-source) and Roboflow let teams annotate in-house, offering timeline editors and export to common formats. Setup cost is low but requires dedicated annotator training and QC infrastructure. Labelbox and Encord add workflow management and IAA tracking for $500–2,000/month, suitable for teams annotating 100–500 hours/year[14].

Managed annotation services like Scale AI, Appen, and CloudFactory provide trained annotator workforces and handle QC. Scale's Physical AI offering specializes in robot manipulation, delivering 95% IAA on atomic actions at $80–150 per video hour[7]. Sama focuses on ethical sourcing, employing annotators in Kenya and Uganda at living wages; cost is 20% higher but appeals to buyers with ESG mandates[15].

Specialist robotics vendors like Kognic and iMerit understand robot-specific validation rules (gripper physics, affordance constraints). Kognic annotates multi-sensor streams (RGB, depth, LiDAR) with microsecond synchronization, critical for outdoor mobile manipulation[16]. Claru offers pre-annotated kitchen and warehouse datasets plus custom collection, targeting buyers who need domain-specific data without building annotation pipelines.

Platform selection criteria: throughput (hours annotated per week), domain expertise (kitchen vs. warehouse vs. outdoor), format support (RLDS, MCAP, custom), and QC transparency (can you audit rejected annotations?). Request 10-hour pilot projects from 2–3 vendors, then compare IAA, turnaround time, and cost per usable annotation hour.

Common Failure Modes and Mitigation Strategies

Taxonomy drift occurs when annotators invent new labels or redefine existing ones mid-project. A 'grasp' label might start meaning 'fingers touch object' but drift to 'fingers close around object' after 500 annotations. Mitigation: lock taxonomy definitions in a shared glossary, run weekly calibration sessions where annotators label the same 5-minute video and discuss disagreements, and flag any label used < 10 times in the first 20% of data for review.

Boundary ambiguity plagues continuous actions. When does 'reach' end and 'grasp' begin — when fingers touch the object, or when the gripper closes? The EPIC-KITCHENS-100 team resolved this by defining boundaries as the first frame where the action's effect is visible: 'grasp' starts when the object begins moving with the hand[2]. This rule cut boundary disagreement from 28% to 11%.

Rare action under-sampling happens when annotators spend 90% of effort on common actions ('reach,' 'grasp') and rush through rare ones ('unscrew,' 'flip'). Policies trained on this data fail on rare actions despite high overall accuracy. Mitigation: stratified sampling that over-represents rare actions during QC review, and per-class IAA reporting that flags low-agreement labels for re-annotation.

Annotation lag in active learning loops: models trained on week-old annotations can't guide collection of new data. Scale AI offers 24-hour annotation turnaround for priority batches, enabling daily model updates during data collection sprints[7]. For teams annotating in-house, dedicate one annotator to same-day QC of newly collected data, even if it means slower progress on the backlog.

Activity Annotation in the Physical AI Data Marketplace

Activity-annotated datasets trade on truelabel's physical AI marketplace under three licensing models. Royalty-free datasets like RoboNet (CC BY 4.0) allow unlimited commercial use but lack exclusivity — competitors access the same data[17]. Usage-based licenses charge per training run or deployed robot, aligning cost with value but requiring audit trails. Exclusive licenses grant sole access for 12–36 months, commanding 5–10× premiums but eliminating data leakage to competitors.

Buyers prioritize four metadata signals. Annotation provenance (who labeled it, with what tool, at what IAA threshold) determines trust. Domain coverage (kitchen, warehouse, hospital) predicts transfer performance. Temporal resolution (atomic vs. coarse) sets the ceiling on policy granularity. Format compatibility (RLDS, MCAP, HDF5) determines integration cost. The truelabel provenance standard requires sellers to disclose annotator training protocols, QC pass rates, and per-class IAA distributions[18].

Market pricing: coarse task-level annotations (1 label per 10-second clip) cost $2–5 per video minute. Intermediate action annotations (5–10 labels per minute) cost $15–30. Atomic annotations (30–50 labels per minute) cost $80–150. Multi-view synchronized annotation adds 50–100% premium. Exclusive licenses multiply base cost by 5–10×. A 1,000-hour dataset with intermediate annotations lists at $900K–1.8M non-exclusive, $4.5M–9M exclusive.

Dataset reusability drives ROI. A kitchen manipulation dataset annotated for 'pick apple' transfers to 'pick orange' with zero re-annotation but fails on 'open jar' (different affordance). The Open X-Embodiment consortium's cross-embodiment benchmark shows that policies trained on 10 diverse datasets outperform those trained on 10× more data from a single domain[6]. Buyers should prioritize domain diversity over raw annotation volume when budget is constrained.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. World Models

    World models use temporal action boundaries to segment continuous experience into discrete state transitions for improved RL sample efficiency

    worldmodels.github.io
  2. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100

    EPIC-KITCHENS-100 contains 90,000 action segments across 100 hours of egocentric kitchen video with 84% inter-annotator agreement on temporal boundaries

    arXiv
  3. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    DROID dataset captures 76,000 manipulation trajectories in 350 diverse real-world environments with task-level annotation

    arXiv
  4. LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch

    LeRobot datasets use two-tier structure with task-level labels for retrieval and dense robot state logs for policy training

    arXiv
  5. Ego4D: Around the World in 3,000 Hours of Egocentric Video

    Ego4D annotates 670 hours across 74 scenarios with 110-verb, 478-noun taxonomy, achieving Cohen's kappa of 0.71 after taxonomy revision

    arXiv
  6. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment aggregates 1 million trajectories across 22 robot platforms, harmonizing 527 task descriptions into unified taxonomy

    arXiv
  7. scale.com physical ai

    Scale AI Physical AI platform combines interpolation with active learning to achieve 95% of manual annotation accuracy at 3× speed

    scale.com
  8. RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning

    RLDS standard wraps temporal annotations in TFRecord episodes with synchronized sensor streams for end-to-end policy training

    arXiv
  9. cloudfactory.com industrial robotics

    CloudFactory offers custom QC pipelines for manipulation datasets including physics-based validation and affordance checks

    cloudfactory.com
  10. RT-1: Robotics Transformer for Real-World Control at Scale

    RT-1 trained on 130,000 robot trajectories spanning 700 tasks, achieving 97% success on seen tasks

    arXiv
  11. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    RT-2 added 6 billion web images to training, improving zero-shot generalization to 62% on unseen tasks

    arXiv
  12. RoboNet: Large-Scale Multi-Robot Learning

    RoboNet demonstrated that multi-robot data from 7 platforms improves sim-to-real transfer by 34% over single-robot datasets

    arXiv
  13. Dataset page

    Something-Something v2 contains 220,000 third-person videos annotated with 174 fine-grained action classes

    developer.qualcomm.com
  14. labelbox

    Labelbox provides workflow management and inter-annotator agreement tracking for $500–2,000/month

    labelbox.com
  15. sama

    Sama focuses on ethical annotation sourcing with living-wage employment in Kenya and Uganda

    sama.com
  16. kognic.com platform

    Kognic annotates multi-sensor streams with microsecond synchronization for outdoor mobile manipulation

    kognic.com
  17. RoboNet dataset license

    RoboNet is licensed under CC BY 4.0 allowing unlimited commercial use without exclusivity

    GitHub raw content
  18. truelabel data provenance glossary

    Truelabel provenance standard requires disclosure of annotator training protocols, QC pass rates, and per-class IAA distributions

    truelabel.ai

More glossary terms

FAQ

What is the difference between activity annotation and action recognition?

Activity annotation is the human labeling process that produces ground-truth temporal boundaries and action labels in video. Action recognition is the machine learning task of automatically predicting those labels from raw video. Annotation creates the training data; recognition is the model trained on that data. High-quality annotation with ≥80% inter-annotator agreement is required to train recognition models that exceed 70% accuracy on held-out test sets.

How many annotated demonstrations do I need to train a robot manipulation policy?

Vision-language-action models like RT-1 require 10,000–100,000 demonstrations to generalize across tasks. Imitation learning policies for single tasks (pick-and-place, drawer opening) can succeed with 500–2,000 demonstrations if the environment is controlled. Domain diversity matters more than raw count: 5,000 demonstrations across 10 environments outperform 50,000 from a single kitchen. Budget 60% of data spend on collection, 25% on annotation, 15% on QC and metadata.

Can I use egocentric human video to train a robot with a wrist-mounted camera?

Yes, if the camera viewpoint and field-of-view match. Policies trained on EPIC-KITCHENS egocentric video transfer to robots with wrist cameras at 60–75% of human performance, but fail when the robot camera is mounted on the shoulder or overhead. Viewpoint adaptation layers (learned transforms between human and robot perspectives) recover 10–20% of lost performance but require 500–1,000 robot demonstrations for training. Match annotation viewpoint to deployment camera configuration whenever possible.

What inter-annotator agreement threshold should I require from annotation vendors?

Target ≥80% temporal IoU for action boundaries and ≥85% label accuracy for verb/noun classes. EPIC-KITCHENS-100 reports 84% IoU and 91% label accuracy after two-pass review. Below 75% IoU, boundary errors corrupt policy learning; below 80% label accuracy, the model learns incorrect action semantics. Request per-class IAA breakdowns — rare actions often have 10–20% lower agreement than common ones and may need targeted re-annotation.

Should I annotate at atomic or task level for robot imitation learning?

Task-level annotation (one label per 10–30 second episode) suffices for high-level policy learning and costs 5–10× less than atomic annotation. Atomic annotation (one label per 0.5–2 second action primitive) is required only when you need interpretable failure diagnosis or plan to compose learned primitives into novel task sequences. The DROID dataset demonstrates that task-level labels plus dense robot state logs enable policy training without per-frame action labels, cutting annotation cost by 80% with no performance loss.

How do I validate that an activity-annotated dataset will transfer to my domain?

Request a 10-hour sample covering your target tasks and environments. Train a baseline policy on the sample, then measure zero-shot performance on 20 held-out episodes in your deployment environment. If success rate exceeds 40%, the dataset likely transfers; below 20% signals domain mismatch requiring re-annotation or data augmentation. Check taxonomy coverage: ≥80% of your target actions should appear in the sample. Verify temporal resolution matches your control frequency (30 Hz for most manipulation, 10 Hz for navigation).

Find datasets covering activity annotation

Truelabel surfaces vetted datasets and capture partners working with activity annotation. Send the modality, scale, and rights you need and we route you to the closest match.

Browse Activity-Annotated Datasets