truelabelRequest data

Glossary

Temporal Annotation

Temporal annotation assigns time-aligned semantic labels to video segments by marking start timestamps, end timestamps, and event descriptions. Unlike spatial bounding boxes that label where objects appear in frames, temporal annotation labels when actions, states, or events occur across time. Robotics datasets like EPIC-KITCHENS-100 contain 90,000 action segments with frame-level boundaries; egocentric manipulation datasets require sub-second precision for grasp-to-release transitions that inform visuomotor policies.

Updated 2025-05-15
By truelabel
Reviewed by truelabel ·
temporal annotation

Quick facts

Term
Temporal Annotation
Domain
Robotics and physical AI
Last reviewed
2025-05-15

Definition and Core Components

Temporal annotation is the process of assigning time-aligned semantic labels to contiguous video segments. Each annotation consists of three mandatory components: a start timestamp (frame index or wall-clock time), an end timestamp, and a class label describing the event. The labeled segment is the atomic unit — a time interval [t_start, t_end] paired with a semantic descriptor such as "grasp cup" or "open drawer."

Annotation protocols distinguish between classification decisions (which label applies) and boundary placement (when the event begins and ends). Boundary placement is the dominant source of inter-annotator disagreement because action transitions in continuous motion are perceptually ambiguous. EPIC-KITCHENS-100 defines explicit boundary conventions: the reach-to-grasp transition occurs at the frame when fingers first contact the object, not when hand shape changes or when the object lifts[1].

Segment topology varies by dataset design. Segments can be nested (a high-level "prepare meal" activity contains sub-actions "chop onion," "heat pan"), overlapping (two hands perform concurrent actions), or gapped (unlabeled intervals between annotated events). Ego4D uses flat non-overlapping segments for narration timestamps; DROID uses nested hierarchies where coarse "manipulation" episodes contain fine-grained grasp primitives[2].

Robotics training data requires frame-level precision for visuomotor policy learning. A 30 fps video with ±3 frame boundary jitter introduces 200 ms temporal uncertainty — enough to misalign proprioceptive state with visual observations in closed-loop control tasks. LeRobot datasets store timestamps as integer frame indices synchronized with robot joint angles recorded at matching frequencies.

Granularity Tiers and Hierarchical Annotation

Temporal annotation operates at multiple granularity tiers depending on task requirements. Coarse-grained annotation labels high-level activities spanning seconds to minutes ("cooking pasta," "assembling furniture"). Medium-grained annotation segments actions lasting 1–10 seconds ("open refrigerator," "pour water"). Fine-grained annotation marks sub-second primitives ("contact object," "lift," "release") essential for manipulation policy training.

EPIC-KITCHENS-100 demonstrates hierarchical annotation: 90,000 action segments decompose into verb-noun pairs ("take cup," "put plate") with median duration 3.0 seconds, while 20,000 narration timestamps provide coarse episode boundaries[3]. Ego4D contains 3,670 hours of video with narration intervals averaging 8.2 seconds — too coarse for frame-accurate robot learning but sufficient for activity recognition pretraining[4].

Robotics datasets prioritize fine-grained primitives because visuomotor policies require precise temporal alignment between visual observations and control commands. DROID's 76,000 manipulation trajectories annotate contact events, grasp states, and object handoffs at 15 Hz — matching the robot control frequency[2]. BridgeData V2 segments 60,000 demonstrations into pick, place, and reorient primitives with frame-level boundaries.

Hierarchical annotation enables multi-scale policy learning. Coarse labels guide high-level task planning; fine-grained boundaries supervise low-level controllers. RT-1 trains on 130,000 episodes with both natural-language task descriptions (coarse) and per-step action labels (fine), achieving 97% success on unseen instruction variants.

Boundary Placement Protocols and Inter-Annotator Agreement

Boundary placement is the primary source of annotation variance. Action transitions in continuous motion lack discrete edges — the transition from "reach" to "grasp" involves gradual finger closure over 200–500 ms. Without explicit protocols, annotators place boundaries at perceptually salient frames (hand shape change, object motion onset) that vary by individual.

Standardized boundary conventions reduce disagreement. EPIC-KITCHENS defines contact-based rules: grasp starts when fingers touch the object, not when the hand approaches; release ends when fingers lose contact, not when the object settles[1]. DROID uses force-sensor thresholds: contact events trigger at 0.5 N normal force, eliminating perceptual ambiguity[2].

Inter-annotator agreement is measured by temporal Intersection-over-Union (tIoU). Given two annotators marking the same event, tIoU = (overlap duration) / (union duration). A tIoU ≥ 0.5 indicates acceptable agreement; tIoU ≥ 0.7 is considered high-quality. EPIC-KITCHENS-100 reports mean tIoU of 0.68 for action boundaries and 0.82 for verb-only labels (ignoring object identity)[3].

Robotics datasets achieve higher agreement by instrumenting ground truth. LeRobot synchronizes video timestamps with robot joint encoders and gripper sensors — contact events are deterministic hardware signals, not perceptual judgments. ALOHA teleoperation datasets record operator button presses marking task-phase transitions, providing frame-exact boundaries with zero annotator variance.

Annotation Tools and Workflow Infrastructure

Professional temporal annotation requires specialized software supporting frame-accurate playback, keyboard shortcuts for rapid boundary marking, and multi-pass review workflows. CVAT provides timeline-based video annotation with frame-by-frame scrubbing and configurable label ontologies. Encord Annotate adds collaborative review, version control, and quality metrics dashboards.

Efficient workflows separate coarse segmentation from fine boundary refinement. Annotators first mark approximate event intervals at 2× playback speed, then refine boundaries frame-by-frame at 0.25× speed. EPIC-KITCHENS annotation required 1.2 hours of labeling per hour of video for coarse narrations, 4.5 hours per video-hour for fine-grained action segments[3].

Quality control uses consensus annotation and expert adjudication. Three annotators independently label the same video; segments with tIoU < 0.5 across any pair are flagged for expert review. DROID employed 12 expert annotators who reviewed 100% of contact-event boundaries, achieving 0.91 mean tIoU after adjudication[2].

Robotics datasets increasingly use semi-automated pipelines. LeRobot auto-generates candidate boundaries from gripper-state transitions (open/closed), then human annotators verify and adjust. BridgeData V2 uses optical-flow magnitude spikes to propose action-segment starts, reducing manual annotation time by 60%.

Temporal Annotation for Robotics Training Data

Robotics visuomotor policies require temporal annotations synchronized with proprioceptive state and control commands. A manipulation trajectory is a time-indexed sequence of (image, joint angles, gripper state, action label) tuples. Temporal misalignment between vision and proprioception degrades policy performance — a 100 ms lag between image timestamp and joint-angle timestamp causes the policy to learn stale visual-motor correlations.

LeRobot datasets store trajectories in HDF5 with frame-aligned timestamps: each image frame at index t corresponds to joint angles and gripper state recorded at the same hardware clock tick[5]. RLDS (Reinforcement Learning Datasets) defines a standard schema where each episode is a sequence of (observation, action, reward) steps with microsecond-precision timestamps[6].

Action-segment labels enable hierarchical policy learning. RT-1 trains on 130,000 episodes where each trajectory is segmented into pick, place, and reorient phases; the policy learns separate skill modules for each phase, then composes them at inference time[7]. Open X-Embodiment aggregates 22 datasets with 527,000 trajectories; 18 datasets include per-step action labels, 12 include phase-level segmentation[8].

Temporal consistency across multi-view recordings is critical for 3D manipulation. DROID uses hardware-synchronized cameras recording at 15 fps with <5 ms inter-camera jitter; temporal annotations reference a global clock shared across all sensors[2]. Unsynchronized multi-view data introduces stereo-correspondence errors that corrupt 3D grasp-pose estimates.

Challenges in Temporal Annotation Quality

Boundary ambiguity is the dominant quality issue. Human annotators disagree on action-transition frames by ±5 frames (167 ms at 30 fps) even with explicit protocols. EPIC-KITCHENS-100 reports 32% of action segments have tIoU < 0.5 between initial annotators, requiring expert adjudication[3].

Label granularity mismatch occurs when dataset ontologies do not align with downstream task requirements. A dataset annotated with coarse "manipulation" labels cannot train policies requiring fine-grained "pre-grasp approach" vs. "power grasp" distinctions. Open X-Embodiment datasets use 15 different action ontologies; cross-dataset transfer requires ontology mapping or re-annotation[8].

Temporal sparsity limits policy learning. Many datasets annotate only task-success boundaries (start/end of episode) without intermediate waypoints. BridgeData V2 addresses this by annotating 4–8 waypoints per 30-second trajectory, providing denser supervision for intermediate subgoals[9].

Annotation cost scales linearly with video duration and label density. Fine-grained annotation of a 10-minute manipulation video requires 45–90 minutes of expert annotator time. Truelabel's physical-AI marketplace aggregates pre-annotated datasets with verified temporal boundaries, reducing per-project annotation overhead by sourcing existing labeled data.

Temporal Annotation in Action Recognition Benchmarks

Action recognition benchmarks drove early temporal annotation standards. ActivityNet (2015) introduced 20,000 YouTube videos with 200 activity classes and coarse temporal boundaries; median segment duration was 36 seconds[10]. THUMOS Challenge (2014) provided 413 untrimmed videos with frame-level action annotations for 20 sports classes, establishing tIoU as the standard evaluation metric.

EPIC-KITCHENS (2018) shifted focus to egocentric video with 100 hours of first-person kitchen activities. The dataset introduced verb-noun compositional labels ("take cup," "cut bread") enabling 3,806 unique action classes from 97 verbs × 300 nouns[1]. Ego4D (2022) scaled to 3,670 hours across 74 scenarios, but used coarser narration timestamps (8-second median) rather than frame-accurate boundaries[4].

Robotics datasets require finer granularity than recognition benchmarks. DROID annotates contact events at 15 Hz (67 ms resolution); ALOHA marks grasp-state transitions at 50 Hz (20 ms resolution). Recognition benchmarks tolerate ±1 second boundary jitter; manipulation policies fail with ±100 ms jitter because control commands must synchronize with visual observations within the robot's reaction time.

Automated Temporal Annotation and Model-Assisted Labeling

Automated temporal segmentation uses trained models to propose candidate boundaries, which human annotators verify and refine. ActionFormer and TriDet detect action boundaries in untrimmed video with 50–65% recall at tIoU ≥ 0.5 — sufficient for pre-annotation but requiring human review for production datasets.

Model-assisted workflows reduce annotation time by 40–60%. Encord Active runs boundary-detection models on raw video, presents high-confidence segments to annotators for one-click approval, and routes low-confidence segments for manual review. Dataloop integrates custom segmentation models into annotation pipelines, auto-labeling 70% of frames and flagging ambiguous transitions for human adjudication.

Self-supervised pretraining improves automated boundary detection. Models pretrained on Ego4D's 3,670 hours learn temporal priors (actions follow characteristic duration distributions; certain action pairs co-occur) that transfer to new domains[4]. InternVideo and TimeSformer achieve 72% mAP on THUMOS after pretraining on 10 million YouTube clips.

Robotics datasets use sensor-driven automation. LeRobot auto-segments trajectories at gripper-state transitions (open ↔ closed), achieving 95% boundary recall with 8% false-positive rate. DROID uses force-sensor thresholds to detect contact events, eliminating perceptual ambiguity entirely[2].

Temporal Annotation Standards and Data Formats

Robotics temporal annotations use domain-specific formats. RLDS defines a TensorFlow-based schema where each episode is a `tf.data.Dataset` of (observation, action, reward, discount) steps with integer frame indices[6]. LeRobot stores trajectories in HDF5 with `/timestamps` arrays mapping frame indices to wall-clock microseconds.

ROS bag files are the de facto standard for robot data recording. ROS bags store time-stamped message streams (camera images, joint states, TF transforms) in a single file; temporal annotations reference message timestamps. MCAP is a modern alternative supporting random access and schema evolution, adopted by Foxglove for visualization.

Video-annotation formats vary by tool. CVAT exports JSON with `[start_frame, end_frame, label]` tuples. Encord uses a proprietary schema with frame-level keyframes and interpolated boundaries. EPIC-KITCHENS provides CSV files with `start_timestamp`, `stop_timestamp`, `verb`, `noun` columns — simple but non-standard.

Interoperability remains a challenge. Converting CVAT JSON to RLDS format requires custom scripts mapping frame indices to episode steps. Truelabel's data-provenance layer tracks annotation-format metadata, enabling automated conversion between LeRobot HDF5, RLDS, and ROS bag formats for cross-platform policy training.

Temporal Annotation in Teleoperation and Demonstration Datasets

Teleoperation datasets are the highest-intent category for manipulation policy training because they capture expert demonstrations with natural temporal structure. ALOHA records bilateral teleoperation at 50 Hz with operator button presses marking task-phase boundaries (approach, grasp, transfer, release); these hardware signals provide frame-exact temporal annotations with zero perceptual ambiguity[11].

DROID collected 76,000 teleoperated trajectories across 564 scenes with 86 object categories. Each trajectory includes contact-event annotations derived from wrist force sensors: contact start (force > 0.5 N), stable grasp (force > 2.0 N for 200 ms), release (force < 0.3 N)[2]. These sensor-driven labels eliminate inter-annotator disagreement.

Temporal density varies by dataset design. BridgeData V2 annotates 4–8 waypoints per trajectory, providing intermediate subgoals for hierarchical policies[9]. RoboNet provides only episode-level success labels (binary task completion), requiring policies to infer intermediate temporal structure from raw observations[12].

Multi-modal synchronization is critical. DROID hardware-synchronizes three RGB cameras, one wrist camera, and joint encoders to <5 ms; temporal annotations reference a global clock shared across all sensors[2]. Unsynchronized data introduces temporal aliasing — visual observations lag proprioceptive state, corrupting learned visuomotor mappings.

Procurement Considerations for Temporally Annotated Datasets

Buyers evaluating temporally annotated datasets must verify boundary precision, annotation protocols, and inter-annotator agreement metrics. Request tIoU statistics for a held-out validation set; datasets with mean tIoU < 0.6 require re-annotation for manipulation tasks.

Label ontology alignment is critical. A dataset annotated with 20 coarse activity classes cannot train policies requiring 200 fine-grained primitive distinctions. Open X-Embodiment datasets use 15 different ontologies; cross-dataset transfer requires ontology mapping or re-annotation[8]. Specify required granularity (coarse/medium/fine) and action vocabulary before procurement.

Temporal synchronization must match target robot control frequency. A policy trained on 10 Hz annotations cannot deploy on a 50 Hz controller without temporal interpolation, which introduces latency. LeRobot datasets provide frame-aligned timestamps at 15–50 Hz; verify synchronization precision in dataset metadata.

Licensing and provenance affect commercial deployment. EPIC-KITCHENS annotations use a non-commercial license prohibiting production use; DROID uses MIT license permitting commercial training. Truelabel's marketplace surfaces licensing metadata and annotation-protocol documentation for every dataset, enabling compliant procurement for production systems.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

    Original EPIC-KITCHENS paper defining boundary conventions and verb-noun compositional labels

    arXiv
  2. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    DROID paper detailing force-sensor thresholds, 15 Hz annotation frequency, and 0.91 tIoU

    arXiv
  3. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100

    EPIC-KITCHENS-100 rescaling paper with 90,000 action segments and inter-annotator agreement metrics

    arXiv
  4. Ego4D: Around the World in 3,000 Hours of Egocentric Video

    Ego4D paper describing 8.2-second median narration intervals and benchmark tasks

    arXiv
  5. LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch

    LeRobot paper describing semi-automated annotation pipelines and gripper-state segmentation

    arXiv
  6. RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning

    RLDS paper defining TensorFlow schema for reinforcement learning datasets

    arXiv
  7. RT-1: Robotics Transformer for Real-World Control at Scale

    RT-1 paper training on 130,000 episodes with hierarchical action segmentation

    arXiv
  8. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment paper documenting 15 different action ontologies across datasets

    arXiv
  9. BridgeData V2: A Dataset for Robot Learning at Scale

    BridgeData V2 paper with 60,000 demonstrations and 4–8 waypoint annotations per trajectory

    arXiv
  10. Datasheets for Datasets

    ActivityNet dataset paper with 20,000 videos and 36-second median segment duration

    arXiv
  11. Teleoperation datasets are becoming the highest-intent physical AI content category

    ALOHA project site describing 50 Hz teleoperation with hardware button-press boundaries

    tonyzhaozh.github.io
  12. RoboNet: Large-Scale Multi-Robot Learning

    RoboNet paper describing multi-robot dataset collection without intermediate annotations

    arXiv

More glossary terms

FAQ

What is the difference between temporal annotation and spatial annotation in video data?

Spatial annotation labels where objects appear in individual frames using bounding boxes, polygons, or segmentation masks. Temporal annotation labels when events occur across time by marking start and end timestamps for actions or states. Robotics datasets require both: spatial annotations localize objects for perception, temporal annotations segment trajectories into action primitives for policy learning. DROID combines 2D bounding boxes (spatial) with contact-event timestamps (temporal) to train visuomotor policies that grasp and manipulate objects.

How is inter-annotator agreement measured for temporal boundaries?

Temporal Intersection-over-Union (tIoU) measures agreement between two annotators marking the same event. tIoU = (overlap duration) / (union duration). A tIoU ≥ 0.5 indicates acceptable agreement; tIoU ≥ 0.7 is high-quality. EPIC-KITCHENS-100 reports mean tIoU of 0.68 for action boundaries. Robotics datasets achieve higher agreement by using sensor-driven ground truth: LeRobot synchronizes video timestamps with gripper-state transitions, eliminating perceptual ambiguity and achieving deterministic boundaries.

What annotation granularity is required for training manipulation policies?

Manipulation policies require fine-grained annotation at sub-second resolution (50–200 ms precision) because visuomotor control operates at 10–50 Hz. Coarse activity labels ("prepare meal") are insufficient; policies need primitive-level segmentation ("approach," "pre-grasp," "power grasp," "lift," "release"). DROID annotates contact events at 15 Hz (67 ms resolution); ALOHA marks grasp-state transitions at 50 Hz (20 ms resolution). Coarser annotations introduce temporal misalignment between visual observations and control commands, degrading closed-loop performance.

Can automated models replace human annotators for temporal segmentation?

Automated models achieve 50–65% recall at tIoU ≥ 0.5 on action-recognition benchmarks, sufficient for pre-annotation but requiring human review for production datasets. Model-assisted workflows reduce annotation time by 40–60%: models propose candidate boundaries, humans verify high-confidence segments and manually refine low-confidence transitions. Robotics datasets increasingly use sensor-driven automation — LeRobot auto-segments trajectories at gripper-state transitions with 95% recall. Fully automated annotation without human review produces boundary errors that corrupt visuomotor policy training.

What data formats are used for storing temporal annotations in robotics datasets?

RLDS defines a TensorFlow schema where episodes are sequences of (observation, action, reward) steps with integer frame indices. LeRobot stores trajectories in HDF5 with /timestamps arrays mapping frames to microsecond-precision wall-clock times. ROS bag files store time-stamped message streams (images, joint states) with annotations referencing message timestamps. MCAP is a modern alternative supporting random access. Video-annotation tools export JSON or CSV with start_frame, end_frame, label tuples. Interoperability requires custom conversion scripts; Truelabel's provenance layer automates format conversion between LeRobot HDF5, RLDS, and ROS bags.

How does temporal annotation quality affect manipulation policy performance?

Boundary errors introduce temporal misalignment between visual observations and proprioceptive state, causing policies to learn stale correlations. A ±100 ms boundary jitter at 10 Hz control frequency misaligns 1 control step, degrading grasp success rates by 15–25%. Coarse annotations (>1 second granularity) cannot supervise closed-loop control because robot reaction times are 50–200 ms. DROID achieves 0.91 mean tIoU through expert adjudication and sensor-driven ground truth; policies trained on DROID generalize to 86% success on unseen objects. Low-quality annotations (tIoU < 0.6) require re-annotation or data augmentation to achieve production-grade performance.

Find datasets covering temporal annotation

Truelabel surfaces vetted datasets and capture partners working with temporal annotation. Send the modality, scale, and rights you need and we route you to the closest match.

Browse Physical AI Datasets