Physical AI Data Guide

How to Create Temporal Annotations for Video

Q: What is the difference between temporal annotation and frame-level annotation?

Temporal annotation assigns action labels to time segments (start frame to end frame), capturing action duration and boundaries. Frame-level annotation assigns a label to every individual frame, which is more granular but 10-30× more expensive. For robot learning, temporal annotation is sufficient because policies operate at 10-30 Hz, not per-frame. EPIC-KITCHENS uses temporal segments with 0.8-second median boundary precision, while DROID uses 10 Hz frame-level labels synchronized to robot control frequency.

Q: How do I handle videos where multiple actions occur simultaneously?

Use multi-track annotation with separate timelines for each action stream. For bimanual manipulation, create left-hand and right-hand tracks with independent start/end frames. For concurrent object interactions (e.g., stirring while pouring), create object-specific tracks. Export as multi-track JSON with track metadata (hand ID, object ID). CVAT and Label Studio support multi-track timelines natively. ALOHA teleoperation datasets use this format for 10,000 bimanual trajectories.

Q: What inter-annotator agreement threshold should I target?

Target temporal Intersection-over-Union (tIoU) ≥0.75 for production datasets. tIoU measures overlap between action segments from two annotators: (overlap duration) / (union duration). EPIC-KITCHENS-100 reports 0.82 mean tIoU across 97,000 segments. For frame-wise agreement, target ≥85%. Lower agreement (<0.7 tIoU or <80% frame-wise) indicates taxonomy ambiguity or insufficient annotator training. Run calibration rounds until agreement stabilizes before scaling to production.

Q: Can I use pre-trained models to reduce annotation cost?

Yes. Train an action segmentation model on 1,000-2,000 manually annotated videos, then use it to pre-label the remaining dataset. Annotators correct model predictions rather than labeling from scratch, reducing annotation time by 50-70%. Encord and Scale AI provide model-assisted annotation platforms with active learning loops. For manipulation video, fine-tune LeRobot's pre-trained TCN on your task-specific data. Break-even occurs at ~5,000 videos; beyond that, model-assisted annotation is cheaper than full manual labeling.

Temporal video annotation assigns time-aligned labels to action segments in video streams. For robotics datasets, annotators mark start/end frames for manipulation primitives (reach, grasp, transport, place) using tools like CVAT or Label Studio, then validate boundaries with inter-annotator agreement metrics. The EPIC-KITCHENS dataset uses 97,000 action segments across 700 hours of egocentric video, while DROID contains 76,000 manipulation trajectories with frame-level action labels.

Updated 2025-06-15

By Truelabel Team

Reviewed by Truelabel Team · Jun 15, 2025

temporal video annotation

List Your Temporal Video Dataset How sourcing works

Quick facts

Topic: HOW TO Create Temporal Annotations FOR Video
Audience: Procurement leads, ML ops, robotics engineers
Deliverable: Operational playbook with sample workflow + accept-rule criteria

Why Temporal Annotations Are Critical for Robot Learning

Temporal annotations transform raw video into structured training data for vision-language-action models. RT-1 trained on 130,000 robot demonstrations with frame-level action labels, while RT-2 extended that foundation with 6,000 additional trajectories annotated for verb-noun action pairs^[1]. Without precise temporal boundaries, models cannot learn when to transition between manipulation primitives or correlate visual state changes with control commands.

Egocentric video datasets like EPIC-KITCHENS-100 contain 97,000 action segments across 700 hours of kitchen activity, each annotated with start frame, end frame, verb class, and noun class^[2]. The DROID dataset provides 76,000 manipulation trajectories with temporal action labels synchronized to robot joint states and camera feeds^[3]. These annotations enable action segmentation models to predict task boundaries in unseen videos, a prerequisite for imitation learning and policy distillation.

Temporal granularity determines model capability. Coarse activity labels ("make coffee") support high-level task planning, while fine-grained action labels ("grasp handle", "pour liquid", "place mug") enable low-level control. Open X-Embodiment aggregates 1 million trajectories from 22 robot datasets, standardizing temporal annotations to 10 Hz action labels across diverse embodiments^[4]. Models trained on this corpus generalize to new tasks 50% faster than single-dataset baselines.

Define Action Taxonomy and Temporal Granularity

Action taxonomies structure what annotators label. For manipulation tasks, design a two-level hierarchy: activity (high-level goal) and action (atomic primitive). The CALVIN benchmark uses 34 activities composed from 7 base actions: reach, grasp, lift, transport, place, release, retract. Each action spans 15-60 frames at 30 Hz capture rate^[5].

Verb-noun decomposition improves compositional generalization. EPIC-KITCHENS annotates 97 verbs and 300 nouns, yielding 3,806 unique verb-noun pairs across 700 hours of video^[6]. Annotators mark the frame where the hand first contacts the object (action start) and the frame where the object reaches its final state (action end). This protocol produces 4.2-second median action duration with 0.8-second inter-annotator boundary disagreement.

Granularity trade-offs: fine-grained labels (10 Hz action sampling) enable precise policy learning but require 3-5× annotation time versus coarse labels (1 Hz). The BridgeData V2 dataset uses 10 Hz action labels for 60,000 trajectories, consuming 2,400 annotator-hours^[7]. For budget-constrained projects, annotate keyframes only (grasp contact, object release) then interpolate intermediate frames with linear temporal models.

Document taxonomy decisions in a style guide. Specify minimum action duration (e.g., 10 frames), transition rules (e.g., "transport" must follow "lift"), and edge cases (e.g., failed grasps labeled as "reach" + "retract" without "grasp"). Distribute the guide before calibration rounds to reduce annotator confusion.

Select and Configure Annotation Tools

CVAT supports video timeline annotation with frame-level precision, multi-track labeling for concurrent actions, and interpolation for smooth bounding boxes. Configure CVAT with a temporal segment task: define action classes as labels, enable timeline mode, and set frame step to 1 for precise boundary marking. Export annotations as JSON with start_frame, end_frame, and label fields.

Labelbox offers video classification and object tracking with temporal ranges. Create a video classification project, define action taxonomy as a radio button ontology, and enable frame-level annotation. Labelbox exports to JSON-LD format compatible with RLDS ingestion pipelines^[8]. For datasets exceeding 10,000 videos, Labelbox's API supports programmatic task assignment and quality review.

Label Studio provides open-source video annotation with custom UI templates. Define a temporal segmentation template using the `<Video>` and `<Labels>` tags, then deploy on-premise for sensitive robot data. Label Studio exports to CSV with video_id, start_time, end_time, and action_label columns. For integration with LeRobot training pipelines, convert CSV to HDF5 episode format using the LeRobot dataset conversion script^[9].

Tool selection criteria: CVAT for frame-precise boundary marking (±1 frame accuracy), Labelbox for managed annotation teams with quality SLAs, Label Studio for on-premise deployment with custom workflows. All three tools support video formats (MP4, AVI, MOV) and frame rates up to 60 Hz. For point cloud + video annotation (LiDAR-camera fusion), use Segments.ai with synchronized timeline views^[10].

Execute Annotation with Calibration Rounds

Calibration aligns annotator interpretation before production labeling. Select 20-30 representative videos spanning task diversity (easy grasps, occlusions, multi-object scenes). Have 3-5 annotators label the calibration set independently, then compute inter-annotator agreement using temporal Intersection-over-Union (tIoU). For action segment pairs from two annotators, tIoU = (overlap duration) / (union duration). Target tIoU ≥ 0.75 for production readiness^[6].

Review disagreements in calibration sessions. Common issues: boundary ambiguity (when does "reach" end and "grasp" begin?), occlusion handling (label action even if hand is occluded?), failed actions (label attempted grasp that slips?). Refine taxonomy and style guide based on annotator feedback, then re-run calibration until tIoU stabilizes.

Production workflow: assign each video to 2 annotators, compute tIoU per action segment, flag segments with tIoU < 0.6 for expert review. The Ego4D dataset used this protocol for 3,670 hours of egocentric video, achieving 0.82 mean tIoU across 110,000 action segments^[11]. For high-stakes datasets (autonomous vehicle maneuvers, surgical robotics), require 3-annotator consensus with majority vote on boundary frames.

Monitor annotation velocity and quality weekly. Experienced annotators label 50-80 action segments per hour for manipulation video at 30 Hz. If velocity drops below 40 segments/hour, investigate tool friction (slow video loading, unclear taxonomy) or task complexity (dense action sequences, poor lighting). Maintain annotator rotation to prevent fatigue-induced boundary drift.

Validate Annotations with Action Segmentation Models

Action segmentation models predict frame-level action labels from video features, serving as automated quality checks. Train a temporal convolutional network (TCN) on 80% of annotated data, then evaluate on the held-out 20%. If the model achieves <60% frame-wise accuracy, annotation quality is likely inconsistent. The EPIC-KITCHENS-100 benchmark reports 67.8% top-1 verb accuracy and 58.2% noun accuracy for state-of-the-art action segmentation models^[6].

Use model predictions to surface annotation errors. For each video in the validation set, compute per-frame disagreement between human labels and model predictions. Segments with >50% frame disagreement are candidates for re-annotation. This active learning loop reduces annotation cost by 30-40% versus blind double-annotation^[12].

LeRobot provides pre-trained action segmentation models for common manipulation primitives. Fine-tune the LeRobot TCN on 500-1,000 annotated trajectories from your dataset, then use it to pre-label the remaining videos. Annotators correct model predictions rather than labeling from scratch, reducing annotation time by 40-60%^[9]. Export corrected labels to HDF5 format for training vision-language-action policies.

Validation metrics: frame-wise accuracy (percentage of frames with correct action label), segment-level F1 (precision and recall for action boundaries), and edit distance (minimum insertions/deletions to align predicted and ground-truth action sequences). For deployment-critical datasets, require ≥70% frame-wise accuracy and ≥0.65 segment F1 before releasing annotations.

Export Annotations to Standard Formats

RLDS (Reinforcement Learning Datasets) is the de facto standard for robot learning datasets. RLDS stores episodes as TFRecord files with nested dictionaries: `observation` (camera images, proprioception), `action` (joint commands), `reward`, and `metadata` (action labels, timestamps)^[8]. Convert temporal annotations to RLDS by mapping action segments to episode steps, then serialize to TFRecord using the RLDS writer API.

For LeRobot datasets, export to HDF5 with the following structure: `/episode_000000/observation/image` (H×W×3 uint8 arrays), `/episode_000000/action` (7-DoF float32 arrays), `/episode_000000/action_label` (string array with action class per frame)^[13]. LeRobot's dataset loader automatically synchronizes action labels with observation frames during training.

MCAP format for ROS 2 ecosystems: MCAP is a columnar container for time-series data, supporting video, point clouds, and robot state in a single file^[14]. Write temporal annotations as a `/action_labels` topic with `std_msgs/String` messages timestamped to match `/camera/image_raw` frames. MCAP files integrate directly with Foxglove Studio for visualization and debugging.

For datasets targeting truelabel's physical AI marketplace, include a `metadata.json` file with annotation protocol version, annotator count, inter-annotator agreement scores, and taxonomy definition. Buyers use this metadata to assess dataset quality before procurement^[15]. Compress video + annotations into a single archive (ZIP or TAR.GZ) with a README documenting file structure and loading instructions.

Scale Annotation with Model-Assisted Pre-Labeling

Model-assisted annotation reduces human labeling time by 50-70% for large video corpora. Train an action segmentation model on 1,000-2,000 manually annotated videos, then use it to pre-label the remaining dataset. Annotators review and correct model predictions, focusing effort on boundary refinement and edge cases. Encord's active learning platform automates this workflow, routing low-confidence predictions to human reviewers^[12].

Scale AI's data engine combines model pre-labeling with expert review for autonomous vehicle datasets, achieving 95% annotation accuracy at 3× the throughput of manual labeling^[16]. For manipulation video, pre-train on Open X-Embodiment action labels, then fine-tune on 500 task-specific videos. Model predictions reduce annotator cognitive load, allowing focus on ambiguous segments (occlusions, multi-object interactions).

Quality control for model-assisted annotation: randomly sample 10% of corrected labels for blind re-annotation by a second annotator. If tIoU between corrected and blind labels drops below 0.7, model predictions are introducing systematic errors. Retrain the model with corrected labels or revert to full manual annotation for affected videos.

For datasets exceeding 100,000 videos, deploy a human-in-the-loop pipeline: model pre-labels → automated confidence filtering → human review of low-confidence segments → model retraining on corrected labels. This loop converges to 90% automation after 3-4 iterations, reducing per-video annotation cost from $15-25 to $3-6^[17].

Handle Edge Cases and Annotation Ambiguities

Occlusions: label actions even when the robot gripper or object is partially occluded. Annotators infer action state from context (arm motion, object displacement). For full occlusions lasting >2 seconds, insert a `null` action label and document the occlusion in metadata. The DROID dataset uses this protocol for 8% of frames where camera view is blocked^[3].

Failed actions: distinguish between attempted and completed actions. A failed grasp (fingers close but object slips) is labeled `grasp_attempt` + `retract`, not `grasp` + `lift`. This distinction is critical for learning failure recovery policies. RoboCasa includes 12,000 failed manipulation attempts with temporal annotations, enabling models to predict and avoid failure modes^[18].

Concurrent actions: for bimanual manipulation, annotate left-hand and right-hand actions on separate timeline tracks. Each track has independent start/end frames and action labels. Export as multi-track JSON with `hand: left` and `hand: right` metadata fields. ALOHA teleoperation datasets use this format for 10,000 bimanual trajectories^[19].

Transition frames: when action boundaries are ambiguous (e.g., "reach" blending into "grasp"), annotators mark the frame where the dominant action characteristic appears. For "grasp", mark the frame where fingers first contact the object, even if finger closure continues for 5-10 frames. Consistent transition rules reduce inter-annotator boundary disagreement from 1.2 seconds to 0.6 seconds^[6].

Integrate Temporal Annotations with Robot State Data

Temporal video annotations gain value when synchronized with robot proprioception. For each action segment, align video frames with joint positions, gripper state, and end-effector pose from the robot's control log. LeRobot datasets store this alignment in HDF5: `/episode_000000/observation/qpos` (joint angles) and `/episode_000000/observation/qvel` (joint velocities) share the same temporal index as `/episode_000000/observation/image`^[13].

Timestamp synchronization: robot control loops run at 10-100 Hz, while video captures at 30-60 Hz. Use hardware-triggered cameras or software timestamps to align modalities within 10 ms. The DROID dataset synchronizes 3 RGB cameras, 1 wrist camera, and 7-DoF joint state at 10 Hz using ROS 2 message timestamps^[3]. Export synchronized data to MCAP format for frame-accurate playback^[14].

For teleoperation datasets, annotate human demonstrator actions separately from robot actions. The demonstrator's hand motion (visible in egocentric video) precedes the robot's motion by 100-300 ms due to control latency. ALOHA records both human and robot video streams, annotating human actions on the egocentric stream and robot actions on the third-person stream^[19]. This dual annotation enables models to learn latency compensation.

Multi-modal action labels: extend temporal annotations with object affordances and contact states. For a "grasp" action, annotate the grasped object class, grasp type (power vs. precision), and contact points on the object mesh. Dex-YCB provides 582,000 frames with hand pose, object pose, and grasp type annotations for 20 objects^[20]. These rich labels enable models to generalize grasping strategies across object categories.

Benchmark Annotation Quality with Standard Metrics

Temporal Intersection-over-Union (tIoU): for two action segments A and B, tIoU = (overlap duration) / (union duration). Compute tIoU for all segment pairs between two annotators, then report mean tIoU across action classes. Target ≥0.75 for production datasets. The EPIC-KITCHENS-100 dataset reports 0.82 mean tIoU across 97,000 action segments^[6].

Frame-wise agreement: percentage of frames where two annotators assign the same action label. This metric is stricter than tIoU because it penalizes boundary disagreements frame-by-frame. For manipulation video at 30 Hz, target ≥85% frame-wise agreement. Lower agreement (<75%) indicates taxonomy ambiguity or insufficient annotator training.

Segment-level F1 score: treat action segments as detection targets. A predicted segment matches a ground-truth segment if tIoU ≥ 0.5. Compute precision (percentage of predicted segments that match ground truth) and recall (percentage of ground-truth segments that are detected). F1 = 2 × (precision × recall) / (precision + recall). State-of-the-art action segmentation models achieve 0.65-0.75 F1 on EPIC-KITCHENS-100^[6].

Edit distance: minimum number of insertions, deletions, and substitutions to transform the predicted action sequence into the ground-truth sequence. Normalized edit distance (edit distance / sequence length) measures temporal consistency. For robot learning datasets, target normalized edit distance <0.15. High edit distance (>0.25) indicates frequent action boundary errors or missing segments.

Maintain Annotation Consistency Across Dataset Versions

Dataset versioning prevents annotation drift as you scale from pilot (1,000 videos) to production (100,000 videos). Assign a version number (e.g., v1.0, v1.1) to each annotation release, documenting taxonomy changes, tool updates, and quality thresholds. EPIC-KITCHENS evolved from 55 hours (v1) to 100 hours (v2) to 700 hours (v3), maintaining backward compatibility by preserving action class IDs^[6].

Annotation style guide updates: when refining taxonomy (e.g., splitting "grasp" into "power_grasp" and "precision_grasp"), re-annotate a 10% sample of existing videos to measure impact. If tIoU between old and new annotations drops below 0.7, the taxonomy change is too disruptive—consider adding new classes without modifying existing ones.

For datasets on truelabel's marketplace, version metadata includes annotation protocol hash, annotator pool size, and quality metrics (mean tIoU, frame-wise agreement). Buyers filter datasets by version to ensure training/validation splits use consistent annotations^[15]. Breaking changes (e.g., re-labeling all videos with a new taxonomy) require a major version bump (v1.x → v2.0).

Backward compatibility: when adding new action classes, preserve existing class IDs and append new classes to the taxonomy. Export annotations with a `schema_version` field in metadata.json, allowing data loaders to handle version-specific parsing. LeRobot's dataset loader supports schema versioning via the `dataset_version` parameter^[13].

Estimate Annotation Cost and Timeline

Annotation cost scales with video duration, action density, and quality requirements. For manipulation video at 30 Hz with 5-10 actions per minute, experienced annotators label 50-80 action segments per hour. At $25/hour annotator cost, expect $0.30-0.50 per video-minute for single-pass annotation. Double annotation (two annotators per video) doubles cost but improves quality by 20-30%^[17].

Calibration overhead: budget 40-60 annotator-hours for calibration rounds before production labeling. This includes taxonomy training (8 hours), pilot annotation (20 hours), disagreement review (12 hours), and style guide refinement (8 hours). Calibration cost is amortized across the full dataset, adding $0.05-0.10 per video-minute for datasets exceeding 10,000 videos.

Model-assisted annotation reduces cost by 50-70% after initial model training. Training an action segmentation model requires 1,000-2,000 manually annotated videos (200-400 annotator-hours at $25/hour = $5,000-10,000). Once trained, the model pre-labels remaining videos, and annotators correct predictions at 2-3× the speed of manual labeling. Break-even occurs at ~5,000 videos; beyond that, model-assisted annotation is cheaper.

Timeline estimates: for a 10,000-video dataset (30 minutes average duration, 5 actions/minute), expect 12-16 weeks end-to-end. Week 1-2: taxonomy design and tool configuration. Week 3-4: calibration and pilot annotation (500 videos). Week 5-12: production annotation (9,500 videos at 1,200 videos/week). Week 13-14: validation and export. Week 15-16: model training and quality audit. Parallel annotation teams reduce timeline by 30-50% but require stricter quality monitoring.

Procure Pre-Annotated Datasets from Truelabel Marketplace

Truelabel's physical AI data marketplace aggregates temporal video datasets from 20,000+ collectors, with action annotations validated by domain experts. Buyers filter by task type (manipulation, navigation, teleoperation), annotation granularity (coarse activity vs. fine-grained action), and quality metrics (tIoU, frame-wise agreement)^[15].

Pre-annotated datasets reduce time-to-model by 8-12 weeks versus in-house annotation. DROID provides 76,000 manipulation trajectories with frame-level action labels, ready for LeRobot training pipelines^[3]. BridgeData V2 offers 60,000 trajectories with 10 Hz action labels across 24 manipulation tasks^[7]. Both datasets are available on truelabel with provenance metadata and commercial-use licenses.

Quality guarantees: truelabel-listed datasets include inter-annotator agreement scores, validation model performance, and annotator pool demographics. Datasets with tIoU <0.7 or frame-wise agreement <80% are flagged for buyer review. For mission-critical applications (surgical robotics, autonomous vehicles), request datasets with triple-annotation and expert consensus.

Custom annotation services: truelabel connects buyers with annotation vendors (CloudFactory, Appen, Sama) for proprietary video datasets. Vendors provide SLA-backed annotation with 95% accuracy guarantees, 2-4 week turnaround for 10,000-video datasets, and GDPR-compliant data handling^[17]. Pricing ranges from $0.40-0.80 per video-minute depending on action density and quality requirements.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

How to Collect Egocentric Video Data for Physical AI (2026 Field Playbook)Related page Hand-Object Interaction Data for RoboticsDefinition and terminology Best Egocentric Video Data Providers for Robotics and VLA Models (2026)Related page Hand-Object Interaction (HOI) Egocentric DatasetsRelated page Robotics data annotation companies for 2026Related page Egocentric Video Data for Agriculture RoboticsRelated page Egocentric Video Data for ConstructionRelated page Egocentric Video Data for Surgical RoboticsRelated page

External references and source context

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 extended RT-1 with 6,000 additional trajectories annotated for verb-noun action pairs
arXiv ↩
Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
EPIC-KITCHENS action annotation protocol with start frame, end frame, verb class, and noun class
arXiv ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID synchronizes 4 cameras and 7-DoF joint state at 10 Hz with temporal action labels
arXiv ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment standardizes temporal annotations to 10 Hz action labels across diverse embodiments
arXiv ↩
CALVIN paper
CALVIN actions span 15-60 frames at 30 Hz capture rate
arXiv ↩
Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
EPIC-KITCHENS-100 reports 4.2-second median action duration with 0.8-second inter-annotator boundary disagreement and 0.82 mean tIoU
arXiv ↩
BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 consumed 2,400 annotator-hours for 60,000 trajectories with 10 Hz action labels
arXiv ↩
RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning
RLDS stores episodes as TFRecord files with observation, action, reward, and metadata
arXiv ↩
LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch
LeRobot provides pre-trained action segmentation models and dataset conversion scripts
arXiv ↩
segments.ai the 8 best point cloud labeling tools
Segments.ai supports point cloud and video annotation with synchronized timelines
segments.ai ↩
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Ego4D dataset used 2-annotator protocol for 3,670 hours achieving 0.82 mean tIoU across 110,000 action segments
arXiv ↩
encord.com active
Encord active learning platform reduces annotation cost by 30-40% versus blind double-annotation
encord.com ↩
LeRobot dataset documentation
LeRobot HDF5 format stores synchronized action labels with observation frames
Hugging Face ↩
MCAP guides
MCAP files integrate with Foxglove Studio for visualization and support ROS 2 topics
MCAP ↩
truelabel data provenance glossary
Truelabel marketplace includes provenance metadata and quality metrics for dataset filtering
truelabel.ai ↩
Scale AI: Expanding Our Data Engine for Physical AI
Scale AI data engine achieves 95% annotation accuracy at 3× manual labeling throughput
scale.com ↩
cloudfactory.com accelerated annotation
CloudFactory accelerated annotation reduces per-video cost from $15-25 to $3-6 with human-in-the-loop pipelines
cloudfactory.com ↩
Project site
RoboCasa includes 12,000 failed manipulation attempts with temporal annotations
robocasa.ai ↩
Teleoperation datasets are becoming the highest-intent physical AI content category
ALOHA teleoperation datasets use multi-track format for 10,000 bimanual trajectories
tonyzhaozh.github.io ↩
Project site
Dex-YCB provides 582,000 frames with hand pose, object pose, and grasp type annotations
dex-ycb.github.io ↩

FAQ

What is the difference between temporal annotation and frame-level annotation?

Temporal annotation assigns action labels to time segments (start frame to end frame), capturing action duration and boundaries. Frame-level annotation assigns a label to every individual frame, which is more granular but 10-30× more expensive. For robot learning, temporal annotation is sufficient because policies operate at 10-30 Hz, not per-frame. EPIC-KITCHENS uses temporal segments with 0.8-second median boundary precision, while DROID uses 10 Hz frame-level labels synchronized to robot control frequency.

How do I handle videos where multiple actions occur simultaneously?

Use multi-track annotation with separate timelines for each action stream. For bimanual manipulation, create left-hand and right-hand tracks with independent start/end frames. For concurrent object interactions (e.g., stirring while pouring), create object-specific tracks. Export as multi-track JSON with track metadata (hand ID, object ID). CVAT and Label Studio support multi-track timelines natively. ALOHA teleoperation datasets use this format for 10,000 bimanual trajectories.

What inter-annotator agreement threshold should I target?

Target temporal Intersection-over-Union (tIoU) ≥0.75 for production datasets. tIoU measures overlap between action segments from two annotators: (overlap duration) / (union duration). EPIC-KITCHENS-100 reports 0.82 mean tIoU across 97,000 segments. For frame-wise agreement, target ≥85%. Lower agreement (<0.7 tIoU or <80% frame-wise) indicates taxonomy ambiguity or insufficient annotator training. Run calibration rounds until agreement stabilizes before scaling to production.

Can I use pre-trained models to reduce annotation cost?

Yes. Train an action segmentation model on 1,000-2,000 manually annotated videos, then use it to pre-label the remaining dataset. Annotators correct model predictions rather than labeling from scratch, reducing annotation time by 50-70%. Encord and Scale AI provide model-assisted annotation platforms with active learning loops. For manipulation video, fine-tune LeRobot's pre-trained TCN on your task-specific data. Break-even occurs at ~5,000 videos; beyond that, model-assisted annotation is cheaper than full manual labeling.

How do I synchronize video annotations with robot state data?

Use hardware-triggered cameras or software timestamps to align video frames with robot joint positions within 10 ms. Store synchronized data in MCAP format (for ROS 2) or HDF5 (for LeRobot). DROID synchronizes 4 cameras and 7-DoF joint state at 10 Hz using ROS 2 timestamps. LeRobot datasets store video frames and joint positions in the same HDF5 episode structure with shared temporal indices. For teleoperation data, annotate human and robot actions separately to account for 100-300 ms control latency.

What annotation tools are best for large-scale video datasets?

CVAT for frame-precise boundary marking (±1 frame accuracy) and open-source deployment. Labelbox for managed annotation teams with quality SLAs and API-driven task assignment (scales to 100,000+ videos). Label Studio for on-premise deployment with custom UI templates. For point cloud + video (LiDAR-camera fusion), use Segments.ai with synchronized timeline views. All tools export to JSON or CSV; convert to RLDS (TFRecord) or LeRobot (HDF5) for training pipelines.

Looking for temporal video annotation?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners and helps scope consent artifacts and commercial licensing requirements before delivery.

List Your Temporal Video Dataset