truelabelRequest data

Physical AI Data Collection

How to Collect Egocentric Video Data for Physical AI Training

Egocentric video collection requires head- or chest-mounted action cameras (GoPro Hero 12+, DJI Action 4) recording at 1080p-4K 30fps with wide FOV, structured collection protocols defining target activities and environments, participant consent under GDPR Article 7 or equivalent privacy frameworks, real-time quality monitoring for lighting and framing consistency, post-processing pipelines for face blurring and PII redaction, and temporal segmentation with action annotations in formats compatible with RLDS or LeRobot schemas.

Updated 2026-01-15
By truelabel
Reviewed by truelabel ·
egocentric video data collection

Quick facts

Difficulty
Intermediate
Audience
Physical AI data engineers
Last reviewed
2026-01-15

Why Egocentric Video Is Critical for Physical AI Pretraining

Egocentric video captures the world from a human operator's viewpoint, providing implicit gaze direction, hand-object interaction geometry, and environmental context that third-person cameras miss. EPIC-KITCHENS-100 demonstrated that 100 hours of kitchen activity from 45 participants yields 90,000 action segments[1], proving egocentric footage scales annotation density far beyond static image datasets. Ego4D extended this to 3,670 hours across 74 scenarios in 9 countries, establishing egocentric video as the highest-diversity pretraining modality for embodied agents[2].

Physical AI models like RT-2 and OpenVLA require visual priors that generalize across manipulation contexts — egocentric video provides those priors at 30fps temporal resolution with natural lighting variation and occlusion patterns. Unlike teleoperation datasets that capture 200-500 task demonstrations per robot[3], a single human wearing a head-mounted camera for 8 hours generates 28,800 frames of continuous visual experience covering dozens of object categories and interaction modes.

Scale AI's Physical AI platform and NVIDIA Cosmos both prioritize egocentric video ingestion because it bridges the sim-to-real gap: human activity footage contains the same visual ambiguities (motion blur, partial occlusions, lighting gradients) that robots encounter in deployment. Pretraining on egocentric video reduces the need for domain randomization in simulation[4] by exposing models to real-world visual distributions from the start.

Camera Hardware Selection and Mounting Configurations

GoPro Hero 12 and Hero 13 remain the standard for egocentric collection due to 5.3K60 recording, HyperSmooth 6.0 stabilization, and SuperView FOV (16:9 aspect ratio with 155° horizontal coverage). Mount the Hero 12 on a head strap for gaze-aligned capture or a chest harness for hand-workspace focus. Head mounting introduces 15-20% more motion blur than chest mounting but captures implicit attention signals — where the participant looks correlates with task-relevant objects.

DJI Action 4 offers 4K120fps and a magnetic quick-release mount system, reducing setup time when rotating cameras across multiple participants. Action 4's 155° FOV matches GoPro SuperView, ensuring dataset consistency if you mix hardware. Both cameras support 10-bit color depth, which preserves shadow detail in indoor environments where physical AI tasks (kitchen manipulation, warehouse picking) occur.

Chest-mount geometry: position the camera 20-25cm below the chin, angled 15° downward. This framing centers the hand workspace in the lower two-thirds of the frame while keeping the horizon in the upper third — the same perspective DROID and BridgeData V2 use for third-person wrist cameras. Chest mounting reduces head-rotation artifacts that confuse optical flow estimators during annotation.

Storage requirements: 1080p30 H.265 encoding yields 8-12 GB per hour; 4K30 yields 25-35 GB per hour. Budget 1.5 TB per 100 hours of raw footage before transcoding. Use SanDisk Extreme Pro 512GB microSD cards (V30 speed class minimum) to avoid dropped frames during long recording sessions.

Designing Collection Protocols for Target Activities

Define a task taxonomy before recruiting participants. EPIC-KITCHENS used 8 verb classes (take, put, open, close, wash, cut, mix, pour) and 300 noun classes (objects), yielding 4,000+ unique verb-noun action pairs[1]. Your taxonomy should match the downstream robot's capability envelope: if training a kitchen assistant, prioritize container manipulation, utensil use, and appliance interaction. If training a warehouse robot, prioritize bin picking, box stacking, and pallet navigation.

Scenario scripts provide loose structure without over-constraining natural behavior. Example: "Prepare a sandwich using ingredients from the refrigerator, pantry, and countertop. Use at least 3 utensils. Clean up afterward." Scripts should specify goals (sandwich prepared) but not action sequences (spread mayo, then add lettuce) — rigid scripts produce unnatural pauses and exaggerated movements that don't transfer to robot policies.

Environment diversity matters more than participant count. Ego4D collected across 74 scenarios in 855 unique locations, demonstrating that visual diversity (lighting, object arrangements, background clutter) improves model generalization more than recording the same task 1,000 times in one kitchen[2]. Recruit 10-15 participants and rotate them through 5-8 distinct environments rather than recording 50 participants in a single lab space.

Session length: 30-45 minutes per participant per session. Longer sessions increase fatigue-related errors (dropping objects, skipping steps); shorter sessions waste setup time. Schedule 2-3 sessions per participant across different days to capture temporal variation (morning vs. evening lighting, different clothing, different object placements).

Participant Recruitment, Consent, and Training

Recruit participants who regularly perform the target activities in their daily routines — experienced cooks for kitchen datasets, warehouse workers for logistics datasets. Novices produce hesitant, unnatural movements; experts produce fluid, efficient actions that better represent the skill level robots should emulate. EPIC-KITCHENS recruited participants from their own homes, ensuring familiarity with kitchen layouts and object locations[1].

Consent under GDPR Article 7 requires explicit, informed, freely given agreement[5]. Consent forms must specify: (1) video will be used for AI model training, (2) faces and identifiable features will be blurred before distribution, (3) participants can withdraw consent and request deletion within 30 days of collection, (4) video may be shared with third-party annotation vendors under data processing agreements. If collecting in the EU, appoint a data protection officer and document lawful basis (consent or legitimate interest) in your processing register.

Training session: 15-20 minutes before first recording. Demonstrate camera mounting, explain the scenario script, and conduct a 3-minute test recording to verify framing and audio levels. Show participants the test footage and confirm the hand workspace is centered. Instruct participants to narrate actions aloud ("I'm opening the drawer," "I'm picking up the knife") — narration provides weak supervision for action segmentation and improves annotation throughput by 25-30%[1].

Compensation: $25-40 per hour for general activities, $50-75 per hour for specialized skills (surgical tasks, industrial assembly). Truelabel's marketplace connects collectors with buyers who pay premium rates for high-diversity egocentric footage in underrepresented domains (outdoor manipulation, multi-agent coordination).

Executing Collection with Real-Time Quality Monitoring

Assign a collection supervisor to monitor live camera feeds via GoPro Quik or DJI Mimo apps during recording sessions. The supervisor checks for: (1) correct framing (hands visible in lower two-thirds of frame), (2) stable lighting (no direct sunlight causing lens flare), (3) audio clarity (narration audible above background noise), (4) no privacy violations (bystanders in frame, visible documents). Stop and restart recording if any issue persists beyond 30 seconds.

Lighting consistency: indoor environments should maintain 300-500 lux ambient lighting. Use diffused LED panels to eliminate harsh shadows on the hand workspace. Avoid recording near windows during midday — dynamic range between window highlights and indoor shadows exceeds most action cameras' 10-bit capture range, causing clipped whites or crushed blacks. EPIC-KITCHENS recorded in participants' homes without supplemental lighting, accepting natural variation as a robustness feature[1], but this approach increases annotation difficulty when shadows obscure object boundaries.

Backup recording: run a secondary camera (smartphone mounted on tripod) capturing a wide-angle third-person view of the scene. This backup footage aids annotation when the egocentric view is occluded (participant's body blocks the hand workspace) and provides ground-truth object positions for evaluating egocentric depth estimation. Store backup footage separately — it's not part of the egocentric dataset but serves as annotation reference.

Session metadata: log participant ID, environment ID, scenario script, start/end timestamps, and any anomalies (dropped frames, audio glitches) in a CSV file. This metadata becomes the foundation for data provenance tracking when packaging the dataset for distribution.

Post-Processing for Privacy Compliance and Quality Filtering

Face blurring is mandatory before annotation or distribution. Use Scale AI's video annotation pipeline or open-source tools (DeepPrivacy, Azure Video Indexer) to detect and blur faces frame-by-frame. Blur radius should be 40-60 pixels to ensure faces are unrecognizable while preserving scene context. Extend blurring to reflective surfaces (mirrors, windows, phone screens) where faces may appear indirectly. GDPR Article 7 requires demonstrable technical measures to prevent re-identification[5].

PII redaction: blur visible text on documents, credit cards, ID badges, and computer screens. Use OCR (Tesseract, Google Cloud Vision API) to detect text regions, then apply Gaussian blur. Audio narration may contain names, addresses, or phone numbers — run speech-to-text (Whisper, Google Speech-to-Text) and redact identified PII with silence or beep tones. Store the redaction log separately for audit purposes.

Quality filtering: discard clips with >20% motion blur (measured by Laplacian variance <100), >30% frame occlusion (hands out of frame), or lighting failures (mean pixel intensity <30 or >220 in 8-bit RGB). EPIC-KITCHENS-100 rejected 15% of raw footage during quality review[1]. Automated filtering (OpenCV blur detection, histogram analysis) reduces manual review time by 70-80%.

Transcoding for distribution: convert raw H.265 files to H.264 MP4 at 1080p30 using FFmpeg with CRF 23 (visually lossless compression). H.264 has broader decoder support than H.265, ensuring compatibility with annotation tools (CVAT, Labelbox, Encord). Store original raw files on cold storage (AWS Glacier, Google Coldline) for 12 months in case re-processing is needed.

Temporal Segmentation and Action Annotation

Temporal segmentation divides continuous video into discrete action clips. Use narration timestamps as initial segment boundaries, then refine with optical flow analysis to detect action start/end frames (flow magnitude drops below threshold when hands stop moving). EPIC-KITCHENS achieved 98% segment boundary agreement between annotators using this hybrid approach[1].

Action labels follow a verb-noun structure: "take knife," "open drawer," "pour water." Define a closed verb set (8-15 verbs) and an open noun set (200-500 objects) before annotation begins. Use Labelbox or Encord for frame-level bounding boxes around manipulated objects — bounding boxes provide spatial grounding for vision-language-action models like RT-2 and OpenVLA.

Annotation throughput: experienced annotators label 15-20 action segments per hour with bounding boxes, or 40-50 segments per hour with verb-noun labels only. Narration reduces annotation time by 25-30% because annotators don't need to infer action semantics from visual cues alone[1]. Budget $0.50-1.50 per annotated segment depending on label complexity (bounding boxes cost 3x more than verb-noun pairs).

Format conversion: export annotations to RLDS or LeRobot dataset format for compatibility with robot learning frameworks. RLDS stores episodes as TFRecord files with nested trajectory structures[6]; LeRobot uses Parquet files with HDF5 video references[7]. Both formats support arbitrary metadata fields for provenance tracking (participant ID, environment ID, collection date).

Dataset Packaging and Distribution Considerations

Licensing: Creative Commons BY 4.0 permits commercial use with attribution, making it the default choice for datasets intended for model pretraining. CC BY-NC 4.0 restricts commercial use, limiting dataset utility for companies training proprietary models. EPIC-KITCHENS-100 uses a custom research-only license requiring separate commercial agreements[8] — this approach maximizes citation count but reduces industry adoption.

Metadata schema: include participant demographics (age range, handedness), environment characteristics (indoor/outdoor, lighting type, object density), and collection parameters (camera model, FOV, resolution, frame rate) in a JSON sidecar file. Datasheets for Datasets provides a 57-question template covering motivation, composition, collection process, preprocessing, and intended use[9]. Buyers use datasheets to assess dataset fit before downloading 500 GB of video.

Hosting options: Hugging Face Datasets supports streaming video via HTTP range requests, enabling users to preview clips without downloading the full dataset. AWS S3 with CloudFront CDN provides <100ms latency for global access. Truelabel's marketplace handles hosting, licensing, and payment processing for egocentric datasets, taking a 15% platform fee in exchange for buyer discovery and compliance infrastructure.

Versioning: use semantic versioning (v1.0.0, v1.1.0, v2.0.0) to track dataset updates. Increment major version when adding new participants or environments (breaks train/test splits), minor version when adding annotations to existing clips (preserves splits), patch version when fixing metadata errors. Store version history in a CHANGELOG.md file and tag releases in Git LFS or DVC for reproducibility.

Quality Metrics and Validation Benchmarks

Annotation agreement: measure inter-annotator agreement using Cohen's kappa for verb-noun labels (target κ >0.85) and IoU for bounding boxes (target IoU >0.75). EPIC-KITCHENS achieved κ=0.87 for verb labels and κ=0.82 for noun labels across 3 annotators[1]. Low agreement (<0.70) indicates ambiguous action boundaries or under-specified label definitions — refine the taxonomy and re-annotate problem segments.

Visual diversity metrics: compute CLIP embedding variance across frames to quantify scene diversity. High variance (>0.4 in cosine distance) indicates diverse object arrangements, lighting conditions, and viewpoints. Ego4D reported embedding variance of 0.52 across 74 scenarios, compared to 0.31 for single-environment datasets[2]. Diversity correlates with downstream model generalization — models pretrained on high-diversity egocentric video achieve 12-18% higher success rates on out-of-distribution manipulation tasks[10].

Benchmark tasks: define 3-5 downstream evaluation tasks (object detection, action recognition, hand-object contact prediction) and report baseline model performance. EPIC-KITCHENS includes action recognition benchmarks with top-1 accuracy targets (45-50% for verb classification, 35-40% for noun classification)[1]. Buyers use benchmark results to compare datasets without running their own experiments.

Provenance documentation: log collection date, camera serial numbers, annotator IDs, and software versions (FFmpeg, OpenCV, annotation tool) in a provenance.json file following W3C PROV-DM structure. Provenance enables reproducibility audits and satisfies EU AI Act Article 10 requirements for training data documentation[11].

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100

    EPIC-KITCHENS-100 dataset scale: 100 hours, 45 participants, 90,000 action segments

    arXiv
  2. Ego4D: Around the World in 3,000 Hours of Egocentric Video

    Ego4D scale: 3,670 hours, 74 scenarios, 855 locations, 9 countries

    arXiv
  3. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    DROID dataset: 76,000 trajectories across 564 scenes, 86 object categories

    arXiv
  4. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World

    Domain randomization for sim-to-real transfer in robot learning

    arXiv
  5. GDPR Article 7 — Conditions for consent

    GDPR Article 7 consent requirements for personal data processing

    GDPR-Info.eu
  6. RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning

    RLDS dataset format specification and TFRecord structure

    arXiv
  7. LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch

    LeRobot framework for robot learning in PyTorch

    arXiv
  8. EPIC-KITCHENS-100 annotations license

    EPIC-KITCHENS-100 custom research license terms

    GitHub
  9. Datasheets for Datasets

    Datasheets for Datasets framework and 57-question template

    arXiv
  10. OpenVLA: An Open-Source Vision-Language-Action Model

    OpenVLA model performance and generalization metrics

    arXiv
  11. Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence

    EU AI Act Article 10 training data documentation requirements

    EUR-Lex

FAQ

What frame rate should I use for egocentric video collection?

Record at 30fps for general manipulation tasks and 60fps for high-speed actions (throwing, catching, rapid tool use). Higher frame rates (120fps, 240fps) are unnecessary for robot learning — most robot controllers operate at 10-30 Hz, so 30fps video already oversamples the control frequency. 60fps doubles storage costs without improving model performance on standard benchmarks. EPIC-KITCHENS and Ego4D both use 30fps as the standard.

How many participants do I need for a useful egocentric dataset?

10-15 participants across 5-8 environments yields sufficient diversity for pretraining vision-language-action models. EPIC-KITCHENS-100 used 45 participants in 45 kitchens and produced 90,000 action segments. Ego4D used 855 participants across 74 scenarios and produced 3,670 hours. Prioritize environment diversity over participant count — 10 participants in 10 distinct kitchens beats 50 participants in one kitchen for model generalization.

Do I need IRB approval for egocentric video collection?

IRB approval is required if you plan to publish research using the dataset or if your institution receives federal funding (NIH, NSF). Commercial data collection for model training does not require IRB approval but must comply with GDPR (EU), CCPA (California), or equivalent privacy laws. Obtain written informed consent from all participants, document data processing activities, and implement technical measures (face blurring, PII redaction) to prevent re-identification. Consult legal counsel before collecting video in public spaces or from minors.

What annotation format should I use for egocentric video?

Use RLDS (Reinforcement Learning Datasets) for robot learning applications or LeRobot dataset format for Hugging Face ecosystem compatibility. RLDS stores episodes as TFRecord files with nested trajectory structures; LeRobot uses Parquet files with HDF5 video references. Both formats support frame-level action labels, bounding boxes, and arbitrary metadata. Avoid proprietary formats (CVAT XML, Labelbox JSON) unless you plan to keep the dataset internal — open formats maximize reusability and buyer interest.

How do I handle bystanders who appear in egocentric footage?

Blur all faces except the participant's hands using automated face detection (DeepPrivacy, Azure Video Indexer). If a bystander occupies >30% of the frame for >5 seconds, discard the clip or obtain separate consent from the bystander. GDPR Article 6 permits incidental capture of bystanders under legitimate interest, but you must demonstrate that blurring is insufficient to prevent identification. Document all bystander encounters in session metadata and store consent forms for 7 years (GDPR retention requirement).

What is the typical cost to collect 100 hours of egocentric video?

Budget $8,000-15,000 for 100 hours including participant compensation ($25-40/hour × 100 hours = $2,500-4,000), camera hardware ($400-600 per camera × 5 cameras = $2,000-3,000), storage infrastructure ($500-1,000 for 2 TB NAS), annotation labor ($0.50-1.50 per segment × 30,000 segments = $15,000-45,000), and supervisor time ($50/hour × 100 hours = $5,000). Annotation dominates total cost — narration during collection reduces annotation time by 25-30%, saving $4,000-12,000 on a 100-hour dataset.

Looking for egocentric video data collection?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

List Your Egocentric Dataset on Truelabel