Physical AI data collection
How to Collect Egocentric Video Data for Physical AI (2026 Field Playbook)
To collect egocentric video data, mount a wide-FOV camera (GoPro Hero 13, Project Aria, or Apple Vision Pro) on the head or chest, record everyday manipulation tasks at 1080p/30fps with consented participants, monitor framing and lighting live, blur faces and PII, segment the footage into sub-actions, and export to a robot-ready format like RLDS or LeRobot. Diversity of environments and collectors matters more than raw hours. Budget roughly $10 to $30 per usable hour for human capture. Plan for an intermediate-difficulty project that runs a few weeks end to end.
Quick facts
- Capture spec
- 1080p / 30fps, head- or chest-mounted, wide FOV
- Diversity target
- Coverage matrix, >=3 participants per environment-activity cell
- Consent
- GDPR Art. 6/7 basis, face + PII redaction before delivery
- Delivery format
- RLDS or LeRobot, with per-session provenance
- Cost guide
- ~$10–$30 per usable hour (human capture)
Before you start: prerequisites, difficulty, and timeline
Egocentric video collection is an intermediate-difficulty project. You don't need a robotics lab, but you do need a capture rig, a small recruitment and QA process, and a privacy pipeline before a single frame ships to a buyer. A focused custom program runs a few weeks from writing the protocol to delivering the first robot-ready batch, and the timeline is driven by recruitment reach and QA throughput rather than by the recording itself.
What to have in place falls into three buckets. Hardware: a wide-FOV camera plus a head or chest mount, fast V30-class cards, spare batteries (a single GoPro Enduro battery gives you over 2.5 hours of continuous recording at 1080p/30fps, so plan around 150 minutes per battery and carry spares) [1], and storage for raw plus a transcoded copy. People: consented collectors who already perform the target task, and a supervisor (or a self-run checklist) on the live feed. Process: a written task taxonomy, a consent form with a lawful basis, a redaction step, and a target delivery format. The Tools and Technologies list below is the working stack most programs converge on.
- 01
Capture
A wide-FOV action camera (GoPro Hero 13, DJI Action) for the workhorse tier; smart glasses or a headset (Project Aria Gen 2, Apple Vision Pro, Pico 4 Ultra) when you need eye tracking or per-joint hand pose; an RGB-D camera (Intel RealSense D455, 86° × 57° depth FOV) only when you need metric depth.
- 02
Mounts
Head strap for gaze-aligned capture, chest harness for a steadier hand-workspace view, wrist mount for grasp-level detail. Most manipulation programs run chest or head, and many run both.
- 03
Redaction and QA
A face-detection-and-blur pass over every frame plus reflective surfaces, a quality filter (motion-blur, occlusion, brightness thresholds), and a human spot-check on a 10–15% daily sample plus all borderline clips.
- 04
Encoding and delivery
H.265 on-device for storage, transcode to H.264 MP4 for annotation-tool compatibility, then export to RLDS (TFRecord) or LeRobot (Parquet) with per-session provenance attached.
What egocentric video data is, and why robots need it
Egocentric video is first-person footage shot from the viewpoint of a person doing a real task. The camera sits on the head, the chest, or a pair of glasses, so the frame carries what a third-person camera can't capture: where the hands are, what the eyes track, and the exact geometry of a grasp as it happens. That hand-object signal is the part a manipulation policy needs to learn from.
The field moved fast because the footage is cheap. Industry estimates put human egocentric capture at roughly $10 to $30 per hour against $50 to $200 per hour for robot teleoperation [2]. You're paying a person to do a task they already know, instead of paying an operator to drive a robot arm through it one trajectory at a time.
This idea has a research lineage. R3M, a visual representation pretrained on Ego4D human video, improved downstream manipulation task success by over 20% versus training from scratch and by over 10% versus representations like CLIP and MoCo, and let a real Franka arm learn tasks from just 20 demonstrations [3]. That established egocentric human video as a pretraining substrate for robot perception before robot-only data was abundant.
The 2025-2026 results push it into deployment. Figure trained its Helix model for Project Go-Big "Using 100% egocentric human video data, collected passively as people do behaviors in real Brookfield homes," and reported it "required no robot demonstrations whatsoever" [4]. Georgia Tech's EgoMimic co-trains policies on human egocentric video plus teleoperated robot data: a model "trained on 2 hours robot data + 1 hour hand data strongly outperforms" a baseline "trained on 3 hours of robot data" [5], and Meta reported the approach delivering "a 400% increase in [the] robot's performance across various tasks" with just 90 minutes of Project Aria recordings, work presented at ICRA 2025 [6]. In those results, an hour of hand-camera footage did at least the work of an hour of robot teleoperation, at a fraction of the cost.
Choose your capture hardware: cameras, FOV, weight, mounts
Three hardware tiers cover almost every program, and the right one depends on what signal you need and how long collectors have to wear it. The comparison table below summarizes the tradeoffs; the detail follows.
Action-camera tier (the workhorse). A GoPro Hero 13 Black weighs 157g and shoots wide-angle that the Ultra Wide Lens Mod extends to a 177° field of view [7]. It's the default because it's rugged, cheap, and collectors already understand it. Mount it on a head strap for gaze-aligned capture or a chest harness for a stable hand-workspace view. DJI's Action line is an equivalent substitute when you want a magnetic quick-release to rotate one camera across several collectors.
Smart-glasses and headset tier (lightweight, sensor-rich). Meta's Project Aria Gen 2 research glasses weigh about 75g and run 6 to 8 hours, with RGB plus 6DOF SLAM, eye-tracking cameras, spatial microphones, IMUs, and on-device hand tracking; the Gen 2 unveil added a PPG heart-rate sensor and a contact microphone [8]. Apple Vision Pro is the capture device behind EgoDex and gives you per-joint hand and finger tracking with on-device SLAM [9]. The Pico 4 Ultra sits at the heavier end at 580g with roughly 105° per eye and on-device hand tracking, and serves as a common baseline rig for paid collection programs [10].
Multi-camera and RGB-D rigs (when you need depth). If your task needs metric depth or stereo, add an Intel RealSense D455, which pairs a global-shutter RGB sensor matched to an 86° × 57° depth field of view, with depth error under 2% at 4m and an ideal range of 0.6 to 6m [11]. Reach for this tier only when monocular video genuinely isn't enough, because it adds weight, calibration overhead, and sync tolerances you now have to police.
Mounting geometry decides half your data quality. Head mounts capture the eye-gaze signal but add motion blur as the head swings. Chest mounts trade the gaze signal for a steadier hand-workspace view: position the camera roughly 20 to 25cm below the chin and angle it about 15° down, so the hands sit in the lower two-thirds of the frame and the horizon stays in the top third. Wrist mounts get you close to the grasp but lose scene context. Most manipulation programs run chest or head, and many run both.
A quick way to choose. If your downstream model consumes gaze or attention, go head-mounted and accept the motion blur. If it consumes clean hand trajectories, go chest-mounted for the steadier frame. If it needs dexterous finger-level detail, the headset tier is the only one that gives you per-joint hand tracking without bolting on a separate glove or marker rig. EgoDex's 194 tabletop tasks were built on exactly that capability with Apple Vision Pro [9], and EgoLive's custom JoyEgoCam rig pushed the stereo, wide-FOV end of the spectrum at 2160×2160 per camera [12] to get depth-bearing first-person footage at scale.
| Camera / rig | FOV | Weight | Resolution / fps | Best for | Cost tier |
|---|---|---|---|---|---|
| GoPro Hero 13 Black | Up to 177° (Ultra Wide Lens Mod) | 157g | Up to 5.3K60; capture at 1080p/30fps | Workhorse manipulation capture; rugged, familiar | $ (~$400 body) |
| DJI Action (4/5) | Wide (~155°) | ~145g class | 4K; capture at 1080p/30fps | Quick-release to rotate one camera across collectors | $ low |
| Project Aria Gen 2 (Meta) | RGB + 6DOF SLAM | ~75g | RGB + eye/hand tracking, spatial audio | Lightweight, sensor-rich research capture | $$$ (research program) |
| Apple Vision Pro | Headset stereo | Headset | Per-joint hand + finger tracking, on-device SLAM | Dexterous finger-level detail (EgoDex device) | $$$ high |
| Pico 4 Ultra | ~105° per eye | 580g | On-device hand tracking | Baseline paid-collection headset rig | $$ mid |
| Intel RealSense D455 | 86° × 57° depth FOV | RGB-D rig | Global-shutter RGB + stereo depth; 0.6–6m range | Tasks that need metric depth or stereo | $$$ (RGB-D add-on) |
Dial in recording settings
Record manipulation at 1080p and 30fps with a linear or wide lens profile and electronic stabilization on. That's the de-facto standard, and the open datasets confirm it: Build AI's Egocentric-10K is 1080p at 30fps [13], and EgoVid-5M is curated from 1080p footage [14]. You don't need more.
The frame-rate question trips people up. Most robot controllers run at 10 to 30 Hz, so 30fps already oversamples the control loop. Push to 60fps only for genuinely fast actions like throwing, catching, or rapid tool changes. Recording at 120fps or higher just multiplies your storage bill for frames the policy will never use.
Storage planning is simple arithmetic. H.265 at 1080p/30fps lands around 8 to 12 GB per hour; 4K/30fps lands closer to 25 to 35 GB per hour. At program scale that works out to roughly 1 TB per 100 hours at 1080p and around 3 TB per 100 hours at 4K, and keeping both a raw and a transcoded copy pushes you toward the higher end of a 1 to 5 TB per 100-hour planning range. Budget for both copies, and use a fast V30-class card so you don't drop frames mid-session. Encode in H.265 on the device for the storage savings, then transcode to H.264 for delivery so every downstream annotation tool can decode it.
Two settings ruin more egocentric footage than anything else, so lock them before the first session. First, the lens profile: shoot with a linear or wide profile, not a heavily fisheyed "SuperView" look, because extreme barrel distortion makes hand-pose and depth estimation harder downstream. Second, stabilization: leave electronic stabilization on for handheld realism, but don't crank it so high that the crop throws the hands out of frame at the edges. Lock white balance per environment instead of leaving it on auto, so the color doesn't pump every time the collector turns toward a window. None of this is exotic, and it's the difference between footage a model can learn from and footage your annotators quietly throw away.
Design the collection protocol
Write a task taxonomy before you recruit anyone. EPIC-KITCHENS organizes kitchen activity into a verb-noun structure and yields about 90,000 action segments from 100 hours across 45 kitchens [15]. Borrow that structure: a small closed set of verbs (take, put, open, close, pour, cut) crossed with an open set of nouns (the objects in your domain). Match the taxonomy to the robot's actual capability envelope. A kitchen assistant needs container and utensil manipulation; a warehouse robot needs bin-picking and box-stacking.
Use scenario scripts that specify goals, not action sequences. "Make a sandwich using ingredients from the fridge and pantry, clean up after" produces natural behavior. "Spread mayo, then add lettuce, then close the bread" produces stilted, exaggerated movements that don't transfer to a policy. You want the messy, efficient way a real person does the task.
Environment diversity outweighs participant count, and the datasets prove it. Ego4D spans 9 countries and 74 worldwide locations with 923 unique participants [16], and DROID was collected by 50 data collectors across North America, Asia, and Europe over 12 months precisely to vary the scene [17]. Ten collectors across ten distinct kitchens will generalize better than fifty collectors in one lab.
Make diversity something you can measure. Build a coverage matrix with environments as rows and activities as columns, then track how many distinct participants you've captured in each cell. A practical floor is at least three different participants per environment-activity combination, so no single person's habits dominate any cell. Vary lighting, object arrangement, background clutter, and time of day across the matrix.
Session length is a fatigue tradeoff. Run 30 to 45 minutes per collector per session: go longer and you accumulate fatigue errors (dropped objects, skipped steps, rushed motion); go shorter and setup overhead eats your budget. Schedule two or three sessions per collector across different days so you capture temporal variation, because the same person in the same kitchen at 8am and 6pm gives you different light, different clothing, and slightly different object placement. Spreading sessions out instead of cramming them into one afternoon is cheap diversity.
Recruit, consent, and train collectors
Who makes a good collector. Recruit people who already do the target task. Experienced cooks produce fluid kitchen footage; warehouse workers produce realistic picking motion. Novices hesitate and over-articulate, which is the motion you don't want a robot to copy. EPIC-KITCHENS recorded participants in their own homes for this reason, since familiarity with the space produces natural, goal-directed behavior [15].
Is egocentric video collection a real paid job? Yes, and it's a fast-growing one. Build AI's Egocentric-10K was collected across 85 factories worldwide, with the project's launch announcement reporting roughly 2,100 factory workers [13], and DROID was built by 50 data collectors across three continents over a year [17]. Paid programs recruit through freelance and microtask platforms; the work is to wear a rig, perform consented everyday or occupational tasks, and submit clean footage that passes a QA bar. Truelabel runs paid egocentric collection programs across a network of more than 20,000 collectors spanning Asia, Africa, and the Americas.
Consent and privacy law. Establish a lawful basis under GDPR Article 6 ("Lawfulness of processing") and meet the consent conditions in Article 7 before a single frame is recorded [18]. The consent form should state plainly that the footage trains AI models, that faces and identifiable features are blurred before distribution, that the collector can withdraw and request deletion, and that the data may pass to downstream buyers under a data-processing agreement. If you're training a high-risk system, EU AI Act Article 10 ("Data and data governance") requires documented, representative, and bias-examined training data, with high-risk obligations enforceable from August 2, 2026 [19]. Build the documentation as you collect, not after.
Train before the first take. Run a short onboarding: demonstrate the mount, walk the scenario script, record a two-minute test, and replay it with the collector so they can see whether the hands sit in frame. Ask collectors to narrate what they're doing ("I'm opening the drawer"). That narration is weak supervision: it gives your annotators temporal cues at no extra cost and speeds up segmentation later.
Run the shoot with real-time QA
Reject bad footage on the spot, before it's paid for and ingested. Put a supervisor (or a checklist the collector self-runs) on the live feed and watch four things: framing (hands in the lower two-thirds), lighting (no blown-out windows or lens flare), audio (narration audible over background noise), and privacy (no bystanders or readable documents in frame). If any fault persists past about 30 seconds, stop and restart. Then, off the live feed, review a 10–15% random sample of each day's accepted footage so problems that slip past the live check surface within a day rather than at delivery.
Keep indoor lighting steady and diffuse, and avoid shooting straight at a midday window. The dynamic range between a bright window and an indoor shadow exceeds what an action camera captures cleanly, so you end up with clipped highlights or crushed shadows over the exact hand region you care about. A couple of diffused LED panels fix most of it.
Log session metadata as you go: collector ID, environment ID, scenario, start and end timestamps, and any anomaly. That log is the seed of your provenance record, and without it you can't prove where a clip came from or which consent covers it. A cheap second camera on a tripod, capturing a wide third-person view, is worth running as a backup, because it disambiguates the annotation when the collector's own body occludes the workspace.
Process for privacy and quality
Redact faces and PII before anything else touches the footage. Run frame-by-frame face detection and blur every face, then extend the blur to reflective surfaces where a face can reappear: mirrors, windows, phone screens. Treat the recall of your face detector as a process target you measure and report, not a marketing number. A strong detector like RetinaFace reports 91.4% AP on the WIDER FACE Hard test set [20], so a real pipeline pairs automated detection with a human spot-check pass rather than trusting detection alone. Then redact readable text on documents, screens, and badges, and scrub names or addresses from any narration audio.
Filter for quality with hard thresholds. Drop clips that fail a measurable bar: excessive motion blur (low Laplacian variance), hands out of frame for too much of the clip, or mean brightness pinned near black or white. Automated filters do the first pass and cut manual review sharply, then a human reviews the borderline cases.
Transcode for delivery. Convert the device-side H.265 to H.264 MP4 at 1080p/30fps so every annotation and training tool can decode it without friction, and keep the original raw files in cold storage in case you need to re-process.
The order here is deliberate: redact first, then filter, then transcode. If you transcode before redacting, you have to re-encode a second time after the blur pass, which doubles compute and adds a generation of compression artifacts. If you filter before redacting, you waste a privacy pass on clips you're about to discard. Process privacy on the footage you intend to keep, in that sequence, and the pipeline stays cheap.
Annotate and package for robot training
Segment continuous video into sub-actions. A single manipulation breaks into a predictable arc: reach, pre-grasp, contact, lift, transport, release. Use the collector's narration timestamps as first-pass boundaries, then refine them where the hand motion starts and stops. Automated action-boundary detection gives you a rough first cut, but treat machine boundaries as a draft a human tightens rather than a finished label, then label each segment with the same verb-noun vocabulary you defined in the taxonomy, and add per-joint hand pose and contact points where your downstream model consumes them. EgoDex demonstrates the high end of this: 829 hours across 194 tabletop tasks with paired 3D hand and per-joint finger tracking, captured on Apple Vision Pro with on-device SLAM, which Apple calls "the largest and most diverse dataset of dexterous human manipulation to date" [9].
Multi-sensor capture has to be synchronized to tight tolerances, or the labels are wrong. A common operational target keeps RGB and depth within one frame of each other and IMU and body-skeleton streams within about 20ms [21]. Check the sync during QA, not after annotation.
Export to a robot-ready format so buyers ingest without bespoke ETL. RLDS stores episodes as TFRecord files with nested trajectory structures [22]; LeRobot uses Parquet with referenced video and is the native format for the Hugging Face robot-learning ecosystem [23]. Both carry arbitrary metadata fields, which is where your provenance (collector ID, environment ID, consent reference, collection date) rides along with the data. That last point matters more than format preference, because data that ships its own provenance is data a buyer can actually license.
What it really costs (transparent economics)
Most guides quote an hourly rate and stop. The full line-item picture for a custom collection is below, so you can budget before you start.
The headline number is the capture rate: roughly $10 to $30 per usable hour for human egocentric footage, versus $50 to $200 per hour for robot teleoperation [2]. "Usable" carries the weight there. If real-time QA rejects a meaningful slice of raw footage, your cost per accepted hour climbs, which is the reason to gate quality on the spot rather than pay for footage you later discard.
The cost stack below the capture rate is what people forget. Hardware is a one-time line (action cameras are cheap; sensor-rich glasses and RGB-D rigs aren't). Supervision and onboarding time is real labor. Storage is small but recurring, across both raw and transcoded copies (plan roughly 1 to 5 TB per 100 hours, per the recording-settings math). The privacy pass (automated redaction plus a human spot-check) is per-hour labor. Annotation is usually the largest variable line: dense per-segment labeling with hand pose costs far more than coarse verb-noun tagging, which is why narration-as-weak-supervision during capture cuts downstream annotation time and cost.
The practical takeaway: hardware is the small line, and recruitment reach, QA, privacy, and annotation are the ones that decide whether a custom program stays affordable. Capturing dirty footage and fixing it later costs more than gating quality at the source.
| Line item | What drives it | Cost behavior |
|---|---|---|
| Capture (collector pay) | Usable hours of human footage; ~$10–$30/usable hr vs $50–$200/hr teleop [ref:capture-economics] | Largest direct line; rises with QA rejection rate |
| Hardware | Camera tier: ~$400 action cam vs sensor-rich glasses / RGB-D rigs | One-time; amortized across many sessions |
| Supervision + onboarding | Live QA staffing, collector training, test takes | Per-program labor; front-loaded |
| Storage | ~1–5 TB per 100 hrs (raw + transcoded); ~8–12 GB/hr at 1080p H.265 | Small but recurring |
| Privacy pass | Frame-by-frame face/PII redaction + human spot-check | Per-hour labor on kept footage only |
| Annotation | Segment boundaries, verb-noun labels, hand pose, contact points | Largest variable line; dense labeling costs most |
Public datasets vs custom collection, with a 2026 dataset reference
Public egocentric datasets are excellent references and weak products. They're research-licensed or shaped to someone else's task, so they rarely match the embodiment, sensor stack, and environment of what you're shipping. Use them to benchmark and to bootstrap, then collect custom for the deployment-specific distribution. The current, correctly-attributed reference set is below.
Ego4D is the breadth anchor: over 3,670 hours of daily-life activity, 923 unique participants, across 9 countries and 74 worldwide locations [16]. Watch one citation trap: some provider pages quote stale, lower figures like 855 participants or 3,000 hours. The primary source ego4d-data.org states 3,670 hours and 923 participants, and that's what to cite.
Ego-Exo4D is the "largest public dataset of time-synchronized first- and third-person video," with more than 1,400 hours and over 800 skilled participants across six countries [24]. EPIC-KITCHENS-100 is the fine-grained kitchen anchor: 100 hours, 45 kitchens, about 90,000 action segments [15]. HD-EPIC adds dense annotation to in-the-wild kitchen capture: 41 hours across 9 kitchens, presented at CVPR 2025 [25].
The 2025 to 2026 wave is where the field is now. EgoDex (Apple) is 829 hours, 194 tabletop tasks, with per-joint hand tracking on Apple Vision Pro [9]. EgoVid-5M is 5 million egocentric clips with kinematic control and textual descriptions, curated from 1080p footage for egocentric video generation and accepted at NeurIPS 2025 [14]. Egocentric-10K (Build AI) is 10,000 hours and 1.08 billion frames at 1080p/30fps, released under Apache-2.0 [13]. EgoLive is 1,680 hours of stereo video at 60fps across 65,866 episodes and 346 real-world tasks, captured on a custom JoyEgoCam head-mounted device with a 130°×130° field of view at 2160×2160 per camera [12]. The downstream consumers are the VLA models: OpenVLA is a 7B-parameter open model pretrained on 970k real-world robot demonstrations from Open X-Embodiment [26], which itself aggregates over a million trajectories across 22 embodiments [27].
Across all of them, the newest and most useful datasets are custom-collected against a specific embodiment and task distribution rather than scraped, which is the gap a custom program fills.
| Dataset | Scale | Capture device | License / use |
|---|---|---|---|
| Ego4D | 3,670 hrs, 923 participants, 74 locations | Mixed head-mounted | Research access |
| Ego-Exo4D | >1,400 hrs, >800 participants, 6 countries | Synced ego + exo | Research access |
| EPIC-KITCHENS-100 | 100 hrs, 45 kitchens, ~90K segments | Head-mounted GoPro | Research license |
| HD-EPIC | 41 hrs, 9 kitchens (CVPR 2025) | Head-mounted | Research access |
| EgoDex (Apple) | 829 hrs, 194 tabletop tasks | Apple Vision Pro | Research release |
| EgoVid-5M | 5M clips, 1080p source (NeurIPS 2025) | Mixed (curated) | Generation research |
| Egocentric-10K | 10,000 hrs, 1.08B frames, 1080p/30fps | Head-mounted (factory) | Apache-2.0 |
| EgoLive | 1,680 hrs stereo @ 60fps, 346 tasks | JoyEgoCam (custom) | Research |
Best egocentric video data providers, compared
If you'd rather source egocentric data than run a capture program yourself, the providers split into three kinds: managed collection vendors that run their own crews, annotation-and-collection shops that deliver labeled or raw footage, and a marketplace model that coordinates a distributed collector network against your spec. The named comparison below maps the live field as of 2026; each entry is described from its own public positioning.
Managed single-vendor networks own their crews and infrastructure. Claru, for instance, positions itself as a global egocentric collection network with trained collectors and built-in privacy and annotation pipelines [28], which gives you tight control of one protocol but ties diversity to that one vendor's footprint. Annotation-led shops such as iMerit collect with a global workforce of full-time employees and deliver footage "raw or annotated, ready for model training or evaluation workflows" [29]; Unidata runs paid collection on rigs like the Pico 4 Ultra with multimodal sync and GDPR consent protocols [10]; Labellerr is an annotation-first platform (with some capture) layered on top of robotics datasets [30]. A marketplace plus collector network, the truelabel model, treats geographic and demographic diversity as a sourcing parameter rather than a single crew's reach.
Be honest about where each route wins. Public datasets lead on cost and documentation. A managed single vendor leads on hands-on control of one protocol. An annotation-led shop leads when you want collection and labeling in one contract. A marketplace earns the top spot when you need custom capture shaped to your spec, per-clip provenance and consent, and robotics-ready delivery from a network broad enough to vary geography and demographics on demand. If your bottleneck is none of those, one of the other routes may fit better, and the table above says so plainly.
| Provider / model | Type | Best for | Watch out for |
|---|---|---|---|
| truelabel | Marketplace + global collector network | Custom, geo-diverse, provenance-tracked supply with RLDS/LeRobot delivery | Newer model; first-party scale figures still being published |
| Claru | Managed single-vendor collection network | Tight control of one protocol; built-in privacy + annotation pipelines | Diversity tied to one vendor's collector footprint |
| iMerit | Collection + annotation workforce | Collect and label in one contract; raw or annotated delivery | Vendor-managed crews rather than open-network diversity |
| Unidata | Collection shop (multimodal rigs) | Multimodal sync, scenario scripting, GDPR consent on headset rigs | Fixed rig roster; you scope the protocol |
| Labellerr | Annotation-first platform (with some capture) | Labeling and curating robotics datasets you already have | Annotation-led; capture is not its core |
| Public datasets (Ego4D, EgoDex, Egocentric-10K) | Open / research datasets | Benchmarking and pretraining bootstraps at low cost | Research-licensed or off-spec; rarely matches your embodiment |
How a marketplace scales egocentric capture across geographies
A single shoot gets you one place, one crew, one set of environments. The hard part of egocentric data collection isn't the first hour. It's making capture repeatable across geographies, demographics, and environments while keeping every clip's consent and provenance intact. That is a network problem, and a single vendor's footprint can only stretch so far across it.
This is where truelabel fits. The physical-AI data marketplace connects robotics buyers with a global collector network, so geographic, demographic, and environmental diversity becomes a sourcing parameter instead of a logistics project. You post a spec, review matched samples, a first-batch evaluation runs against your rubric, and scale-up is gated on acceptance. Marketplace quality comes from rejecting bad batches before they reach the buyer, so the goal is acceptance rate per spec, not raw collector headcount.
Every session carries its own provenance. The consent artifact, location release, and per-session metadata travel with the clip, which is what makes the data licensable for commercial training and auditable against EU AI Act Article 10 documentation requirements [19]. Delivery lands in RLDS or LeRobot [22] [23], so buyers ingest robotics-ready episodes without building bespoke ETL. Truelabel's network spans more than 20,000 collectors across Asia, Africa, and the Americas, with deepening density across North, Latin, and South America, and has delivered over 100,000 hours of egocentric footage to date. A calibration pilot typically returns its first batch within about a week, and scale-up is gated on a first-review acceptance rate that runs around 97% once that pilot is complete.
Where public datasets give you research-licensed references and a single vendor gives you one crew's footage, a marketplace turns a one-off shoot into a provenance-tracked, geo-diverse supply line you can scale on demand and license cleanly.
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- GoPro Enduro Battery
GoPro Enduro battery (HERO13 Black) delivers over 2.5 hours of continuous recording at 1080p30 (gopro.com Enduro battery product page, verified 2026-06-10).
gopro.com ↩ - Best Egocentric Video Data Providers for Robotics AI (2026)
Industry estimate: human egocentric capture costs about $10 to $30 per hour versus $50 to $200 per hour for robot teleoperation.
claru.ai ↩ - R3M: A Universal Visual Representation for Robot Manipulation
R3M (Nair et al., arXiv 2203.12601): a visual representation pretrained on Ego4D human video improves downstream manipulation success by over 20% vs training from scratch and over 10% vs CLIP/MoCo, and enables a real Franka arm to learn tasks from 20 demonstrations.
arXiv ↩ - Project Go-Big
Figure Project Go-Big trained Helix using 100% egocentric human video collected passively in real Brookfield homes, requiring no robot demonstrations.
Figure ↩ - EgoMimic: Scaling Imitation Learning via Egocentric Video
EgoMimic (Georgia Tech, 2024): a policy trained on 2 hours robot data plus 1 hour hand data outperforms a baseline trained on 3 hours of robot data.
EgoMimic (Georgia Tech) ↩ - EgoMimic uses Project Aria for robotics imitation learning
Meta's official EgoMimic blog reports the approach delivering a 400% increase in the robot's performance across various tasks with just 90 minutes of Project Aria recordings; presented at ICRA 2025 (Atlanta), attributed to Georgia Tech.
Meta AI ↩ - GoPro HERO13 Black
GoPro Hero 13 Black weighs 157g; the Ultra Wide Lens Mod extends the field of view to 177 degrees; launch MSRP around $400.
gopro.com ↩ - Project Aria Gen 2 research glasses
Project Aria Gen 2 weighs about 75g, runs 6 to 8 hours, with RGB, 6DOF SLAM, eye tracking, on-device hand tracking, a PPG heart-rate sensor, and a contact microphone.
Meta ↩ - EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video
EgoDex: 829 hours, 194 tabletop tasks, per-joint hand tracking on Apple Vision Pro; Apple's largest dataset of dexterous human manipulation to date.
arXiv ↩ - Egocentric Data Collection for Robot Training
Pico 4 Ultra weighs 580g with about 105 degrees per eye and on-device hand tracking, used by Unidata as a baseline paid-collection rig with multimodal sync and GDPR consent protocols.
unidata.pro ↩ - Intel RealSense Depth Camera D455
Intel RealSense Depth Camera D455 (Intel product spec, SKU 205847): depth FOV 86° × 57° (±3°), depth stream 1280x720 up to 90fps, global-shutter RGB matched to the depth FOV and 1080p-capable, depth error under 2% at 4m, ideal range 0.6m to 6m. Re-confirm on intelrealsense.com primary page before FINAL (page refused connection 2026-06-10).
Intel RealSense ↩ - EgoLive: A Large-Scale Egocentric Dataset from Real-World Human Tasks
EgoLive: 1,680 hours stereo at 60fps, 65,866 episodes, 346 tasks, JoyEgoCam head-mounted device at 130x130 degree FOV, 2160x2160 per camera.
arXiv ↩ - Egocentric-10K
Egocentric-10K: 10,000 hours, 1.08 billion frames, 1080p/30fps, Apache-2.0; collected across 85 factories (HF card). The launch announcement reports roughly 2,100 factory workers (2,138 per HF/Rohan Paul; rounded to ~2,100).
Hugging Face ↩ - EgoVid-5M
EgoVid-5M: 5 million egocentric clips with kinematic control and textual descriptions, curated from 1080p source footage for egocentric video generation; clip count from egovid.github.io + arXiv 2411.08380, NeurIPS 2025 acceptance per the GitHub repo tag.
EgoVid-5M project ↩ - EPIC-KITCHENS project site
EPIC-KITCHENS-100: 100 hours across 45 kitchens, about 90,000 verb-noun action segments.
epic-kitchens.github.io ↩ - Egocentric video remains useful but incomplete for robot data buyers
Ego4D: over 3,670 hours, 923 unique participants, 9 countries, 74 worldwide locations.
ego4d-data.org ↩ - Project site
DROID was collected by 50 data collectors across North America, Asia, and Europe over the course of 12 months.
droid-dataset.github.io ↩ - GDPR Article 7 — Conditions for consent
GDPR Article 6 governs lawfulness of processing (legal basis); Article 7 governs the conditions for consent.
GDPR-Info.eu ↩ - EU AI Act, Article 10: Data and data governance
EU AI Act Article 10 (Data and data governance) requires representative, documented, bias-examined training data; high-risk obligations enforceable August 2, 2026.
artificialintelligenceact.eu ↩ - RetinaFace: Single-stage Dense Face Localisation in the Wild
RetinaFace reports 91.4% AP on the WIDER FACE Hard test set (arXiv 1905.00641 abstract); there is no published >98% recall figure.
arXiv ↩ - Egocentric Data Collection for Robot Training
Operational multi-sensor sync target: RGB and depth within one frame, IMU and body skeleton within about 20ms.
unidata.pro ↩ - RLDS with TensorFlow Datasets
RLDS stores robot episodes as TFRecord files with nested trajectory structures.
TensorFlow ↩ - LeRobot documentation
LeRobot uses Parquet with referenced video and is the native robot-learning dataset format in the Hugging Face ecosystem.
Hugging Face ↩ - Ego-Exo4D project site
Ego-Exo4D is the largest public dataset of time-synchronized first- and third-person video, more than 1,400 hours and over 800 participants across six countries.
ego-exo4d-data.org ↩ - HD-EPIC: A Highly-Detailed Egocentric Video Dataset
HD-EPIC: 41 hours across 9 kitchens, a highly-detailed egocentric video dataset (CVPR 2025).
arXiv ↩ - OpenVLA project
OpenVLA is a 7B-parameter open VLA pretrained on 970k real-world robot demonstrations from Open X-Embodiment.
openvla.github.io ↩ - truelabel Open X-Embodiment glossary
Open X-Embodiment aggregates over one million real robot trajectories across 22 embodiments.
truelabel.ai ↩ - Best Egocentric Video Data Providers for Robotics AI (2026)
Claru positions itself as a managed global egocentric collection network with trained collectors and built-in privacy and annotation pipelines.
claru.ai ↩ - Egocentric Video Data Collection Services
iMerit collects egocentric video with a global workforce of full-time employees and Scholars and delivers footage raw or annotated, ready for model training or evaluation workflows.
imerit.ai ↩ - 7 Top Egocentric Data Service Providers for Robotics 2026
Labellerr operates primarily as an annotation-first data-labeling and curation platform for robotics datasets (with some capture offered), rather than a primary capture network.
labellerr.com ↩
FAQ
What is egocentric video data collection?
Egocentric video data collection is the process of recording first-person footage from a camera worn on a person's head, chest, or glasses while they perform real tasks. It captures hand-object interaction, gaze direction, and scene context that third-person cameras miss. The footage pretrains robot vision and vision-language-action models, then gets redacted for privacy and exported to a robot-ready format like RLDS or LeRobot.
What is an egocentric view in video?
An egocentric view is the first-person perspective: the scene as the person doing the task sees it, captured by a camera mounted on their head, chest, or glasses. It shows the hands, the manipulated objects, and where attention is directed. This contrasts with an exocentric (third-person) view, where a fixed external camera observes the person. Robots learn manipulation more directly from egocentric video because the viewpoint matches a robot's own onboard cameras.
What is a human egocentric video annotation workflow?
An egocentric video annotation workflow segments continuous first-person footage into labeled sub-actions. The typical steps: redact faces and PII, segment the video at action boundaries (reach, grasp, lift, release) using narration and hand-motion cues, label each segment with verb-noun actions, add hand pose and contact points where needed, verify multi-sensor sync, and export to RLDS or LeRobot with provenance metadata attached to every episode.
What camera should I use to collect egocentric video?
For most programs, a GoPro Hero 13 Black (157g, up to a 177° FOV with the Ultra Wide Lens Mod) is the workhorse: rugged, cheap, and familiar to collectors. For lightweight, sensor-rich capture, Meta's Project Aria Gen 2 (~75g, with eye tracking and on-device hand tracking) or Apple Vision Pro (per-joint hand tracking, used to build Apple's EgoDex dataset) are stronger. Add an Intel RealSense D455 only when you specifically need synchronized RGB-D depth.
What resolution and frame rate should egocentric video be recorded at?
Record at 1080p and 30fps for standard manipulation. That's the de-facto standard, corroborated by major datasets: Egocentric-10K is 1080p/30fps and EgoVid-5M is curated from 1080p footage. Use 60fps only for fast actions like throwing or rapid tool changes. Going beyond that wastes storage, because most robot controllers run at 10 to 30 Hz, so 30fps video already oversamples the control loop.
How do I make an egocentric dataset diverse enough?
Spread your capture wide rather than deep, and make that a measurable target. Build a coverage matrix with environments as rows and activities as columns, and aim for at least three different participants per environment-activity cell so no single person's habits dominate. Vary lighting, object arrangement, clutter, and time of day across the matrix. Ego4D's 74 locations and DROID's three-continent collection show the payoff of spreading capture across scenes rather than stacking hours in one lab.
How much egocentric video do I need to pretrain a robot policy?
Common industry guidance is on the order of 1,000 to 10,000 hours across diverse environments to pretrain a vision encoder, and 100 to 500 hours of in-domain footage to fine-tune for a specific task. But diversity beats raw hours: Figure trained a deployable policy on 100% egocentric human video, and Georgia Tech's EgoMimic found one hour of hand-camera data outperformed an extra hour of robot data. Start small and in-domain, then scale.
How do I handle privacy and consent in egocentric video?
Establish a lawful basis under GDPR Article 6 and meet the consent conditions in Article 7 before recording. The consent form must state that the data trains AI models, that faces and PII are blurred before distribution, and that collectors can withdraw. Redact faces frame-by-frame (including reflections), scrub readable text and audio PII, and log consent per session. For high-risk systems, document the dataset to meet EU AI Act Article 10 requirements, enforceable from August 2, 2026.
How do I get paid to collect egocentric video data?
Egocentric video collection is a real paid job. Programs recruit collectors to wear a camera rig and record consented everyday or occupational tasks, then pay for footage that passes a QA bar. Build AI's Egocentric-10K was collected across 85 factories, with its launch announcement reporting roughly 2,100 factory workers, and DROID was built by 50 paid collectors across three continents. Human egocentric capture is typically valued at $10 to $30 per usable hour, with specialized skills paying more. Marketplaces and microtask platforms route this work to vetted collectors.
How is egocentric video different from teleoperation data for robot training?
Egocentric video is a person performing a task in first person, with no robot involved; teleoperation data is a human driving an actual robot through the task, capturing exact joint and action labels. Egocentric video is cheaper (roughly $10 to $30 per hour versus $50 to $200 for teleop) and scales faster, but it lacks ground-truth robot actions. Many modern pipelines co-train on both: egocentric video for broad visual priors, teleoperation for action-grounded fine-tuning.
What annotation format should egocentric video use for VLA training?
Use RLDS or LeRobot. RLDS stores episodes as TFRecord files with nested trajectory structures and is common in the Open X-Embodiment ecosystem. LeRobot uses Parquet with referenced video and is native to the Hugging Face robot-learning stack, including downstream VLA models like OpenVLA. Both carry arbitrary metadata, so per-session provenance (collector ID, environment, consent reference) ships inside the dataset. Avoid proprietary annotation-tool formats unless the dataset stays internal.
Looking for how to collect egocentric video data?
Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners and helps scope consent artifacts and commercial licensing requirements before delivery.
Source egocentric video from a global collector network