Physical AI data engineering
How to Build an Egocentric Data Pipeline for Physical AI
An egocentric data pipeline turns raw first-person video from wearable cameras into robot-ready training data. It runs seven stages: ingest and validate clips, extract frames at 10 to 30 FPS, blur faces and PII, enrich with depth, segmentation, hand, and caption models, sync the sensor streams to the video clock, package trajectories as RLDS, LeRobot, or HDF5, then validate quality before delivery. The hard parts are sensor-to-video sync, camera-geometry matching, and provenance, not the vision models, which are commodities.
Quick facts
- Input
- Wearable-camera clips (GoPro HERO13, Project Aria, Apple Vision Pro, RealSense D455)
- Extraction
- 10 to 30 FPS, blur-gated, deduplicated
- Enrichment
- Depth Anything V2, SAM 2, hand-object, caption (CLIP-gated)
- Sync
- RGB to IMU/proprioception aligned to the video clock
- Delivery
- RLDS, LeRobot (Parquet + MP4), HDF5, WebDataset
- Provenance
- Consent artifact, license, model versions, lineage per clip
Comparison
| Dataset | Verified scale | Modality / intent | License |
|---|---|---|---|
| Ego4D | 3,670 hours; 923 participants; 74 locations, 9 countries | First-person daily-life video; perception benchmark | Ego4D license agreement |
| EgoDex (Apple) | 829 hours; 194 tabletop tasks | Egocentric video + paired 3D hand/finger tracking | CC BY-NC-ND (dataset) |
| EgoVid-5M | 5M+ clips at 1080P (7.1 TB) or 540P (3.5 TB) | Kinematic + text annotations; video generation | Per project terms |
| Egocentric-10K (Build AI) | 10,000 hours; 1.08B frames; 192,900 clips; 16.4 TB | Factory-worker first-person video | Apache-2.0 |
| Open X-Embodiment | 1M+ trajectories; 22 embodiments; 527 skills | Robot-side action data; cross-embodiment training | Mixed per source dataset |
What is an egocentric data pipeline?
An egocentric data pipeline is the set of stages that turn raw first-person video from a wearable camera into training data a robot policy can ingest without bespoke ETL. Egocentric data is first-person video or sensor data captured from the perspective of a person performing real tasks [1]. The pipeline takes the messy output of that capture (motion blur, faces, drift, variable lighting) and produces clean, labeled, rights-cleared episodes in a standard format.
Why first-person instead of a fixed camera? Because it's the viewpoint the policy will actually see at inference. Figure trains its Helix VLA on 100% egocentric human video collected passively as people do tasks in real Brookfield homes [2]. EgoMimic co-trains manipulation policies from human egocentric video plus teleoperated robot data, and reports that one hour of extra hand data is worth more than one hour of extra robot data [3]. So the first-person frame lets you bridge cheap human demonstration and expensive robot teleoperation.
This guide walks seven stages: ingest and validate, extract and gate frames, filter PII, enrich with vision models, sync sensors and match geometry, package, then validate. Two of them decide whether the data is usable, and almost nobody documents them: sensor sync with camera-geometry matching (Stage 5), and provenance (the licensing section). Read those twice.
Pipeline architecture at a glance
An egocentric data pipeline is a one-way conveyor: a clip enters as an immutable raw file and leaves as one or more episodes in a delivery bucket, with a row in a metadata database tracing every transformation in between. Each stage reads from the previous stage's output prefix, writes to its own, and records which model version it ran. Nothing overwrites the raw archive.
You make one batch-versus-streaming decision up front. For under roughly 500 clips per week, a single Python process with checkpointing to Redis or a status column will do. Above that, parallelize frame extraction and GPU inference across workers, with Apache Airflow or Prefect for orchestration and Celery plus Redis (or AWS Batch) for the task queue. Keep all metadata (clip_id, source_camera, recording_date, processing_status, model versions) in PostgreSQL or DuckDB so any episode replays from its source clip.
The table below maps the seven stages to outputs and tools, plus a rough difficulty and effort read so you can plan staffing. The two starred stages, sync and validation, are where most pipelines lose transferability without anyone noticing until a buyer's policy fails.
| Stage | Output | Key tools | Difficulty | Build effort |
|---|---|---|---|---|
| 1. Ingest + validate | Validated, archived clips + provenance row | ffprobe, S3, PostgreSQL | Low | ~1 day |
| 2. Extract + gate | Frame sequences, blur-gated, deduplicated | FFmpeg, OpenCV, pHash | Low | ~1 day |
| 3. Filter PII | Face/PII-blurred frames + audit status | RetinaFace, PaddleOCR, human review | Medium | ~3 days |
| 4. Enrich | Depth, masks, hand poses, captions | Depth Anything V2, SAM 2, EgoHOS, CLIP | Medium | ~1 week |
| 5. Sync + geometry ★ | Sensor streams aligned to the video clock | PTP/NTP, GPIO trigger, calibration | High | ~2 weeks |
| 6. Package | RLDS / LeRobot / HDF5 episodes | RLDS, LeRobot, MCAP | Medium | ~3 days |
| 7. Validate ★ | Accepted episodes + audit trail + dashboard | Streamlit/Gradio, metric gates | Medium | ~1 week |
Batch vs streaming: which architecture to run
The architecture question is settled by how data arrives and how fast you need it back. Most egocentric capture is batch: collectors record on wearable cameras through the day and upload chunks, so a clip is hours old by the time it lands. For that, a batch pipeline is correct. An orchestrator (Airflow or Prefect) schedules runs, fans frame extraction and GPU inference out across workers, and writes episodes when a batch completes. You optimize for throughput, not latency, and you size the GPU pool to clear the backlog within your delivery SLA.
A streaming pipeline is only worth its extra complexity when capture and consumption are coupled: a live teleoperation session where an operator needs near-real-time feedback, or an active-learning loop that re-prioritizes capture based on what the model is still weak on. There the stages run as always-on workers reading from a queue (Kafka or Kinesis), and SAM 2's streaming memory for real-time video [4] makes per-frame enrichment feasible online. The cost is operational: backpressure handling, partial-failure recovery, and exactly-once semantics you don't need in batch.
For a data marketplace or a research dataset, default to batch. Reach for streaming only when a human or a robot is waiting on the output, and keep the two paths separate rather than bolting real-time constraints onto a batch system that doesn't need them.
A concrete way to size the batch path: pick a target delivery SLA (say, clips processed within 48 hours of upload), measure your per-clip wall-clock through the heaviest stage, and provision GPU workers to clear the expected daily upload inside that window with headroom for a backlog day. If a collector network uploads 200 hours of footage on a Friday, you don't want Monday's batch to still be running on Wednesday. Idempotency is the other batch property worth designing in early: key every stage's output by clip_id plus stage plus model-version, so re-running a failed batch overwrites cleanly instead of producing duplicate episodes. That same key is what lets you re-run a single stage later when you swap an enrichment model, which is the replay path Stage 7 and the monitoring section depend on.
Stage 1: Ingest and validate raw wearable-camera footage
Difficulty: low. Wearable cameras (GoPro HERO13, Project Aria, Apple Vision Pro, Intel RealSense D455) output H.264 or H.265 at 1080p to 5.3K and 30 to 120 FPS. Ingestion validates the file before it touches a GPU. Probe it with FFmpeg:
`ffprobe -v error -show_entries format=duration,bit_rate -show_entries stream=codec_name,width,height,r_frame_rate -of json input.mp4`
Reject clips with corrupted headers, zero duration, or a resolution that doesn't match the capture spec. Then assign a UUID and write a provenance row: collector_id (anonymized), recording_date, location_type, activity_category, source_camera with firmware, and the consent-artifact reference. Apple captured EgoDex with paired 3D hand and finger tracking recorded at the time of capture using Vision Pro and on-device SLAM [5]; if your camera emits a parallel pose or IMU stream, capture its file path and clock offset now, because you can't reconstruct it later.
Make the provenance row a real table, not a JSON blob, so every field is queryable at delivery time. At minimum store clip_id (UUID), raw_s3_key, collector_id (anonymized), recording_date, location_type, activity_category, source_camera, camera_firmware, consent_artifact_ref, consent_scope (research vs commercial-training), and processing_status. The consent_scope column is the one that decides commercial usability per clip, and the camera columns are what Stage 5's geometry filtering reads later, so capturing them at ingest costs nothing and not capturing them is unrecoverable.
Archive raw clips to object storage under an immutable prefix, organized by date and collector: `s3://bucket/raw/2026-06-10/collector-a3f2/clip-uuid.mp4`. Turn on versioning and a lifecycle rule that moves cold raw clips to Glacier Instant Retrieval after 90 days. Never overwrite or delete a source file once ingested. Every enriched artifact stores its raw_s3_key, so any episode traces back to the exact clip and consent record it came from.
Choosing the capture camera: a spec comparison
The camera decides what signal the pipeline even has to work with, so pick it before you write a line of code. The deciding factor is matching the camera's geometry to your robot's deployment view; the spec table below covers the common rigs, with every figure from the vendor's own spec page.
A GoPro HERO13 Black gives you cheap, high-resolution RGB: 5.3K60 video (and 4K120, 2.7K240) with HyperSmooth 6.0 stabilization on a 1/1.9-inch sensor, a 156-degree native field of view that the HB-Series Ultra Wide Lens Mod widens to 177 degrees [6]. That wide FOV is useful precisely because you can crop it down to a target deployment view later. Project Aria Gen 1 trades resolution for synchronized sensors: an RGB camera at 2880x2880 (downsampled to 1408x1408), two mono scene (SLAM) cameras at 640x480, and two IMUs at 1000 Hz and 800 Hz [7], which is why research capture (Ego4D, EgoMimic) leans on Aria. The Intel RealSense D455 adds metric depth: stereo depth at 1280x720 up to 90 FPS, an 86-by-57-degree depth FOV, an ideal range of 0.6 to 6 m, and under 2 percent depth error at 4 m [8]. Apple Vision Pro is the dexterity rig, since EgoDex's paired 3D hand and finger tracking comes from its on-device SLAM [5].
| Camera | RGB / video | Depth or pose | Field of view | Best for |
|---|---|---|---|---|
| GoPro HERO13 Black | 5.3K60, 4K120, 2.7K240; HyperSmooth 6.0 | None (RGB only) | 156 deg native, 177 deg with lens mod | Cheap high-res RGB at scale |
| Project Aria (Gen 1) | RGB 2880x2880 down to 1408x1408 | 2x mono scene/SLAM 640x480 + 2 IMUs (1000/800 Hz) | Wide multi-camera | Synchronized research capture |
| Intel RealSense D455 | RGB + stereo depth 1280x720 @90 FPS | Metric depth, 0.6-6 m, <2% error @4 m | 86 x 57 deg depth FOV | Metric depth out of the box |
| Apple Vision Pro | Egocentric video | On-device SLAM + 3D hand/finger pose | Headset-native | Dexterous hand-tracking capture |
Stage 2: Frame extraction and quality gating
Difficulty: low. Extract frames at the rate the model needs, not the camera's native rate. Egocentric manipulation rarely needs 120 FPS; 10 to 30 FPS captures human hand cadence while cutting storage and compute. Match the rate to the control frequency of your target robot: a policy that acts at 20 Hz wants frames near 20 FPS.
`ffmpeg -i input.mp4 -vf fps=10 -q:v 2 frames/%06d.jpg`
That extracts 10 FPS at high JPEG quality. Then gate on quality before you spend GPU time. Score blur with Laplacian variance:
`cv2.Laplacian(gray, cv2.CV_64F).var()`
Low variance means a flat, blurry frame. Calibrate the threshold per camera rather than copying a magic number; head-turn motion blur is the dominant failure mode in first-person capture, so a clip can lose 30 to 50 percent of frames here and that's fine. Deduplicate near-identical frames with a perceptual hash (pHash) and drop any frame within a few bits of Hamming distance of its predecessor. That removes the dead segments where the wearer paused. Write surviving frames to `processed/clip-uuid/frames/` and record frames_extracted_count and the blur distribution so later stages can skip or flag weak frames.
Stage 3: Privacy and PII filtering
Difficulty: medium. Egocentric video shot in homes and workplaces captures faces, screens, documents, and license plates. This stage isn't optional, and it's a real reason public corpora carry restrictive licenses: EPIC-KITCHENS, for example, states plainly that "you may not use the material for commercial purposes" [9]. Filtering PII is what clears your data for commercial sale.
Run a face detector (RetinaFace or MTCNN) on every frame, expand each box outward so the blur covers hairline and jaw, and apply a strong Gaussian blur. Detect on-screen text with OCR (PaddleOCR or EasyOCR) and blur regions that match PII patterns: `b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,}b` for emails, `bd{3}-d{2}-d{4}b` for US SSNs. Run it as two passes: automated detection plus blur, then human spot-review of a sample and of any frame the detector flagged as low-confidence. Record privacy_status (pending, blurred, reviewed) per clip.
That single license sentence — "you may not use the material for commercial purposes" — separates a research benchmark from a commercial training-data license. PII blur is only half the obligation. If any frame can identify a person and you collect in the EU or from EU residents, your lawful basis is usually consent under GDPR Article 7, which requires that consent be freely given, specific, informed, and unambiguous, and that the person can withdraw it as easily as they gave it. So the consent artifact you reference in the Stage 1 provenance row has to be specific to AI-training use and revocable, and your pipeline needs a way to pull a withdrawn participant's clips back out of every downstream dataset. Treat consent and PII handling as a pipeline stage with an audit trail and a privacy_status you can query, not a manual checkbox.
Stage 4: Multi-model enrichment (depth, segmentation, hands, captions)
Difficulty: medium. Enrichment is the stage most guides stop at, because the models are commodities now. Depth Anything V2 produces monocular depth, trains on 595K synthetic labeled plus 62M+ real unlabeled images, ships from 25M to 1.3B parameters, and runs about 10x faster than diffusion-based depth models [10]. Pick the ViT-Large checkpoint for quality or a smaller one for throughput, and save depth as 16-bit PNG keyed to the source frame.
For segmentation, SAM 2 is a foundation model for promptable segmentation in images and video with streaming memory for real-time use. On the SA-V test set its large model runs 39.5 FPS and its tiny model 91.2 FPS on an A100 [4], so your speed-versus-quality call is a checkpoint choice, not an architecture rewrite. Pair class-agnostic masks with a classifier (CLIP) to attach labels. For hands, run an egocentric hand-object model rather than a generic body-pose model, because first-person hands enter from the bottom of the frame and generic pose estimators miss them. EgoHOS gives you egocentric hand and interacted-object masks; the 100DOH hand-object detector adds a per-hand contact state, classifying each detected hand into no contact, self contact, other-person contact, portable-object contact, or stationary-object (furniture) contact [11]. That contact label is worth keeping as a queryable field, because a buyer training a grasp policy wants the frames where a hand is actually in portable-object contact, not the frames where it's resting on a table.
Gate captioning hard. Vision-language models hallucinate objects that aren't in frame, so generate the caption, verify it with a CLIP similarity check against the actual frames, and drop captions below threshold. Store everything keyed to the frame index with the model name and version (`depth_model: depth_anything_v2_vitl`) so the dataset stays reproducible and a buyer can filter by enrichment version.
Stage 5: Sync sensors and match robot geometry
Difficulty: high, and this is the stage most guides, including Claru's, skip. It's where egocentric data fails to transfer to a robot even though the frames look clean. Video alone isn't enough. Unidata makes the point that video captures what the hand looks like but cannot capture how hard a grasp is or when a slip is imminent [12]. If your capture rig has a wrist IMU, force/torque sensor, or glove, those streams carry the signal the camera can't see, and you have to align them to the video clock.
Sync tolerances are tight. A wrist IMU should be aligned to within about 20 ms of the video [13]; beyond that the kinematics and the pixels disagree and the policy learns noise. Use hardware-triggered capture or a shared clock (PTP/NTP or a GPIO pulse), and store the per-stream clock offset alongside each clip so a later run can re-align if a device drifts. The DROID robot dataset is the reference for what disciplined multi-stream capture records: per timestep, at 15 Hz, three stereo RGB streams at 1280x720 (two exterior ZED 2 plus a wrist-mounted ZED Mini), joint positions and velocities (7D), end-effector pose and velocity (6D), and gripper position and velocity (1D) [14], over 76k trajectories with 1,417 calibrated viewpoints [15]. An egocentric rig won't have a robot's joint encoders, but it borrows the same calibration discipline: one clock, recorded offsets, and a fixed extrinsic between every sensor.
Camera geometry is the other failure mode, and it stays hidden because the frames look fine. A 15-degree field-of-view mismatch changes the apparent size and position of grasped objects enough to matter for precise manipulation [16]. A head-mounted camera sees hands enter from below; a wrist-mounted end-effector camera sees the workspace from 15 to 30 cm above the object, looking roughly downward [17]. Train a policy on head footage, deploy on a wrist camera, and the distributions won't match, so the policy degrades on hardware that looked nothing like its training view. You have two ways out. Either decide the deployment camera up front and capture to match it, or capture a wide fisheye at head level (204 degrees horizontal by 137 degrees vertical) that contains the full hand-workspace geometry, then crop to the target view later [18]. If you need metric depth at the head position to recover that geometry, a stereo head like the ZED X Mini gives depth from 0.1 to 8 m [19]. Whichever you pick, write the camera model, intrinsics, and extrinsic into the clip's metadata, because a buyer training a wrist-camera policy has to filter your corpus down to wrist-compatible geometry, and they can only do that if the geometry is a queryable field rather than something they have to eyeball.
Stage 6: Package as robot-ready datasets
Difficulty: medium. Package in the format the buyer's training stack ingests directly. RLDS stores episodic data as Episodes of Steps, where each Step carries mandatory is_first and is_last flags plus optional observation, action, reward, discount, and metadata fields [20]. It's the schema behind Open X-Embodiment, which aggregates 1M+ trajectories across 22 embodiments and 527 skills [21], so RLDS-formatted data drops into cross-embodiment training. Serialize it with the RLDS tooling, which targets the TensorFlow Datasets ecosystem [22].
LeRobot is the other standard worth supporting: a LeRobotDataset stores synchronized MP4 videos for vision and Parquet files for state and action data [23]. The Parquet-plus-MP4 layout streams and filters without loading whole episodes into memory, which matters when an episode is a five-minute task. For custom multi-modal schemas, HDF5 and WebDataset work fine; for the tabular metadata sidecar, use Parquet. You won't write any of these serializers from scratch: the building blocks of an egocentric video data pipeline GitHub stack (the RLDS tooling, the LeRobot repo, SAM 2, Depth Anything V2) are all open source, so packaging is wiring those libraries together rather than authoring a format.
Two packaging rules prevent silent leakage. Split by episode, never by frame, or temporal information bleeds across train and test. And group frames into episodes by task boundary (one "make coffee" attempt from start to finish), not by file. If you're ingesting robot logs, read them from MCAP or ROS bags and align the topics to the video before you write Steps.
Decide the schema by who consumes the data. RLDS is the right default if the buyer trains in the TensorFlow or JAX ecosystem or co-trains against Open X-Embodiment, because the corpus already speaks that schema [21]. LeRobot is the right default for the PyTorch and Hugging Face side, where the Parquet-plus-MP4 layout lets a dataloader stream frames lazily instead of decoding whole episodes [23]. When you genuinely don't know the buyer's stack, ship RLDS as the interchange format and offer a LeRobot conversion, since both are lossless representations of the same Episode-of-Steps structure. Whatever you write, store the observation keys consistently across every episode (for example, always `observation/image`, `observation/depth`, `observation/state`), because a downstream training script keys on those field names and a single inconsistent episode breaks the dataloader for the whole shard. Version the schema itself in the dataset card so a buyer who ingested v1 knows what changed in v2 before they re-pull terabytes.
Stage 7: Quality validation and human review loops
Difficulty: medium. Automated checks catch pipeline bugs before bad data reaches training. Compute per-clip metrics and gate on them rather than eyeballing output. Flag any clip that fails a threshold for human review, and build a simple dashboard (Streamlit or Gradio) that shows the RGB frame, depth, mask overlay, and caption side by side so a reviewer can accept, reject, or send back for reprocessing in seconds.
Sample-audit accepted clips too, because a detector that regresses will pass its own checks without anyone noticing. The marketplace framing matters here. Value comes from rejecting bad batches before they reach the buyer, not from raw clip count. Track ingestion volume, per-stage error rate, and processing latency over time so a regression shows up as a trend, not as a buyer complaint three weeks later.
Set the human-review sample rate by how much you trust the automated gates. A new capture campaign or a freshly swapped enrichment model warrants reviewing 10 to 20 percent of accepted clips until the metrics stabilize; a mature pipeline running known-good models can drop to a 1 to 5 percent audit on accepted clips while still reviewing 100 percent of clips the gates flagged. Make the review decision durable: store accept, reject, and reprocess as a status on the clip with the reviewer id and a timestamp, so the same clip never lands in two reviewers' queues and a rejected clip can't silently re-enter the delivery set. The reject reasons themselves are signal. If a reviewer keeps rejecting clips for "depth holes on reflective surfaces," that's a capture-environment problem to fix upstream, not a per-clip fix, and you only see it if reject reasons are a structured field you can aggregate rather than free-text notes.
| Metric | Definition | Flag if |
|---|---|---|
| Frame yield | Extracted frames / expected at target FPS | Below 95% |
| Blur ratio | Blurred frames / total | Above 40% |
| Depth coverage | Valid-depth pixels / total | Below 80% |
| Mask coverage | Non-background pixels / total | Below 10% |
| Caption confidence | Mean CLIP-gate score | Below 0.7 |
Choosing enrichment models: speed versus accuracy
Every enrichment stage is a throughput-versus-quality dial, and the right setting depends on volume. For the two heaviest passes, the trade-off is a checkpoint choice, not a research project.
Depth: Depth Anything V2 ships in scales from 25M to 1.3B parameters [10], so you pick the ViT-Small or Base checkpoint for a high-volume pipeline and the ViT-Large checkpoint when depth quality drives a manipulation policy. Because it runs about 10x faster than diffusion-based depth models [10], even the large checkpoint stays tractable at scale; run it at reduced input resolution before you reach for a smaller model.
Segmentation: SAM 2 spans the same kind of range. On the SA-V test set the large model runs 39.5 FPS at 79.5 J&F while the tiny model runs 91.2 FPS at 76.5 J&F on an A100 [4]. That's a roughly 2.3x speedup for a few points of mask quality, which is the right trade for a 10,000-clip month and the wrong trade for a small high-precision dataset. Hands and captions are lighter: run the egocentric hand-object model on every frame, but caption on a sparser cadence (one caption per second of video) and interpolate, since a caption describes an action span rather than a single frame.
Rule of thumb: extract at the lowest FPS the policy tolerates, pick the smallest checkpoint that clears your quality gate, and spend the saved compute on running more clips, not on a marginally sharper mask.
Integrating proprioception: turning video into visuomotor trajectories
Egocentric video becomes a robot trajectory only when it carries the action and state signals a policy regresses against. On the robot side, DROID shows the target shape: 76k teleoperation trajectories over 350 hours [15], recording per timestep at 15 Hz three stereo RGB streams (1280x720), joint positions and velocities (7D), end-effector pose and velocity (6D), and gripper position and velocity (1D) [14]. An egocentric pipeline aimed at robot learning has to produce an analog of that structure from human capture.
The mechanics: read the sensor log (joint states, gripper, IMU, or estimated hand pose) and the video, then interpolate the sensor stream onto the video frame timestamps so every frame gets a paired state and action vector. Write the result as RLDS Steps, where each Step's observation holds the image plus depth plus pose, and the action holds the delta to the next state [20]. If you only have human hand pose and no robot action (a pure demonstration dataset), store the hand pose as the observation and leave the action for downstream relabeling. EgoMimic shows this human-side data is valuable enough that one hour of hand data can beat one hour of robot data [3].
For cross-embodiment training, normalize action spaces. Open X-Embodiment aggregates 22 different embodiments into one corpus [21], which only works because the contributing datasets map their varied arms and grippers onto a shared action representation. If your data will be co-trained with public corpora, normalize to a common end-effector action (translation, rotation, gripper) and document the mapping in the dataset card rather than shipping a bespoke schema nobody else can ingest.
Delivering datasets: formats, splits, and documentation
A good pipeline earns trust at delivery. Ship the format the buyer's stack reads: RLDS for TensorFlow/JAX robot-learning pipelines [20], LeRobot's Parquet-plus-MP4 for the Hugging Face ecosystem [23], HDF5 or WebDataset for custom loaders. Don't invent a schema; the value of RLDS and LeRobot is that 60 datasets and 34 labs already speak them through Open X-Embodiment [21].
The payoff of shipping a standard format is that the buyer queries it with an off-the-shelf SDK instead of writing a parser. A LeRobotDataset loads by repo id and exposes episodes and per-frame tensors directly:
```python from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
ds = LeRobotDataset("your-org/egocentric-coffee-v2") print(ds.num_episodes, ds.num_frames, ds.fps) frame = ds[0] # observation.images.<cam>, observation.state, action ep0 = ds.episode_data_index # episode -> frame-range index ```
For a WebDataset delivery, the queryable surface is the directory layout itself: shard tar files plus a sidecar manifest, so a dataloader streams shards and a buyer filters by the metadata columns you wrote (`activity`, `collector_id`, `min_quality`, camera geometry) before decoding any video:
``` dataset/ metadata.parquet # one row per episode: activity, collector_id, quality, camera_intrinsics shards/episodes-000000.tar # observation/image, observation/depth, observation/state, action shards/episodes-000001.tar dataset_card.md ```
Whatever the container, the rule is the same: the fields a buyer filters on (activity, quality, collector, camera geometry) have to be queryable columns in a manifest, not something they reconstruct by opening clips.
Define splits by episode and by collector, never by frame. Frame-level splits leak temporal information across train and test and inflate your reported metrics. EPIC-KITCHENS-100, with its 100 hours across 45 kitchens [24], models environment-disjoint splits: hold out whole kitchens so the test set measures generalization to a new environment rather than memorization of a seen one. For egocentric data, split by collector_id or location so the eval stays honest about distribution shift.
Then document it. A dataset card in the Datasheets for Datasets style records collection method, consent, PII handling, enrichment model versions, known biases, and prohibited uses. Ship a small public preview (five to ten fully enriched episodes) so a buyer can validate fitness before committing. A marketplace makes this non-negotiable: the buyer's first-batch eval runs against real samples, and scale-up is gated on that eval passing.
Monitoring, lineage, and replay
A production pipeline fails in quiet, specific ways. FFmpeg chokes on a truncated clip, a depth model emits NaNs on an overexposed frame, a caption model starts hallucinating after a checkpoint swap. Instrument every stage with structured logs keyed to clip_id, stage, status, and error, and aggregate them so you can answer "which stage dropped these 40 clips" in one query.
Alert on the symptoms that matter: ingestion volume falling below baseline (a capture or upload break), frame-extraction failures above a few percent (a codec or FFmpeg-version issue), inference latency spiking (GPU throttling or out-of-memory), or PII detection firing on an implausible share of frames (a misconfigured model or a distribution shift). Each of those is a real failure mode, not a hypothetical.
Lineage closes the loop. Record which raw clip and which model versions produced each episode (OpenLineage is one way; an append-only log is another), so that when a buyer flags a bad episode you trace it to its source clip, the exact enrichment versions, and the human-review decision, then replay just the affected stage with an updated model. Without lineage, a single bad batch means re-running everything; with it, you fix the one clip and re-ship. Build the pipeline so every delivered episode can be audited back to a consented source, because that audit trail is what lets a buyer deploy the data without re-clearing it.
Storage and compute: what a 100-hour campaign actually costs
Cost depends on bitrate and which models you run, so here's the math instead of a single headline number. The figures below are our own derived estimate from public bitrates and on-demand GPU pricing, not a cited primary stat; plug in your real numbers.
Raw storage scales with the camera's encode. 4K H.265 at 30 FPS lands near 45 to 60 Mbps, which is roughly 20 to 27 GB per hour, so 100 hours runs on the order of 2 to 3 TB raw before extraction. Extracted frames, depth, and masks add several times that; budget low tens of TB for a 100-hour processed campaign and use S3 Intelligent-Tiering after 30 days. For a sanity check at the high end, Egocentric-10K's 10,000 hours at 1080p/30fps land at 16.4 TB of source video across 192,900 clips [25], which works out to a few TB per 100 hours of raw 1080p capture and confirms the heavy storage cost is the enrichment artifacts, not the source.
GPU compute usually dominates. Depth and segmentation are the heavy passes; both scale linearly with frame count, so extracting at 10 FPS instead of 30 cuts that bill by three. The levers that move cost: extract at the lowest FPS the policy tolerates, run inference at reduced resolution, quantize to FP16, batch frames to saturate the GPU, and use spot instances with checkpointing. Cache enrichment outputs so a privacy-filter rerun doesn't re-pay for depth and masks.
Licensing, consent, and provenance
A clip you can't prove you have the rights to is a legal liability rather than usable training data. Public datasets make this concrete: EPIC-KITCHENS publishes a non-commercial license that bars commercial use of the material [9], which is exactly why a research benchmark can't be repackaged as commercial supply. EgoDex sits under CC BY-NC-ND terms for the same reason [26]. Commercial egocentric data needs consent artifacts, location releases, and explicit commercial training rights attached at capture, not bolted on after [27].
Pick the license deliberately. CC BY 4.0 permits commercial use with attribution; CC BY-NC restricts to non-commercial; CC BY-NC-ND adds a no-derivatives clause; many public egocentric corpora sit under custom research licenses that prohibit redistribution. If you collect from EU residents, your lawful basis is normally consent under GDPR Article 7, which requires consent that's freely given, specific, informed, and unambiguous, and which the participant can withdraw as easily as they gave it. Capture that consent as a specific, revocable AI-training permission and store the form reference in the clip's provenance row.
Document it so a buyer can audit it. Write a dataset card in the Datasheets for Datasets style covering collection method, consent, PII filtering, enrichment model versions, and prohibited uses. For high-stakes deployments, C2PA manifests cryptographically attest capture device and chain of custody, and OpenLineage records which raw clip and which model versions produced each episode. Run provenance as a stage that emits queryable outputs (a license field, a consent reference, a lineage record per clip), so the rights status of any episode is a database lookup rather than an email thread.
Real-world egocentric datasets, compared
No competitor gives a clean, verified cross-dataset table at this depth, so here's one. Every figure below traces to the dataset's own page. Use it to decide what to reference, what to license, and which gap your own capture has to fill.
Ego4D sets the breadth baseline: over 3,670 hours of daily-life video from 923 participants across 74 locations in 9 countries, with 3D scans, gaze, stereo, multi-camera, and narrations [28]. For dexterity, EgoDex captures 829 hours over 194 tabletop tasks with paired 3D hand and finger tracking from Apple Vision Pro [5], though its dataset is released under CC BY-NC-ND terms [26], so it's a research reference, not commercial supply. EgoVid-5M targets video generation, with 5M+ clips carrying kinematic and text annotations [29], distributed at 1080P (7.1 TB) or 540P (3.5 TB) [30], and presented at NeurIPS 2025 [31]. For raw scale, Egocentric-10K reaches 10,000 hours, 1.08 billion frames, 192,900 clips, 16.4 TB, under Apache-2.0 [25]. HD-EPIC pushes annotation density instead: 41 hours in 9 kitchens at 263 annotations per minute, all grounded in 3D [32]. Ego-Exo4D pairs the views, the largest public time-synchronized first- and third-person dataset, with 800+ participants [33]. EgoLive is the freshest entrant, billed as the largest open-source annotated egocentric dataset for real-world task routines [34].
Read the table by intent, not by size. If you need a perception benchmark, Ego4D and HD-EPIC are the reference points; if you need dexterous manipulation, EgoDex's paired hand tracking is the closest public analog to what a robot policy regresses against; if you need generation priors, EgoVid-5M's kinematic-plus-text annotation is the fit; and if you need sheer volume to pretrain on, Egocentric-10K's billion-frame factory corpus under Apache-2.0 is the only fully open option at that scale. The gap none of them fill is your deployment-specific geometry and rights, which is the case for capturing your own or sourcing to spec.
For the robot side of the join, Open X-Embodiment (1M+ trajectories, 22 embodiments, 527 skills) [21] and OpenVLA (a 7B model trained on 970k Open-X episodes that outperforms the 55B closed RT-2-X) [35] show what the human egocentric data gets co-trained against.
What pipelines look like at real scale
The architecture above holds whether you process 100 clips or a million; what changes is the engineering around it. The public datasets are the existence proof. Ego4D ran a globally distributed capture-and-processing effort to reach over 3,670 hours from 923 participants across 74 locations in 9 countries, with synchronized multi-camera, gaze, audio, 3D scans, and narrations [28]. That modality count is what makes scale hard: the pipeline isn't one video stream but several synchronized streams that all have to land on the same clock, which is the reason Stage 5 takes real engineering.
Egocentric-10K pushes the volume axis: 10,000 hours of factory-worker first-person video, 1.08 billion frames across 192,900 clips, 16.4 TB, released under Apache-2.0 [25]. At a billion frames, the cost levers from the storage-and-compute section stop being optimizations and become the difference between a feasible and an infeasible run; extracting at 10 FPS instead of 30, quantizing, and batching are what make a billion-frame enrichment pass affordable. Ego-Exo4D adds the paired-view dimension, the largest public time-synchronized first- and third-person dataset with 800+ participants [33], which means the pipeline carries two camera rigs per session and a calibration between them.
The pattern across all three: the model calls are commodity, and the engineering that scales is synchronization, calibration, storage discipline, and provenance. Get those right at 100 clips and the same pipeline runs at a million; skip them and the corpus looks fine until a buyer's policy fails to transfer.
Buy vs build: run your own egocentric data pipeline or source from a marketplace
Build the pipeline yourself when capture is your moat: a proprietary embodiment, a task distribution no public corpus covers, or a deployment camera you control end to end. The runbook above is the whole job, and the stages that take real engineering are sync, geometry, and provenance, not the model calls.
Source from a marketplace when you need rights-cleared data shaped to a deployment context faster than you can recruit, equip, and QA collectors. Public datasets like Ego4D, DROID, and Open X-Embodiment make useful baselines, but production teams typically need data matched on embodiment, sensor stack, task distribution, and environment, with commercial rights and consent attached. That's the gap a marketplace fills: spec to sample to gated first-batch eval to scale-up, delivered in RLDS or LeRobot with provenance.
Weigh the real costs, not just the per-clip price. Building means standing up the seven stages above, recruiting and equipping a collector network, handling consent and PII at the legal bar your jurisdiction sets, and carrying the fixed cost of GPU infrastructure whether or not you're capturing this month. The payoff is a moat: data nobody else can buy. Buying means a variable cost that tracks your actual need, rights and provenance handled by the supplier, and a first-batch eval you can fail cheaply before committing to scale, at the cost of the data not being exclusive unless you contract for it. A useful tie-breaker is the licensing reality from the comparison table: the open corpora that exist (Ego4D, EgoDex, Egocentric-10K) are research or volume datasets, not deployment-specific manipulation data with commercial training rights, so for a production robot program the build-or-source choice is rarely "use a free dataset" and almost always "capture it yourself or have it captured to spec."
Truelabel is that sourcing layer: a network of more than 20,000 collectors across Asia, Africa, and the Americas that has delivered over 100,000 hours of egocentric footage, shipping each delivery with consent artifacts, license, and lineage in robot-ready format. The decision is concrete: if you're inventing the capture rig, build it. If you need rights-cleared, fit-to-spec egocentric data without standing up a pipeline, source it.
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- truelabel egocentric data glossary
Egocentric data is first-person video or sensor data captured from the perspective of a person performing real tasks.
truelabel.ai ↩ - Project Go-Big
Figure's Project Go-Big trains the Helix VLA on 100% egocentric human video data collected passively as people perform behaviors in real Brookfield homes.
Figure ↩ - EgoMimic: Scaling Imitation Learning via Egocentric Video
EgoMimic co-trains manipulation policies from human egocentric video and teleoperated robot data using Project Aria, and finds that one hour of additional hand data is significantly more valuable than one hour of additional robot data.
EgoMimic (Georgia Tech) ↩ - SAM 2: Segment Anything in Images and Videos
SAM 2 is a foundation model for promptable visual segmentation in images and video with streaming memory for real-time processing; on SA-V test the large model runs 39.5 FPS and the tiny model 91.2 FPS on an A100.
GitHub (Meta / facebookresearch) ↩ - EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video
EgoDex (Apple) is 829 hours of egocentric video over 194 tabletop tasks with paired 3D hand and finger tracking captured at record time using Apple Vision Pro and on-device SLAM.
arXiv ↩ - GoPro HERO13 Black
The GoPro HERO13 Black records 5.3K60 video (also 4K120 and 2.7K240) with HyperSmooth 6.0 stabilization and a 1/1.9-inch sensor, a 156-degree native field of view that the HB-Series Ultra Wide Lens Mod widens to 177 degrees.
gopro.com ↩ - Project Aria Hardware Specifications
Project Aria Gen 1 glasses carry an RGB camera at 2880x2880 (downsampled to 1408x1408), two mono scene (SLAM) cameras at 640x480, and two IMUs operating at 1000 Hz and 800 Hz.
Meta / Project Aria ↩ - Intel RealSense Depth Camera D455
The Intel RealSense D455 captures stereo depth at 1280x720 up to 90 FPS with an 86-by-57-degree depth field of view, an ideal range of 0.6 to 6 m, and less than 2 percent depth error at 4 m.
Intel RealSense ↩ - EPIC-KITCHENS-100 dataset page
EPIC-KITCHENS publishes a non-commercial license that prohibits using the material for commercial purposes.
epic-kitchens.github.io ↩ - Depth Anything V2
Depth Anything V2 is trained on 595K synthetic labeled images plus 62M+ real unlabeled images, ships at scales from 25M to 1.3B parameters, and runs about 10x faster than diffusion-based depth models.
Depth Anything V2 project ↩ - Understanding Human Hands in Contact at Internet Scale (hand_object_detector)
The 100DOH hand-object detector (Understanding Human Hands in Contact at Internet Scale, CVPR 2020) predicts, per detected hand, a contact state of no contact, self contact, other-person contact, portable-object contact, or stationary-object (furniture) contact.
GitHub (ddshan) ↩ - Egocentric Data Collection for Robot Training
Video captures what the hand looks like. It cannot capture how hard a grasp is or when slip is imminent.
unidata.pro ↩ - Egocentric Data Collection for Robot Training
A wrist IMU should be synced to within about 20 ms of the video stream.
unidata.pro ↩ - DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID records per timestep, at 15 Hz, three stereo RGB camera streams at 1280x720 (two exterior ZED 2 plus a wrist-mounted ZED Mini), robot joint positions and velocities (7D), end-effector pose and velocity in the base frame (6D), and gripper position and velocity (1D).
arXiv ↩ - Project site
DROID is a robot-teleoperation dataset of 76k demonstration trajectories (350 hours) across 564 scenes and 86 tasks, captured by 50 collectors at 13 institutions with two ZED 2 stereo cameras plus a wrist-mounted ZED Mini and 1,417 calibrated camera viewpoints.
droid-dataset.github.io ↩ - Egocentric Data Collection for Robot Training
A 15-degree field-of-view mismatch changes the apparent size and position of grasped objects enough to matter for precise manipulation.
unidata.pro ↩ - Egocentric Data Collection for Robot Training
A policy trained on head-mounted footage sees hands entering from below, while a wrist-mounted end-effector camera sees the workspace from 15 to 30 cm above the object looking roughly downward.
unidata.pro ↩ - Egocentric Data Collection for Robot Training
A fisheye camera at head level (204 degrees horizontal by 137 degrees vertical) captures the full hand-workspace geometry.
unidata.pro ↩ - Egocentric Data Collection for Robot Training
A ZED X Mini stereo head provides depth from 0.1 to 8 m at the head position.
unidata.pro ↩ - RLDS GitHub repository
RLDS stores episodic data as Episodes of Steps, where each Step carries mandatory is_first/is_last flags plus optional observation, action, reward, discount, and metadata fields.
GitHub ↩ - truelabel Open X-Embodiment glossary
Open X-Embodiment aggregates 1M+ real robot trajectories across 22 embodiments from 60 datasets and 34 labs, covering 527 skills.
truelabel.ai ↩ - RLDS GitHub repository
The RLDS ecosystem provides tooling to serialize episodes into a standard episodic schema for storage, retrieval, and TFDS consumption.
GitHub ↩ - LeRobot documentation
LeRobotDataset stores synchronized MP4 videos for vision and Parquet files for state and action data.
Hugging Face ↩ - EPIC-KITCHENS project site
EPIC-KITCHENS-100 contains 100 hours of Full-HD first-person kitchen video with about 90K action segments across 97 verb classes and 300 noun classes, recorded in 45 kitchens across 4 cities.
epic-kitchens.github.io ↩ - Egocentric-10K
Egocentric-10K (Build AI) is 10,000 hours of factory-worker first-person video totaling 1.08 billion frames across 192,900 clips at 1080p/30fps (16.4 TB), released under Apache-2.0.
Hugging Face ↩ - EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video
The EgoDex dataset is released under CC-BY-NC-ND terms.
arXiv ↩ - truelabel egocentric data licensing hub
Commercial egocentric supply requires consent artifacts, location releases, and explicit commercial training rights, not just a research-license corpus.
truelabel.ai ↩ - Egocentric video remains useful but incomplete for robot data buyers
Ego4D provides over 3,670 hours of daily-life first-person video from 923 unique participants across 74 worldwide locations in 9 countries, with 3D scans, audio, gaze, stereo, multiple synchronized wearable cameras, and textual narrations.
ego4d-data.org ↩ - EgoVid-5M
EgoVid-5M provides over 5 million egocentric clips annotated with low-level kinematic control (ego-view translation and rotation) and high-level text, built for egocentric video generation.
EgoVid-5M project ↩ - EgoVid-5M
EgoVid-5M is distributed at 1080P (7.1 TB) and 540P (3.5 TB) resolution.
EgoVid-5M project ↩ - EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation
EgoVid-5M was presented as a NeurIPS 2025 poster.
arXiv ↩ - HD-EPIC: A Highly-Detailed Egocentric Video Dataset
HD-EPIC is a highly-detailed egocentric kitchen dataset of 41 hours of video in 9 kitchens with 263 annotations per minute, all grounded in 3D, presented at CVPR 2025.
arXiv ↩ - Ego-Exo4D project site
Ego-Exo4D is the largest public dataset of time-synchronized first- and third-person video, with more than 800 skilled participants across six countries and seven US states.
ego-exo4d-data.org ↩ - EgoLive: A Large-Scale Egocentric Dataset from Real-World Human Tasks
EgoLive is described as the largest open-source annotated egocentric dataset focused on real-world task-oriented human routines to date.
arXiv ↩ - OpenVLA project
OpenVLA is a 7B-parameter open-source vision-language-action model trained on 970k robot episodes from Open X-Embodiment, and it outperforms the 55B-parameter closed RT-2-X model.
openvla.github.io ↩
FAQ
What is an egocentric data pipeline?
An egocentric data pipeline is the set of stages that turn raw first-person video from a wearable camera into robot-ready training data. It runs seven stages: ingest and validate clips, extract frames at 10 to 30 FPS, blur faces and PII, enrich with depth, segmentation, hand, and caption models, sync the sensor streams to the video clock, package trajectories as RLDS, LeRobot, or HDF5, then validate quality before delivery. The hard parts are sensor-to-video sync and camera-geometry matching, not the vision models.
What is egocentric video data collection?
Egocentric video data collection is the practice of recording first-person video from a wearable camera worn by a person performing real tasks, so models see hands, tools, objects, and the environment from the viewpoint a robot or AR agent will use. It's the capture step that feeds an egocentric data pipeline, and for commercial use it pairs each session with consent and a license, not just raw footage.
What is an egocentric view in video?
An egocentric view in video is the first-person perspective, recorded from a camera mounted on the head, chest, or wrist of the person doing the action, rather than a fixed third-person camera watching them. It shows the world the way the actor sees it, with hands and tools entering the frame, which is why it transfers well to robot policies that act from a similar on-body viewpoint.
What is a human egocentric video annotation workflow?
A human egocentric video annotation workflow labels first-person clips with what a model needs: action segments (verb-noun pairs), object masks, hand poses, and captions, usually with automated models producing a first pass and human reviewers correcting it. EPIC-KITCHENS-100, for example, annotated about 90K action segments across 97 verbs and 300 nouns over 100 hours of kitchen video [ref:ref-epic-kitchens].
What cameras work best for egocentric video data collection?
Pick the camera by what signal you need. The GoPro HERO13 Black gives high-resolution RGB (5.3K60, HyperSmooth 6.0, 156-degree FOV widening to 177 with the lens mod) and is cheap to scale [ref:ref-gopro-hero13]. Project Aria adds synchronized SLAM cameras and two IMUs (1000/800 Hz) [ref:ref-aria-hardware], and Apple captured EgoDex (829 hours, paired 3D hand tracking) on Vision Pro [ref:ref-egodex]. The Intel RealSense D455 adds metric depth (1280x720, 0.6 to 6 m) [ref:ref-realsense-d455]. The deciding factor is matching the camera's field of view to your robot's deployment camera, because mismatched geometry breaks transfer.
How much storage and compute do I need for an egocentric video pipeline?
Raw 4K H.265 at 30 FPS runs roughly 20 to 27 GB per hour, so 100 hours is about 2 to 3 TB raw before extraction; extracted frames, depth, and masks add several times that. For scale, Egocentric-10K's 10,000 hours at 1080p/30fps is 16.4 TB of source video [ref:ref-ego10k]. GPU cost is dominated by depth and segmentation, both linear in frame count, so extracting at 10 FPS instead of 30 cuts that bill by three. Quantize to FP16, batch frames, and use spot instances with checkpointing to cut it further. (Storage and GPU figures are derived estimates, not a cited figure.)
How do I handle privacy and PII in egocentric video data?
Run a two-stage filter: automated face detection (RetinaFace or MTCNN) plus strong Gaussian blur on every frame, then OCR (PaddleOCR or EasyOCR) with regex to catch emails, SSNs, and on-screen text, followed by human spot-review of low-confidence frames. Record a privacy_status per clip. This isn't optional: public corpora like EPIC-KITCHENS carry non-commercial licenses partly because of identifiable content, and consent plus PII handling is what makes egocentric data commercially usable [ref:ref-epic-kitchens-license].
What enrichment models should I run on egocentric video data?
For robot learning, depth and segmentation are the priority. Depth Anything V2 gives monocular depth and runs about 10x faster than diffusion-based models, with checkpoints from 25M to 1.3B parameters [ref:ref-depthanything]. SAM 2 handles promptable video segmentation, running 39.5 FPS (large) to 91.2 FPS (tiny) on an A100 [ref:ref-sam2]. Add an egocentric hand-object model for hands and a caption model gated by CLIP similarity to suppress hallucinated objects.
How do I synchronize egocentric video with robot proprioception?
Use a shared clock or hardware trigger, not post-hoc guessing. Capture video and sensor streams against PTP/NTP time or a GPIO pulse, store each stream's clock offset, then interpolate the sensor stream onto the video frame timestamps. Tolerances are tight: a wrist IMU should be aligned to within about 20 ms of the video, or the kinematics and pixels disagree and the policy learns noise [ref:ref-unidata-imu]. DROID is the reference for what to record per timestep: stereo RGB, joint positions and velocities (7D), end-effector pose (6D), and gripper state (1D) at 15 Hz [ref:ref-droid-arxiv]. For robot logs, read MCAP or ROS bags and align topics before writing RLDS Steps.
What frame rate should I extract from egocentric video for robot learning?
Extract 10 to 30 FPS for manipulation and match the rate to the robot's control frequency: a policy acting at 20 Hz wants frames near 20 FPS. The camera's native 120 FPS is overkill and balloons storage and compute. Higher rates only help for fast motions like catching. Set the FPS in the FFmpeg extract step (`-vf fps=20`) and record it in metadata so downstream users know the temporal resolution.
How do I validate pipeline output quality at scale?
Gate every clip on automated per-clip metrics: frame yield above 95%, blur ratio under 40%, depth coverage above 80%, mask coverage above 10%, and caption confidence above 0.7. Flag failures into a review dashboard that shows RGB, depth, mask, and caption side by side for an accept/reject/reprocess decision. Sample-audit accepted clips too, and track per-stage error rate over time so a model regression shows up as a trend before it reaches a buyer.
Looking for egocentric data pipeline?
Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners and helps scope consent artifacts and commercial licensing requirements before delivery.
Source egocentric data on Truelabel