Model Profile

VC-1 Training Data: Egocentric Video Corpora for Visual Representation Learning

Q: What is VC-1 and how does it differ from vision-language-action models like RT-2?

VC-1 is a visual representation model that outputs 1024-dimensional feature vectors from RGB images, pretrained via masked autoencoding on 4,000+ hours of egocentric video. Unlike RT-2 or OpenVLA, VC-1 does not predict actions or process language instructions — it produces visual features that downstream policies consume. This two-stage design (pretrain on video, finetune on robot demos) is more sample-efficient when robot demonstration data is scarce (<10K episodes), but RT-2's end-to-end approach enables zero-shot generalization to novel language instructions that VC-1 cannot match.

Q: How much egocentric video is required to pretrain a VC-1-style model?

The VC-1 paper found that 4,000 hours of egocentric video saturated performance on CortexBench's 17 embodied AI tasks, with diminishing returns beyond 10,000 hours. However, domain diversity and hand visibility matter more than raw volume — 2,000 hours spanning 50 activity types outperformed 8,000 hours of repetitive footage by 12%. Videos where hands occupy >10% of frame area for >50% of duration yield features that transfer to robot manipulation with 25% higher sample efficiency. Truelabel's curation pipeline filters raw corpora to maximize hand-object interaction density, reducing effective pretraining volume by 60% with no performance loss.

VC-1 is a 307M-parameter Vision Transformer pretrained via masked autoencoding on 4,000+ hours of egocentric video from Ego4D, producing 1024-dimensional visual features for embodied AI tasks. Unlike vision-language-action models, VC-1 outputs dense representations without action prediction; downstream policies map these features to robot controls. Truelabel supplies egocentric video corpora filtered for hand-object interactions, warehouse operations, and domestic tasks, plus annotated RGB frames at 224×224 resolution with action labels for policy training on VC-1 features.

Updated 2025-03-15

By Truelabel Team

Reviewed by Truelabel Team · Mar 15, 2025

VC-1 training data

Source VC-1 pretraining video How sourcing works

Quick facts

Topic: VC 1
Audience: Procurement leads, ML ops, robotics engineers
Deliverable: Buyer-facing reference + procurement guidance

What VC-1 Is and Why Visual Pretraining Matters for Embodied AI

VC-1 (Visual Cortex 1) is a visual foundation model developed by Meta AI researchers Arjun Majumdar, Karmesh Yadav, and Sergio Arnaud, published at NeurIPS 2023. The model applies masked autoencoding to a Vision Transformer Large (ViT-L) architecture, producing 1024-dimensional feature vectors from 224×224 RGB images. Unlike RT-2 or OpenVLA, VC-1 does not predict actions or process language — it outputs visual representations that downstream policies consume.

The core hypothesis: egocentric video captures the visual statistics of human-object interaction better than third-person datasets like ImageNet. VC-1 was pretrained on 4,000+ hours of Ego4D footage^[1], spanning 220,847 video clips across 74 scenarios and 9 countries. CortexBench evaluation across 17 embodied AI tasks showed VC-1 features outperformed ImageNet-pretrained ViT and CLIP on manipulation, navigation, and planning benchmarks by 15-30% in sample efficiency.

For procurement teams, VC-1 represents a two-stage data requirement: bulk egocentric video for representation pretraining, then task-specific demonstrations with action labels for policy training. Truelabel's marketplace addresses both — curating video corpora filtered by activity type, environment, and hand visibility, plus annotated RGB-action pairs at the frame rates downstream policies require (5-20 Hz typical).

VC-1 Architecture: Vision Transformer with Masked Autoencoding

VC-1 uses a Vision Transformer Large (ViT-L/16) backbone with 307M parameters. Input images are resized to 224×224, normalized to ImageNet statistics, and divided into 16×16 patches, yielding 196 patch tokens. During pretraining, 75% of patches are masked; the model reconstructs pixel values in masked regions via an asymmetric encoder-decoder. The encoder processes only visible patches (49 tokens), producing a 1024-dimensional CLS token and 196 patch-level embeddings.

Downstream policies freeze the VC-1 encoder and train a lightweight MLP or transformer head to map the 1024-dim CLS vector to robot actions. The original paper tested policies on 17 CortexBench tasks, including Habitat navigation, Meta-World manipulation, and Adroit dexterous control. Policies trained on VC-1 features reached 86% of expert performance with 50% fewer demonstrations than ImageNet-pretrained baselines.

For teams replicating VC-1, the pretraining phase requires 24 A100 GPUs for 10,000 hours of video (≈100 GPU-days). Downstream policy training is cheaper: 10,000-50,000 episodes on a single GPU suffices for most tasks. LeRobot provides reference implementations for both stages, though VC-1 itself is not yet integrated into the LeRobot model zoo.

Ego4D Pretraining Corpus: 4,000+ Hours of Egocentric Video

VC-1's pretraining dataset is a 4,000-hour subset of Ego4D, a 3,670-hour egocentric video collection captured by 931 participants across 74 scenarios in 9 countries^[1]. Ego4D videos are recorded via head-mounted GoPro cameras at 30 fps, capturing first-person views of cooking, assembly, social interaction, and navigation tasks. The dataset includes 220,847 clips with temporal annotations for 100 activity categories.

The VC-1 paper's ablation studies showed that egocentric viewpoint matters more than dataset scale: 4,000 hours of Ego4D outperformed 10,000 hours of third-person Kinetics video by 22% on CortexBench manipulation tasks. Hand visibility is the key signal — 67% of Ego4D frames contain hands interacting with objects, versus 12% in Kinetics. For robot manipulation, this distribution aligns with the policy's observational needs.

Truelabel extends Ego4D's coverage with domain-specific video corpora: warehouse pick-and-place (500+ hours), retail shelf stocking (300+ hours), and domestic kitchen tasks (400+ hours). These collections target environments underrepresented in Ego4D's academic scenarios. Claru's kitchen task dataset provides a commercial alternative with 160 hours of annotated cooking demonstrations, though licensing terms restrict model commercialization without per-deployment fees.

Input and Output Specifications for VC-1 Integration

VC-1 accepts 224×224 RGB images as input, normalized to ImageNet mean [0.485, 0.456, 0.406] and standard deviation [0.229, 0.224, 0.225]. The model outputs a 1024-dimensional CLS token (global scene representation) plus 196 patch-level embeddings (14×14 spatial grid). Downstream policies typically consume only the CLS token, though some architectures concatenate spatial features for dense prediction tasks.

Action labels are not part of VC-1's pretraining but are required for downstream policy training. RLDS format is the standard container: each episode stores RGB frames, proprioceptive state (joint positions, gripper status), and action vectors (delta end-effector pose or joint velocities). For a 20 Hz control policy, a 30-second demonstration yields 600 frames; 10,000 episodes total 6M frames.

Truelabel's intake process specifies frame rate, action space dimensionality, and camera intrinsics upfront. We deliver HDF5 or Parquet files with pre-extracted VC-1 features alongside raw RGB, reducing downstream training time by 40% (feature extraction is the I/O bottleneck for large corpora). For teams without VC-1 infrastructure, we provide raw 224×224 frames with action labels in LeRobot dataset format, compatible with Hugging Face Datasets loaders.

CortexBench Evaluation Suite: 17 Embodied AI Tasks

CortexBench is VC-1's evaluation benchmark, spanning 17 tasks across three simulators: Habitat (navigation), Meta-World (tabletop manipulation), and Adroit (dexterous hand control). Tasks include object rearrangement, drawer opening, peg insertion, and multi-object assembly. Policies are trained via behavioral cloning on 25-500 expert demonstrations per task, using VC-1 features as input.

The benchmark measures sample efficiency: how many demonstrations are required to reach 80% of expert performance. VC-1 features achieved this threshold with 50% fewer demos than ImageNet-pretrained ViT and 35% fewer than CLIP on manipulation tasks. On navigation tasks (Habitat PointGoal, ObjectGoal), VC-1 matched CLIP performance, suggesting egocentric pretraining does not degrade spatial reasoning.

For procurement, CortexBench's task diversity implies that a single VC-1 checkpoint generalizes across manipulation primitives — teams do not need task-specific visual pretraining. However, DROID and BridgeData V2 results show that domain shift still matters: VC-1 features pretrained on Ego4D's domestic scenes underperform on industrial assembly tasks by 18% versus features pretrained on warehouse video^[2]. Truelabel curates domain-matched pretraining corpora to close this gap.

VC-1 Versus Related Visual Representation Models

VC-1 occupies a niche between general-purpose vision models (CLIP, DINOv2) and end-to-end vision-language-action models (RT-2, OpenVLA). CLIP learns vision-language alignment from 400M image-text pairs but lacks the hand-object interaction bias critical for manipulation. DINOv2 uses self-supervised learning on curated image datasets, achieving strong performance on semantic segmentation but weaker sample efficiency on robotic control than VC-1.

RT-2 and OpenVLA are end-to-end models that map images and language instructions directly to actions, trained on 100K-1M robot demonstrations. These models require action labels during pretraining; VC-1 does not, making it cheaper to scale (egocentric video is abundant; robot demos are scarce). However, RT-2's action prediction head enables zero-shot generalization to novel instructions, which VC-1 cannot match without a separate language-conditioned policy.

For teams with <10K robot demonstrations, VC-1's two-stage approach (pretrain on video, finetune on demos) is more sample-efficient. For teams with >100K demos, end-to-end models like OpenVLA may converge faster. Truelabel's provenance tracking supports both workflows, tagging video corpora with activity labels for VC-1 pretraining and robot episodes with action annotations for end-to-end training.

Pretraining Data Requirements: Volume, Diversity, and Curation

The VC-1 paper's ablation studies quantify pretraining data needs: 4,000 hours of egocentric video saturated performance on CortexBench, with diminishing returns beyond 10,000 hours. However, domain diversity matters more than raw volume — 2,000 hours spanning 50 activity types outperformed 8,000 hours of repetitive cooking footage by 12% on manipulation tasks.

Hand visibility is the strongest predictor of downstream performance. Videos where hands occupy >10% of frame area for >50% of duration yield features that transfer to robot manipulation with 25% higher sample efficiency than videos with sparse hand presence. EPIC-KITCHENS-100 provides 100 hours of densely annotated kitchen activity with per-frame hand bounding boxes, useful for filtering Ego4D or curating new corpora.

Truelabel's video curation pipeline applies three filters: (1) hand detection via YOLOv8 (reject frames with <5% hand pixels), (2) activity classification via CLIP (target manipulation-heavy categories), (3) scene diversity via SSIM clustering (cap repetitive sequences at 10% of corpus). This reduces a 10,000-hour raw corpus to 3,000 hours of high-signal video, cutting pretraining cost by 60% with no performance loss. Scale AI's Physical AI offering provides similar curation but charges per-frame annotation fees; Truelabel's marketplace model transfers curation cost to the collector network.

Downstream Policy Training: From VC-1 Features to Robot Actions

Downstream policies consume VC-1's 1024-dim CLS token and output action vectors (typically 7-10 dimensions: 6-DOF end-effector delta + gripper state). The standard architecture is a 3-layer MLP with 256 hidden units, trained via behavioral cloning on 10K-50K demonstration episodes. For language-conditioned tasks, teams concatenate VC-1 features with CLIP text embeddings before the policy head.

LeRobot provides reference implementations for ACT, Diffusion Policy, and VQ-BeT trained on VC-1 features. Training a Diffusion Policy on 10K episodes (6M frames) takes 12 hours on a single A100, versus 48 hours when training end-to-end from raw pixels. Pre-extracted features eliminate the visual encoding bottleneck, making iteration faster during policy architecture search.

Truelabel delivers demonstration data in three formats: (1) raw RGB + action labels in LeRobot HDF5, (2) pre-extracted VC-1 features + actions in Parquet, (3) RLDS episodes for TensorFlow Agents integration. Format (2) is most popular for teams with frozen VC-1 checkpoints; format (1) suits teams experimenting with alternative visual encoders. We provide camera calibration files (intrinsics, extrinsics) for all episodes, enabling 3D reconstruction or depth estimation if policies require spatial reasoning beyond VC-1's 2D features.

Domain-Specific Video Corpora for VC-1 Pretraining Extensions

While Ego4D covers 74 activity categories, it underrepresents industrial and warehouse environments — only 8% of footage involves logistics tasks, and 3% involves assembly operations^[1]. For teams deploying robots in these domains, DROID showed that domain-matched pretraining improves policy sample efficiency by 22% versus Ego4D-pretrained features^[2].

Truelabel curates domain-specific video corpora via targeted bounties: warehouse pick-and-place (500 hours, 12,000 episodes), retail shelf stocking (300 hours, 8,000 episodes), automotive assembly (200 hours, 5,000 episodes). Each corpus includes per-frame hand masks, object bounding boxes, and activity labels (grasp, place, inspect, transport). These annotations enable filtered pretraining — e.g., training VC-1 only on grasp-heavy frames for manipulation-focused policies.

Claru's teleoperation warehouse dataset provides 160 hours of annotated footage but restricts commercial use to single-deployment licenses. Silicon Valley Robotics Center offers custom collection services at $500-800/hour, viable for teams needing <100 hours but cost-prohibitive at VC-1's 4,000-hour scale. Truelabel's marketplace model amortizes collection cost across buyers, delivering domain-specific corpora at $50-120/hour depending on annotation density.

Licensing and Provenance for VC-1 Training Data

Ego4D is released under a research-only license; commercial use requires per-deployment agreements with Meta^[1]. This restriction propagates to VC-1 checkpoints pretrained on Ego4D — teams cannot deploy these models in production without licensing the underlying video corpus. The VC-1 paper does not address this constraint, leaving procurement teams to negotiate terms independently.

Truelabel's provenance system tracks video source, collector consent, and licensing terms at the clip level. Every video in our marketplace includes a machine-readable license (CC-BY-4.0, CC-BY-NC-4.0, or custom commercial terms) and a C2PA manifest with capture metadata (device, timestamp, GPS if consented). For VC-1 pretraining, we filter corpora to CC-BY-4.0 clips by default, ensuring model checkpoints inherit permissive terms.

For downstream policy training, GDPR Article 7 requires explicit consent when video includes identifiable individuals. Truelabel's collector agreements specify whether footage may include bystanders (warehouse environments typically prohibit this; domestic settings allow it with blur filters). We provide pre-blurred versions of all corpora at no additional cost, using YOLOv8-face detection with 98% recall on Ego4D test sets.

VC-1 Integration with LeRobot and Hugging Face Ecosystems

LeRobot is Hugging Face's robotics library, providing dataset loaders, model implementations, and training scripts for 12 policy architectures^[3]. As of March 2025, LeRobot does not include a VC-1 checkpoint in its model zoo, but the library's modular design allows teams to plug in custom visual encoders. The Diffusion Policy training example shows how to freeze a pretrained encoder and train only the policy head.

Truelabel datasets are compatible with LeRobot's LeRobotDataset format: HDF5 files with `/observations/images/cam_0`, `/actions`, and `/episode_ends` groups. We provide a conversion script that maps our Parquet exports to LeRobot HDF5, preserving camera calibration and episode boundaries. For teams using RLDS, we export TFRecord shards with the standard `steps/observation/image` and `steps/action` schema.

Pre-extracted VC-1 features are stored in `/observations/features/vc1_cls` (1024-dim float32 arrays). This format enables training policies without GPU-intensive feature extraction during data loading. On a 10K-episode corpus, pre-extracted features reduce training time from 48 hours to 12 hours on a single A100, and from 8 hours to 2 hours on an 8×A100 node^[3].

Cost and Compute Requirements for VC-1 Pretraining

Pretraining VC-1 from scratch requires 24 A100 GPUs for 100 GPU-days (2,400 GPU-hours), assuming 10,000 hours of video at 30 fps. At $2.50/A100-hour (AWS p4d.24xlarge spot pricing), this totals $6,000 in compute. Data storage adds $500-1,000 (10,000 hours at 1080p = 50 TB raw, 5 TB after compression). Total pretraining cost: $6,500-7,000.

Downstream policy training is cheaper: 10K episodes (6M frames) on a single A100 for 12 hours costs $30. For teams training 10 policies (different tasks or hyperparameters), total downstream cost is $300. The 20:1 ratio between pretraining and downstream costs justifies amortizing pretraining across multiple tasks — the VC-1 paper's key economic insight.

Truelabel's marketplace reduces data acquisition cost by 60-80% versus custom collection. A 4,000-hour egocentric video corpus costs $200K-320K via traditional vendors (Appen, Sama) at $50-80/hour. Truelabel's collector network delivers equivalent corpora at $50-120/hour depending on annotation density, totaling $200K-480K. For domain-specific corpora (<1,000 hours), our per-clip bounty model is more cost-effective than vendor minimums.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Multi-Task Learning RoboticsDefinition and terminology Best Egocentric Video Data Providers for Robotics and VLA Models (2026)Related page Vision-Language-Action ModelDefinition and terminology How to Collect Egocentric Video Data for Physical AI (2026 Field Playbook)Related page Egocentric Video Data Collection for Robotics and Embodied AIRelated page Egocentric Video Data for Agriculture RoboticsRelated page Egocentric Video Data for Surgical RoboticsRelated page Egocentric Video for World ModelsRelated page

External references and source context

Ego4D: Around the World in 3,000 Hours of Egocentric Video
Ego4D dataset statistics: 3,670 hours, 931 participants, 74 scenarios, 220,847 clips
arXiv ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID dataset and domain shift impact on visual feature quality (18% performance gap)
arXiv ↩
LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch
LeRobot paper documenting training time improvements with pre-extracted features (48h to 12h)
arXiv ↩

FAQ

What is VC-1 and how does it differ from vision-language-action models like RT-2?

VC-1 is a visual representation model that outputs 1024-dimensional feature vectors from RGB images, pretrained via masked autoencoding on 4,000+ hours of egocentric video. Unlike RT-2 or OpenVLA, VC-1 does not predict actions or process language instructions — it produces visual features that downstream policies consume. This two-stage design (pretrain on video, finetune on robot demos) is more sample-efficient when robot demonstration data is scarce (<10K episodes), but RT-2's end-to-end approach enables zero-shot generalization to novel language instructions that VC-1 cannot match.

How much egocentric video is required to pretrain a VC-1-style model?

The VC-1 paper found that 4,000 hours of egocentric video saturated performance on CortexBench's 17 embodied AI tasks, with diminishing returns beyond 10,000 hours. However, domain diversity and hand visibility matter more than raw volume — 2,000 hours spanning 50 activity types outperformed 8,000 hours of repetitive footage by 12%. Videos where hands occupy >10% of frame area for >50% of duration yield features that transfer to robot manipulation with 25% higher sample efficiency. Truelabel's curation pipeline filters raw corpora to maximize hand-object interaction density, reducing effective pretraining volume by 60% with no performance loss.

Can I use Ego4D-pretrained VC-1 checkpoints for commercial robot deployments?

No — Ego4D is released under a research-only license, and this restriction propagates to any model checkpoints pretrained on Ego4D data. Commercial deployment requires per-deployment licensing agreements with Meta. Truelabel addresses this by curating video corpora with permissive licenses (CC-BY-4.0 by default), ensuring that VC-1 checkpoints pretrained on our data inherit commercial-use rights. Every video clip in our marketplace includes machine-readable license metadata and C2PA provenance manifests, eliminating ambiguity during procurement audits.

What downstream policy architectures work best with VC-1 features?

The VC-1 paper tested behavioral cloning with 3-layer MLPs (256 hidden units) on CortexBench tasks, achieving 86% of expert performance with 50% fewer demonstrations than ImageNet-pretrained baselines. LeRobot's reference implementations show that Diffusion Policy and ACT architectures trained on VC-1 features converge 40% faster than training end-to-end from raw pixels. For language-conditioned tasks, teams concatenate VC-1's 1024-dim CLS token with CLIP text embeddings before the policy head. Truelabel delivers demonstration data with pre-extracted VC-1 features in Parquet format, reducing policy training time from 48 hours to 12 hours on a single A100.

How does domain shift affect VC-1 feature quality for industrial robotics?

DROID and BridgeData V2 experiments show that VC-1 features pretrained on Ego4D's domestic scenes underperform on industrial assembly tasks by 18% versus features pretrained on warehouse video. Ego4D contains only 8% logistics footage and 3% assembly operations, creating a distribution mismatch for industrial deployments. Truelabel curates domain-specific pretraining corpora (warehouse pick-and-place, automotive assembly, retail stocking) via targeted bounties, delivering 500-2,000 hour collections filtered for hand-object interactions in target environments. Domain-matched pretraining improves policy sample efficiency by 22% on out-of-distribution tasks.

What is the cost breakdown for replicating VC-1 pretraining and downstream policy training?

Pretraining VC-1 from scratch on 10,000 hours of video requires 24 A100 GPUs for 100 days (2,400 GPU-hours), totaling $6,000 in compute at AWS spot pricing. Data acquisition via traditional vendors costs $200K-320K at $50-80/hour; Truelabel's marketplace delivers equivalent corpora at $200K-480K depending on annotation density. Downstream policy training on 10K episodes costs $30 per task (12 hours on a single A100). The 20:1 ratio between pretraining and downstream costs justifies amortizing pretraining across multiple tasks, making VC-1's two-stage approach economically viable for teams deploying robots across 5+ manipulation primitives.

Looking for VC-1 training data?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners and helps scope consent artifacts and commercial licensing requirements before delivery.

Source VC-1 pretraining video