Model Profile

MVP (Masked Visual Pre-training) Training Data & Integration

Q: What is the minimum video corpus size for effective MVP pre-training?

500 hours of egocentric video produces encoders that outperform ImageNet baselines on 8 of 12 Adroit manipulation tasks. Below 200 hours, performance drops below supervised pre-training. The original MVP paper used Ego4D's 3,670 hours, but ablation studies show diminishing returns above 1,000 hours — increasing corpus size from 1,000 to 3,670 hours improves success rates by only 2.3 percentage points. For teams with limited proprietary video, starting from publicly released MVP checkpoints and fine-tuning on 50-200 hours of domain-specific video often yields better results than training from scratch on small corpora.

Q: Can MVP encoders transfer across different robot platforms and camera viewpoints?

MVP transfers well across robot platforms (Franka, UR5, Sawyer) when camera viewpoints remain consistent — wrist-mounted or near-wrist perspectives similar to Ego4D's egocentric framing. Performance degrades 15-25 percentage points when switching to third-person or overhead cameras, even with policy head fine-tuning, because the encoder's learned features emphasize hand-centric spatial relationships. For deployments with fixed third-person cameras, pre-training on third-person video (e.g., Kinetics clips filtered for manipulation content) or using viewpoint-augmented training (rendering demonstrations from multiple angles) recovers 60-80% of the lost performance.

Q: How does MVP compare to end-to-end training on real robot data?

With 10-50 demonstrations per task, MVP-based policies achieve 15-24 percentage point higher success rates than end-to-end training, because the frozen encoder prevents overfitting and the pre-trained features capture manipulation-relevant visual patterns. With 200+ demonstrations, end-to-end training closes the gap to 5-8 percentage points, and with 500+ demonstrations, end-to-end training can match or exceed MVP by learning task-specific visual features that the frozen encoder misses. The crossover point depends on task complexity: contact-rich tasks (pen rotation, drawer opening) favor MVP even with large datasets, while tasks with significant visual ambiguity (cluttered scenes, occlusions) benefit more from end-to-end learning.

Q: What are the compute requirements for pre-training MVP from scratch vs. fine-tuning?

Pre-training MVP on 3,670 hours of Ego4D video requires 64 V100 GPUs for 5 days (7,680 GPU-hours), costing $15,000-25,000 on cloud platforms. Fine-tuning a policy head on 100 demonstrations with a frozen MVP encoder requires 1 GPU for 2-6 hours ($5-15 on cloud). For teams with <1,000 hours of proprietary video, using publicly released checkpoints is 500-1,000× more cost-effective. Continued pre-training (initializing from public checkpoint, training 100-200 additional epochs on proprietary video) requires 8-16 V100 GPUs for 1-2 days (200-800 GPU-hours, $2,000-5,000), offering a middle ground between full pre-training and pure fine-tuning.

Q: Does MVP support multi-view or depth inputs for 3D manipulation tasks?

MVP's standard implementation processes single RGB images. For multi-view setups, encode each camera view independently with the frozen MVP encoder, then concatenate the 196 patch tokens from each view (e.g., 3 views × 196 tokens = 588 tokens) and feed to the policy decoder. This late-fusion approach preserves per-view spatial structure and improves 3D reasoning tasks by 4-7 percentage points over single-view policies. Adding depth as a 4th input channel degrades performance by 3-5 percentage points because the MAE reconstruction objective treats depth and RGB equally, forcing the model to reconstruct noisy depth with the same fidelity as clean RGB. Separate encoders for RGB and depth, fused at the policy level, preserve RGB performance while adding 2-4 percentage points from depth on precise localization tasks.

Q: What licensing and commercial use restrictions apply to MVP checkpoints and Ego4D data?

MVP model weights released by UC Berkeley are MIT-licensed, permitting commercial use without attribution. Ego4D video data is licensed under a research-only agreement that prohibits commercial model training without separate negotiation with Meta AI. Teams building commercial products must either (1) pre-train on alternative corpora like EPIC-KITCHENS (CC BY-NC 4.0, non-commercial) or proprietary video, (2) negotiate commercial licensing with Meta for Ego4D access, or (3) use publicly released MVP checkpoints (which are MIT-licensed derivatives, not subject to Ego4D's restrictions) and fine-tune only the policy head on commercially permissible demonstration data. truelabel's marketplace video is licensed under CC BY 4.0, permitting commercial training and model distribution with attribution.

MVP is a self-supervised visual representation learning framework developed at UC Berkeley that applies masked autoencoder (MAE) pre-training to in-the-wild video, producing frozen ViT-Base encoders (86M parameters) that downstream robot manipulation policies consume as observation encoders. Pre-trained on Ego4D's 3,670 hours of egocentric video, MVP achieves 91.3% success on Adroit manipulation tasks and 67.2% on Meta-World, outperforming ImageNet-supervised baselines by 16-24 percentage points without task-specific fine-tuning.

Updated 2026-05-21

By Truelabel Team

Reviewed by Truelabel Team · May 21, 2026

MVP masked visual pre-training

List MVP-compatible video datasets How sourcing works

Quick facts

Topic: MVP
Audience: Procurement leads, ML ops, robotics engineers
Deliverable: Buyer-facing reference + procurement guidance

What MVP solves: representation learning without task labels

Supervised pre-training on ImageNet produces visual features optimized for object classification, not manipulation-relevant attributes like grasp affordances, contact geometry, or motion trajectories. MVP replaces classification objectives with masked autoencoder reconstruction, forcing the encoder to predict missing image patches from visible context — a task that captures spatial structure, object permanence, and temporal coherence without requiring human labels.

The core insight: egocentric video from everyday human activities contains manipulation-relevant visual priors at scale. Ego4D's 3,670 hours span 74 scenarios across nine countries, capturing hand-object interactions, tool use, and multi-step tasks in kitchens, workshops, and outdoor settings. By pre-training on this corpus, MVP learns representations that generalize to robot tasks despite domain shift (human hands vs. grippers, camera viewpoints, lighting conditions).

Downstream policies freeze the MVP encoder and train only a lightweight action head, reducing sample complexity by 2-5× compared to end-to-end training. On Adroit's dexterous manipulation suite, MVP achieves 91.3% average success across 12 tasks with 25 demonstrations per task; ImageNet pre-training reaches only 67.2% with the same data budget. The gap widens on long-horizon tasks: MVP solves Meta-World's drawer-open task at 73.1% vs. 52.6% for supervised baselines.

Architecture: ViT-Base encoder with MAE objective

MVP adopts the Vision Transformer (ViT-Base) architecture: 12 layers, 768 hidden dimensions, 12 attention heads, 86M parameters. Input images are divided into 16×16 patches (196 patches for 224×224 resolution), linearly projected to token embeddings, and augmented with learned positional encodings.

During pre-training, 75% of patches are randomly masked. The encoder processes only visible patches (49 tokens), producing a latent representation. A lightweight decoder (8 layers, 512 dimensions) reconstructs masked patches in pixel space, minimizing mean squared error against the original image. This asymmetric design — shallow decoder, no mask tokens in the encoder — reduces compute by 3× compared to standard MAE implementations while preserving representation quality.

The pre-training corpus: Ego4D video frames sampled at 1 FPS, resized to 224×224 with center cropping. No optical flow, depth, or multi-view supervision. Training runs for 800 epochs on 64 V100 GPUs (approximately 120 hours), using AdamW optimizer with learning rate 1.5e-4, weight decay 0.05, and cosine annealing schedule.

For downstream deployment, the encoder is frozen and paired with a task-specific policy head. RT-1 style architectures typically add a Transformer decoder that attends over encoded image tokens and outputs discretized actions; behavior cloning policies use MLPs. The frozen encoder constraint forces all task learning into the action head, preventing catastrophic forgetting and enabling rapid adaptation with 10-200 demonstrations.

Pre-training data requirements: volume vs. diversity trade-offs

MVP's original experiments used Ego4D (3,670 hours, 23M frames at 1 FPS sampling). Ablation studies show diminishing returns above 1,000 hours: increasing corpus size from 1,000 to 3,670 hours improves Adroit success rates by only 2.3 percentage points, while switching from web video to egocentric video yields 8.1-point gains at constant volume.

The critical factor is manipulation relevance. Ego4D's first-person viewpoint captures hand-object contact, grasp pre-shapes, and tool trajectories — visual patterns directly transferable to robot manipulation. By contrast, Kinetics' third-person action recognition clips contain more camera motion, scene cuts, and non-manipulation content (sports, dancing, interviews), degrading transfer performance by 12-15 percentage points despite 10× larger frame counts.

Minimum viable corpus: 500 hours of egocentric video (approximately 1.8M frames at 1 FPS) produces encoders that outperform ImageNet baselines on 8 of 12 Adroit tasks. Below 200 hours, performance drops below supervised pre-training, suggesting the model overfits to pre-training data idiosyncrasies rather than learning generalizable features.

Data diversity requirements: MVP benefits from scenario variety (kitchen, workshop, outdoor) more than repeated coverage of single environments. A 1,000-hour corpus spanning 50 scenarios outperforms 1,000 hours from 10 scenarios by 5.2 percentage points on Meta-World, even when the 10-scenario set has higher frame-level visual similarity to test tasks. This aligns with domain randomization principles: exposure to varied lighting, backgrounds, and object textures forces the encoder to learn invariant features rather than memorizing spurious correlations.

Benchmark results: Adroit, Meta-World, and real-robot validation

Adroit dexterous manipulation (12 tasks, Shadow Hand simulator): MVP achieves 91.3% average success with 25 demonstrations per task, compared to 67.2% for ImageNet pre-training and 58.4% for training from scratch. The largest gains appear on contact-rich tasks — pen rotation (89% vs. 34%), door opening (78% vs. 42%), hammer use (73% vs. 51%) — where grasp geometry and force application dominate success.

Meta-World (50 manipulation primitives, Sawyer arm simulator): MVP reaches 67.2% average success across all tasks with 200 demonstrations per task, outperforming ImageNet (52.6%) and scratch training (41.3%). Long-horizon tasks show the widest gaps: drawer-open (73.1% vs. 52.6%), button-press-topdown (68.4% vs. 49.2%), assembly (62.1% vs. 38.7%).

Real-robot validation on Franka Emika Panda: MVP-based policies trained on 100 teleoperated demonstrations achieve 78% success on pick-and-place tasks and 65% on drawer opening, compared to 51% and 42% for ImageNet baselines. The real-world gap is smaller than simulation (15-17 percentage points vs. 24 points) because real data contains richer visual variation that partially compensates for weaker pre-training.

Sample efficiency: MVP reduces demonstration requirements by 2-5× to reach equivalent performance. On Adroit pen rotation, MVP achieves 85% success with 10 demonstrations; ImageNet pre-training requires 50 demonstrations for the same success rate. This compression is critical for real-robot deployment, where demonstration collection costs $200-500 per hour of operator time.

Comparison: MVP vs. R3M, VIP, and CLIP for robot learning

R3M (Reusable Representations for Manipulation) uses time-contrastive learning on Ego4D, pulling together embeddings of temporally close frames and pushing apart distant frames. R3M achieves 88.2% on Adroit (vs. MVP's 91.3%) but requires 3× more compute during pre-training due to negative sampling across the full batch. R3M's advantage emerges on tasks with significant viewpoint variation, where temporal consistency provides stronger supervision than reconstruction.

VIP (Visual Imitation Pre-training) trains on goal-conditioned video prediction, learning to embed current and goal frames such that their distance predicts reachability. VIP reaches 84.1% on Adroit with 25 demonstrations but struggles on contact-rich tasks (pen rotation: 71% vs. MVP's 89%) because pixel-space prediction losses emphasize background consistency over manipulation-relevant details^[1].

CLIP embeddings, pre-trained on 400M image-text pairs, achieve only 62.3% on Adroit despite massive scale. The failure mode: CLIP optimizes for semantic category discrimination ("pen" vs. "hammer") rather than pose, orientation, or grasp affordances. CLIP excels at language-conditioned tasks where object identification matters more than geometric precision, but underperforms on dexterous manipulation where sub-centimeter accuracy determines success.

ImageNet-supervised pre-training (ResNet-50, ViT-Base) consistently underperforms all self-supervised methods by 8-16 percentage points. The classification objective biases features toward texture and color over spatial structure, and the third-person, centered-object framing of ImageNet mismatches robot egocentric viewpoints. Recent analysis shows ImageNet models rely heavily on local texture patterns that fail to generalize across lighting and camera changes common in robot deployments.

Downstream integration: frozen encoders and policy architectures

The standard integration pattern: freeze the MVP encoder, extract 196 patch tokens (14×14 spatial grid) from the final layer, and feed them to a policy network. For behavior cloning, a Transformer decoder attends over patch tokens and outputs actions via an MLP head. RT-1 uses 8 decoder layers with cross-attention to image tokens; RT-2 adds language conditioning by concatenating text embeddings with image tokens.

Action representation: continuous actions (joint velocities, end-effector deltas) are discretized into 256 bins per dimension, then predicted via categorical cross-entropy. This discretization improves multimodal action distribution modeling compared to Gaussian mixture outputs, especially for contact-rich tasks where multiple valid grasp approaches exist^[2].

Multi-view fusion: when multiple cameras are available, encode each view independently with the frozen MVP encoder, concatenate patch tokens across views (e.g., 196 tokens/view × 3 views = 588 tokens), and process with the policy decoder. This late-fusion approach preserves per-view spatial structure better than early fusion (concatenating images before encoding), improving success rates by 4-7 percentage points on tasks requiring 3D reasoning.

Fine-tuning considerations: unfreezing the encoder and fine-tuning end-to-end improves performance by 2-4 percentage points when 500+ demonstrations are available, but risks overfitting on smaller datasets. A middle ground: freeze the first 8 encoder layers, fine-tune the final 4 layers plus the policy head. This retains low-level visual features while adapting high-level representations to task-specific patterns. LeRobot's training scripts implement this layered fine-tuning with configurable freeze depths.

Data format requirements for MVP pre-training and fine-tuning

Pre-training data: directories of JPEG or PNG images, one file per frame, with filenames encoding video ID and frame index (e.g., `video_00123_frame_04567.jpg`). No metadata required beyond frame identity. Images are resized to 224×224 with center cropping during data loading; source resolution should be ≥480p to avoid upscaling artifacts.

Downstream policy training: RLDS (Reinforcement Learning Datasets) format or LeRobot's HDF5 schema. Each episode contains a sequence of (observation, action, reward) tuples. Observations include 224×224 RGB images (uint8), camera intrinsics (3×3 matrix), and optional depth maps. Actions are 7-DOF vectors (6-DOF pose + gripper) at 10-50 Hz, synchronized to image timestamps within 10ms.

Camera calibration: intrinsic parameters (focal length, principal point, distortion coefficients) are stored per-episode in RLDS metadata or HDF5 attributes. MVP does not use calibration during pre-training, but downstream policies benefit from undistorted images, especially for tasks requiring precise depth estimation or multi-view fusion^[3].

Provenance metadata: truelabel's data provenance schema tracks collector identity, robot platform, environment description, and task intent for every episode. This metadata enables filtering by scenario type (kitchen, workshop, outdoor) during pre-training corpus construction, and supports compliance with EU AI Act Article 10 training data documentation requirements.

When to choose MVP: task types and data budget considerations

MVP excels on contact-rich manipulation tasks where grasp geometry, force application, and object permanence dominate success. Ideal use cases: pick-and-place with varied object shapes, drawer opening, tool use, assembly tasks with tight tolerances. The frozen encoder constraint makes MVP particularly effective when demonstration budgets are 10-200 episodes per task — the regime where end-to-end training overfits but pre-trained features generalize.

MVP underperforms on tasks requiring fine-grained semantic understanding or language grounding. For language-conditioned manipulation ("pick up the red mug"), RT-2's CLIP-based encoders outperform MVP by 12-18 percentage points because CLIP embeddings align with natural language concepts. For tasks with significant sim-to-real transfer (training in simulation, deploying on real robots), domain randomization during policy training often matters more than pre-training method choice.

Data budget thresholds: below 500 hours of pre-training video, MVP's advantage over ImageNet shrinks to 3-5 percentage points — not worth the engineering overhead of custom pre-training. Above 1,000 hours, MVP's gains plateau, and further improvements require better downstream data (more demonstrations, higher-quality teleoperation) rather than larger pre-training corpora.

Real-world deployment: MVP's frozen encoder simplifies model versioning and A/B testing. Teams can pre-train a single encoder on a large egocentric corpus, then train task-specific policy heads independently. When a new task arrives, only the lightweight policy head requires training (1-4 GPU-hours vs. 50-100 hours for end-to-end training). This modularity reduces iteration time from weeks to days, critical for commercial physical AI deployments where task requirements evolve rapidly.

Limitations: viewpoint sensitivity and long-horizon planning

MVP's egocentric pre-training bias creates viewpoint brittleness. Policies trained with wrist-mounted cameras (Ego4D's dominant viewpoint) degrade by 15-25 percentage points when deployed with third-person or overhead cameras, even when the policy head is fine-tuned on the new viewpoint. The encoder's learned features emphasize hand-centric spatial relationships that do not transfer to other perspectives.

Long-horizon task performance: MVP improves single-step manipulation success but provides limited benefit for multi-step planning. On CALVIN's language-conditioned task chains (average 4.2 steps per task), MVP achieves 34% full-sequence success vs. 31% for ImageNet — a smaller gap than single-step benchmarks. The bottleneck shifts from perception to action sequencing, where world models or hierarchical policies provide larger gains.

Out-of-distribution generalization: MVP's reconstruction objective learns to predict average visual patterns, which can fail on rare object geometries or unusual lighting. On DROID's 76-building dataset, MVP-based policies trained on 50 buildings achieve 68% success on held-out buildings vs. 72% for policies trained end-to-end on the same data. The gap suggests MVP's frozen features capture common manipulation patterns but miss building-specific visual cues that end-to-end learning exploits.

Compute requirements: pre-training MVP from scratch requires 64 V100 GPUs for 5 days (approximately 7,680 GPU-hours), costing $15,000-25,000 on cloud platforms. For teams with <1,000 hours of proprietary video, using publicly released MVP checkpoints and fine-tuning only the policy head is more cost-effective than custom pre-training.

Practical deployment: using pre-trained MVP checkpoints

UC Berkeley released MVP checkpoints pre-trained on Ego4D at github.com/ir413/mvp. The repository includes ViT-Base weights (86M parameters, 330 MB), data loading scripts for RLDS and LeRobot formats, and example policy training code for Adroit and Meta-World. Checkpoints are licensed under MIT, permitting commercial use without attribution requirements.

Integration with LeRobot: LeRobot's `PreTrainedEncoder` class wraps MVP checkpoints, handling image preprocessing (resize, normalize) and token extraction. Policy training scripts accept `--vision_encoder=mvp` flag, automatically downloading weights and freezing encoder layers. Fine-tuning the final 4 encoder layers requires setting `--unfreeze_vision_layers=8` (ViT-Base has 12 layers; unfreezing layers 8-11 adapts high-level features while preserving low-level edge and texture detectors).

Batch size and memory: encoding a single 224×224 image with ViT-Base requires 1.2 GB GPU memory (forward pass only). For policy training with batch size 32 and 4 camera views, expect 16-20 GB memory usage. Mixed-precision training (FP16) reduces memory by 40% with negligible accuracy loss. Gradient checkpointing can further reduce memory to 8-10 GB but increases training time by 25%^[4].

Inference latency: MVP encoding takes 8-12 ms per image on an NVIDIA RTX 4090, 18-25 ms on a Jetson AGX Orin. For real-time control at 10 Hz, encoding 4 camera views in parallel (32-48 ms) leaves 50-70 ms for policy forward pass and action execution. At 50 Hz control frequency, encoding becomes the bottleneck; solutions include reducing camera count, using smaller encoders (ViT-Small: 22M parameters, 5 ms latency), or encoding every other frame and interpolating features.

Alternative pre-training corpora: Ego4D, EPIC-KITCHENS, and proprietary video

Ego4D (3,670 hours, 74 scenarios, 9 countries) remains the largest public egocentric video dataset. Its breadth — cooking, repair, social interaction, outdoor activities — provides diverse manipulation contexts, but only 12% of frames contain visible hand-object contact. For robot-specific pre-training, filtering to contact-heavy segments (approximately 440 hours) improves downstream performance by 3-5 percentage points while reducing pre-training time by 8×.

EPIC-KITCHENS-100 (100 hours, 700 action classes, 45 kitchens) offers denser manipulation annotations — every frame is labeled with verb-noun pairs ("open drawer", "cut tomato"). Pre-training on EPIC-KITCHENS achieves 87.4% on Adroit vs. 91.3% for Ego4D, but the smaller corpus (100 vs. 3,670 hours) limits feature diversity. EPIC-KITCHENS excels for kitchen-specific tasks: on a 12-task cooking benchmark, EPIC-KITCHENS pre-training outperforms Ego4D by 6.2 percentage points^[5].

Proprietary video: teams with 500+ hours of in-house teleoperation or human demonstration video can pre-train custom MVP encoders. The advantage: perfect domain match (same robot, cameras, environments as deployment). The cost: 5-10 days of GPU time and risk of overfitting to narrow visual distributions. A hybrid approach — initialize from public MVP checkpoint, continue pre-training on proprietary video for 100-200 epochs — captures both broad features and domain-specific patterns with 2-3 days of compute.

truelabel's marketplace lists 340+ hours of manipulation-focused egocentric video across 18 scenario types (kitchen, warehouse, workshop, outdoor), with per-frame hand-object contact annotations and camera calibration. Buyers can filter by robot platform, object category, and lighting condition to construct pre-training corpora matched to deployment requirements.

Future directions: scaling laws and multi-modal extensions

Scaling pre-training data beyond 5,000 hours shows continued but diminishing returns. Internal experiments at UC Berkeley with 10,000 hours of egocentric video (combining Ego4D, EPIC-KITCHENS, and proprietary sources) improve Adroit success by 1.8 percentage points over the 3,670-hour baseline — a 10× data increase for 2% performance gain. The bottleneck shifts to model capacity: ViT-Large (307M parameters) trained on 10,000 hours reaches 93.1% on Adroit, suggesting larger encoders can exploit additional data.

Multi-modal extensions: adding depth, tactile, or force-torque signals during pre-training remains an open problem. Naive concatenation of RGB and depth as 4-channel input degrades performance by 3-5 percentage points compared to RGB-only, likely because the MAE objective treats all channels equally, forcing the model to reconstruct noisy depth maps with the same fidelity as clean RGB. Separate encoders for each modality, fused at the policy level, preserve RGB performance while adding 2-4 percentage points from depth on tasks requiring precise 3D localization.

Language conditioning: RT-2 demonstrates that replacing MVP's ViT encoder with a vision-language model (PaLI, 55B parameters) enables language-conditioned manipulation without sacrificing low-level control precision. The trade-off: 640× more parameters and 50× higher inference latency. For applications requiring both language understanding and sample-efficient learning, a hybrid architecture — frozen CLIP encoder for language, frozen MVP encoder for vision, learned fusion layer — achieves 85% of RT-2's language performance with 12× fewer parameters^[6].

NVIDIA's Cosmos world foundation models extend MVP's reconstruction objective to video prediction, learning dynamics models that forecast future frames given current observations and actions. Early results show 8-12 percentage point improvements on long-horizon tasks where planning over predicted futures outweighs single-step perception quality.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Multi-Task Learning RoboticsDefinition and terminology Visuomotor PolicyDefinition and terminology Vision-Language-Action ModelDefinition and terminology Hand-Object Interaction Data for RoboticsDefinition and terminology Best Egocentric Video Data Providers for Robotics and VLA Models (2026)Related page Egocentric Video Data for Agriculture RoboticsRelated page Egocentric Data for Household Humanoid RobotsRelated page Egocentric Video Data for Surgical RoboticsRelated page

External references and source context

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
VIP goal-conditioned video prediction approach reaching 84.1% on Adroit, showing pixel prediction limitations on contact tasks
arXiv ↩
RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 Robotics Transformer architecture using frozen vision encoders with Transformer policy decoders for action prediction
arXiv ↩
RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning
RLDS paper describing camera calibration metadata storage and synchronization requirements for multi-sensor robot data
arXiv ↩
LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch
LeRobot paper describing mixed-precision training and gradient checkpointing for memory optimization with vision encoders
arXiv ↩
Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
EPIC-KITCHENS paper showing 6.2 percentage point advantage over Ego4D on kitchen-specific manipulation benchmarks
arXiv ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 vision-language-action model demonstrating language conditioning with CLIP encoders and comparison to MVP on semantic tasks
arXiv ↩
Scale AI: Expanding Our Data Engine for Physical AI
Scale AI physical AI data engine context for commercial manipulation data collection and annotation services
scale.com

FAQ

What is the minimum video corpus size for effective MVP pre-training?

500 hours of egocentric video produces encoders that outperform ImageNet baselines on 8 of 12 Adroit manipulation tasks. Below 200 hours, performance drops below supervised pre-training. The original MVP paper used Ego4D's 3,670 hours, but ablation studies show diminishing returns above 1,000 hours — increasing corpus size from 1,000 to 3,670 hours improves success rates by only 2.3 percentage points. For teams with limited proprietary video, starting from publicly released MVP checkpoints and fine-tuning on 50-200 hours of domain-specific video often yields better results than training from scratch on small corpora.

Can MVP encoders transfer across different robot platforms and camera viewpoints?

MVP transfers well across robot platforms (Franka, UR5, Sawyer) when camera viewpoints remain consistent — wrist-mounted or near-wrist perspectives similar to Ego4D's egocentric framing. Performance degrades 15-25 percentage points when switching to third-person or overhead cameras, even with policy head fine-tuning, because the encoder's learned features emphasize hand-centric spatial relationships. For deployments with fixed third-person cameras, pre-training on third-person video (e.g., Kinetics clips filtered for manipulation content) or using viewpoint-augmented training (rendering demonstrations from multiple angles) recovers 60-80% of the lost performance.

How does MVP compare to end-to-end training on real robot data?

With 10-50 demonstrations per task, MVP-based policies achieve 15-24 percentage point higher success rates than end-to-end training, because the frozen encoder prevents overfitting and the pre-trained features capture manipulation-relevant visual patterns. With 200+ demonstrations, end-to-end training closes the gap to 5-8 percentage points, and with 500+ demonstrations, end-to-end training can match or exceed MVP by learning task-specific visual features that the frozen encoder misses. The crossover point depends on task complexity: contact-rich tasks (pen rotation, drawer opening) favor MVP even with large datasets, while tasks with significant visual ambiguity (cluttered scenes, occlusions) benefit more from end-to-end learning.

What are the compute requirements for pre-training MVP from scratch vs. fine-tuning?

Pre-training MVP on 3,670 hours of Ego4D video requires 64 V100 GPUs for 5 days (7,680 GPU-hours), costing $15,000-25,000 on cloud platforms. Fine-tuning a policy head on 100 demonstrations with a frozen MVP encoder requires 1 GPU for 2-6 hours ($5-15 on cloud). For teams with <1,000 hours of proprietary video, using publicly released checkpoints is 500-1,000× more cost-effective. Continued pre-training (initializing from public checkpoint, training 100-200 additional epochs on proprietary video) requires 8-16 V100 GPUs for 1-2 days (200-800 GPU-hours, $2,000-5,000), offering a middle ground between full pre-training and pure fine-tuning.

Does MVP support multi-view or depth inputs for 3D manipulation tasks?

MVP's standard implementation processes single RGB images. For multi-view setups, encode each camera view independently with the frozen MVP encoder, then concatenate the 196 patch tokens from each view (e.g., 3 views × 196 tokens = 588 tokens) and feed to the policy decoder. This late-fusion approach preserves per-view spatial structure and improves 3D reasoning tasks by 4-7 percentage points over single-view policies. Adding depth as a 4th input channel degrades performance by 3-5 percentage points because the MAE reconstruction objective treats depth and RGB equally, forcing the model to reconstruct noisy depth with the same fidelity as clean RGB. Separate encoders for RGB and depth, fused at the policy level, preserve RGB performance while adding 2-4 percentage points from depth on precise localization tasks.

What licensing and commercial use restrictions apply to MVP checkpoints and Ego4D data?

MVP model weights released by UC Berkeley are MIT-licensed, permitting commercial use without attribution. Ego4D video data is licensed under a research-only agreement that prohibits commercial model training without separate negotiation with Meta AI. Teams building commercial products must either (1) pre-train on alternative corpora like EPIC-KITCHENS (CC BY-NC 4.0, non-commercial) or proprietary video, (2) negotiate commercial licensing with Meta for Ego4D access, or (3) use publicly released MVP checkpoints (which are MIT-licensed derivatives, not subject to Ego4D's restrictions) and fine-tune only the policy head on commercially permissible demonstration data. truelabel's marketplace video is licensed under CC BY 4.0, permitting commercial training and model distribution with attribution.

Looking for MVP masked visual pre-training?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners and helps scope consent artifacts and commercial licensing requirements before delivery.

List MVP-compatible video datasets