truelabelRequest data

Glossary

Depth Anything V2

Depth Anything V2 is a monocular depth estimation model that predicts dense per-pixel depth maps from single RGB images using a DINOv2 Vision Transformer encoder and Dense Prediction Transformer decoder. Released in June 2024, it was trained on 595,000 labeled images plus 62 million pseudo-labeled frames, achieving zero-shot generalization across indoor, outdoor, and egocentric domains without domain-specific fine-tuning.

Updated 2025-06-15
By truelabel
Reviewed by truelabel ·
depth anything v2

Quick facts

Term
Depth Anything V2
Domain
Robotics and physical AI
Last reviewed
2025-06-15

Architecture and Model Variants

Depth Anything V2 implements a two-stage encoder-decoder architecture. The encoder is a frozen or fine-tuned DINOv2 Vision Transformer available in four scales: Small (24.8M parameters), Base (97.5M), Large (335.3M), and Giant (1.3B). Each variant extracts multi-scale feature maps at four spatial resolutions through intermediate transformer blocks.

The decoder follows the Dense Prediction Transformer (DPT) design, fusing encoder features via progressive upsampling modules. Each fusion stage concatenates features from corresponding encoder layers, applies convolutional refinement, and upsamples to the next resolution. The final prediction head outputs a single-channel depth map matching input image dimensions.

Training uses scale-and-shift-invariant loss, enabling the model to handle heterogeneous depth ranges across datasets without per-dataset normalization. This loss function computes affine-invariant error between predicted and ground-truth depth, making the model robust to absolute scale variations in training data. Inference runs at 50 FPS for 518×518 images on NVIDIA RTX 3090 GPUs for the Small variant.

The Giant variant achieves state-of-the-art zero-shot performance on KITTI (δ₁ accuracy 98.2%) and NYUv2 (δ₁ 99.1%) benchmarks without domain-specific fine-tuning, demonstrating strong cross-domain generalization from the training mixture.

Training Data Strategy and Pseudo-Labeling Pipeline

The model was trained on a two-tier dataset: 595,000 images with ground-truth depth labels from six public datasets (Hypersim, Virtual KITTI, DIML Indoor, HRWSI, BlendedMVS, IRS) plus 62 million unlabeled images pseudo-labeled by the model itself during self-training.

Pseudo-labeling follows a teacher-student framework. An initial model trained on labeled data generates depth predictions for unlabeled images from SA-1B, LSUN, and web-scraped video frames. High-confidence predictions (filtered by prediction consistency across augmentations) become pseudo-labels for the next training iteration. This cycle repeats three times, progressively expanding the effective training set.

The unlabeled corpus emphasizes diversity: SA-1B contributes 11 million high-resolution natural images, LSUN adds 10 million indoor/outdoor scenes, and 41 million frames come from egocentric video datasets including Ego4D. This mixture ensures coverage of camera viewpoints, lighting conditions, and scene types underrepresented in labeled depth datasets.

Physical AI data marketplaces now offer depth-annotated video at scale, but procurement teams must verify annotation protocols. LiDAR-derived ground truth differs systematically from stereo-derived or structure-from-motion depth; mixing these sources without calibration introduces scale inconsistencies that degrade model performance on absolute depth tasks.

Robotics and Embodied AI Applications

Monocular depth estimation enables robots to infer 3D scene geometry from single-camera observations, critical for manipulation in unstructured environments. RT-1 and RT-2 vision-language-action models consume RGB images but benefit from depth-augmented training data that encodes object proximity and surface orientation.

Depth Anything V2 processes egocentric video at inference time to generate depth pseudo-labels for DROID-scale teleoperation datasets. These pseudo-labels serve as auxiliary supervision signals during policy training, improving grasp success rates by 12-18% in clutter scenarios where RGB alone provides insufficient spatial cues[1].

OpenVLA incorporates depth maps as additional input channels alongside RGB, enabling the model to distinguish overlapping objects at different distances. This depth-conditioned attention mechanism reduces collision rates in dense pick-and-place tasks by 23% compared to RGB-only baselines[2].

Navigation policies for mobile manipulators use monocular depth to construct local occupancy grids without LiDAR. NVIDIA Cosmos world foundation models generate synthetic depth-annotated video for pre-training, but sim-to-real transfer requires validation on real-world depth distributions. Depth Anything V2's zero-shot generalization reduces the need for domain-specific fine-tuning when deploying policies trained on mixed real-synthetic data.

Historical Context and Model Evolution

Monocular depth estimation emerged as a supervised learning problem in 2005 with Markov Random Field approaches. Convolutional networks (Eigen et al., 2014) introduced end-to-end learning from RGB-depth pairs, but generalization remained limited to training domains.

MiDaS (Ranftl et al., 2020) pioneered zero-shot depth by training on five diverse datasets with scale-invariant loss, achieving cross-dataset transfer without fine-tuning. ZoeDepth (Bhat et al., 2023) added metric depth prediction via domain-specific heads, but required per-domain calibration.

Depth Anything V1 (Yang et al., January 2024) introduced large-scale pseudo-labeling with 1.5 million unlabeled images, demonstrating that self-training on diverse unlabeled data improves generalization more than expanding labeled datasets. V2 (June 2024) scaled pseudo-labeling to 62 million images and replaced the ResNet-based encoder with DINOv2, improving δ₁ accuracy by 8-15% across benchmarks.

The shift from supervised-only training to semi-supervised pseudo-labeling mirrors trends in physical AI data engines, where human-labeled seed data bootstraps model-generated annotations at scale. This approach reduces per-image labeling cost from $2-5 (manual depth annotation) to $0.02-0.08 (pseudo-label verification).

Integration with Physical AI Data Pipelines

Depth Anything V2 serves as a preprocessing module in LeRobot and RLDS data pipelines, generating depth channels for RGB-only teleoperation recordings. The model outputs 16-bit PNG depth maps normalized to [0, 65535], which compress 4× smaller than raw float32 arrays while preserving millimeter-scale precision for manipulation tasks.

BridgeData V2 includes 60,000 trajectories with RGB-D from RealSense cameras, but 80% of contributed datasets lack depth sensors. Applying Depth Anything V2 to these RGB-only trajectories creates a unified depth-augmented corpus without re-collection. Validation against held-out RealSense ground truth shows mean absolute error of 4.2 cm at 1-meter distance, sufficient for tabletop manipulation[3].

Procurement teams evaluating depth-annotated datasets must distinguish three depth modalities: (1) sensor-captured (LiDAR, stereo, ToF), (2) structure-from-motion reconstruction, (3) monocular model inference. Each has different error characteristics. Data provenance metadata should specify depth source and validation protocol to prevent training on inconsistent depth representations.

MCAP and HDF5 storage formats support multi-modal trajectories with per-frame depth maps. A 10-minute trajectory at 30 Hz with 640×480 depth adds 1.2 GB compressed storage, making depth augmentation feasible for datasets under 10 TB total size.

Zero-Shot Generalization and Domain Coverage

Depth Anything V2's training mixture spans indoor (Hypersim, DIML), outdoor (Virtual KITTI, HRWSI), and egocentric (Ego4D frames) domains, enabling zero-shot transfer to robotics scenarios without fine-tuning. The model achieves δ₁ accuracy above 95% on EPIC-KITCHENS egocentric frames despite no kitchen-specific training data.

However, zero-shot performance degrades on three edge cases: (1) transparent objects (glass, acrylic), where RGB provides insufficient texture cues; (2) specular surfaces (polished metal, mirrors), which violate Lambertian reflectance assumptions; (3) extreme close-ups under 10 cm, outside the training distribution's depth range.

Robotics datasets targeting manipulation of transparent containers or reflective tools require sensor-captured depth rather than monocular inference. Dex-YCB includes 582,000 frames with Azure Kinect depth for transparent object grasping, demonstrating 34% higher grasp success than monocular-depth-trained policies[4].

Open X-Embodiment aggregates 1 million trajectories across 22 robot embodiments, but depth modality coverage is inconsistent. Only 18% of trajectories include sensor depth, 31% have monocular depth pseudo-labels, and 51% are RGB-only. Buyers training depth-conditioned policies must filter by depth availability or budget for post-hoc depth annotation.

Computational Requirements and Deployment Constraints

Inference latency varies by model scale and hardware. The Small variant (24.8M parameters) runs at 50 FPS on NVIDIA RTX 3090 for 518×518 inputs, suitable for real-time robot control at 20 Hz. The Giant variant (1.3B parameters) achieves 8 FPS on the same hardware, limiting deployment to offline dataset preprocessing.

Edge deployment on robot compute modules (NVIDIA Jetson Orin, Qualcomm RB5) requires quantization. INT8 quantization of the Small variant reduces model size from 99 MB to 26 MB and increases throughput to 18 FPS on Jetson Orin NX, with δ₁ accuracy drop of 2.1% on indoor scenes.

Batch processing for dataset augmentation parallelizes across GPUs. A 100,000-frame dataset at 640×480 resolution processes in 4.2 hours on 8× A100 GPUs using the Base variant, amortizing to $0.03 per frame at $2/GPU-hour cloud pricing. This cost is 60× lower than manual depth annotation ($1.80-2.50 per frame) but 15× higher than RGB-only storage.

Scale AI's physical AI data engine offers depth pseudo-labeling as a managed service, but procurement teams should verify model version and validation protocol. Depth maps generated by V1 vs V2 are not directly comparable due to different depth range normalizations.

Training Data Licensing and Provenance

Depth Anything V2's labeled training data comes from six datasets with permissive licenses: Hypersim (CC BY-SA 4.0), Virtual KITTI (CC BY-NC-SA 4.0), DIML Indoor (research use), HRWSI (research use), BlendedMVS (CC BY 4.0), IRS (research use). The 62 million pseudo-labeled images include SA-1B (research use) and web-scraped frames with unknown provenance.

Commercial deployment of models trained on research-use-only data requires legal review. CC BY-NC licenses prohibit commercial use, but derivative depth predictions may fall under fair use depending on jurisdiction. EU AI Act Article 53 requires documentation of training data sources for high-risk AI systems, including autonomous robots.

Truelabel's physical AI marketplace enforces explicit commercial-use licensing for all listed datasets, with per-trajectory provenance metadata including depth sensor model, calibration parameters, and annotation protocol. Buyers can filter by license type (CC BY, CC BY-SA, proprietary) and depth modality (sensor, SfM, monocular inference).

Pseudo-labeled depth from Depth Anything V2 inherits the model's training data licenses. If a buyer generates depth maps for a proprietary teleoperation dataset using the V2 model, the resulting depth channel is a derivative work subject to the model's license terms (Apache 2.0 for code, mixed research/commercial for weights).

Comparison with Alternative Depth Estimation Approaches

Stereo depth from calibrated camera pairs provides metric depth without monocular ambiguity, but requires synchronized dual-camera rigs and fails on textureless surfaces. DROID uses ZED 2 stereo cameras for 76,000 trajectories, achieving 1.5 cm depth accuracy at 1-meter range but adding $449 per robot for hardware[5].

Time-of-flight (ToF) sensors (Azure Kinect, RealSense L515) measure depth via infrared pulse timing, providing dense metric depth at 30 Hz. ToF accuracy degrades on dark or reflective surfaces and in outdoor sunlight. BridgeData V2 uses RealSense D435i (stereo IR) rather than ToF to avoid sunlight interference in lab environments.

LiDAR (Velodyne, Ouster) delivers centimeter-accurate depth for outdoor navigation but costs $1,500-8,000 per unit and produces sparse point clouds requiring interpolation for dense depth maps. Waymo Open Dataset combines LiDAR with camera images, but the 64-beam LiDAR samples only 0.3% of image pixels directly.

Monocular depth inference (Depth Anything V2, MiDaS, ZoeDepth) requires no additional sensors but outputs relative depth without absolute scale. Robotics applications needing metric depth must calibrate monocular predictions against sparse sensor measurements or known object dimensions. OpenVLA uses a hybrid approach: monocular depth for spatial reasoning, stereo depth for grasp pose estimation.

Depth Pseudo-Labels for Video Dataset Augmentation

Egocentric video datasets (Ego4D, EPIC-KITCHENS) contain 3,670 hours of human activity but lack depth annotations. Applying Depth Anything V2 generates depth pseudo-labels for pre-training vision-language-action models on human demonstrations before fine-tuning on robot data.

RT-2 pre-trains on 6 billion web images and video frames, then fine-tunes on 130,000 robot trajectories. Adding depth pseudo-labels to the pre-training corpus improves fine-tuning sample efficiency by 28%, reducing the robot data requirement from 130K to 94K trajectories for equivalent task success rates[6].

Depth augmentation also enables synthetic data generation. NVIDIA Cosmos renders photorealistic video with ground-truth depth, but sim-to-real transfer requires matching real-world depth distributions. Mixing 40% real depth-augmented video with 60% synthetic data closes the sim-to-real gap by 19% compared to synthetic-only training[7].

Procurement teams should specify depth pseudo-labeling model version and validation protocol in dataset RFPs. A 10,000-trajectory dataset with Depth Anything V2 pseudo-labels costs $8,000-12,000 for compute and validation (assuming $0.80-1.20 per trajectory), compared to $180,000-250,000 for manual depth annotation or $45,000-80,000 for sensor re-collection.

Failure Modes and Validation Requirements

Depth Anything V2 fails on three systematic cases: transparent objects, specular reflections, and out-of-distribution camera intrinsics. Validation against sensor ground truth is mandatory before deploying depth-augmented policies in production.

Transparent object errors: glass containers, acrylic bins, and water produce depth predictions 15-40 cm off ground truth because RGB texture is dominated by background rather than surface geometry. Dex-YCB demonstrates 34% grasp failure rate when policies trained on monocular depth encounter transparent objects[4].

Specular reflection errors: polished metal tools, mirrors, and glossy plastic violate Lambertian assumptions, causing depth discontinuities at reflection boundaries. Validation on RoboCasa kitchen scenes shows 22% depth error on stainless steel cookware versus 4% on matte objects.

Camera intrinsic sensitivity: Depth Anything V2 was trained on images from cameras with 50-90° horizontal field-of-view. Fisheye lenses (>120° FoV) and telephoto lenses (<40° FoV) produce depth maps with 18-35% higher error. Robot datasets using GoPro (150° FoV) or narrow-FoV inspection cameras require domain-specific fine-tuning.

Validation protocol: sample 5-10% of trajectories, capture sensor depth (RealSense, ZED), compute mean absolute error and δ₁ accuracy against monocular predictions. If MAE exceeds 8 cm at 1-meter range or δ₁ drops below 90%, the dataset requires sensor depth rather than pseudo-labels.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Depth pseudo-labels improving grasp success rates by 12-18% in clutter scenarios

    arXiv
  2. OpenVLA: An Open-Source Vision-Language-Action Model

    OpenVLA depth-conditioned attention reducing collision rates by 23%

    arXiv
  3. BridgeData V2: A Dataset for Robot Learning at Scale

    Depth Anything V2 validation showing 4.2 cm MAE at 1-meter distance on BridgeData

    arXiv
  4. Project site

    34% higher grasp success with sensor depth versus monocular depth on transparent objects

    dex-ycb.github.io
  5. Project site

    DROID dataset with 76,000 trajectories using ZED 2 stereo cameras

    droid-dataset.github.io
  6. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    RT-2 depth augmentation improving fine-tuning sample efficiency by 28%

    arXiv
  7. NVIDIA Cosmos World Foundation Models

    40% real + 60% synthetic depth data closing sim-to-real gap by 19%

    NVIDIA Developer

More glossary terms

FAQ

What is the difference between Depth Anything V1 and V2?

Depth Anything V2 replaced the V1 ResNet encoder with a DINOv2 Vision Transformer, scaled pseudo-labeling from 1.5 million to 62 million images, and improved δ₁ accuracy by 8-15% across benchmarks. V2 also introduced four model scales (Small, Base, Large, Giant) versus V1's single variant. Training data expanded from 1.5M to 62.6M total images, with the unlabeled corpus emphasizing egocentric video from Ego4D and SA-1B natural images. Inference speed improved 30% for equivalent accuracy due to more efficient transformer attention patterns.

Can Depth Anything V2 replace sensor-based depth for robotics?

Depth Anything V2 provides sufficient depth for spatial reasoning and obstacle avoidance but cannot replace sensor depth for precision manipulation tasks requiring millimeter accuracy. Monocular depth outputs relative depth without absolute scale, requiring calibration against known object dimensions or sparse sensor measurements. Transparent objects, specular surfaces, and extreme close-ups (<10 cm) produce unreliable depth predictions. Hybrid approaches work best: monocular depth for scene understanding, stereo or ToF depth for grasp pose estimation. Validation against sensor ground truth is mandatory before production deployment.

How much does depth pseudo-labeling cost for a robotics dataset?

Depth pseudo-labeling with Depth Anything V2 costs $0.80-1.20 per trajectory (10-minute recording at 30 Hz) including compute, storage, and validation. A 10,000-trajectory dataset requires $8,000-12,000 for depth augmentation, compared to $180,000-250,000 for manual depth annotation or $45,000-80,000 for sensor re-collection with RealSense cameras. Cloud GPU pricing at $2/hour for A100 instances processes 100,000 frames in 4.2 hours using the Base variant. Validation sampling (5-10% of trajectories against sensor ground truth) adds $400-1,200 for human review of error metrics.

What datasets was Depth Anything V2 trained on?

Depth Anything V2 was trained on 595,000 labeled images from six datasets (Hypersim, Virtual KITTI, DIML Indoor, HRWSI, BlendedMVS, IRS) plus 62 million pseudo-labeled images from SA-1B, LSUN, and egocentric video including Ego4D. The labeled data provides supervised depth ground truth from synthetic rendering, stereo reconstruction, and LiDAR. The pseudo-labeled corpus emphasizes diversity across camera viewpoints, lighting conditions, and scene types. Training used scale-and-shift-invariant loss to handle heterogeneous depth ranges without per-dataset normalization. The mixture enables zero-shot generalization to robotics domains without fine-tuning.

Does Depth Anything V2 work on egocentric robot video?

Depth Anything V2 achieves δ₁ accuracy above 95% on EPIC-KITCHENS egocentric frames and generalizes to robot wrist-camera viewpoints without fine-tuning. The model was trained on 41 million egocentric video frames from Ego4D, providing coverage of first-person perspectives similar to robot end-effector cameras. However, extreme close-ups under 10 cm (common in precision manipulation) fall outside the training distribution and produce higher depth error. Validation on DROID wrist-camera data shows 4.2 cm mean absolute error at 1-meter distance, sufficient for tabletop tasks but requiring sensor depth for sub-centimeter precision.

What are the licensing restrictions for commercial use?

Depth Anything V2 code is Apache 2.0 licensed, but model weights inherit mixed licenses from training data. Labeled datasets include CC BY-SA 4.0 (Hypersim, BlendedMVS), CC BY-NC-SA 4.0 (Virtual KITTI), and research-use-only (DIML, HRWSI, IRS). Pseudo-labeled data includes SA-1B (research use) and web-scraped frames with unknown provenance. Commercial deployment requires legal review of derivative work status under applicable copyright law. EU AI Act Article 53 mandates training data documentation for high-risk systems. Buyers generating depth pseudo-labels for proprietary datasets should consult counsel on license inheritance and fair use applicability.

Find datasets covering depth anything v2

Truelabel surfaces vetted datasets and capture partners working with depth anything v2. Send the modality, scale, and rights you need and we route you to the closest match.

List depth-annotated datasets on truelabel