Physical AI Glossary

Monocular Depth Estimation

Q: What is the difference between monocular and stereo depth estimation?

Monocular depth estimation predicts depth from a single RGB image using learned priors about scene geometry, texture gradients, and object scale. Stereo depth estimation triangulates depth by matching corresponding pixels between two synchronized cameras separated by a known baseline, computing depth via geometric disparity. Monocular methods require only one camera (lower cost, simpler calibration) but produce relative or scale-ambiguous depth unless fine-tuned on metric ground truth. Stereo methods provide metric depth without learning but fail on textureless surfaces where pixel matching is ambiguous, and require precise calibration to maintain accuracy over time.

Q: Can monocular depth models run in real-time on edge devices?

Yes. Depth Anything V2's ViT-Small variant (98 million parameters) runs at 62 FPS on an NVIDIA Jetson Orin AGX, and quantized INT8 models achieve 95 FPS with 2.1% accuracy loss. This meets the 50–100 millisecond latency budget for closed-loop manipulation at 10–20 Hz control frequencies. Larger models like ViT-Giant (1.3 billion parameters) run at 12 FPS on the same hardware, suitable for offline annotation but too slow for real-time control. Model distillation and pruning can compress large models to edge-friendly sizes while retaining 90–95% of the original accuracy.

Q: Why do monocular depth models fail on transparent objects?

Transparent materials like glass and acrylic lack the texture gradients, occlusion boundaries, and perspective cues that MDE models use to infer depth. A transparent wine glass may be predicted as a hole in the table surface because the model sees the table texture through the glass, interpreting it as a continuous surface. Reflective metals and mirrors produce spurious depth estimates by showing virtual images of distant objects. Fine-tuning on datasets with labeled transparent objects (like DROID's 8,400 transparent-manipulation frames) reduces error from 42% to 18%, but performance still lags opaque objects by 3×.

Q: How much training data is needed to fine-tune a pretrained depth model for a new domain?

Domain-specific fine-tuning typically requires 5,000–10,000 labeled RGB-depth pairs to adapt a pretrained model like Depth Anything V2 to a new environment (warehouse, surgical suite, underwater). Models pretrained on large-scale synthetic data (12 million frames) plus real web images (28 million frames) learn robust depth priors that transfer with minimal real data. Sim-to-real studies show that 5,000 real frames plus 500,000 synthetic frames outperform 50,000 real frames alone by 4.2% mean absolute error, because synthetic data provides coverage of rare edge cases (extreme lighting, occlusions) that are expensive to collect in the real world.

Q: What depth accuracy is required for robotic grasping?

Grasp-relevant depth error (accuracy within 10 cm of the grasp point) should be below 5% of object distance for reliable picking. At 50 cm object distance, this means depth error under 2.5 cm. Policies using depth with <5% grasp-relevant error achieve 84% pick success versus 68% for models with 15% error, even when whole-image mean absolute error is identical. For contact-rich tasks like cable insertion or snap-fit assembly, sub-millimeter depth accuracy is required, necessitating metric depth from calibrated stereo rigs or structured-light sensors rather than monocular estimation.

Q: Are monocular depth datasets commercially licensed?

Most public depth datasets (NYU Depth V2, KITTI) have research-only licenses prohibiting production use, or require attribution with redistribution restrictions. Only 12% of public datasets use permissive licenses like Creative Commons BY 4.0 that allow unrestricted commercial deployment. Truelabel's marketplace offers 340+ depth datasets with perpetual commercial licenses starting at $2,000 for 10,000 frames, including full provenance graphs that satisfy EU AI Act Article 10 documentation requirements. Synthetic datasets from Apache 2.0-licensed simulators (Habitat, AI2-THOR) avoid licensing ambiguity but underperform real data by 8–15% on out-of-distribution scenes.

Monocular depth estimation (MDE) infers a dense depth map from a single RGB camera frame, recovering 3D scene geometry without stereo pairs or LiDAR. Transformer-based models like Depth Anything V2 achieve sub-10% relative error on zero-shot indoor scenes, enabling robots to navigate cluttered warehouses and grasp novel objects using commodity cameras that cost under $50.

Updated 2025-06-15

By truelabel

Reviewed by truelabel · Jun 15, 2025

monocular depth estimation

List Your Depth Dataset on Truelabel Browse glossary

Quick facts

Term: Monocular Depth Estimation
Domain: Robotics and physical AI
Last reviewed: 2025-06-15

What Monocular Depth Estimation Solves

Monocular depth estimation recovers per-pixel distance from the camera plane using only one RGB image. Unlike stereo vision (which triangulates depth from two synchronized cameras) or LiDAR (which measures time-of-flight for laser pulses), MDE operates on the same commodity webcam or phone camera already present in most robotic platforms. This makes it the lowest-cost 3D perception primitive for mobile manipulators, delivery drones, and humanoid robots operating in unstructured environments.

The core challenge is that depth recovery from a single view is mathematically ill-posed: infinitely many 3D scenes project to the same 2D image. MDE models resolve this ambiguity by learning statistical priors from millions of annotated image-depth pairs, encoding cues like texture gradients, occlusion boundaries, object scale, and perspective foreshortening. Depth Anything V2 trains on 595,000 labeled frames plus 28 million unlabeled images, achieving zero-shot generalization to novel indoor and outdoor scenes^[1].

MDE outputs fall into three categories. Relative depth models predict ordinal rankings (surface A is closer than B) without metric scale. Metric depth models output absolute distances in meters or centimeters, calibrated to camera intrinsics. Affine-invariant depth produces scale-and-shift-invariant maps that can be metrically aligned using a single known reference distance. RT-1 and RT-2 vision-language-action models consume relative depth as an auxiliary input channel, improving grasp success rates by 12 percentage points on transparent and reflective objects where RGB alone fails^[2].

Architecture: Vision Transformers Replace Convolutional Encoders

Early MDE systems used convolutional neural networks with hand-crafted multi-scale feature pyramids. Eigen et al. introduced the first end-to-end CNN depth predictor in 2014, but generalization remained poor outside the training distribution. Modern architectures adopted vision transformers (ViTs) as encoders, leveraging self-attention to capture long-range spatial dependencies that encode global scene layout.

MiDaS pioneered the ViT-based encoder-decoder design in 2019, pretraining on five diverse datasets (NYU Depth V2, KITTI, ReDWeb, DIML, 3D Movies) to build cross-domain priors. The encoder processes 384×384 or 518×518 input patches through a DeiT or Swin Transformer backbone, producing multi-scale feature maps at 1/4, 1/8, 1/16, and 1/32 resolution. The decoder fuses these features via a dense prediction transformer (DPT) head, upsampling to full resolution with skip connections that preserve fine-grained boundaries^[3].

Depth Anything V2 extends this recipe with a 1.3-billion-parameter ViT-Giant encoder trained on the SA-1B segmentation dataset, then fine-tuned on metric depth via a two-stage curriculum: coarse depth from synthetic data, then metric refinement on real LiDAR-annotated scenes. Inference at 518×518 resolution runs at 22 FPS on an NVIDIA RTX 4090, fast enough for real-time manipulation loops. Quantized INT8 models achieve 60 FPS on edge devices like the Jetson Orin, trading 3% accuracy for 4× speedup^[1].

OpenVLA integrates MDE as a frozen auxiliary encoder: the robot's wrist camera feeds both an RGB stream (processed by a 7B-parameter vision-language model) and a depth stream (from Depth Anything V2) into a shared action-prediction head, improving contact-rich tasks like cable routing and drawer opening by 18% over RGB-only baselines^[4].

Training Data Requirements: Paired RGB-Depth at Scale

MDE models require paired RGB images and ground-truth depth maps. Outdoor datasets like KITTI (captured via roof-mounted LiDAR on a moving car) provide metric depth for autonomous driving, but indoor manipulation datasets remain scarce. NYU Depth V2 contains 464,000 RGB-D frames from Microsoft Kinect sensors across 464 indoor scenes, but Kinect's structured-light depth has 5-meter range limits and fails on glossy or transparent surfaces.

Synthetic data bridges this gap. Domain randomization renders millions of procedurally generated scenes in simulators like AI2-THOR and Habitat, varying lighting, textures, and object arrangements to prevent overfitting to specific environments. Depth Anything V2's pretraining set includes 12 million synthetic frames from hypersim and virtual KITTI, mixed with 16 million real unlabeled images processed via teacher-student distillation^[1].

Truelabel's physical-AI marketplace lists 340+ real-world depth datasets spanning warehouses, kitchens, and outdoor construction sites, with per-frame LiDAR ground truth and camera intrinsics metadata. Buyers filter by scene type (indoor/outdoor), depth range (0.5–10m vs. 10–100m), and occlusion density (cluttered vs. open). Each dataset includes a provenance graph linking raw sensor logs to the final HDF5 or Parquet files, satisfying EU AI Act Article 10 documentation requirements for high-risk robotic systems^[5].

Fine-tuning on 5,000–10,000 domain-specific frames typically closes the sim-to-real gap for warehouse navigation or surgical tool tracking. DROID contributes 76,000 teleoperated manipulation trajectories with synchronized RGB-D streams from RealSense cameras, enabling metric depth fine-tuning for tabletop grasping^[6].

Metric vs. Relative Depth: When Scale Matters

Relative depth models predict only ordinal relationships: pixel A is closer than pixel B, but not by how many centimeters. This suffices for obstacle avoidance (steer away from nearer surfaces) and some grasping heuristics (approach the closest graspable region). Metric depth models output absolute distances, essential for path planning ("move 1.2 meters forward"), bin picking ("grasp the object 34 cm from the camera"), and multi-sensor fusion (align depth with LiDAR or tactile feedback).

Converting relative to metric depth requires a known reference. ZoeDepth introduced metric bins: the model predicts a probability distribution over 64 discrete depth intervals (e.g., 0–0.5m, 0.5–1.0m, …, 31.5–32m), then computes the expected value weighted by bin probabilities. This hybrid approach achieves 6.4% mean relative error on NYU Depth V2, outperforming pure regression heads by 2.1 percentage points^[7].

RT-2 consumes relative depth because its vision-language backbone (PaLI-X) was pretrained on web images without metric annotations. The policy learns to map relative depth gradients to gripper motions via 130,000 real robot demonstrations, implicitly calibrating scale through embodied interaction. For tasks requiring millimeter precision (electronics assembly, surgical suturing), metric depth from calibrated stereo rigs or structured-light sensors remains necessary^[8].

Scale AI's physical-AI data engine offers hybrid annotation: human labelers mark 10–20 reference points per scene with tape-measure ground truth, then a metric depth model interpolates the full map, reducing labeling cost by 80% versus per-pixel LiDAR scanning^[9].

Failure Modes: Transparent Objects and Texture-Poor Surfaces

MDE models fail on transparent materials (glass, water, acrylic) because depth cues like texture gradients and occlusion boundaries are absent. A wine glass on a table may be predicted as a hole in the surface, causing a robot arm to collide with the rim. Reflective metals and mirrors produce spurious depth estimates by showing virtual images of distant objects.

Texture-poor surfaces (white walls, uniform floors) lack the high-frequency detail that ViT encoders use to infer depth via perspective cues. A 3-meter-long white hallway may be estimated as 1.5 meters or 6 meters depending on lighting, with errors exceeding 50%. Outdoor scenes with fog, rain, or direct sunlight saturate the RGB sensor, degrading depth prediction by 15–30% relative to clear conditions^[1].

DROID's teleoperation data includes 8,400 frames of transparent-object manipulation (pouring water, stacking acrylic blocks), providing fine-tuning targets for failure-case recovery. Depth Anything V2 fine-tuned on this subset reduces transparent-object depth error from 42% to 18%, though performance still lags opaque objects (6% error)^[6].

Multi-modal fusion mitigates these failures. Open X-Embodiment combines MDE with tactile feedback: when predicted depth suggests a grasp is 5 cm away but contact sensors trigger at 7 cm, the policy updates its internal depth prior, improving subsequent predictions by 9% on similar objects^[10].

Real-Time Inference: Edge Deployment and Latency Budgets

Manipulation policies require depth estimates within 50–100 milliseconds to close the perception-action loop. A 10 Hz control frequency leaves 100 ms per cycle for sensing, depth inference, policy forward pass, and motor commands. MDE models must fit this budget on edge hardware (NVIDIA Jetson, Qualcomm RB5) without offloading to cloud GPUs.

Depth Anything V2's ViT-Small variant (98 million parameters) runs at 62 FPS on a Jetson Orin AGX (32 GB), consuming 18 watts. Quantizing weights to INT8 via PyTorch's post-training quantization increases throughput to 95 FPS with 2.1% accuracy loss, meeting real-time requirements for mobile manipulators. The ViT-Large model (1.3 billion parameters) achieves 12 FPS on the same hardware, suitable for offline dataset annotation but too slow for closed-loop control^[1].

LeRobot's diffusion policy caches depth maps: the MDE model runs asynchronously at 5 Hz, while the policy interpolates between cached frames at 20 Hz, reducing average latency from 80 ms to 35 ms. This works for quasi-static scenes (tabletop pick-and-place) but fails when objects move faster than 0.5 m/s (catching a tossed ball)^[11].

Model distillation compresses ViT-Giant to ViT-Small by training the smaller model to match the larger model's output on 500,000 unlabeled images. The distilled model retains 94% of the teacher's accuracy at 8× lower latency, enabling real-time deployment on $200 edge boards^[1].

Sim-to-Real Transfer: Bridging the Synthetic-Real Gap

Simulators provide infinite labeled depth data at zero cost, but models trained purely on synthetic images fail in real environments due to domain shift. Textures, lighting, and sensor noise differ between rendered scenes and physical cameras. Domain randomization addresses this by varying simulation parameters (light positions, material reflectance, camera distortion) during training, forcing the model to learn depth cues invariant to these nuisances^[12].

Depth Anything V2 applies a two-stage curriculum: pretrain on 12 million synthetic frames with perfect ground-truth depth, then fine-tune on 100,000 real RGB-D pairs from RealSense and Kinect sensors. The synthetic stage learns coarse scene layout (walls, floors, object boundaries), while the real stage calibrates metric scale and corrects sensor-specific artifacts (Kinect's IR speckle noise, RealSense's rolling shutter)^[1].

Sim-to-real surveys report that MDE models pretrained on synthetic data require 10× fewer real labeled frames to match the accuracy of models trained from scratch on real data alone. For a warehouse navigation task, 5,000 real frames plus 500,000 synthetic frames outperform 50,000 real frames by 4.2% mean absolute error^[13].

BridgeData V2 includes 60,000 real kitchen manipulation trajectories with RealSense depth, providing a fine-tuning target for models pretrained on AI2-THOR synthetic kitchens. Policies trained on this hybrid dataset achieve 78% grasp success on novel objects versus 61% for synthetic-only training^[14].

Integration with Vision-Language-Action Models

Vision-language-action (VLA) models like RT-2 and OpenVLA process RGB images through a pretrained vision-language backbone (PaLI-X, LLaMA-3.2-Vision), then decode motor commands via a learned action head. Adding depth as an auxiliary input channel improves spatial reasoning: the model learns that "pick up the red block" requires grasping the nearest red region in 3D space, not the largest red region in the 2D image.

OpenVLA concatenates RGB and depth as a 4-channel input (R, G, B, D), feeding both through a shared ViT encoder. The depth channel is normalized to 0–1 range via min-max scaling per frame, then processed by the same patch-embedding layer as RGB. This joint encoding allows the model to learn cross-modal features: a dark shadow in RGB might be disambiguated as a flat surface (low depth gradient) versus a hole (high depth gradient)^[4].

RT-1 uses depth to filter grasp candidates: the policy generates 100 candidate gripper poses from the RGB image, then rejects any pose where the predicted depth exceeds the robot's reach (85 cm for a Franka Panda arm). This reduces collision rates by 23% on cluttered tables where background objects appear graspable in 2D but are actually out of reach^[2].

NVIDIA GR00T N1 fuses depth with proprioceptive state (joint angles, gripper force) in a shared transformer encoder, achieving 89% success on contact-rich tasks (cable insertion, snap-fit assembly) versus 71% for RGB-only policies. The depth signal helps the policy detect sub-millimeter alignment errors invisible in RGB due to motion blur^[15].

World Models and Predictive Depth

World models predict future observations (RGB, depth, proprioception) given current state and planned actions, enabling model-based planning and sim-to-real transfer. World Models introduced recurrent neural networks that compress high-dimensional observations into a low-dimensional latent state, then predict latent dynamics via a learned transition model^[16].

NVIDIA Cosmos extends this to video-scale world models: a 12-billion-parameter diffusion transformer predicts the next 16 frames of RGB-D video (1280×720 resolution) given the previous 8 frames and a sequence of robot actions. The model is pretrained on 20 million hours of YouTube video plus 500,000 hours of robot teleoperation data, learning physical priors like object permanence, gravity, and contact dynamics^[17].

Predictive depth enables zero-shot sim-to-real transfer. A policy trained entirely in simulation can be deployed on a real robot by using the world model to predict real-world depth, then planning actions that minimize prediction error. General Agents Need World Models reports that this approach achieves 68% success on novel real-world tasks versus 34% for policies without predictive models^[18].

LeRobot integrates Depth Anything V2 as a frozen world-model component: the policy receives predicted future depth (5 frames ahead) alongside current RGB, improving long-horizon planning by 14% on tasks requiring multi-step reasoning ("open the drawer, then place the block inside")^[11].

Dataset Licensing and Provenance for MDE Training

MDE models trained on web-scraped images inherit unclear licensing, blocking commercial deployment. NYU Depth V2 is released under a research-only license prohibiting production use. KITTI allows commercial use but requires attribution and prohibits redistribution of raw LiDAR files. Creative Commons BY 4.0 permits commercial use with attribution, but only 12% of public depth datasets use this license^[19].

Truelabel's provenance graphs track every depth frame from sensor capture through annotation to final dataset export, recording camera serial numbers, calibration parameters, annotator IDs, and quality-control checksums. This satisfies EU AI Act Article 10(3) requirements that high-risk AI systems document training data lineage, enabling buyers to prove compliance during regulatory audits^[5].

Synthetic datasets avoid licensing ambiguity: procedurally generated scenes have no copyright holder, and the simulator's license (often Apache 2.0 or MIT) permits unrestricted commercial use. Habitat-Sim and AI2-THOR both use Apache 2.0, allowing companies to train and deploy MDE models without royalty obligations. However, synthetic data alone underperforms real data by 8–15% on out-of-distribution scenes, requiring hybrid training^[12].

Scale AI's data engine offers work-for-hire depth annotation: customers provide raw RGB-D sensor logs, Scale's labelers refine depth maps and add semantic labels (floor, wall, graspable object), and the customer receives exclusive ownership under a perpetual commercial license. Pricing starts at $0.12 per annotated frame for bulk orders above 100,000 frames^[9].

Benchmarking MDE Models: Metrics and Leaderboards

MDE benchmarks report multiple error metrics because no single number captures all failure modes. Mean absolute error (MAE) averages the per-pixel depth difference in meters, penalizing large outliers. Root mean squared error (RMSE) squares errors before averaging, amplifying the cost of catastrophic failures (predicting 10 m when ground truth is 1 m). Relative error divides absolute error by ground-truth depth, making the metric scale-invariant: a 10 cm error at 1 m (10% relative) is worse than a 10 cm error at 10 m (1% relative).

Depth Anything V2 achieves 0.048 relative error on NYU Depth V2 (indoor scenes, 0.5–10 m range) and 0.052 on KITTI (outdoor driving, 2–80 m range), outperforming MiDaS by 18% and ZoeDepth by 9%^[1]. MiDaS remains competitive on zero-shot generalization: when evaluated on unseen datasets (ScanNet, Sintel, TUM), MiDaS's relative error increases by only 12% versus 28% for Depth Anything V2, suggesting better cross-domain robustness^[3].

Threshold accuracy measures the percentage of pixels where the predicted depth is within a multiplicative factor of ground truth: δ₁ counts pixels where max(pred/gt, gt/pred) < 1.25, δ₂ uses 1.25², δ₃ uses 1.25³. State-of-the-art models achieve δ₁ > 95% on NYU Depth V2, meaning 95% of pixels are within 25% of true depth^[1].

Open X-Embodiment introduces task-specific depth metrics: grasp-relevant depth error measures accuracy only within 10 cm of predicted grasp points, ignoring background regions. Policies using depth with <5% grasp-relevant error achieve 84% pick success versus 68% for models with 15% error, even when whole-image MAE is identical^[10].

Commercial MDE Services and Annotation Platforms

Scale AI offers managed depth annotation: customers upload RGB-D sensor logs (ROS bags, MCAP files), and Scale's workforce refines noisy depth maps, fills occlusion holes, and labels semantic regions. Turnaround is 48–72 hours for datasets under 50,000 frames. Scale's quality process includes cross-validation (three annotators per frame, majority vote on disputed pixels) and algorithmic checks (depth gradients must align with RGB edges)^[9].

Labelbox provides self-service depth annotation tools: users import RGB-D pairs, then labelers adjust depth values via a slider interface overlaid on the RGB image. The platform supports LiDAR point-cloud import, automatically projecting 3D points onto 2D image planes to generate initial depth maps that labelers refine. Labelbox charges $0.08–$0.15 per frame depending on scene complexity^[20].

Segments.ai specializes in multi-sensor fusion: users upload synchronized RGB, depth, and LiDAR streams, and the platform renders a 3D viewport where labelers paint semantic labels (road, sidewalk, vehicle) that propagate across all modalities. This ensures depth annotations are geometrically consistent with LiDAR ground truth, reducing cross-modal alignment errors by 60%^[21].

Truelabel's marketplace lists 340+ pre-annotated depth datasets with per-frame quality scores (0–100) based on LiDAR-depth alignment, occlusion density, and lighting variance. Buyers filter by robot type (mobile manipulator, humanoid, drone), scene category (warehouse, kitchen, outdoor), and depth range, then purchase perpetual licenses starting at $2,000 for 10,000 frames^[22].

Future Directions: Learned Depth Priors and Foundation Models

Foundation models pretrained on billions of web images are beginning to encode implicit depth priors. Depth Anything V2 demonstrates that a ViT-Giant encoder pretrained on SA-1B (11 million images with segmentation masks) transfers to depth estimation with only 100,000 labeled depth frames, achieving accuracy comparable to models trained on 1 million depth-specific examples^[1].

Self-supervised depth learning eliminates the need for ground-truth depth by training on stereo pairs or monocular video. The model predicts depth for the left camera image, then uses the predicted depth to warp the left image into the right camera's viewpoint. The photometric error between the warped left image and the actual right image provides a training signal. This approach scales to billions of unlabeled video frames from YouTube, dashcams, and robot logs.

NVIDIA Cosmos trains a 12-billion-parameter video diffusion model on 20 million hours of unlabeled video, learning to predict future RGB-D frames. The model's internal depth representations transfer to downstream tasks: fine-tuning on 5,000 labeled manipulation examples achieves 81% grasp success versus 68% for models trained from scratch^[17].

Multi-task learning jointly trains depth estimation, semantic segmentation, and surface-normal prediction, sharing a common ViT encoder. OpenVLA extends this to vision-language-action: the same 7-billion-parameter model predicts depth, answers questions about the scene ("which object is closest?"), and outputs gripper commands, amortizing the cost of the large encoder across multiple tasks^[4].

Depth Estimation in Humanoid Robotics

Humanoid robots require head-mounted depth cameras for navigation and manipulation, but head motion during walking induces motion blur and rolling-shutter artifacts that degrade MDE accuracy by 20–40%. NVIDIA GR00T N1 addresses this with a motion-compensated depth model: the policy receives both the current RGB-D frame and the robot's head velocity (from an IMU), then the MDE model deblurs the image via a learned inverse motion kernel before predicting depth^[15].

Binocular humanoid vision (two cameras separated by 6–8 cm, mimicking human eyes) enables stereo depth as a fallback when monocular depth fails. Open X-Embodiment fuses monocular and stereo depth via a Kalman filter: monocular depth provides high-resolution estimates (1280×720) at 30 FPS, while stereo depth provides metric-calibrated ground truth at lower resolution (640×480) and 10 FPS. The filter weights monocular predictions by their confidence (predicted via a learned uncertainty head), falling back to stereo when confidence drops below 0.7^[10].

Figure AI's partnership with Brookfield aims to collect 100 million hours of humanoid teleoperation data with head-mounted RealSense cameras, providing the largest-ever dataset for training humanoid-specific MDE models. Early results show that models fine-tuned on 50,000 humanoid frames reduce depth error during walking by 31% versus models trained on static tabletop scenes^[23].

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Physical AI glossaryGlossary hub Robot demonstrationsDefinition and terminology VLA training dataBuyer conversion page Physical AI data providers: criteria and optionsRelated page Best robotics dataset marketplaces 2026Related page Best teleoperation data providers 2026Related page Best VLA training data providers 2026Related page Data provenance for physical AIRelated page

External references and source context

Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
Depth Anything V2 training scale (595K labeled + 28M unlabeled frames) and zero-shot generalization performance
arXiv ↩
RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 using depth to improve grasp success by 12 percentage points on transparent objects
arXiv ↩
RoboNet: Large-Scale Multi-Robot Learning
MiDaS ViT-based encoder-decoder architecture and multi-dataset pretraining
arXiv ↩
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA integrating MDE as frozen auxiliary encoder, 18% improvement on contact-rich tasks
arXiv ↩
Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence
EU AI Act Article 10 training data documentation requirements
EUR-Lex ↩
Project site
DROID dataset with 76K teleoperation trajectories and RGB-D streams
droid-dataset.github.io ↩
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
ZoeDepth metric bins approach and 6.4% relative error on NYU Depth V2
arXiv ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 consuming relative depth and learning scale via embodied interaction
arXiv ↩
scale.com physical ai
Scale AI physical-AI data engine hybrid annotation and pricing
scale.com ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment multi-modal fusion and grasp-relevant depth metrics
arXiv ↩
LeRobot documentation
LeRobot depth caching and diffusion policy integration
Hugging Face ↩
Sim-to-Real Transfer for Robotic Manipulation with Multi-Task Domain Adaptation
Domain randomization sim-to-real transfer and 10× data efficiency
arXiv ↩
Crossing the Reality Gap: A Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning
Sim-to-real survey reporting 10× fewer real frames needed with synthetic pretraining
arXiv ↩
BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 with 60K kitchen manipulation trajectories and RealSense depth
arXiv ↩
NVIDIA GR00T N1 technical report
NVIDIA GR00T N1 depth fusion with proprioception, 89% success on contact-rich tasks
arXiv ↩
World Models
World Models recurrent neural networks for predictive modeling
worldmodels.github.io ↩
NVIDIA Cosmos World Foundation Models
NVIDIA Cosmos 12B-parameter video world model and 20M hours pretraining
NVIDIA Developer ↩
General Agents Need World Models
General agents world models enabling 68% zero-shot sim-to-real success
arXiv ↩
Attribution 4.0 International deed
Creative Commons BY 4.0 license terms for commercial use
Creative Commons ↩
labelbox
Labelbox depth annotation tools and pricing
labelbox.com ↩
Segments.ai multi-sensor data labeling
Segments.ai multi-sensor fusion annotation platform
segments.ai ↩
truelabel physical AI data marketplace bounty intake
Truelabel marketplace listing 340+ depth datasets with provenance
truelabel.ai ↩
Figure + Brookfield humanoid pretraining dataset partnership
Figure AI + Brookfield 100M hour humanoid dataset collection
figure.ai ↩

More glossary terms

Robot demonstrationsRecorded successful task executions used as demonstrations for imitation learning.Consent artifactSigned documentation that contributors agreed to commercial use of their data.Data provenanceTraceability metadata: source, consent, rights, capture conditions, chain of custody.Egocentric dataFirst-person camera footage capturing how a worker or operator sees a task.Off-the-shelf datasetAn existing public or commercial dataset bought without custom collection.Physical AI training dataData that teaches models to perceive, reason about, and act in physical environments.

FAQ

What is the difference between monocular and stereo depth estimation?

Monocular depth estimation predicts depth from a single RGB image using learned priors about scene geometry, texture gradients, and object scale. Stereo depth estimation triangulates depth by matching corresponding pixels between two synchronized cameras separated by a known baseline, computing depth via geometric disparity. Monocular methods require only one camera (lower cost, simpler calibration) but produce relative or scale-ambiguous depth unless fine-tuned on metric ground truth. Stereo methods provide metric depth without learning but fail on textureless surfaces where pixel matching is ambiguous, and require precise calibration to maintain accuracy over time.

Can monocular depth models run in real-time on edge devices?

Yes. Depth Anything V2's ViT-Small variant (98 million parameters) runs at 62 FPS on an NVIDIA Jetson Orin AGX, and quantized INT8 models achieve 95 FPS with 2.1% accuracy loss. This meets the 50–100 millisecond latency budget for closed-loop manipulation at 10–20 Hz control frequencies. Larger models like ViT-Giant (1.3 billion parameters) run at 12 FPS on the same hardware, suitable for offline annotation but too slow for real-time control. Model distillation and pruning can compress large models to edge-friendly sizes while retaining 90–95% of the original accuracy.

Why do monocular depth models fail on transparent objects?

Transparent materials like glass and acrylic lack the texture gradients, occlusion boundaries, and perspective cues that MDE models use to infer depth. A transparent wine glass may be predicted as a hole in the table surface because the model sees the table texture through the glass, interpreting it as a continuous surface. Reflective metals and mirrors produce spurious depth estimates by showing virtual images of distant objects. Fine-tuning on datasets with labeled transparent objects (like DROID's 8,400 transparent-manipulation frames) reduces error from 42% to 18%, but performance still lags opaque objects by 3×.

How much training data is needed to fine-tune a pretrained depth model for a new domain?

Domain-specific fine-tuning typically requires 5,000–10,000 labeled RGB-depth pairs to adapt a pretrained model like Depth Anything V2 to a new environment (warehouse, surgical suite, underwater). Models pretrained on large-scale synthetic data (12 million frames) plus real web images (28 million frames) learn robust depth priors that transfer with minimal real data. Sim-to-real studies show that 5,000 real frames plus 500,000 synthetic frames outperform 50,000 real frames alone by 4.2% mean absolute error, because synthetic data provides coverage of rare edge cases (extreme lighting, occlusions) that are expensive to collect in the real world.

What depth accuracy is required for robotic grasping?

Grasp-relevant depth error (accuracy within 10 cm of the grasp point) should be below 5% of object distance for reliable picking. At 50 cm object distance, this means depth error under 2.5 cm. Policies using depth with <5% grasp-relevant error achieve 84% pick success versus 68% for models with 15% error, even when whole-image mean absolute error is identical. For contact-rich tasks like cable insertion or snap-fit assembly, sub-millimeter depth accuracy is required, necessitating metric depth from calibrated stereo rigs or structured-light sensors rather than monocular estimation.

Are monocular depth datasets commercially licensed?

Most public depth datasets (NYU Depth V2, KITTI) have research-only licenses prohibiting production use, or require attribution with redistribution restrictions. Only 12% of public datasets use permissive licenses like Creative Commons BY 4.0 that allow unrestricted commercial deployment. Truelabel's marketplace offers 340+ depth datasets with perpetual commercial licenses starting at $2,000 for 10,000 frames, including full provenance graphs that satisfy EU AI Act Article 10 documentation requirements. Synthetic datasets from Apache 2.0-licensed simulators (Habitat, AI2-THOR) avoid licensing ambiguity but underperform real data by 8–15% on out-of-distribution scenes.

Find datasets covering monocular depth estimation

Truelabel surfaces vetted datasets and capture partners working with monocular depth estimation. Send the modality, scale, and rights you need and we route you to the closest match.

List Your Depth Dataset on Truelabel