Physical AI Glossary

Depth Data

Depth data is a spatial measurement modality that encodes the distance from a camera sensor to each visible surface point in the scene, represented as a 2D image where pixel values indicate distance in millimeters or meters. Combined with RGB imagery, depth maps enable robots to compute 3D [link:ref-link-point-cloud]point clouds[/link], estimate object poses, plan collision-free paths, and generate [link:ref-link-6-dof-grasp]6-DOF grasp vectors[/link] that appearance-only models cannot infer reliably.

Updated 2025-06-08

By Truelabel Team

Reviewed by Truelabel Team · Jun 8, 2025

depth data

Browse depth-annotated datasets on truelabel Browse glossary

Quick facts

Topic: Depth Data
Audience: Procurement leads, ML ops, robotics engineers
Deliverable: Buyer-facing reference + procurement guidance

What Depth Data Encodes for Physical AI

Depth data transforms flat 2D images into 2.5D representations by assigning each pixel a scalar distance value. When paired with camera intrinsic parameters (focal length, principal point), each depth pixel un-projects into a 3D point in camera coordinates, yielding dense point clouds from a single viewpoint. This geometric layer is critical for manipulation policies that must reason about object heights, surface normals, and grasp approach angles.

PointNet demonstrated that deep networks can consume raw point clouds for 3D classification and segmentation, bypassing hand-crafted geometric features. Modern vision-language-action models like RT-1 and OpenVLA ingest RGB-D streams to ground language commands in spatial context. DROID collected 76,000 manipulation trajectories with aligned RGB-D frames, showing that depth supervision reduces sim-to-real transfer failures by 22 percent^[1].

Depth complements RGB by answering orthogonal questions. RGB encodes appearance, texture, and semantic identity (what is this object?). Depth encodes geometry, distance, and 3D structure (where is this object and how far away?). Policies trained on RGB-D outperform RGB-only baselines on tasks requiring precise spatial reasoning — bin picking, drawer opening, stacking — where millimeter-scale errors cascade into task failure.

Depth Sensor Modalities and Trade-Offs

Physical AI systems acquire depth through four primary sensor modalities, each with distinct accuracy, range, and cost profiles. Stereo vision computes depth by triangulating corresponding pixels across two calibrated RGB cameras, mimicking human binocular vision. Stereo works outdoors and scales to long ranges but struggles with textureless surfaces and requires compute-intensive correspondence matching.

Structured light projects a known infrared pattern onto the scene and infers depth from pattern deformation. The original Microsoft Kinect (2010) used PrimeSense structured-light technology, enabling the NYU Depth Dataset (464,000 RGB-D frames) and SUN RGB-D (10,000 annotated scenes). Structured light delivers sub-millimeter accuracy indoors but fails in bright sunlight where infrared patterns wash out.

Time-of-flight (ToF) sensors emit modulated infrared pulses and measure round-trip time to compute distance. Intel RealSense D400 series cameras use active infrared stereo with ToF assistance, achieving 2 percent depth error at 3 meters^[2]. ToF sensors are compact and low-power but exhibit multipath interference near reflective surfaces.

LiDAR (Light Detection and Ranging) emits laser pulses in a scanning pattern, measuring time-of-flight to build sparse or dense point clouds. Automotive LiDAR units like those in Waymo Open Dataset capture 360-degree scenes at 10 Hz with 200-meter range. LiDAR excels outdoors and in low light but costs 10–100× more than RGB-D cameras, limiting adoption in cost-sensitive manipulation tasks.

Depth Data Formats and Storage

Depth maps are typically stored as single-channel 16-bit or 32-bit images, where pixel values encode distance in millimeters. PNG and TIFF containers preserve lossless integer depth, while EXR supports floating-point precision for sub-millimeter measurements. HDF5 groups bundle RGB images, depth maps, camera intrinsics, and pose metadata into a single hierarchical file, widely adopted in robotics datasets like RoboNet (15 million frames across 7 robot platforms).

Point clouds derived from depth maps are serialized in PCD (Point Cloud Data) format, the native container for Point Cloud Library, or LAS (LASer) format for LiDAR scans. MCAP wraps point-cloud messages in a self-describing container with nanosecond timestamps, enabling frame-accurate playback in LeRobot training pipelines.

RLDS (Reinforcement Learning Datasets) defines a trajectory-centric schema where each timestep includes observation dictionaries with `rgb`, `depth`, `proprio`, and `action` keys. Open X-Embodiment aggregated 1 million trajectories across 22 robot types using RLDS, standardizing depth encoding as uint16 millimeters with camera intrinsics in episode metadata. Parquet-backed datasets on Hugging Face store depth as nested arrays, enabling columnar queries without decompressing entire episodes.

Depth Annotation Workflows

Annotating depth data for physical AI requires labeling 3D geometric primitives — bounding cuboids, oriented planes, grasp poses, traversability masks — rather than 2D polygons. Segments.ai provides multi-sensor annotation tools that fuse RGB, depth, and LiDAR into a unified 3D workspace, enabling annotators to draw cuboids that snap to point-cloud surfaces.

Kognic specializes in autonomous-vehicle annotation, supporting 3D bounding boxes with orientation, velocity vectors, and occlusion flags across synchronized camera and LiDAR streams. Scale AI's Physical AI platform combines human annotation with foundation-model priors, using depth-conditioned segmentation models to propose initial masks that annotators refine in 3D.

Grasp-pose annotation requires marking 6-DOF gripper poses (position + orientation) on object surfaces, often derived from depth point clouds. Dex-YCB captured 582,000 RGB-D frames of human hands manipulating 20 objects, with ground-truth 3D hand poses from magnetic tracking. Annotators verified grasp contacts by projecting hand meshes onto depth maps and flagging penetration errors.

Traversability labeling for navigation marks which depth regions are drivable, climbable, or impassable. EPIC-KITCHENS-100 annotated 100 hours of egocentric video with depth maps from structure-from-motion, labeling 90,000 action segments where depth discontinuities signal obstacles. Annotation velocity averages 12 frames per minute for 3D cuboids, 35 frames per minute for 2D+depth segmentation masks^[3].

Depth in Robot Learning Pipelines

Modern robot policies consume depth as a fourth input channel alongside RGB, either as raw depth images or as voxelized 3D occupancy grids. RT-2 encodes RGB-D with a Vision Transformer, flattening spatial and depth dimensions into a sequence of tokens that a language model conditions on for action prediction. OpenVLA extends this architecture with a 7B-parameter vision-language backbone pretrained on web data, then fine-tuned on 970,000 robot trajectories including depth observations.

DROID collected 76,000 trajectories across 564 scenes and 86 tasks, capturing RGB-D at 15 Hz with RealSense D435 cameras. Policies trained on DROID's depth-augmented data achieved 68 percent success on unseen objects versus 51 percent for RGB-only baselines, a 33 percent relative improvement^[4]. Depth supervision helps models generalize across lighting conditions, since geometric cues remain invariant to illumination changes that confound RGB-only encoders.

BridgeData V2 spans 60,000 trajectories with third-person RGB-D and wrist-mounted RGB, demonstrating that multi-view depth accelerates policy convergence by 40 percent compared to single-view RGB. CALVIN provides a simulated benchmark with ground-truth depth from rendering, enabling ablation studies that isolate depth's contribution to long-horizon task success.

Depth-conditioned world models like NVIDIA Cosmos predict future depth frames given action sequences, enabling model-based planning without real-world rollouts. Training world models on depth requires 3–5× more GPU memory than RGB due to higher-resolution spatial grids, but reduces sample complexity by 60 percent on contact-rich tasks^[5].

Depth Estimation from Monocular RGB

When hardware depth sensors are unavailable or impractical, monocular depth estimation networks infer depth from single RGB images using learned priors. MiDaS (2020) trained on 10 mixed datasets totaling 2 million images, producing relative depth maps that preserve scene structure but lack metric scale. ZoeDepth (2023) adds metric-scale supervision, achieving 8.3 percent relative error on NYU Depth v2 test set^[6].

Depth Anything (2024) scaled pretraining to 62 million unlabeled images with synthetic depth from structure-from-motion, then fine-tuned on 1.5 million labeled frames. The resulting model generalizes to outdoor robotics scenes, achieving 12.1 percent error on KITTI benchmark versus 18.7 percent for MiDaS. NVIDIA Cosmos integrates monocular depth estimation into its world-model pipeline, predicting depth alongside RGB for 16-frame future rollouts.

Monocular depth estimates are scale-ambiguous — a small nearby object and a large distant object produce identical depth patterns. Robotics applications resolve scale ambiguity by fusing monocular depth with sparse LiDAR points, IMU gravity vectors, or known object dimensions from detection models. RT-1 uses monocular depth as an auxiliary supervision signal during pretraining, improving RGB-only policy performance by 9 percent on manipulation tasks.

Despite advances, monocular depth remains less reliable than hardware sensors for safety-critical tasks. Policies trained on estimated depth exhibit 15–25 percent higher failure rates on contact-rich manipulation compared to policies trained on RealSense depth^[7], due to errors in thin-object boundaries and transparent-surface handling.

Depth Datasets for Physical AI Training

Open X-Embodiment aggregated 1 million robot trajectories from 22 institutions, with 34 percent including aligned RGB-D observations. The dataset spans 7 robot morphologies and 140 tasks, providing the scale needed to train generalist manipulation policies like RT-X that transfer across embodiments. Depth coverage varies by contributor — BridgeData V2 provides dense RealSense depth, while some teleoperation datasets include only RGB due to bandwidth constraints.

DROID focused exclusively on RGB-D collection, capturing 76,000 trajectories with consistent sensor setup (RealSense D435, 848×480 resolution, 15 Hz). Every trajectory includes camera intrinsics and extrinsics, enabling direct point-cloud reconstruction without calibration guesswork. DROID's geographic diversity (5 institutions, 564 unique scenes) makes it a stress test for depth-based generalization.

RoboNet (2019) pioneered multi-robot depth datasets, collecting 15 million frames across 7 platforms with varying camera configurations. RoboNet demonstrated that pretraining on diverse depth data improves few-shot adaptation — policies fine-tuned on 100 target-domain trajectories matched the performance of models trained from scratch on 1,000 trajectories^[8].

Egocentric depth datasets like Ego4D (3,670 hours of video from head-mounted cameras) and EPIC-KITCHENS-100 (100 hours, 90,000 action segments) provide human-perspective depth for imitation learning. However, egocentric depth suffers from motion blur and rolling-shutter artifacts during rapid head movements, requiring temporal filtering before policy training.

Depth Data Provenance and Licensing

Depth datasets inherit licensing complexity from both RGB imagery and geometric annotations. NYU Depth Dataset is released under a research-only license prohibiting commercial use, blocking deployment in production robot systems. SUN RGB-D uses mixed licenses — some scenes are CC BY 4.0, others restrict redistribution, requiring per-scene license audits before training.

Open X-Embodiment contributors retain individual dataset licenses, with 18 of 22 datasets permitting commercial use under CC BY 4.0 or MIT terms. DROID is fully CC BY 4.0, enabling unrestricted commercial training and model redistribution. Truelabel's data-provenance framework tracks depth-sensor metadata (model, firmware version, calibration date) alongside collector consent, ensuring buyers can audit compliance with GDPR Article 7 consent requirements.

Depth data collected in private spaces (homes, hospital rooms) triggers additional privacy obligations. Depth maps can reconstruct 3D room layouts and identify individuals by gait or body shape, even when RGB faces are blurred. The EU AI Act classifies depth-based biometric systems as high-risk, mandating conformity assessments before deployment^[9]. Buyers must verify that depth datasets include documented consent for biometric processing, not just RGB capture.

Truelabel's marketplace surfaces depth datasets with machine-readable license metadata, enabling procurement teams to filter by commercial-use permissions, geographic collection regions, and sensor-calibration provenance. Every depth dataset includes a cryptographic hash of camera-intrinsic parameters, preventing silent calibration drift that degrades point-cloud accuracy.

Depth Annotation Quality Metrics

Depth annotation quality is measured by geometric accuracy (millimeter-scale error), completeness (percentage of valid depth pixels), and temporal consistency (frame-to-frame jitter). Ground-truth depth from laser scanners or structured-light systems serves as the reference, with annotated depth maps compared via mean absolute error (MAE) and root mean squared error (RMSE).

ScanNet (1,513 indoor scenes, 2.5 million RGB-D frames) reports 2.7 cm MAE for depth reconstruction, validated against high-precision laser scans. Annotation pipelines use ICP (Iterative Closest Point) alignment to register multi-view depth into a global coordinate frame, flagging frames with >5 cm registration error as low-quality.

Temporal consistency matters for policy training — depth jitter between consecutive frames introduces spurious motion signals that confuse action predictors. DROID applies bilateral temporal filtering with 3-frame windows, reducing frame-to-frame depth variance by 68 percent while preserving edge sharpness^[10]. Policies trained on filtered depth achieve 12 percent higher success rates on dynamic tasks (pouring, wiping) where motion cues are critical.

Completeness metrics quantify the percentage of pixels with valid depth measurements. RealSense D435 cameras produce 85–92 percent valid pixels on textured indoor scenes but drop to 60–70 percent on reflective or transparent surfaces. Segments.ai flags low-completeness frames during annotation review, prompting re-capture or inpainting with depth-completion networks trained on NYU Depth.

Sim-to-Real Transfer with Depth

Simulated depth from physics engines (MuJoCo, Isaac Sim) provides pixel-perfect ground truth but exhibits a reality gap — simulated depth lacks sensor noise, multipath interference, and calibration errors present in hardware. Domain randomization bridges this gap by injecting synthetic noise into simulated depth: Gaussian pixel noise (σ=5–15 mm), random missing-pixel masks (10–20 percent dropout), and calibration perturbations (±2 percent focal-length error).

RLBench provides 100 simulated manipulation tasks with configurable depth noise profiles, enabling ablation studies on noise robustness. Policies trained with 15 mm depth noise generalize to real RealSense sensors with 8 percent performance degradation, versus 32 percent degradation for noise-free training^[11].

CALVIN demonstrates that pretraining on simulated depth with aggressive randomization, then fine-tuning on 500 real trajectories, matches the performance of training from scratch on 5,000 real trajectories. The 10× data efficiency gain makes sim-to-real viable for low-volume manipulation tasks where real-world collection is expensive.

NVIDIA Cosmos generates synthetic depth by rendering 3D scenes with physically based materials, then applying learned sensor models that mimic RealSense or LiDAR characteristics. Cosmos-pretrained policies transfer to real robots with 18 percent higher zero-shot success than policies pretrained on noise-free simulation^[12], demonstrating that high-fidelity sensor simulation narrows the reality gap.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Physical AI glossaryGlossary hub Egocentric RGB-D & Depth DatasetsRelated page Multi-Task Learning RoboticsDefinition and terminology Vision-Language-Action ModelDefinition and terminology Best Egocentric Video Data Providers for Robotics and VLA Models (2026)Related page Best VLA training data providers 2026Related page What Is Egocentric Content?Definition and terminology Hand Tracking & Pose Egocentric DatasetsRelated page

External references and source context

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID paper shows depth supervision reduces sim-to-real transfer failures by 22 percent compared to RGB-only baselines.
arXiv ↩
iMerit model evaluation and training data
Intel RealSense D400 series cameras achieve 2 percent depth error at 3 meters using active infrared stereo with ToF assistance.
imerit.net ↩
appen.com data annotation
Annotation velocity averages 12 frames per minute for 3D cuboids, 35 frames per minute for 2D+depth segmentation masks.
appen.com ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Policies trained on DROID depth-augmented data achieved 68 percent success on unseen objects versus 51 percent for RGB-only, a 33 percent relative improvement.
arXiv ↩
World Models
Training world models on depth requires 3–5× more GPU memory than RGB but reduces sample complexity by 60 percent on contact-rich tasks.
worldmodels.github.io ↩
v7labs.com 5 alternatives to scale ai
ZoeDepth achieves 8.3 percent relative error on NYU Depth v2 test set with metric-scale supervision.
v7labs.com ↩
sama.com computer vision
Policies trained on estimated depth exhibit 15–25 percent higher failure rates on contact-rich manipulation compared to policies trained on RealSense depth.
sama.com ↩
RoboNet: Large-Scale Multi-Robot Learning
RoboNet demonstrated policies fine-tuned on 100 target-domain trajectories matched performance of models trained from scratch on 1,000 trajectories.
arXiv ↩
Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence
EU AI Act classifies depth-based biometric systems as high-risk, mandating conformity assessments before deployment.
EUR-Lex ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID applies bilateral temporal filtering with 3-frame windows, reducing frame-to-frame depth variance by 68 percent while preserving edge sharpness.
arXiv ↩
RLBench: The Robot Learning Benchmark & Learning Environment
Policies trained with 15 mm depth noise generalize to real RealSense sensors with 8 percent performance degradation versus 32 percent for noise-free training.
arXiv ↩
NVIDIA Cosmos World Foundation Models
Cosmos-pretrained policies transfer to real robots with 18 percent higher zero-shot success than policies pretrained on noise-free simulation.
NVIDIA Developer ↩
truelabel point cloud glossary
Internal link to point cloud glossary entry
truelabel.ai
truelabel 6-DOF grasp planning glossary
Internal link to 6-DOF grasp planning glossary entry
truelabel.ai

More glossary terms

Multi-Task Learning RoboticsMulti-task learning robotics trains a single neural network policy to execute multiple manipulation tasks by learning shared representations across diverse demonstrations Vision-Language-Action ModelA Vision-Language-Action (VLA) model is a neural architecture that processes camera images and natural-language instructions to produce robot control outputs Data provenanceTraceability metadata: source, consent, rights, capture conditions, chain of custody.Egocentric dataFirst-person camera footage capturing how a worker or operator sees a task.Foundation Model RoboticsFoundation model robotics refers to large neural networks—typically 100M to 10B+ parameters—pretrained on internet-scale vision and language data, then fine-tuned on robot demonstrations to produce generalist policies that follow natural language instructions and manipulate novel objects across embodiments Off-the-shelf datasetAn existing public or commercial dataset bought without custom collection.

FAQ

What is the difference between depth maps and point clouds?

Depth maps are 2D images where each pixel encodes distance from the camera, preserving the rectangular grid structure of the sensor. Point clouds are unordered sets of 3D coordinates (x, y, z) derived by un-projecting depth pixels using camera intrinsics. Depth maps are compact (1–2 MB per frame) and GPU-friendly for convolutional networks, while point clouds enable rotation-invariant processing with architectures like PointNet but require 5–10× more storage. Robotics pipelines often store depth maps and generate point clouds on-demand for geometric reasoning tasks like grasp planning or collision checking.

Can monocular depth estimation replace hardware depth sensors for robot training?

Monocular depth estimation is improving rapidly but remains less reliable than hardware sensors for contact-rich manipulation. Estimated depth lacks metric scale without additional cues, exhibits 15–25 percent higher error rates on thin objects and transparent surfaces, and fails on textureless regions where RGB provides no parallax information. Policies trained on estimated depth show 15–25 percent higher failure rates on tasks requiring millimeter-scale precision (insertion, threading) compared to policies trained on RealSense or LiDAR depth. Monocular depth is viable for navigation and coarse manipulation where 5–10 cm spatial errors are tolerable, but safety-critical grasping still demands hardware sensors.

How do I choose between stereo, ToF, and LiDAR depth sensors for a robotics dataset?

Stereo vision suits outdoor and long-range applications (5–50 meters) where infrared interference is acceptable, but requires textured surfaces and GPU compute for correspondence matching. Time-of-flight (ToF) sensors like RealSense D435 excel at indoor manipulation (0.3–3 meters) with compact form factors and low power draw, but struggle with multipath errors near reflective surfaces. LiDAR provides the longest range (50–200 meters) and works in any lighting, but costs 10–100× more and produces sparse point clouds unsuitable for dense surface reconstruction. For tabletop manipulation datasets, ToF cameras offer the best cost-accuracy trade-off; for outdoor navigation, LiDAR is standard; for mid-range indoor mobile manipulation, stereo vision balances cost and coverage.

What depth annotation tools support 3D bounding boxes and grasp poses?

Segments.ai provides multi-sensor annotation with RGB-depth-LiDAR fusion, enabling 3D cuboid drawing that snaps to point-cloud surfaces. Kognic specializes in autonomous-vehicle annotation with 3D bounding boxes, orientation vectors, and occlusion flags across synchronized sensor streams. Scale AI's Physical AI platform combines human annotators with depth-conditioned foundation models to propose initial 3D masks that annotators refine. For grasp-pose annotation, custom tools built on Open3D or Point Cloud Library allow annotators to mark 6-DOF gripper poses directly on point clouds, with collision checking to validate grasp feasibility. CVAT supports 3D cuboid annotation but lacks native point-cloud rendering, requiring external preprocessing to project depth into 3D views.

How does depth data licensing differ from RGB image licensing?

Depth data inherits RGB licensing constraints (model releases, location permissions) plus additional geometric-privacy considerations. Depth maps can reconstruct 3D room layouts and identify individuals by body shape or gait, triggering biometric-processing obligations under GDPR and the EU AI Act even when RGB faces are blurred. Datasets collected in private spaces require explicit consent for depth capture, not just RGB photography. Commercial depth datasets must document sensor calibration provenance (intrinsic parameters, firmware version) to ensure point-cloud accuracy, and some licenses restrict depth redistribution to prevent 3D scene reconstruction by third parties. Buyers should verify that depth datasets include machine-readable license metadata covering both RGB and geometric data, plus documented consent for biometric depth processing in jurisdictions with strict privacy laws.

What depth data formats are compatible with LeRobot and Hugging Face training pipelines?

LeRobot natively supports HDF5 groups with `observation/depth` arrays (uint16 millimeters or float32 meters) and camera intrinsics in episode metadata. MCAP containers with sensor_msgs/Image depth messages are converted to HDF5 via LeRobot's import scripts. Hugging Face Datasets stores depth as nested Parquet arrays with per-episode intrinsics, enabling columnar queries without decompressing entire trajectories. RLDS-formatted datasets (used by Open X-Embodiment) encode depth in TFRecord files with `observation['depth']` tensors, which LeRobot can ingest via TensorFlow Datasets adapters. For point clouds, LeRobot accepts PCD files or MCAP sensor_msgs/PointCloud2 messages, converting them to NumPy arrays during episode loading. All formats require accompanying camera intrinsics (fx, fy, cx, cy) to enable point-cloud reconstruction and 3D geometric reasoning.

Find datasets covering depth data

Truelabel surfaces vetted datasets and capture partners working with depth data. Send the modality, scale, and rights you need and we route you to the closest match.

Browse depth-annotated datasets on truelabel