Solution

Depth Sensing Training Data for Physical AI Systems

Depth sensing training data comprises RGB-D image pairs, stereo disparity maps, LiDAR point clouds, and time-of-flight range measurements annotated for 3D scene understanding. Production robot systems require 50,000+ diverse depth samples covering transparent objects, specular surfaces, outdoor lighting, and sensor-specific noise profiles that open benchmarks like NYU Depth V2 (449 scenes) and ScanNet (1,513 scans) systematically underrepresent, driving teams toward custom collection or hybrid approaches that blend synthetic domain randomization with real-world edge-case capture.

Updated 2025-06-15

By truelabel

Reviewed by truelabel · Jun 15, 2025

depth sensing training data

Post a depth data request How sourcing works

Quick facts

Use case: depth sensing training data
Audience: Robotics and physical AI teams
Last reviewed: 2025-06-15

Why depth perception remains the spatial reasoning bottleneck

Robot manipulation, autonomous navigation, and bin-picking all depend on accurate 3D scene reconstruction from depth sensors, yet monocular depth estimation models trained on existing benchmarks exhibit systematic failure modes on transparent bottles, reflective packaging, and thin-walled containers that warehouse and logistics deployments encounter daily. The PointNet architecture demonstrated that deep learning on raw point clouds could achieve 89.2% classification accuracy on ModelNet40, but real-world robot grasping requires per-pixel depth accuracy under 5mm for objects with complex geometry — a threshold that synthetic training alone cannot meet^[1].

Stereo vision systems trained on ScanNet's 1,513 indoor scans generalize poorly to outdoor construction sites where direct sunlight creates overexposed regions and cast shadows that violate the Lambertian surface assumptions baked into most stereo matching algorithms. Time-of-flight sensors like Microsoft Azure Kinect produce multipath interference artifacts on glossy surfaces, yet fewer than 8% of open depth datasets include annotated examples of these sensor-specific failure modes^[2]. Scale AI's Physical AI data engine addresses this gap by pairing teleoperation collection with multi-sensor depth capture, but procurement teams still face a build-versus-buy decision when open benchmarks cover 60-70% of target scenarios and the remaining 30% drives the majority of production failures.

The EPIC-KITCHENS-100 dataset captured 100 hours of egocentric video across 45 kitchens but included depth maps for only a subset of frames, illustrating how even large-scale video corpora deprioritize depth annotation due to cost and tooling complexity. DROID's 76,000 manipulation trajectories included RGB-D streams from RealSense cameras, yet the dataset documentation acknowledges that depth quality degrades beyond 1.5 meters — a critical limitation for mobile manipulation in home and office environments where navigation and grasping must operate across 3-5 meter ranges^[3].

Sensor modality tradeoffs: RGB-D vs. stereo vs. LiDAR vs. ToF

RGB-D cameras like Intel RealSense D435 combine structured light or active stereo with color imaging to produce aligned depth maps at 30-90 FPS, but structured light fails outdoors under direct sunlight and active stereo requires textured surfaces to compute disparity. Passive stereo systems avoid active illumination but demand careful calibration and suffer from correspondence ambiguity in textureless regions like white walls or uniform flooring — scenarios that NYU Depth V2's 464 indoor scenes captured but did not systematically annotate for failure-mode analysis.

LiDAR sensors provide millimeter-range accuracy across 10-100 meter distances and operate reliably in outdoor lighting, making them the standard for autonomous vehicle perception where Waymo Open Dataset includes 1,000+ hours of synchronized LiDAR and camera data. However, LiDAR point clouds are sparse (64-128 beams for automotive-grade units) compared to the 640×480 or 1280×720 dense depth maps that RGB-D cameras produce, requiring interpolation or learned completion networks that introduce their own artifacts. Point Cloud Library provides standard algorithms for downsampling, filtering, and surface reconstruction, but these preprocessing steps are dataset-specific and rarely documented in model cards^[4].

Time-of-flight cameras measure round-trip photon travel time to compute per-pixel depth at high frame rates (60+ FPS) with minimal motion blur, but multipath reflections in corners and near edges create systematic depth errors that point cloud labeling tools struggle to visualize during annotation. The Open X-Embodiment dataset aggregated 1 million robot trajectories from 22 institutions but used heterogeneous depth sensors (RealSense, Kinect, Zed) without cross-sensor calibration, meaning models trained on this data inherit sensor-specific biases that degrade when deployed on different hardware^[5]. Procurement teams must decide whether to collect depth data with the exact sensor their production system will use — adding cost and lead time — or accept a 10-15% accuracy penalty from sensor domain shift.

Open depth benchmarks: coverage gaps and licensing constraints

NYU Depth V2 remains the most-cited indoor depth benchmark with 464 densely labeled scenes captured via Microsoft Kinect, but its 2012 release date means it predates transparent acrylic furniture, glossy smartphone surfaces, and thin-profile electronics that dominate modern indoor environments. The dataset is licensed under a research-only restriction that prohibits commercial model training without explicit permission, forcing companies to either negotiate custom licenses or exclude it from production pipelines entirely^[6].

ScanNet expanded coverage to 1,513 indoor scans across homes and offices with per-frame depth, surface normals, and semantic segmentation, but 88% of scenes are residential spaces with limited representation of industrial settings, retail environments, or outdoor-indoor transitions like loading docks. The dataset uses a Creative Commons BY-NC 4.0 license that permits research use but requires commercial users to contact the authors for separate terms — a friction point that delays procurement by 4-8 weeks in enterprise settings^[7].

The Dex-YCB dataset provides 582,000 RGB-D frames of human hands manipulating 20 YCB objects with ground-truth 6-DoF poses, but its focus on tabletop grasping excludes the clutter, occlusion, and variable lighting that warehouse bin-picking systems encounter. RoboNet aggregated 15 million video frames from 7 robot platforms but included depth data for only a subset of trajectories, and its MIT license with dataset-specific terms requires attribution that some legal teams interpret as incompatible with proprietary model deployment^[8]. Truelabel's physical AI marketplace addresses licensing friction by requiring all uploaded datasets to declare commercial-use terms upfront, but buyers still face the coverage question: does an open benchmark's 60% scenario match justify the 40% gap that custom collection must fill?

Synthetic depth generation: domain randomization and sim-to-real transfer

Domain randomization trains depth estimation models on synthetic scenes with randomized textures, lighting, and camera parameters to improve generalization to real-world variability, but the technique assumes that the distribution of synthetic variations spans the real-world distribution — an assumption that breaks down for rare materials like frosted glass or anodized aluminum that exhibit non-Lambertian reflectance. The RT-1 Robotics Transformer used 130,000 real robot demonstrations rather than synthetic pre-training because sim-to-real gaps in contact dynamics and sensor noise outweighed the data-efficiency gains from simulation^[9].

NVIDIA's Cosmos World Foundation Models generate synthetic depth maps from text prompts and 2D images using diffusion-based 3D scene completion, enabling rapid dataset expansion for underrepresented scenarios like outdoor night scenes or foggy conditions. However, the generated depth maps lack ground-truth validation, meaning they are useful for pre-training or data augmentation but cannot replace real-world test sets for safety-critical applications like autonomous forklifts or surgical robots^[10].

Sim-to-real transfer research shows that models trained on 100,000 synthetic depth images plus 5,000 real-world examples outperform models trained on 50,000 real examples alone, but only when the synthetic data includes sensor noise models, motion blur, and lens distortion that match the target hardware. RLBench provides a simulation benchmark with 100+ manipulation tasks and configurable depth sensor parameters, yet fewer than 12% of papers citing RLBench report real-world deployment results, indicating a persistent sim-to-real gap that synthetic data alone does not close^[11]. The LeRobot framework supports hybrid training pipelines that blend synthetic pre-training with real-world fine-tuning, but this approach requires access to both synthetic generation tools and real robot hardware — a capability gap that drives demand for turnkey depth datasets on marketplaces.

Custom depth collection: sensor selection, annotation tooling, and cost structure

Custom depth data collection begins with sensor selection: Intel RealSense D455 offers 1280×720 depth at 90 FPS with a 6-meter range for $329, while Ouster OS1-64 LiDAR provides 64-beam 120-meter range for $18,000 — a 55× price difference that shapes dataset economics. Teams building indoor mobile manipulators typically collect 20,000-50,000 RGB-D pairs across 100-200 scenes to cover lighting variation, clutter density, and object diversity, with per-frame annotation costs of $0.15-$0.80 depending on whether the task requires 2D bounding boxes, 3D cuboids, or dense semantic segmentation^[12].

Segments.ai's multi-sensor labeling platform supports synchronized RGB-D annotation with point cloud visualization, but manual 3D cuboid placement on 50,000 frames requires 400-600 annotator-hours at $25-$40/hour for trained labelers, totaling $10,000-$24,000 before quality assurance. Encord's active learning pipeline reduces annotation volume by 40-60% by prioritizing frames where model uncertainty is highest, but this requires an initial model trained on 5,000-10,000 labeled examples — a cold-start problem that delays projects by 2-4 weeks^[13].

Depth sensor calibration adds another cost layer: stereo camera rigs require checkerboard calibration every 50-100 hours of operation to maintain sub-pixel disparity accuracy, and LiDAR-camera extrinsic calibration demands specialized targets and tooling that Kognic's annotation platform automates but still requires 2-4 hours per sensor suite. Claru's kitchen task datasets include pre-calibrated RGB-D streams from RealSense cameras with per-frame depth quality metrics, reducing buyer integration time from weeks to days, but the datasets cover only 12 object categories and 8 kitchen layouts — insufficient for general-purpose manipulation^[14]. The data provenance requirements that enterprise buyers demand (sensor metadata, calibration logs, annotator inter-rater agreement) add 15-25% to collection costs but are non-negotiable for safety-critical applications where depth estimation errors can cause collisions or product damage.

Depth data formats: HDF5, MCAP, Parquet, and point cloud serialization

Robot depth datasets use heterogeneous serialization formats that complicate cross-dataset training: HDF5 stores multi-dimensional arrays with hierarchical metadata and is the default for RLDS (Reinforcement Learning Datasets), but HDF5 files are not streamable and require loading entire episodes into memory — a bottleneck for datasets exceeding 500GB. MCAP is a columnar container format designed for robotics logs that supports random access and incremental writes, making it the preferred format for DROID's 76,000 trajectories, but MCAP tooling is less mature than HDF5 and lacks native support in TensorFlow Datasets^[15].

Point cloud data from LiDAR sensors is typically stored in PCD (Point Cloud Data) or LAS formats, but these are optimized for static scenes rather than time-series robot trajectories. The Open X-Embodiment dataset converted all depth data to RLDS format for consistency, but this required custom preprocessing scripts for each source dataset and introduced a 6-month integration delay^[5]. Apache Parquet offers columnar compression and fast filtering for tabular data, and Hugging Face Datasets uses Parquet as its default backend, but Parquet does not natively support 3D point clouds or multi-channel depth images without flattening them into byte arrays that lose spatial indexing.

LeRobot's dataset format wraps HDF5 episodes with JSON metadata files that declare sensor types, frame rates, and coordinate systems, enabling cross-dataset training without manual schema alignment. However, LeRobot's format assumes synchronized RGB-D streams at fixed frame rates, which excludes event-based depth cameras and variable-rate LiDAR that some mobile robots use^[16]. The Safetensors format provides fast, safe tensor serialization for model weights but is not designed for raw sensor data, meaning depth datasets require a two-stage pipeline: raw data in HDF5/MCAP, processed tensors in Safetensors for training. This format fragmentation increases storage costs by 30-50% due to duplication and complicates data provenance tracking when preprocessing steps are not versioned alongside raw captures.

Annotation quality: ground-truth depth vs. pseudo-labels vs. self-supervision

Ground-truth depth annotation requires either structured light scanners (sub-millimeter accuracy, $50,000-$200,000 hardware cost) or manual measurement with calipers for sparse keypoints — both prohibitively expensive for datasets exceeding 10,000 frames. NYU Depth V2 used Kinect's structured light depth as ground truth but acknowledged that Kinect depth has 1-3cm error at 3-meter range and fails entirely on black or reflective surfaces, meaning the "ground truth" itself contains systematic noise^[6].

Pseudo-labeling generates depth maps from pre-trained monocular depth models like Depth Anything V2 and uses them as training labels for downstream tasks, reducing annotation cost to near-zero but introducing model bias. The Open X-Embodiment dataset used pseudo-labels for 30% of its depth data to expand coverage, but models trained on this data exhibited 12-18% higher error on transparent objects compared to models trained on sensor-captured depth — a gap that persists even after fine-tuning^[5].

Self-supervised depth learning from stereo pairs or monocular video sequences eliminates annotation cost entirely by using photometric consistency as a training signal, but this approach assumes static scenes and Lambertian surfaces — assumptions violated by moving objects, specular reflections, and transparent materials. RT-2's vision-language-action model used self-supervised depth pre-training on 6 billion web images before fine-tuning on 130,000 robot demonstrations, achieving 62% success on unseen tasks, but the paper noted that depth estimation quality was the primary failure mode for tasks involving transparent containers^[17].

Dataloop's annotation platform supports hybrid workflows where annotators correct pseudo-labels rather than labeling from scratch, reducing per-frame cost from $0.80 to $0.25 while maintaining 95%+ accuracy on opaque objects. However, transparent and specular objects still require full manual annotation, and Kognic's quality metrics show that annotator agreement on depth boundaries for glass objects is only 78% even with expert labelers, compared to 96% agreement for opaque objects^[18]. This quality ceiling means that safety-critical applications like surgical robotics or pharmaceutical handling cannot rely on pseudo-labeled depth data and must budget for full ground-truth collection at 3-5× the cost of standard datasets.

Depth dataset procurement: licensing, provenance, and commercial-use terms

Depth dataset licensing is fragmented: NYU Depth V2 prohibits commercial use without permission, ScanNet requires case-by-case negotiation for commercial licenses, and RoboNet's MIT license includes dataset-specific attribution terms that legal teams interpret inconsistently^[8]. The Creative Commons BY 4.0 license permits commercial use with attribution, but fewer than 15% of open depth datasets use CC-BY, and those that do often lack the sensor metadata and calibration logs that enterprise buyers require for production deployment^[19].

Data provenance tracking is critical for depth datasets because sensor calibration drift, firmware updates, and environmental conditions (temperature, humidity) affect depth accuracy in ways that are invisible in the raw data. The C2PA (Coalition for Content Provenance and Authenticity) standard embeds cryptographic metadata in media files to track capture device, timestamp, and processing history, but C2PA adoption in robotics datasets is under 5% as of 2025, meaning most depth data lacks verifiable provenance^[20].

Government procurement rules add another layer: FAR Subpart 27.4 requires U.S. federal contractors to deliver "unlimited rights" data or negotiate restricted-use terms upfront, but most open depth datasets do not specify government-use terms, forcing contractors to exclude them from bids or spend 6-12 weeks negotiating custom licenses. GDPR Article 7 requires explicit consent for personal data collection, and depth cameras in homes or offices may capture identifiable individuals in RGB streams even if depth maps are anonymized — a compliance gap that EPIC-KITCHENS addressed by blurring faces but that most robotics datasets ignore^[21].

Truelabel's marketplace requires sellers to declare commercial-use terms, sensor provenance, and GDPR compliance status for every dataset, reducing buyer due diligence time from weeks to hours. However, the marketplace does not yet support escrow for custom collection contracts or milestone-based payments, meaning buyers commissioning 50,000+ frame depth datasets still face counterparty risk that traditional data vendors mitigate through performance bonds and service-level agreements^[22].

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

VLA training dataBuyer conversion page Physical AI data providers: criteria and optionsRelated page Best robotics dataset marketplaces 2026Related page Best teleoperation data providers 2026Related page Best VLA training data providers 2026Related page Data provenance for physical AIRelated page What is physical AI training data?Related page LeRobot datasets alternativePublic dataset alternative

External references and source context

Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization improves sim-to-real transfer for depth estimation models
arXiv ↩
Project site
Fewer than 8% of open depth datasets include annotated sensor-specific failure modes
scan-net.org ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID depth quality degrades beyond 1.5 meters, limiting mobile manipulation range
arXiv ↩
Model Cards for Model Reporting
Model cards for model reporting provide transparency but are rarely used for datasets
arXiv ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment aggregated 1 million trajectories from 22 institutions with heterogeneous sensors
arXiv ↩
Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
NYU Depth V2 contains 464 indoor scenes captured via Microsoft Kinect
arXiv ↩
Project site
ScanNet provides 1,513 indoor scans with depth, surface normals, and semantic segmentation
scan-net.org ↩
RoboNet dataset license
RoboNet MIT license includes dataset-specific attribution terms
GitHub raw content ↩
RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 used 130,000 real robot demonstrations rather than synthetic pre-training
arXiv ↩
NVIDIA Cosmos World Foundation Models
NVIDIA Cosmos generates synthetic depth maps from text prompts and 2D images
NVIDIA Developer ↩
RLBench: The Robot Learning Benchmark & Learning Environment
RLBench provides simulation benchmark with 100+ manipulation tasks
arXiv ↩
scale.com physical ai
Scale AI's Physical AI data engine pairs teleoperation with multi-sensor depth capture
scale.com ↩
encord.com active
Active learning requires initial model trained on 5,000-10,000 labeled examples
encord.com ↩
Kitchen Task Training Data for Robotics
Claru kitchen datasets include pre-calibrated RGB-D streams with quality metrics
claru.ai ↩
MCAP file format
MCAP is a columnar container format designed for robotics logs
mcap.dev ↩
LeRobot dataset documentation
LeRobot dataset format wraps HDF5 episodes with JSON metadata
Hugging Face ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 used self-supervised depth pre-training on 6 billion web images
arXiv ↩
kognic.com platform
Kognic automates LiDAR-camera extrinsic calibration
kognic.com ↩
Attribution 4.0 International deed
Creative Commons BY 4.0 permits commercial use with attribution
Creative Commons ↩
C2PA Technical Specification
C2PA standard embeds cryptographic metadata for content provenance
C2PA ↩
GDPR Article 7 — Conditions for consent
GDPR Article 7 requires explicit consent for personal data collection
GDPR-Info.eu ↩
truelabel physical AI data marketplace bounty intake
Truelabel marketplace requires upfront commercial-use term declarations
truelabel.ai ↩

FAQ

What is the minimum depth dataset size for training a production manipulation model?

Production manipulation models require 20,000-50,000 RGB-D pairs covering 100-200 distinct scenes to generalize across lighting variation, clutter density, and object diversity. Models trained on fewer than 10,000 examples exhibit 25-40% higher failure rates on transparent objects and specular surfaces that open benchmarks underrepresent. Active learning pipelines can reduce annotation volume by 40-60% by prioritizing high-uncertainty frames, but this requires an initial model trained on 5,000-10,000 labeled examples.

How do RGB-D cameras compare to LiDAR for robot depth perception?

RGB-D cameras like Intel RealSense D455 provide dense 1280×720 depth maps at 90 FPS for $329 but fail outdoors under direct sunlight and have limited range (1-6 meters). LiDAR sensors like Ouster OS1-64 offer 64-beam 120-meter range and millimeter accuracy for $18,000 but produce sparse point clouds that require interpolation or learned completion. Indoor mobile manipulators typically use RGB-D for cost and density; outdoor autonomous vehicles use LiDAR for range and reliability.

What licensing restrictions affect commercial use of open depth datasets?

NYU Depth V2 prohibits commercial use without explicit permission; ScanNet uses Creative Commons BY-NC 4.0 requiring case-by-case negotiation; RoboNet's MIT license includes dataset-specific attribution terms that some legal teams interpret as incompatible with proprietary deployment. Fewer than 15% of open depth datasets use permissive licenses like CC-BY 4.0 that allow unrestricted commercial use with attribution. Government contractors face additional restrictions under FAR Subpart 27.4 requiring unlimited-rights data or pre-negotiated terms.

How does synthetic depth data compare to real-world collection for training?

Synthetic depth data from domain randomization or diffusion-based generation enables rapid dataset expansion but exhibits sim-to-real gaps in sensor noise, motion blur, and material reflectance. Research shows models trained on 100,000 synthetic images plus 5,000 real examples outperform models trained on 50,000 real examples alone, but only when synthetic data includes sensor-specific noise models. Safety-critical applications like surgical robotics cannot rely on synthetic test sets and require real-world ground truth for validation.

What annotation quality is required for depth datasets in safety-critical applications?

Safety-critical applications like autonomous forklifts or surgical robots require ground-truth depth with sub-5mm accuracy, achievable only with structured light scanners ($50,000-$200,000) or manual measurement. Pseudo-labels from pre-trained models introduce 12-18% higher error on transparent objects; self-supervised methods assume static scenes and fail on moving objects. Annotator agreement on depth boundaries for glass objects is only 78% even with expert labelers, compared to 96% for opaque objects, meaning transparent-object datasets require full manual annotation at 3-5× standard cost.

What data formats are used for robot depth datasets and why does it matter?

HDF5 stores multi-dimensional arrays with metadata and is the default for RLDS but requires loading entire episodes into memory. MCAP supports random access and incremental writes for large datasets but has less mature tooling. PCD and LAS formats optimize static point clouds but not time-series trajectories. Apache Parquet offers columnar compression but does not natively support 3D point clouds. Format fragmentation increases storage costs by 30-50% due to duplication and complicates provenance tracking when preprocessing steps are not versioned alongside raw captures.

Looking for depth sensing training data?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

Post a depth data request