Physical AI Data Glossary
RGB-D Data
RGB-D data combines a standard RGB color image with a spatially aligned depth map, where each pixel stores metric distance from the camera to the surface. This multimodal format enables robots to perceive both visual appearance and 3D geometry simultaneously, making it the dominant modality for indoor manipulation, navigation, and scene understanding in physical AI systems.
Quick facts
- Term
- RGB-D Data
- Domain
- Robotics and physical AI
- Last reviewed
- 2025-05-15
What RGB-D Data Represents in Physical AI Systems
RGB-D data encodes both photometric and geometric information in a single synchronized capture. The RGB channels store standard 8-bit color values (red, green, blue) per pixel, while the depth channel stores a floating-point or 16-bit integer distance measurement in millimeters or meters[1]. Given camera intrinsic parameters (focal length, principal point, distortion coefficients), each depth pixel can be backprojected into 3D space, transforming the 2D image into a metric point cloud.
This dual representation solves a fundamental limitation of monocular vision: scale ambiguity. A small object close to the camera produces the same 2D projection as a large object far away. Depth measurements break this degeneracy, enabling robots to compute object dimensions, plan collision-free paths, and position end-effectors at precise 3D coordinates. Scale AI's physical AI platform processes RGB-D streams for manipulation policy training, while NVIDIA Cosmos world foundation models use depth as a geometric prior for scene generation[2].
The spatial alignment requirement is critical: RGB and depth frames must be captured simultaneously and registered to the same coordinate frame. Misalignment of even 2-3 pixels degrades downstream tasks like grasp detection and semantic segmentation. Modern sensors achieve sub-pixel registration through factory calibration, but multi-sensor annotation platforms still require manual verification of alignment quality during dataset curation.
Depth Acquisition Technologies and Trade-offs
Three primary technologies generate depth maps: structured light, time-of-flight (ToF), and stereo triangulation. Structured light projectors (used in Intel RealSense D400 series and early Microsoft Kinect) emit infrared patterns and compute depth from pattern deformation. ToF sensors (Azure Kinect, Helios2) measure round-trip time of modulated light pulses. Stereo systems (ZED cameras, OAK-D) compute disparity between two synchronized RGB cameras.
Each technology exhibits distinct failure modes. Structured light fails on transparent, reflective, or black surfaces that absorb infrared. ToF suffers from multipath interference in corners and exhibits range-dependent noise (±10mm at 1m, ±40mm at 4m for Azure Kinect). Stereo requires textured surfaces and fails in low-light or repetitive-pattern environments. DROID's 350-hour manipulation dataset documents per-scene depth quality metrics, showing 18% invalid pixel rates in kitchen environments with glossy countertops[3].
Outdoor robotics typically uses LiDAR instead of RGB-D due to range limitations (most RGB-D sensors cap at 3-6 meters). Waymo Open Dataset combines 64-beam LiDAR with five RGB cameras but does not provide dense RGB-D alignment. Indoor datasets like ScanNet use Structure Sensor or RealSense for room-scale reconstruction, achieving 5cm mesh accuracy across 1,513 scenes.
Standard File Formats and Storage Architectures
RGB-D datasets use three dominant storage patterns. ROS bag files store synchronized RGB and depth topics as compressed image messages, widely used in robotics research but requiring ROS tooling for access. HDF5 hierarchical containers group RGB arrays, depth arrays, and metadata in a single file with chunked compression, enabling random access without loading full trajectories. MCAP is a newer columnar format designed for multi-sensor logs, offering 40% smaller files than ROS bags and native Parquet export[4].
Depth encoding varies by precision requirements. Millimeter-precision datasets store depth as uint16 (0-65,535mm range), while research datasets often use float32 for sub-millimeter accuracy or to encode invalid pixels as NaN. EPIC-KITCHENS-100 stores depth as 16-bit PNG with a 0.1mm scale factor, achieving 10:1 compression over raw float arrays. LeRobot's dataset schema uses Parquet for tabular metadata and separate HDF5 files for image tensors, balancing query performance with storage density.
Point cloud representations (PCD, LAS, PLY) are common for scene reconstruction but rare in manipulation datasets, where per-pixel correspondence to RGB is essential. PointNet architectures can consume raw point clouds, but most manipulation policies require dense depth maps to preserve spatial locality for convolutional encoders.
Annotation Workflows for Manipulation Datasets
RGB-D annotation combines 2D image labeling with 3D geometric constraints. Semantic segmentation masks must align precisely with depth discontinuities — a 2-pixel mask error at an object boundary translates to 5-10cm position error in 3D. Segments.ai's multi-sensor platform overlays 2D polygons on RGB while displaying the corresponding 3D point cloud, enabling annotators to verify mask boundaries against geometric edges.
Grasp pose annotation requires 6-DOF labels (3D position + 3-axis rotation) that respect depth geometry. Annotators place virtual gripper models in the 3D scene, with the platform automatically snapping to valid surface points using depth data. DexYCB dataset provides 582,000 grasp annotations with sub-centimeter accuracy by using depth-based collision checking during labeling. Invalid grasps (gripper intersecting object mesh) are flagged automatically, reducing annotation error rates from 12% to 3%[5].
Kognic's annotation platform supports multi-frame RGB-D sequences, tracking object identities across frames while maintaining 3D consistency. Annotators label keyframes, and the system propagates masks using optical flow and depth-based motion estimation. This semi-automated workflow achieves 4x throughput compared to per-frame manual annotation, critical for the 100+ hour datasets required by modern manipulation policies.
RGB-D in Reinforcement Learning and Imitation Learning
Manipulation policies consume RGB-D data in three primary forms: raw image pairs, voxel grids, or learned embeddings. RT-1 (Robotics Transformer) processes RGB and depth as separate 6-channel inputs to a Vision Transformer, achieving 97% success on 700+ tasks across 13 robots[6]. OpenVLA concatenates RGB and depth into a 4-channel tensor (RGB + D) before passing to a pretrained vision encoder, reducing parameter count by 30% compared to dual-encoder architectures.
Depth provides critical inductive bias for sim-to-real transfer. Domain randomization varies RGB appearance (lighting, textures) but preserves geometric structure in depth, enabling policies to generalize across visual domains while maintaining spatial reasoning. CALVIN benchmark demonstrates 40% higher sim-to-real success rates when policies train on RGB-D versus RGB-only, particularly for tasks requiring precise 3D positioning like drawer opening.
RLDS (Reinforcement Learning Datasets) standardizes RGB-D storage for policy training, defining schemas for synchronized observation spaces and action trajectories. Open X-Embodiment aggregates 22 datasets (1M+ episodes) in RLDS format, with 18 datasets including depth channels. This standardization enables cross-dataset pretraining, where policies learn geometric priors from diverse RGB-D sources before fine-tuning on target tasks.
Procurement Considerations for RGB-D Training Data
RGB-D dataset procurement requires validating four technical dimensions: sensor specifications, calibration quality, annotation density, and format compatibility. Sensor specs must match deployment hardware — training on RealSense D435 data (87° H-FOV, 1280×720 depth) then deploying on Azure Kinect (75° H-FOV, 640×576 depth) introduces distribution shift that degrades policy performance by 15-25%[7].
Calibration artifacts are common in commercial RGB-D datasets. Depth-RGB misalignment exceeding 3 pixels occurs in 8-12% of frames in datasets collected without per-session calibration verification. Truelabel's data provenance system tracks calibration timestamps, sensor serial numbers, and alignment validation metrics, enabling buyers to filter low-quality captures before training. Datasets lacking this metadata require manual quality audits, adding 20-40 hours per 1,000 episodes.
Annotation density varies widely: BridgeData V2 provides sparse keyframe labels (1 per 10 seconds), while DROID offers dense per-frame segmentation masks. Sparse labels suffice for imitation learning with temporal smoothing, but dense labels are mandatory for semantic SLAM and scene understanding tasks. Buyers should specify required annotation frequency in procurement contracts to avoid scope gaps.
Truelabel's physical AI marketplace indexes RGB-D datasets by sensor type, annotation schema, and task domain, with per-episode pricing starting at $0.80 for teleoperation data and $4.50 for densely annotated manipulation sequences. License terms must address derivative model rights — some academic datasets prohibit commercial use of policies trained on their RGB-D data.
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- Point Cloud Library documentation
Point Cloud Library documentation defines standard 3D data structures and backprojection math
Point Cloud Library ↩ - NVIDIA GR00T N1 technical report
GR00T N1 technical report details depth integration in world foundation models
arXiv ↩ - DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID paper reports 18% invalid depth pixel rates in kitchen scenes
arXiv ↩ - MCAP specification
MCAP specification details 40% file size reduction versus ROS bags
MCAP ↩ - Project site
DexYCB project reports 3% annotation error rate with depth collision checking
dex-ycb.github.io ↩ - RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 achieves 97% success on 700+ tasks across 13 robots
arXiv ↩ - Crossing the Reality Gap: A Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning
Sim-to-real survey documents 15-25% performance degradation from sensor distribution shift
arXiv ↩
More glossary terms
FAQ
What is the difference between RGB-D data and point clouds?
RGB-D data is a dense 2D grid where each pixel has color (RGB) and depth (D) values, maintaining the image structure and pixel correspondence. Point clouds are unordered 3D coordinate sets (X, Y, Z) with optional color attributes, generated by backprojecting RGB-D pixels into 3D space. RGB-D preserves spatial locality for convolutional networks, while point clouds enable rotation-invariant processing with architectures like PointNet. Most manipulation datasets store RGB-D because policies need dense per-pixel features; point clouds are used primarily for scene reconstruction and registration tasks.
Can RGB-D datasets be used for outdoor robotics applications?
RGB-D sensors have limited outdoor utility due to 3-6 meter range constraints and infrared interference from sunlight. Structured light and ToF depth sensors fail in bright outdoor conditions because ambient infrared overwhelms the projected signal. Outdoor robotics relies on LiDAR (50-200m range) combined with separate RGB cameras, not fused RGB-D. However, RGB-D remains viable for outdoor manipulation tasks within arm's reach (bin picking, agricultural harvesting) where the robot operates in the sensor's 0.5-3m sweet spot. Datasets like Waymo Open provide LiDAR + RGB but not aligned RGB-D.
How do I convert RGB-D data between ROS bags and HDF5 formats?
Use rosbag2 Python API to extract synchronized RGB and depth topics, then write arrays to HDF5 with h5py. Key steps: subscribe to /camera/color/image_raw and /camera/depth/image_rect topics, verify timestamps match within 10ms, decode compressed images to NumPy arrays, and store in HDF5 groups with dataset chunking (e.g., chunks=(1, 480, 640) for per-frame access). Preserve camera intrinsics (fx, fy, cx, cy) as HDF5 attributes. For MCAP conversion, use the mcap Python library to write Image and CameraInfo messages. LeRobot provides conversion scripts for common formats in their GitHub repository.
What annotation tools support RGB-D data labeling?
Segments.ai, CVAT, Labelbox, and V7 Darwin support RGB-D annotation with 3D visualization. Segments.ai overlays 2D masks on RGB while displaying the corresponding point cloud, enabling annotators to verify boundaries against depth discontinuities. CVAT supports depth as a separate layer but lacks native 3D viewers. Labelbox and V7 offer 3D cuboid tools that snap to depth surfaces. For manipulation-specific workflows (grasp poses, 6-DOF annotations), specialized tools like Kognic or custom Unity-based annotators are common. Open-source options include Label Studio with custom 3D plugins, though setup requires engineering effort.
How much RGB-D training data is required for manipulation policies?
Modern manipulation policies require 50-200 hours of RGB-D teleoperation data for single-task learning, and 500-2,000 hours for multi-task generalist models. RT-1 trained on 130,000 episodes (≈700 hours) across 700 tasks. OpenVLA used 970,000 trajectories from Open X-Embodiment. Data efficiency improves with pretraining: policies initialized on large RGB-D datasets can fine-tune on 10-50 task-specific demonstrations. Quality matters more than quantity — 20 hours of clean, diverse RGB-D data outperforms 100 hours of repetitive or low-variation captures. Budget $3,000-$8,000 per task for custom RGB-D collection and annotation.
What are common failure modes in RGB-D datasets that affect policy training?
Five critical failure modes: (1) depth-RGB misalignment from poor calibration, causing 3D position errors; (2) invalid depth pixels (NaN or zero values) on reflective/transparent surfaces, creating holes in point clouds; (3) temporal desynchronization where RGB and depth frames are captured 30-50ms apart, introducing motion artifacts; (4) depth noise at range limits (>3m) where sensor precision degrades; (5) compression artifacts in depth maps stored as lossy JPEG instead of lossless PNG. Policies trained on corrupted RGB-D data exhibit 20-40% lower success rates. Truelabel's marketplace filters datasets by calibration quality and invalid-pixel percentage to surface clean training data.
Find datasets covering RGB-D data
Truelabel surfaces vetted datasets and capture partners working with RGB-D data. Send the modality, scale, and rights you need and we route you to the closest match.
Browse RGB-D Datasets on Truelabel