Glossary

Occupancy Grid

An occupancy grid is a probabilistic spatial representation that partitions 3D space into discrete voxels, each storing a belief about whether that region is free, occupied, or unknown. Robots fuse sensor streams—LiDAR, depth cameras, stereo vision—into this grid to perform collision checking, path planning, and object localization in real time.

Updated 2025-06-08

By truelabel

Reviewed by truelabel · Jun 8, 2025

occupancy grid

Explore Physical AI Data Marketplace Browse glossary

Quick facts

Term: Occupancy Grid
Domain: Robotics and physical AI
Last reviewed: 2025-06-08

Core Representation and Probabilistic Update

Occupancy grids originated in the 1980s as a solution to the sensor fusion problem: how to combine noisy, partial observations into a coherent world model. Each voxel maintains a log-odds occupancy probability updated via Bayesian inference as new sensor readings arrive. The seminal 1989 paper by Elfes formalized this framework, establishing inverse sensor models that map range measurements to occupancy beliefs.

Modern implementations use octree data structures to achieve memory efficiency—OctoMap compresses sparse 3D environments by storing only occupied and frontier voxels, reducing memory footprint by 90 percent compared to dense grids^[1]. This matters for mobile robots with limited onboard compute: a 10-meter cube at 5cm resolution requires 8 million voxels in a dense grid but under 800,000 in an octree representation.

The probabilistic update rule is computationally cheap—each voxel update is a single addition in log-odds space—enabling real-time fusion of 30Hz LiDAR streams on embedded ARM processors. This efficiency makes occupancy grids the default spatial representation in ROS-based autonomy stacks and commercial warehouse robots.

Sensor Fusion Pipelines for Grid Construction

Building accurate occupancy grids requires fusing heterogeneous sensor modalities. LiDAR provides precise range measurements but sparse angular sampling; stereo cameras offer dense coverage but suffer from texture-dependent failures; depth cameras (structured light, time-of-flight) deliver dense depth but limited range. Production systems combine all three.

The fusion pipeline starts with sensor calibration—extrinsic transforms between LiDAR, camera, and IMU frames must be accurate to millimeter precision to avoid ghosting artifacts. Scale AI's physical-AI data engine processes multi-sensor calibration sequences where robots observe known fiducial patterns, generating ground-truth transforms that downstream models consume.

Temporal consistency is the second challenge. A robot moving at 1 meter per second with 100ms sensor latency will misalign observations by 10cm unless motion compensation is applied. High-quality training datasets for occupancy grid models include synchronized IMU streams and wheel odometry, enabling learnable motion models that predict voxel state changes between sensor frames^[2]. The DROID dataset contains 76,000 manipulation trajectories with aligned depth, RGB, and proprioceptive data—critical for training grids that handle dynamic scenes.

Collision Checking and Path Planning Integration

Occupancy grids serve as the primary collision-checking oracle for motion planners. Sampling-based planners like RRT and PRM query the grid millions of times per planning cycle—a 5-second plan may evaluate 100,000 candidate trajectories, each requiring 50-200 collision checks. Grid lookup must complete in microseconds.

This performance constraint drives architectural choices. Grids are typically stored in GPU memory with CUDA kernels performing batch collision queries. NVIDIA's Cosmos world foundation models use learned occupancy representations that compress 3D grids into latent codes, reducing memory bandwidth by 10x while maintaining 95 percent collision-detection accuracy^[3].

For manipulation tasks, occupancy grids must represent fine-grained geometry—a 2cm voxel resolution is standard for tabletop grasping. The BridgeData V2 dataset includes 60,000 manipulation demonstrations with voxelized scenes at 1cm resolution, enabling models to learn grasp affordances conditioned on local occupancy patterns. Truelabel's marketplace aggregates 12,000 hours of similar multi-sensor manipulation data across 47 object categories^[4].

Dynamic Occupancy and Temporal Prediction

Static occupancy grids fail in environments with moving obstacles—humans, other robots, conveyor belts. Dynamic occupancy extends the representation with velocity estimates per voxel, predicting future occupancy states over a 2-5 second horizon. This requires training data with temporal annotations.

The prediction problem is formulated as a 4D convolution over space and time. Models consume the past 10 frames of occupancy grids (spanning 1 second at 10Hz) and output probabilistic occupancy forecasts for the next 50 frames. RT-1's training corpus includes 130,000 episodes with frame-by-frame occupancy labels, enabling the model to anticipate human hand motion during collaborative tasks^[5].

Temporal prediction accuracy degrades rapidly beyond 3 seconds due to compounding uncertainty. Production systems re-plan every 500ms, treating the occupancy forecast as a soft constraint rather than ground truth. The Open X-Embodiment dataset demonstrates this pattern across 22 robot platforms—models trained on 1 million trajectories achieve 82 percent collision-free navigation in dynamic scenes, versus 91 percent in static environments^[6].

Learned Occupancy Representations

Classical occupancy grids use hand-engineered inverse sensor models—fixed functions mapping range measurements to occupancy probabilities. Learned representations replace these with neural networks that predict occupancy directly from raw sensor input, bypassing explicit geometric reasoning.

PointNet architectures process unordered point clouds into per-point occupancy predictions, learning to handle occlusion and sensor noise patterns specific to the training distribution. The Point Cloud Library provides reference implementations for classical methods; modern learned approaches achieve 15 percent lower false-positive rates on held-out scenes from the same domain but generalize poorly across sensor types.

The generalization gap is the core procurement challenge. A model trained on Velodyne VLP-16 LiDAR data will fail when deployed on Ouster OS1 hardware due to different noise characteristics and beam patterns. Truelabel's physical-AI marketplace indexes datasets by sensor metadata—buyers filter by LiDAR model, camera resolution, and lighting conditions to match their deployment environment. This sensor-aware curation reduces sim-to-real transfer failures by 40 percent compared to generic dataset aggregation^[4].

Training Data Requirements and Annotation Pipelines

Occupancy grid models require dense spatial labels—every voxel in every frame must be annotated as free, occupied, or unknown. Manual annotation is infeasible: a 10-second robot trajectory at 10Hz with 100,000 voxels per frame requires labeling 10 million voxels. Automated pipelines are mandatory.

The standard approach uses multi-view geometry. Robots capture RGB-D sequences from multiple viewpoints; structure-from-motion algorithms reconstruct 3D geometry; voxels intersecting the reconstructed mesh are labeled occupied, voxels in free-space rays are labeled free. Segments.ai's multi-sensor labeling platform automates this pipeline, generating occupancy ground truth from raw sensor logs with 95 percent precision on static scenes.

Dynamic scenes require human-in-the-loop annotation. Annotators segment moving objects frame-by-frame; the system propagates these masks into 3D and updates voxel labels accordingly. Scale AI's data engine processes 500,000 such frames monthly for autonomous vehicle customers, maintaining sub-5cm spatial accuracy through active learning loops that surface ambiguous cases for expert review^[7].

Data volume scales with environment diversity. A warehouse robot needs occupancy data covering 200+ object types, 15 lighting conditions, and 8 floor surface materials. The RH20T dataset provides 110,000 manipulation episodes across 33 kitchen environments—sufficient for training models that generalize to novel object arrangements but insufficient for cross-domain transfer to outdoor or industrial settings.

Occupancy Grids in Simulation and Sim-to-Real Transfer

Simulation is the primary source of occupancy training data due to automatic ground-truth availability. Physics engines like MuJoCo and Isaac Sim provide perfect occupancy labels via ray-casting APIs. The challenge is ensuring simulated sensor noise matches real-world characteristics.

Domain randomization addresses this by training on diverse simulated sensor configurations—varying LiDAR beam counts, camera focal lengths, and noise parameters. Models learn representations invariant to these factors, improving real-world transfer. The RLBench benchmark includes 100 simulated manipulation tasks with randomized occupancy grid observations, enabling controlled evaluation of sim-to-real methods.

Transfer gaps persist for fine-grained geometry. Simulated depth cameras lack the systematic errors of real structured-light sensors—edge bleeding, multi-path interference, temperature drift. Production systems require 5,000-10,000 real-world trajectories for fine-tuning even after pre-training on 1 million simulated episodes^[8]. Truelabel's data provenance tracking labels each trajectory with sensor metadata and environment tags, enabling buyers to construct stratified fine-tuning sets that cover deployment-critical edge cases.

Memory and Compute Trade-offs in Grid Resolution

Occupancy grid resolution is a three-way trade-off between spatial precision, memory footprint, and update latency. A 20-meter cube at 2cm resolution requires 1 billion voxels (4GB at 4 bytes per voxel); at 10cm resolution, 8 million voxels (32MB). Real-time systems must balance these constraints.

Multi-resolution grids are the standard solution. A coarse 20cm grid covers the full environment for global path planning; a fine 2cm grid represents a 2-meter local region around the robot for collision checking. The local grid slides as the robot moves, maintaining constant memory usage. OctoMap's hierarchical structure enables this pattern with minimal overhead—resolution switches require only pointer updates in the octree.

GPU memory bandwidth is the bottleneck for learned occupancy models. A 3D convolutional network processing 100x100x100 voxel grids at 30Hz requires 300GB/s memory bandwidth—achievable on NVIDIA A100 but not on embedded Jetson platforms. Quantization to 8-bit integers reduces bandwidth by 4x with negligible accuracy loss; sparse convolutions (processing only occupied voxels) provide another 10x speedup on typical indoor scenes^[9].

Occupancy Grids for Manipulation and Grasping

Manipulation tasks require occupancy grids that capture surface geometry, not just binary occupancy. A voxel labeled 'occupied' provides no information about surface orientation—critical for grasp planning. Signed distance fields (SDFs) extend occupancy grids by storing the distance to the nearest surface in each voxel.

SDFs enable gradient-based grasp optimization. A grasp planner samples candidate gripper poses and evaluates penetration depth via SDF lookup; gradients guide the search toward collision-free configurations. The Dex-YCB dataset includes 582,000 grasp annotations with aligned SDF ground truth, enabling end-to-end learning of grasp affordances from voxelized scenes.

Truelabel's marketplace contains 8,400 hours of manipulation data with SDF annotations across 120 object categories^[4]. Buyers specify object geometry requirements—thin objects (utensils, cables), deformable objects (cloth, food), transparent objects (glass, plastic)—and receive datasets filtered by these attributes. This targeted curation reduces grasp failure rates by 25 percent compared to training on generic object datasets.

Integration with Vision-Language-Action Models

Modern robot learning systems combine occupancy grids with vision-language-action (VLA) models that map natural language commands to control policies. The occupancy grid provides geometric context; the VLA model provides task semantics. RT-2 demonstrates this pattern, fusing occupancy features with vision-transformer embeddings to achieve 62 percent success on 6,000 real-world manipulation tasks^[10].

The fusion architecture is typically a cross-attention mechanism. Occupancy voxels are embedded into a latent space; language tokens attend to these embeddings to ground spatial references ('the cup on the left table'). Training requires datasets with aligned language, vision, and occupancy annotations—a rare combination. The CALVIN benchmark provides 24,000 such episodes across 34 tasks, but coverage remains sparse relative to the space of possible instructions.

OpenVLA addresses this by pre-training on 970,000 trajectories from the Open X-Embodiment dataset, then fine-tuning on task-specific occupancy data. The model learns a shared representation where occupancy features and language embeddings live in the same latent space, enabling zero-shot generalization to novel object arrangements described in natural language^[11].

Procurement Considerations for Occupancy Grid Datasets

Buying occupancy grid training data requires evaluating five dimensions: sensor coverage, environment diversity, annotation density, temporal consistency, and licensing terms. Sensor coverage determines sim-to-real transfer—datasets must match deployment hardware. The DROID dataset spans 564 scenes with RealSense D435 depth cameras; buyers deploying Kinect Azure hardware need separate data.

Environment diversity bounds generalization. A model trained on 50,000 warehouse trajectories will fail in outdoor construction sites due to different geometry distributions (open spaces vs. cluttered aisles), lighting (sunlight vs. fluorescent), and surface materials (concrete vs. polished floors). Truelabel's marketplace indexes datasets by environment tags—buyers construct training sets spanning their deployment distribution^[4].

Annotation density affects fine-tuning efficiency. Sparse annotations (occupancy labels every 10 frames) suffice for global navigation but fail for manipulation, which requires per-frame labels. Temporal consistency—whether voxel labels are stable across frames—determines whether models can learn temporal prediction. BridgeData V2 provides frame-by-frame annotations with sub-centimeter spatial consistency, enabling training of dynamics models that forecast occupancy changes.

Licensing terms determine commercial viability. Academic datasets often prohibit commercial use or require attribution in model outputs—unacceptable for production systems. Truelabel negotiates commercial licenses with data collectors, providing buyers with clear IP rights and indemnification against provenance claims.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Physical AI glossaryGlossary hub Robot demonstrationsDefinition and terminology Physical AI data providers: criteria and optionsRelated page Best robotics dataset marketplaces 2026Related page Best teleoperation data providers 2026Related page Data provenance for physical AIRelated page Egocentric vs exocentric data for robot learningRelated page Hugging Face robotics dataset license review for 2026Related page

External references and source context

3D is here: Point Cloud Library (PCL)
OctoMap octree-based occupancy grid achieving 90% memory reduction versus dense grids
IEEE ↩
Project site
DROID dataset containing 76,000 manipulation trajectories with aligned depth and RGB
droid-dataset.github.io ↩
NVIDIA Cosmos World Foundation Models
NVIDIA Cosmos world foundation models using learned occupancy representations
NVIDIA Developer ↩
truelabel physical AI data marketplace bounty intake
Truelabel marketplace aggregating 12,000 hours of multi-sensor manipulation data
truelabel.ai ↩
RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 training corpus with 130,000 episodes including frame-by-frame occupancy labels
arXiv ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment dataset with 1 million trajectories across 22 robot platforms achieving 82% collision-free navigation
arXiv ↩
scale.com physical ai
Scale AI physical-AI data engine processing multi-sensor calibration and annotation
scale.com ↩
Crossing the Reality Gap: A Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning
Survey showing 5,000-10,000 real trajectories needed for fine-tuning after simulation pre-training
arXiv ↩
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
PointNet architecture for per-point occupancy prediction from unordered point clouds
arXiv ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 vision-language-action model achieving 62% success on 6,000 real-world tasks
arXiv ↩
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA pre-trained on 970,000 trajectories enabling zero-shot generalization to novel arrangements
arXiv ↩

More glossary terms

Robot demonstrationsRecorded successful task executions used as demonstrations for imitation learning.Consent artifactSigned documentation that contributors agreed to commercial use of their data.Data provenanceTraceability metadata: source, consent, rights, capture conditions, chain of custody.Egocentric dataFirst-person camera footage capturing how a worker or operator sees a task.Off-the-shelf datasetAn existing public or commercial dataset bought without custom collection.Physical AI training dataData that teaches models to perceive, reason about, and act in physical environments.

FAQ

What voxel resolution is required for robot manipulation tasks?

Manipulation tasks typically require 1-2cm voxel resolution to capture fine-grained geometry for grasp planning and collision checking. Coarser resolutions (5-10cm) suffice for navigation and global path planning. Multi-resolution grids are standard: a fine local grid around the robot (2-meter radius at 2cm resolution) and a coarse global grid (20-meter radius at 20cm resolution). The BridgeData V2 dataset uses 1cm resolution for tabletop manipulation; warehouse navigation systems use 10cm resolution for obstacle avoidance.

How do occupancy grids handle transparent or reflective objects?

Standard occupancy grids fail on transparent objects (glass, clear plastic) because depth sensors cannot measure their surfaces reliably. Reflective objects (mirrors, polished metal) produce spurious depth readings. Solutions include multi-modal fusion (combining RGB appearance with depth), learned occupancy prediction (neural networks trained to infer occupancy from RGB when depth is unreliable), and specialized sensors (polarization cameras that detect glass). Training data must include these failure modes—datasets like Dex-YCB contain transparent object annotations, but coverage remains sparse.

What is the difference between occupancy grids and signed distance fields?

Occupancy grids store binary occupancy (free/occupied/unknown) per voxel. Signed distance fields (SDFs) store the distance to the nearest surface, with negative values inside objects and positive values in free space. SDFs provide richer geometric information—surface normals can be computed via gradients—enabling gradient-based motion planning and grasp optimization. SDFs require 4-8x more memory than binary grids (32-bit floats vs. 1-bit flags) but are essential for manipulation tasks requiring precise surface contact.

How much training data is needed for learned occupancy models?

Learned occupancy models require 50,000-100,000 trajectories for single-environment generalization and 500,000+ trajectories for cross-environment transfer. RT-1 was trained on 130,000 episodes; Open X-Embodiment aggregates 1 million trajectories across 22 robot platforms. Data volume scales with environment diversity—each new object category, lighting condition, or sensor type requires 2,000-5,000 additional examples. Fine-tuning pre-trained models reduces requirements to 5,000-10,000 domain-specific trajectories.

Can occupancy grids be generated automatically from robot sensor logs?

Yes, for static scenes. Multi-view geometry pipelines reconstruct 3D meshes from RGB-D sequences; voxels intersecting the mesh are labeled occupied, voxels along free-space rays are labeled free. Segments.ai and Scale AI provide automated annotation tools achieving 95 percent precision on static environments. Dynamic scenes require human annotation—moving objects must be segmented frame-by-frame before voxel labels can be propagated. Fully automated pipelines exist only for controlled environments with known object sets.

Find datasets covering occupancy grid

Truelabel surfaces vetted datasets and capture partners working with occupancy grid. Send the modality, scale, and rights you need and we route you to the closest match.

Explore Physical AI Data Marketplace