truelabelRequest data

Physical AI Glossary

Spatial Action Maps

Spatial action maps represent robot control policies as dense, pixel-aligned action predictions over an image observation. Instead of outputting a single action vector, the policy produces a spatial map where each pixel encodes the value or likelihood of executing an action at that location. The robot selects its action by identifying the peak pixel coordinate and converting it to a physical command through camera calibration, exploiting translation-equivariance in visual affordances.

Updated 2025-05-15
By truelabel
Reviewed by truelabel ·
spatial action maps

Quick facts

Term
Spatial Action Maps
Domain
Robotics and physical AI
Last reviewed
2025-05-15

Architecture and Representation

Spatial action maps encode robot policies as dense 2D or 3D grids where each cell predicts the desirability of executing an action at the corresponding spatial location. The architecture typically uses a fully convolutional encoder-decoder that preserves spatial resolution, outputting a heatmap with the same width and height as the input image. At inference time, the robot selects the pixel with the highest activation, converts that coordinate to world space via camera intrinsics and extrinsics, and executes the corresponding action.

This representation was formalized for mobile navigation by Wu et al. in 2020, demonstrating that per-pixel value maps over top-down views enabled efficient collision avoidance and goal-reaching. Transporter Networks extended the concept to manipulation by predicting separate pick and place heatmaps, each representing spatial affordances for grasp and release actions. The key architectural insight is translation equivariance: convolutional layers naturally share weights across spatial locations, so a visual pattern that affords grasping in one image region also affords grasping when it appears elsewhere.

Modern implementations often use U-Net or ResNet-FPN backbones to extract multi-scale features, then upsample to full resolution for dense prediction. RT-1 demonstrated that spatial action tokenization could be combined with transformer architectures by discretizing the action space into a grid and treating each cell as a token. This hybrid approach achieved 97% success on 700+ real-world tasks, showing that spatial representations scale to diverse manipulation scenarios when paired with large-scale data.

Training Data Requirements

Spatial action map policies require pixel-aligned supervision where each training example pairs an RGB or RGB-D observation with a ground-truth action coordinate. For manipulation tasks, this typically means recording teleoperation demonstrations with calibrated cameras and logging the 3D pick and place points, then projecting those points into image space to generate target heatmaps. DROID collected 76,000 manipulation trajectories across 564 skills and 86 environments, providing the scale needed to train generalizable spatial policies.

Annotation workflows must preserve spatial precision: a 5-pixel error in a 224×224 image corresponds to roughly 2cm error in a typical tabletop workspace, enough to cause grasp failures. Scale AI's Physical AI platform offers calibrated multi-view annotation tools that project 3D keypoints into synchronized camera frames, ensuring sub-centimeter alignment. For navigation tasks, top-down occupancy maps require LiDAR or depth sensor fusion to generate ground-truth traversability labels at each grid cell.

Data volume scales with task diversity: Open X-Embodiment aggregated 1 million trajectories from 22 robot embodiments to train RT-X models, demonstrating that spatial representations benefit from cross-embodiment transfer when action spaces are normalized to a common coordinate frame[1]. Truelabel's marketplace currently lists 340+ manipulation datasets with calibrated camera parameters, enabling buyers to source pre-aligned training data for spatial policy architectures.

Translation Equivariance and Generalization

The core advantage of spatial action maps is translation equivariance: if a policy learns to grasp a cup at pixel (100, 150), it automatically generalizes to grasping the same cup at pixel (200, 250) without additional training. This property arises from the convolutional architecture, which applies the same learned filters across all spatial locations. Empirical studies show that spatial policies achieve 30-40% higher success rates on novel object placements compared to vector-based policies trained on identical data[2].

Translation equivariance breaks down under perspective distortion, occlusion, and scale variation. CLIPort addressed this by fusing CLIP vision features with spatial maps, enabling the policy to recognize objects under viewpoint changes while preserving spatial structure. PerAct extended spatial maps to 3D voxel grids, predicting actions in a discretized workspace volume rather than image coordinates, which improved robustness to camera pose variation by 18% on RLBench tasks.

Real-world deployment requires domain randomization to handle lighting, texture, and background variation. Domain randomization techniques augment training images with synthetic variations in color, brightness, and clutter, forcing the spatial policy to rely on geometric affordances rather than spurious texture correlations. BridgeData V2 demonstrated that training on 60,000 trajectories with aggressive augmentation enabled zero-shot transfer to novel kitchens with 74% success, compared to 52% for policies trained without augmentation.

Comparison to Vector-Based Policies

Vector-based policies output a fixed-dimensional action vector (e.g., 7D for position, orientation, and gripper state) regardless of input resolution, while spatial policies output a dense map with resolution proportional to the image size. This architectural difference has profound implications for sample efficiency and generalization. Spatial policies require 2-5× fewer demonstrations to reach equivalent performance on pick-and-place tasks because they exploit spatial structure, but they struggle with tasks requiring precise force control or dynamic manipulation where the action space is not naturally spatial.

RT-2 demonstrated that vision-language-action models can achieve strong generalization by pre-training on web-scale image-text data, then fine-tuning on robot trajectories. RT-2 uses a vector-based action head but achieves 62% success on emergent skills not seen during robot training, suggesting that large-scale pre-training can compensate for the inductive bias advantages of spatial representations. However, RT-2 required 130,000 robot demonstrations plus billions of web images, whereas spatial policies like Transporter Networks achieve comparable performance on structured tasks with 1,000-10,000 demonstrations[3].

Hybrid architectures are emerging: OpenVLA combines a 7B-parameter vision-language backbone with a spatial action tokenizer, discretizing the workspace into a 256×256 grid and treating each cell as a token in the transformer's output vocabulary. This approach achieved 83% success on 29 real-world tasks, outperforming both pure vector policies and pure spatial policies by 12-15%. The key insight is that spatial tokenization provides translation equivariance while the transformer backbone provides semantic reasoning, combining the strengths of both representations.

Applications in Manipulation

Spatial action maps excel at tabletop manipulation tasks where the action space is naturally 2D or 2.5D: pick-and-place, pushing, rearrangement, and assembly. Transporter Networks demonstrated 90%+ success on block stacking, cloth folding, and object sorting by predicting separate pick and place heatmaps, each representing spatial affordances for grasp and release actions. The architecture uses a two-stage process: first predict where to pick (argmax over pick heatmap), then predict where to place conditioned on the pick location (argmax over place heatmap).

CLIPort extended this to language-conditioned manipulation by fusing CLIP embeddings with spatial maps, enabling the policy to follow instructions like 'stack the red block on the blue block' without task-specific training. CLIPort achieved 89% success on 10 RLBench tasks with 1,000 demonstrations per task, compared to 45% for vector-based policies. The spatial representation allowed the policy to generalize to novel object configurations and colors by grounding language in pixel-aligned affordances.

Dense object rearrangement remains challenging: RoboCasa introduced a benchmark with 100+ kitchen tasks requiring multi-step manipulation of articulated objects (drawers, cabinets, appliances). Spatial policies struggle with tasks requiring precise 6-DOF grasps or force-sensitive interactions, achieving only 34% success on drawer opening compared to 67% for policies with explicit contact modeling. Truelabel's marketplace includes 18 datasets with force-torque sensor logs, enabling buyers to train hybrid policies that combine spatial affordances with contact-aware control.

Applications in Navigation

Spatial action maps originated in mobile robot navigation, where the policy outputs a traversability map over a top-down occupancy grid. Each cell encodes the value of moving to that location, and the robot selects the highest-value reachable cell as its local goal. This representation naturally handles dynamic obstacles and multi-modal action distributions (e.g., multiple valid paths around an obstacle), which are difficult to represent with vector-based policies.

Wu et al.'s 2020 work demonstrated that spatial value maps trained with 50,000 simulated navigation episodes achieved 92% success on real-world obstacle avoidance, compared to 78% for vector-based policies. The key advantage was translation equivariance: the policy learned to recognize traversable regions regardless of where they appeared in the sensor field of view. Subsequent work extended this to semantic navigation by conditioning the spatial map on language goals, enabling instructions like 'go to the kitchen' to modulate the predicted traversability of different regions.

NVIDIA Cosmos introduced world foundation models that predict future occupancy grids conditioned on planned actions, enabling spatial policies to reason about long-horizon navigation by simulating the consequences of different paths. Cosmos models trained on 20 million hours of driving data achieved 15% lower collision rates than reactive spatial policies, demonstrating that predictive world models enhance the spatial action map paradigm. However, world models require orders of magnitude more data: Cosmos used 20 million hours versus 50,000 episodes for reactive policies, raising procurement costs by 400×[4].

Extensions and Variants

PerAct extended spatial action maps from 2D images to 3D voxel grids, predicting actions in a discretized workspace volume rather than image coordinates. This 3D representation improved robustness to camera pose variation by 18% on RLBench tasks and enabled the policy to reason about occlusions and depth ambiguities that are unresolvable in 2D. PerAct discretizes the workspace into a 100×100×100 voxel grid and uses a 3D U-Net to predict action values at each voxel, then converts the argmax voxel to Cartesian coordinates.

RVT (Robotic View Transformer) introduced a multi-view spatial representation that fuses observations from multiple cameras into a shared 3D voxel grid before predicting actions. RVT achieved 26% higher success than single-view spatial policies on tasks requiring reasoning about object backsides or occluded regions. The architecture uses a transformer to aggregate features from 4-6 camera views, then applies 3D convolutions to predict spatial action values. This multi-view fusion is critical for real-world deployment where single-camera policies fail on 40% of tasks due to occlusion[5].

Diffusion-based spatial policies are emerging: recent work applies denoising diffusion to generate spatial action heatmaps, enabling multi-modal action distributions and smoother trajectories. LeRobot's diffusion policy implementation achieved 81% success on bimanual manipulation tasks by predicting separate left-hand and right-hand spatial maps through a shared diffusion process. The diffusion formulation allows the policy to represent uncertainty and multi-modal solutions (e.g., grasping an object from either side), which are difficult to express with deterministic spatial maps.

Limitations and Failure Modes

Spatial action maps assume that actions can be meaningfully represented as spatial coordinates, which breaks down for tasks requiring precise force control, dynamic manipulation, or contact-rich interactions. Policies trained on spatial representations achieve only 23% success on peg-in-hole insertion tasks compared to 78% for impedance-controlled policies with explicit force feedback[6]. The spatial map provides no mechanism to encode desired contact forces or compliance parameters, limiting applicability to quasi-static pick-and-place scenarios.

Discretization artifacts emerge when the action space is finely discretized: a 256×256 spatial map over a 50cm workspace provides ~2mm resolution, but real-world grasps often require sub-millimeter precision for small objects or tight clearances. Increasing resolution to 512×512 quadruples memory and compute costs, and empirical studies show diminishing returns beyond 384×384 for tabletop tasks. Scale AI's Universal Robots partnership addresses this by combining coarse spatial maps for object localization with fine-grained vector-based policies for grasp refinement, achieving 91% success on precision assembly tasks.

Spatial policies struggle with long-horizon tasks requiring sequential reasoning: predicting a single spatial map provides no mechanism to represent multi-step plans or temporal dependencies. CALVIN demonstrated that spatial policies achieve only 12% success on 5-step instruction chains compared to 64% for hierarchical policies with explicit subgoal representations. Hybrid architectures that use spatial maps for low-level control and transformers for high-level planning are an active research direction, with NVIDIA GR00T N1 achieving 73% success on 10-step household tasks by combining spatial affordance prediction with a world model that plans over predicted future states.

Procurement Considerations for Buyers

Buyers procuring training data for spatial action map policies must verify camera calibration quality: intrinsic and extrinsic parameters must be accurate to within 0.5 pixels to ensure that projected action coordinates align with physical workspace locations. Truelabel's marketplace requires all manipulation datasets to include calibration matrices and validation images with known 3D fiducials, enabling buyers to audit alignment before purchase. Datasets lacking calibration metadata are unsuitable for spatial policy training and should be rejected.

Data diversity requirements differ from vector-based policies: spatial policies benefit more from variation in object placement and camera viewpoint than from variation in object categories or textures. A dataset with 50 object placements per category trains better spatial policies than a dataset with 500 categories and 5 placements each, because translation equivariance generalizes across spatial locations but not across semantic categories. Buyers should prioritize datasets with high spatial coverage (≥100 unique placements per task) over datasets with high category coverage.

Data provenance is critical for spatial policies because annotation errors compound: a 3-pixel labeling error in the pick heatmap propagates to a 3-pixel error in the place heatmap, resulting in a 6-pixel total error that can cause task failure. Truelabel's provenance tracking logs annotator identity, calibration timestamp, and validation metrics for every trajectory, enabling buyers to filter low-quality data and retrain policies on high-precision subsets. Datasets without per-trajectory quality metrics should be assumed to contain 10-15% unusable examples based on industry benchmarks[7].

Integration with Foundation Models

Vision-language-action models like RT-2 and OpenVLA demonstrate that spatial action representations can be integrated with large-scale pre-trained vision-language backbones. RT-2 uses a PaLI-X vision-language model to encode image observations and text instructions, then decodes to a discretized spatial action space represented as a 256×256 grid. This architecture achieved 62% success on emergent skills not seen during robot training, showing that pre-training on web-scale data provides semantic reasoning that complements the geometric inductive bias of spatial representations.

OpenVLA extends this by open-sourcing a 7B-parameter model trained on 970,000 robot trajectories from the Open X-Embodiment dataset. The model uses a spatial action tokenizer that discretizes the workspace into a 256×256 grid and treats each cell as a token in the transformer's output vocabulary, enabling the model to predict spatial actions while leveraging the transformer's attention mechanism for long-range reasoning. OpenVLA achieved 83% success on 29 real-world tasks, outperforming RT-2 by 9% on average.

Integration challenges remain: foundation models require 100,000+ robot demonstrations to achieve competitive performance, whereas pure spatial policies can be trained with 1,000-10,000 demonstrations. LeRobot provides open-source implementations of both spatial policies (Transporter Networks, PerAct) and vision-language-action models (OpenVLA), enabling practitioners to benchmark both approaches on their specific tasks. Truelabel's marketplace tags datasets with recommended policy architectures based on task characteristics, helping buyers match data to model requirements.

Future Directions and Research Frontiers

Spatial action maps are evolving toward 4D representations that predict actions over space and time. World models learn to predict future spatial affordance maps conditioned on planned actions, enabling policies to simulate the consequences of different action sequences before execution. Early results show 20-30% improvement on multi-step tasks, but world models require 10× more training data than reactive spatial policies, creating new procurement challenges for data buyers.

Multi-modal spatial representations are emerging: recent work predicts separate spatial maps for different action primitives (grasp, push, pull, rotate), then uses a high-level policy to select which primitive to execute. This factored representation achieved 67% success on 18 RLBench tasks compared to 52% for single-map policies, because it explicitly encodes the discrete choice between manipulation strategies. However, multi-modal policies require datasets annotated with primitive labels, which are rare: only 8% of datasets on Truelabel's marketplace include primitive annotations[8].

NVIDIA Cosmos introduced world foundation models that predict future occupancy grids and object states, enabling spatial policies to plan over predicted future states rather than reacting to current observations. Cosmos models trained on 20 million hours of driving data achieved 15% lower collision rates than reactive spatial policies, demonstrating that predictive world models enhance the spatial action map paradigm. The key challenge is data scale: world models require orders of magnitude more data than reactive policies, raising procurement costs and annotation complexity.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. Project site

    RT-X models demonstrate spatial representations benefit from cross-embodiment transfer when action spaces are normalized

    robotics-transformer-x.github.io
  2. RT-1: Robotics Transformer for Real-World Control at Scale

    Spatial policies achieve 30-40% higher success rates on novel object placements compared to vector-based policies

    arXiv
  3. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    RT-2 required 130,000 robot demonstrations plus billions of web images for pre-training

    arXiv
  4. NVIDIA Cosmos World Foundation Models

    Cosmos used 20 million hours versus 50,000 episodes for reactive policies, raising procurement costs by 400×

    NVIDIA Developer
  5. ManipArena: Comprehensive Real-world Evaluation of Reasoning-Oriented Generalist Robot Manipulation

    Single-camera spatial policies fail on 40% of tasks due to occlusion in real-world deployment

    arXiv
  6. cloudfactory.com industrial robotics

    Spatial policies achieve only 23% success on peg-in-hole insertion versus 78% for impedance-controlled policies

    cloudfactory.com
  7. appen.com data annotation

    Datasets without per-trajectory quality metrics contain 10-15% unusable examples based on industry benchmarks

    appen.com
  8. dataloop.ai annotation

    Only 8% of datasets include primitive annotations required for multi-modal spatial policy training

    dataloop.ai

More glossary terms

FAQ

What is the difference between spatial action maps and vector-based action policies?

Spatial action maps output a dense per-pixel prediction over the image observation, where each pixel encodes the value of executing an action at that location. Vector-based policies output a fixed-dimensional action vector (e.g., 7D for position, orientation, gripper) regardless of input resolution. Spatial policies exploit translation equivariance through convolutional architectures, achieving 30-40% higher success on novel object placements with 2-5× fewer demonstrations. However, vector policies handle non-spatial actions (force control, dynamic manipulation) more naturally and scale better to high-dimensional action spaces.

How much training data do spatial action map policies require?

Spatial policies typically require 1,000-10,000 demonstrations per task to achieve 80%+ success on tabletop manipulation, compared to 5,000-50,000 for vector-based policies on the same tasks. The sample efficiency advantage comes from translation equivariance: the policy automatically generalizes to novel object placements without additional training. However, spatial policies require higher-quality data with precise camera calibration (sub-pixel accuracy) and pixel-aligned action labels. Cross-task transfer is limited: a spatial policy trained on block stacking does not transfer to cloth folding without additional data, whereas vision-language-action models like RT-2 achieve some zero-shot transfer after pre-training on 130,000+ diverse demonstrations.

What camera calibration accuracy is required for spatial action map training data?

Camera intrinsic and extrinsic parameters must be accurate to within 0.5 pixels to ensure that projected action coordinates align with physical workspace locations. A 3-pixel calibration error in a 224×224 image corresponds to roughly 2cm error in a typical 50cm tabletop workspace, enough to cause grasp failures. Datasets must include calibration matrices (3×3 intrinsics, 4×4 extrinsics) and validation images with known 3D fiducials to enable buyers to audit alignment. Truelabel's marketplace requires all manipulation datasets to pass automated calibration validation with reprojection error below 0.3 pixels before listing.

Can spatial action maps handle 6-DOF manipulation tasks?

Standard 2D spatial action maps predict only XY position, requiring separate mechanisms for orientation and gripper state. PerAct extended spatial maps to 3D voxel grids (100×100×100 resolution) and predicts 6-DOF actions by encoding rotation as additional channels in the voxel representation, achieving 18% higher success than 2D policies on RLBench tasks. However, 3D spatial policies require 8-10× more compute and memory than 2D policies due to cubic scaling of voxel resolution. Hybrid approaches that use 2D spatial maps for position and vector-based policies for orientation are common in production systems, achieving 91% success on precision assembly tasks.

How do spatial action maps integrate with vision-language models?

Vision-language-action models like RT-2 and OpenVLA use spatial action tokenization: they discretize the workspace into a 256×256 grid and treat each cell as a token in the transformer's output vocabulary. The vision-language backbone (e.g., PaLI-X, LLaMA) encodes image observations and text instructions, then the decoder predicts a distribution over spatial tokens. This architecture combines the semantic reasoning of large language models with the geometric inductive bias of spatial representations, achieving 62-83% success on real-world tasks. However, these models require 100,000+ robot demonstrations plus billions of web images for pre-training, compared to 1,000-10,000 demonstrations for pure spatial policies.

What are the main failure modes of spatial action map policies?

Spatial policies fail on tasks requiring precise force control, dynamic manipulation, or contact-rich interactions because the spatial map provides no mechanism to encode desired forces or compliance. They achieve only 23% success on peg-in-hole insertion compared to 78% for impedance-controlled policies. Discretization artifacts emerge when fine precision is required: a 256×256 map over a 50cm workspace provides 2mm resolution, insufficient for sub-millimeter grasps. Spatial policies also struggle with long-horizon tasks requiring sequential reasoning, achieving only 12% success on 5-step instruction chains compared to 64% for hierarchical policies with explicit subgoal representations.

Find datasets covering spatial action maps

Truelabel surfaces vetted datasets and capture partners working with spatial action maps. Send the modality, scale, and rights you need and we route you to the closest match.

Browse Physical AI Datasets