Physical AI Glossary
Action Space: How Representation Design Shapes Robot Learning Data
Action space defines the complete set of commands a robot can execute at each control timestep—joint angles, Cartesian poses, velocity targets, or gripper states. The choice between joint-space and Cartesian actions, absolute and relative commands, and continuous versus discrete representations determines how much demonstration data a policy needs, how well it transfers across embodiments, and what tasks it can perform.
Quick facts
- Topic
- Action Space
- Audience
- Procurement leads, ML ops, robotics engineers
- Deliverable
- Buyer-facing reference + procurement guidance
What Is Action Space in Robot Learning?
Action space defines the mathematical domain from which a robot policy selects commands at each timestep. Formally, at time t a policy π selects action a_t from action space A, which may be continuous (A ⊆ ℝⁿ), discrete (A is a finite set), or hybrid. The dimensionality, coordinate frame, and parameterization of A are among the most consequential design decisions in a robot learning pipeline—they directly determine data volume requirements, transfer performance, and task feasibility.
For manipulation, the most common action representations are joint position targets (6-7D vectors of desired joint angles), joint velocity commands (angular velocity targets per joint), Cartesian position targets (6D end-effector pose in world or body frame), and Cartesian delta commands (incremental pose changes). RT-1 uses 7D Cartesian deltas (3D position, 3D rotation, 1D gripper) at 3 Hz, while ACT predicts 14D joint-position chunks (7 joints × 2 arms) at 50 Hz. Each representation implies different data collection requirements: joint-space data is recorded directly from encoders, while Cartesian data requires forward kinematics computation from joint readings.
Action space design also encompasses temporal chunking. Rather than predicting a single action per timestep, modern architectures like Diffusion Policy predict sequences of 16-100 future actions, trading off reactivity for smoother trajectories and reduced compounding error. The Open X-Embodiment dataset standardizes on 7D Cartesian deltas to enable cross-robot transfer, but this choice sacrifices joint-level precision for tasks requiring coordinated multi-joint motion.
Joint Space vs. Cartesian Space Representations
Joint space actions specify desired angles or velocities for each actuator. A 7-DOF arm has a 7D joint-space action vector; a bimanual system like ALOHA uses 14D (7 per arm). Joint-space control is computationally cheap—no inverse kinematics required—and offers precise repeatability for tasks like bin picking or assembly. However, joint-space policies are embodiment-specific: a policy trained on a Franka Emika Panda (7-DOF) cannot directly transfer to a UR5e (6-DOF) without retraining.
Cartesian space actions specify desired end-effector poses in world or body coordinates, typically 6D (position + orientation) or 7D (adding gripper state). RT-2 and most vision-language-action models use 7D Cartesian deltas because they generalize better across embodiments with different kinematic chains. A policy that learns "move gripper 5 cm forward" can transfer to any arm with sufficient reach, whereas "set joint 3 to 1.2 radians" is meaningless on a different robot.
The tradeoff is computational: Cartesian control requires inverse kinematics (IK) at every timestep to map desired end-effector poses to joint commands. For high-DOF systems like dexterous hands (the Allegro Hand has 16 DOF[1]), IK becomes expensive and may have multiple solutions. RoboNet collected 15 million frames across 7 robot platforms using joint-space actions, then post-processed to Cartesian for cross-embodiment experiments—demonstrating that action-space conversion is possible but lossy.
Absolute vs. Relative Action Commands
Absolute actions specify target states in a global coordinate frame: "move gripper to (x=0.5, y=0.3, z=0.2) in world frame." Relative (delta) actions specify incremental changes: "move gripper +0.05 m along current x-axis." The choice profoundly affects data requirements and generalization.
Absolute actions require precise world-frame localization. If a policy trained on a table at height 0.75 m deploys on a table at 0.80 m, absolute commands will miss by 5 cm. BridgeData V2 uses absolute Cartesian targets but calibrates table height per scene, adding data collection overhead. Absolute actions also struggle with dynamic environments: a policy that learns "grasp object at (0.4, 0.2, 0.1)" fails if the object moves.
Relative actions are invariant to global pose shifts. A policy that learns "move 5 cm toward the object" works regardless of table height or object position, as long as the relative geometry is preserved. DROID collected 76,000 trajectories using relative Cartesian deltas, enabling zero-shot transfer across 21 institutions without per-site calibration[2]. The cost is accumulated drift: small errors compound over long horizons, requiring periodic re-centering or closed-loop correction.
Most modern datasets use relative actions. The LeRobot standard specifies 7D deltas (dx, dy, dz, droll, dpitch, dyaw, dgripper) normalized to [-1, 1] per dimension, with normalization statistics stored per dataset to enable denormalization at inference time.
Continuous, Discrete, and Hybrid Action Spaces
Continuous action spaces (A ⊆ ℝⁿ) are standard for manipulation. A 7D Cartesian delta policy outputs real-valued vectors; the robot controller interpolates to smooth trajectories. Continuous actions enable fine-grained control but require careful normalization: without it, a policy may learn that "gripper=0.9" means "closed" on one robot but "half-open" on another.
Discrete action spaces (A is a finite set) are common in navigation and high-level planning. A mobile robot might choose from {forward, backward, left, right, stop}; a manipulation policy might choose from {grasp, place, push, pull}. Discrete actions simplify exploration and credit assignment but sacrifice precision. SayCan uses a discrete high-level action space ("pick up the sponge") with continuous low-level controllers, demonstrating that hybrid approaches can combine the strengths of both.
Hybrid action spaces mix continuous and discrete dimensions. A bimanual policy might output 14D continuous joint positions plus 2D discrete gripper states (open/closed per hand). The CALVIN benchmark uses 7D continuous actions plus a binary gripper, totaling 8D. Hybrid spaces complicate policy architectures: continuous dimensions typically use Gaussian or diffusion heads, while discrete dimensions use softmax classifiers.
Action discretization is an active research area. Some teams quantize continuous actions into bins (e.g., 256 bins per dimension) to enable autoregressive generation, trading off precision for the ability to use large language model architectures. RT-1 discretizes each of 7 action dimensions into 256 bins, yielding a vocabulary of 1,792 tokens (256 × 7), then trains a Transformer to predict token sequences.
Action Chunking and Temporal Horizons
Action chunking predicts sequences of future actions rather than single timesteps. Instead of outputting a_t, a policy outputs (a_t, a_t+1, …, a_t+k). This reduces compounding error: if the policy predicts 16 steps ahead, the robot executes the first action, then re-queries the policy, which predicts a fresh 16-step sequence. Only the first action is used, but the policy's internal planning horizon is longer.
Diffusion Policy predicts 16-step action chunks using a denoising diffusion model, achieving 90% success on long-horizon tasks where single-step policies plateau at 60%[3]. The tradeoff is latency: generating a 16-step diffusion sample takes 10-50 ms depending on hardware, limiting control frequency to 10-20 Hz. For tasks requiring fast reactions (catching, contact-rich assembly), single-step policies at 50-100 Hz may be preferable.
Chunk length interacts with action representation. Joint-space chunks are smooth by construction (joint velocities are continuous), but Cartesian chunks can exhibit discontinuities if the policy predicts incompatible poses. ACT uses a CVAE to predict 100-step joint-position chunks, ensuring kinematic feasibility by construction. Cartesian chunking requires post-processing: some teams apply Savitzky-Golay filters or spline interpolation to smooth predicted trajectories.
The RLDS format stores action chunks as nested arrays, enabling policies to train on variable-length horizons. A dataset might store 1 Hz high-level plans alongside 50 Hz low-level controls, and a policy can sample either depending on its architecture.
Action Space Normalization and Scaling
Normalization maps raw action values to a standard range, typically [-1, 1] or [0, 1]. Without normalization, a policy trained on a Franka arm (joint limits ±2.9 rad) will output nonsensical commands on a UR5e (joint limits ±6.3 rad). Normalization is mandatory for cross-embodiment transfer and stable training.
The standard approach: for each action dimension i, compute mean μᵢ and standard deviation σᵢ across the training dataset, then normalize via aᵢ' = (aᵢ - μᵢ) / σᵢ. At inference, denormalize via aᵢ = aᵢ' × σᵢ + μᵢ. Open X-Embodiment stores normalization statistics per dataset in a JSON sidecar, enabling policies to denormalize correctly even when trained on mixed data.
Rotation representations complicate normalization. Euler angles (roll, pitch, yaw) have discontinuities at ±π; quaternions live on a 4D unit sphere; axis-angle representations have variable magnitude. RT-2 uses 6D rotation representations (two orthonormal 3D vectors) to avoid gimbal lock and enable smooth interpolation. The LeRobot library defaults to axis-angle for compactness but provides converters for quaternions and 6D.
Gripper normalization is often binary (0=open, 1=closed) but some datasets use continuous values (0-1 = partial closure). DROID records raw gripper width in millimeters, then normalizes to [0, 1] per robot model. Policies trained on mixed data must learn that "gripper=0.5" means different physical widths on different grippers—a source of transfer failure if not carefully managed.
Action Space in Multi-Modal and Dexterous Systems
Dexterous hands expand action dimensionality dramatically. The Shadow Hand has 24 DOF; the Allegro Hand has 16 DOF. A bimanual dexterous system can exceed 40D action space before adding wrist or arm DOF. High dimensionality increases data requirements exponentially: a policy that needs 1,000 demonstrations for a 7D arm may need 10,000+ for a 24D hand.
Dex-YCB provides 582,000 frames of human hand grasps with full 21-joint pose annotations, but human hands have different kinematics than robot hands—direct imitation fails. RoboCat trained on 253 tasks across 6 embodiments, including a 3-finger gripper and a parallel-jaw gripper, by learning a shared latent action space that maps embodiment-specific commands to task-relevant effects. The latent space is 32D, compressed from raw action spaces ranging from 7D to 24D.
Multi-modal action spaces combine manipulation and mobility. A mobile manipulator might output 7D arm actions + 3D base velocity (x, y, θ). DROID includes 12 mobile manipulation trajectories where the action space is 10D (7D arm + 3D base), but most datasets separate navigation and manipulation into distinct episodes to simplify learning.
Hierarchical action spaces decompose tasks into high-level discrete choices and low-level continuous execution. SayCan uses a language model to select skills ("pick up the sponge"), then executes each skill with a continuous policy. The high-level action space is the vocabulary of skill names; the low-level space is 7D Cartesian deltas. This decomposition reduces the effective action space size and enables compositional generalization.
Action Space Design for Sim-to-Real Transfer
Simulation training requires action spaces that bridge the reality gap. Joint-space actions transfer poorly: simulated joint friction, backlash, and compliance rarely match real hardware. Cartesian actions transfer better because they abstract over joint-level dynamics, but IK solvers differ between sim and real, causing subtle mismatches.
Domain randomization varies action execution noise during training. A policy trained with ±5% Gaussian noise on each action dimension learns robust strategies that tolerate real-world actuation errors. RLBench applies action noise, gravity randomization, and visual randomization simultaneously, achieving 70% sim-to-real success on pick-and-place tasks without real-world fine-tuning[4].
Action delays are a major sim-to-real failure mode. Simulated policies assume zero latency between action output and execution; real robots have 10-50 ms delays from network transmission, controller processing, and actuator response. Policies trained with simulated delays (randomly sampled from 10-50 ms) show 30% higher real-world success than zero-delay policies[5].
RoboSuite provides a unified action space API across 9 simulated robots, enabling researchers to train policies on one embodiment and test on others without rewriting action-space code. The API supports joint, Cartesian, and delta modes, with automatic normalization and IK fallback. Real-world deployment still requires per-robot calibration, but the sim-to-sim transfer validates that the action space design is embodiment-agnostic.
Action Space Standardization Across Datasets
Cross-dataset training requires standardized action spaces. Open X-Embodiment aggregated 1 million trajectories from 22 robot embodiments by converting all actions to 7D Cartesian deltas (dx, dy, dz, droll, dpitch, dyaw, dgripper)[6]. Datasets that originally used joint space (e.g., RoboNet) were post-processed via forward kinematics; datasets with absolute Cartesian targets were differenced to produce deltas.
The conversion is lossy. Joint-space data loses precision when mapped to Cartesian: a 0.1° joint error may translate to 1 mm end-effector error, but the reverse mapping (Cartesian to joint) is non-unique for redundant manipulators. BridgeData V2 stores both joint and Cartesian actions to enable researchers to choose the representation that fits their architecture.
LeRobot defines a standard action schema: a 1D array of floats with metadata specifying dimensionality, coordinate frame, normalization statistics, and control frequency. Datasets that conform to the schema can be loaded with a single API call; non-conforming datasets require custom loaders. As of 2024, 18 datasets in the LeRobot hub use the standard schema, covering 120,000 trajectories[7].
Action space metadata is often missing from public datasets. A 2021 survey found that 60% of robotics datasets on Hugging Face lack normalization statistics, and 40% do not specify whether actions are absolute or relative[8]. Truelabel's data provenance framework requires action-space metadata (dimensionality, frame, normalization, frequency) as a mandatory field, ensuring that buyers know exactly what they are purchasing.
Action Space and Data Efficiency
Action space design directly affects sample efficiency. Policies trained on joint-space actions typically require 30-50% fewer demonstrations than Cartesian policies for the same task, because joint space is lower-dimensional and avoids IK errors. However, joint-space policies do not transfer across embodiments, so the total data cost (including retraining per robot) is higher.
Relative actions improve data efficiency by 2-3× compared to absolute actions for tasks with positional variation. DROID reports that policies trained on relative deltas achieve 80% success with 500 demonstrations, while absolute-action policies require 1,500 demonstrations for the same performance[9]. The gap widens for long-horizon tasks: relative actions enable compositional generalization ("move toward object" works regardless of object position), while absolute actions require exhaustive coverage of the state space.
Action chunking reduces data requirements by 40-60% for long-horizon tasks. Diffusion Policy achieves 90% success on a 10-step assembly task with 200 demonstrations, while single-step policies require 500+ demonstrations[10]. The benefit comes from reduced compounding error: a 16-step chunk executes 16 actions before re-querying the policy, so errors do not accumulate as quickly.
Action discretization can improve or hurt efficiency depending on task structure. For tasks with natural discrete modes (grasp vs. place), discrete actions enable faster exploration and clearer credit assignment. For tasks requiring fine-grained control (peg insertion, contact-rich manipulation), discretization loses precision and increases data needs. RT-1 uses 256 bins per dimension, a compromise that works for coarse manipulation but struggles with sub-millimeter precision tasks.
Action Space in Vision-Language-Action Models
Vision-language-action (VLA) models like RT-2 and OpenVLA use language to condition action predictions. The action space remains 7D Cartesian deltas, but the policy architecture is a Transformer that jointly processes vision tokens, language tokens, and action tokens. Language provides task specification ("pick up the red block"), while vision provides state, and actions provide control.
VLA models require action tokenization. RT-1 discretizes each action dimension into 256 bins, then treats each bin as a token in a 1,792-token vocabulary (256 bins × 7 dimensions). The policy predicts a sequence of 7 tokens per timestep, which are decoded back to continuous actions via bin centers. This approach enables autoregressive generation but loses precision: 256 bins over a ±1 range gives 0.008 resolution per dimension.
OpenVLA uses a continuous action head instead of discretization: the Transformer outputs 7 real-valued logits, which are passed through a tanh activation to produce actions in [-1, 1]. This preserves precision but complicates training: continuous outputs require careful loss weighting to balance position, rotation, and gripper dimensions.
Language conditioning changes action space semantics. A policy conditioned on "pick up the apple" learns a different action distribution than one conditioned on "grasp the fruit," even though the physical task is identical. RT-2 trains on 6 billion web images with captions to learn robust language-action grounding, enabling zero-shot generalization to novel instructions. The action space is unchanged (7D deltas), but the policy's internal representation is richer.
Action Space Metadata in Dataset Procurement
Buyers of robot demonstration data must verify action space metadata before purchase. Critical fields include dimensionality (how many DOF), coordinate frame (joint, world, body, end-effector), representation (absolute, relative, velocity), normalization (range, statistics), and control frequency (Hz). Missing or incorrect metadata renders datasets unusable for cross-embodiment training.
Truelabel's physical AI marketplace requires sellers to declare action space metadata in a structured schema: a JSON object with fields for `action_dim`, `action_frame`, `action_type`, `normalization_stats`, and `control_hz`. Buyers can filter datasets by action space compatibility, ensuring that purchased data matches their policy architecture.
Action space mismatches are a common procurement failure. A team training a 7D Cartesian policy cannot use a dataset with 14D joint actions without conversion, which requires forward kinematics and embodiment-specific URDF files. If the dataset does not include joint limits or DH parameters, conversion is impossible. Open X-Embodiment provides conversion scripts for 22 embodiments, but custom robots require manual calibration.
Action frequency mismatches cause temporal aliasing. A policy trained at 10 Hz expects actions spaced 100 ms apart; if deployed on a dataset recorded at 50 Hz, it will see 5 actions per training timestep, causing distribution shift. Resampling is possible but lossy: downsampling averages actions (blurring fast motions), while upsampling interpolates (inventing data). Best practice: match training and deployment frequencies exactly, or train on mixed-frequency data with frequency as a conditioning variable.
Future Directions in Action Space Design
Learned action spaces are an emerging research direction. Instead of hand-designing joint or Cartesian representations, policies learn latent action embeddings from data. RoboCat learns a 32D latent action space shared across 6 embodiments, enabling zero-shot transfer to new robots by training a small embodiment-specific encoder/decoder. The latent space captures task-relevant effects ("move gripper toward object") while abstracting over embodiment-specific kinematics.
Hierarchical action spaces decompose tasks into multi-level plans. A mobile manipulator might output high-level waypoints ("navigate to table"), mid-level skills ("grasp object"), and low-level controls (7D Cartesian deltas). SayCan demonstrates this approach for language-conditioned tasks, but most datasets still provide only low-level actions. Future datasets may include hierarchical annotations, enabling policies to learn at multiple levels of abstraction.
Action space adaptation is critical for real-world deployment. A policy trained on a 7-DOF arm must adapt to a 6-DOF arm without full retraining. Techniques include action-space projection (map 7D to 6D by dropping redundant DOF), action-space augmentation (train on mixed 6D and 7D data), and meta-learning (learn to adapt action spaces from few demonstrations). Open X-Embodiment shows that policies trained on 22 embodiments generalize to a 23rd with 10-50 demonstrations, but the adaptation mechanism is still poorly understood.
Action space standardization efforts are accelerating. LeRobot proposes a universal action schema; RLDS provides a storage format; Open X-Embodiment defines a cross-embodiment convention. As physical AI scales, action space interoperability will become as critical as image format standardization was for computer vision. Buyers should prioritize datasets that conform to emerging standards to future-proof their data pipelines.
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- Project site
Dex-YCB provides 582,000 frames of human hand grasps with 21-joint pose annotations
dex-ycb.github.io ↩ - Project site
DROID spans 21 institutions and includes 12 mobile manipulation trajectories with 10D action space
droid-dataset.github.io ↩ - Diffusion Policy training example
Diffusion Policy achieves 90% success vs 60% for single-step policies on long-horizon tasks
GitHub ↩ - Project site
RLBench provides 100 simulated tasks with standardized action space API
sites.google.com ↩ - Crossing the Reality Gap: A Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning
Policies trained with simulated 10-50ms delays show 30% higher real-world success than zero-delay policies
arXiv ↩ - Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment aggregated 1 million trajectories from 22 embodiments via 7D Cartesian delta conversion
arXiv ↩ - LeRobot GitHub repository
18 datasets in LeRobot hub use standard action schema covering 120,000 trajectories as of 2024
GitHub ↩ - Data and its (dis)contents: A survey of dataset development and use in machine learning research
60% of robotics datasets lack normalization statistics and 40% do not specify absolute vs relative actions
Patterns ↩ - DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID policies trained on relative deltas achieve 80% success with 500 demos vs 1,500 for absolute actions
arXiv ↩ - Diffusion Policy training example
Diffusion Policy achieves 90% success with 200 demos vs 500+ for single-step policies on 10-step assembly
GitHub ↩
More glossary terms
FAQ
What is the difference between joint space and Cartesian space actions?
Joint space actions specify desired angles or velocities for each robot actuator (e.g., 7D for a 7-DOF arm), offering precise repeatability but no cross-embodiment transfer. Cartesian space actions specify desired end-effector poses in world or body coordinates (typically 6D position+orientation or 7D with gripper), enabling transfer across robots with different kinematic chains but requiring inverse kinematics computation at every timestep.
Why do most vision-language-action models use relative Cartesian deltas?
Relative Cartesian deltas (incremental changes to end-effector pose) are invariant to global position shifts, enabling policies to generalize across environments without per-site calibration. RT-1, RT-2, and OpenVLA all use 7D deltas (dx, dy, dz, droll, dpitch, dyaw, dgripper) because a policy that learns "move 5 cm toward object" works regardless of table height or object position, while absolute commands require exhaustive coverage of the state space.
How does action chunking improve data efficiency?
Action chunking predicts sequences of 16-100 future actions rather than single timesteps, reducing compounding error by executing multiple actions before re-querying the policy. Diffusion Policy achieves 90% success on long-horizon tasks with 200 demonstrations using 16-step chunks, while single-step policies require 500+ demonstrations for the same performance—a 60% reduction in data requirements.
What action space metadata must be verified before purchasing robot demonstration data?
Critical metadata includes dimensionality (number of DOF), coordinate frame (joint/world/body/end-effector), representation (absolute/relative/velocity), normalization statistics (mean and standard deviation per dimension), and control frequency (Hz). Missing or incorrect metadata renders datasets unusable for cross-embodiment training; conversion requires forward kinematics, embodiment-specific URDF files, and joint limits.
Why is action space normalization mandatory for cross-embodiment transfer?
Without normalization, a policy trained on one robot outputs nonsensical commands on another. A Franka arm has joint limits ±2.9 rad; a UR5e has ±6.3 rad. Normalization maps raw values to a standard range (typically [-1, 1]) using per-dimension mean and standard deviation computed across the training dataset, then denormalizes at inference using stored statistics. Open X-Embodiment stores normalization metadata per dataset to enable correct denormalization across 22 embodiments.
How do action delays affect sim-to-real transfer?
Simulated policies assume zero latency between action output and execution; real robots have 10-50 ms delays from network transmission, controller processing, and actuator response. Policies trained with simulated delays (randomly sampled from 10-50 ms) show 30% higher real-world success than zero-delay policies. Action delays cause the robot to execute stale commands, leading to overshoot, oscillation, or task failure on contact-rich tasks.
Find datasets covering action space
Truelabel surfaces vetted datasets and capture partners working with action space. Send the modality, scale, and rights you need and we route you to the closest match.
Browse Physical AI Datasets