Physical AI Glossary

Joint-Space Control

Joint-space control commands a robot's internal degrees of freedom—joint angles, velocities, or torques—rather than end-effector Cartesian poses. For an n-joint manipulator, the control input is an n-dimensional vector specifying per-joint targets at each timestep. Position control (target joint angles tracked by PID) dominates learned manipulation because it is stable, unambiguous, and directly executable. Velocity and torque modes offer finer dynamics but require more sophisticated controllers. Joint-space actions are embodiment-specific: a policy trained on Franka Panda 7-DOF joint vectors cannot transfer to UR5 6-DOF without retraining, unlike task-space policies that generalize across kinematic chains.

Updated 2025-06-08

By truelabel

Reviewed by truelabel · Jun 8, 2025

joint-space control

Browse Joint-Space Teleoperation Datasets Browse glossary

Quick facts

Term: Joint-Space Control
Domain: Robotics and physical AI
Last reviewed: 2025-06-08

What Joint-Space Control Is and Why It Matters for Learned Policies

Joint-space control operates on a robot's internal coordinate system—the joint angles, velocities, or torques of each actuator—rather than the end-effector's Cartesian position and orientation. For a 7-DOF arm like the Franka FR3, a joint-space position command is a 7-element vector specifying target angles for shoulder, elbow, and wrist joints. The robot's low-level servo controller tracks these targets using PID loops, making position control the most stable and widely deployed mode.

Learned manipulation policies favor joint-space actions because they eliminate inverse-kinematics ambiguity. A Cartesian target (x, y, z, roll, pitch, yaw) can correspond to multiple joint configurations for redundant arms; joint-space commands are unambiguous and directly executable. RT-1 and ACT both output joint-space position deltas, trained via behavioral cloning on teleoperation demonstrations. The policy learns smooth joint trajectories that respect the robot's dynamics, avoiding the singularities and discontinuities that plague naïve Cartesian controllers.

The tradeoff is embodiment specificity. A policy trained on 7-DOF Franka joint vectors cannot run on a 6-DOF UR5 without retraining, because the action dimensionality and kinematic structure differ. Task-space policies that output end-effector deltas can generalize across morphologies, but they sacrifice single-embodiment performance. Open X-Embodiment demonstrates that joint-space specialists outperform task-space generalists by 12–18% on manipulation benchmarks when evaluated on the training embodiment^[1].

Position, Velocity, and Torque Control Modes

Joint-space control has three primary modes, each exposing a different level of the control hierarchy. Position control commands target joint angles; the robot's firmware PID controller generates torques to track the setpoint. This is the default mode for most commercial arms and the easiest to work with—policies output joint angles, and the hardware handles dynamics. Scale AI's physical-AI data engine collects teleoperation demonstrations in position mode because it produces stable, reproducible trajectories suitable for behavioral cloning.

Velocity control commands target joint velocities rather than positions. The low-level controller integrates velocity commands into position targets, allowing smoother motion profiles and better handling of contact. Velocity mode is common in mobile manipulation and compliant tasks where the robot must adapt to external forces. However, velocity policies require careful tuning to avoid drift and overshoot, making them less popular for imitation learning.

Torque control commands joint torques directly, bypassing position and velocity loops. This exposes the full dynamics and enables force-sensitive behaviors—insertion, wiping, contact-rich assembly—but demands sophisticated policies that model inertia, friction, and external disturbances. DROID includes 76,000 torque-controlled trajectories for contact-rich tasks, but torque policies remain harder to train than position policies^[2]. Most learned manipulation systems use position control for 80–90% of tasks and reserve torque control for specialized contact scenarios.

Joint-Space Actions in Behavioral Cloning and Imitation Learning

Behavioral cloning trains policies to mimic expert demonstrations by supervised learning on (observation, action) pairs. When actions are joint-space positions, the policy learns a mapping from camera images or proprioceptive state to target joint angles. Diffusion Policy models joint-space action sequences as conditional diffusion processes, generating smooth multi-step trajectories that avoid the compounding errors of single-step predictors. On the robomimic benchmark, diffusion-based joint-space policies achieve 89% success on pick-and-place tasks versus 67% for single-step regression^[3].

The key advantage of joint-space actions for imitation learning is trajectory smoothness. Human teleoperators naturally produce smooth joint motions; recording these as joint angles preserves the temporal structure. Policies trained on joint-space data inherit this smoothness, reducing jerk and improving physical plausibility. ALOHA collects bimanual teleoperation at 50 Hz in joint space, enabling policies to learn fine-grained coordination between left and right arms^[4].

The disadvantage is embodiment lock-in. A policy trained on Franka Panda 7-DOF joint data cannot transfer to a different arm without retraining, because the action space is robot-specific. RT-2 addresses this by pretraining on web-scale vision-language data and fine-tuning on joint-space robot data, but the fine-tuning step is still embodiment-specific. Cross-embodiment generalization requires task-space actions or learned kinematic adapters, both of which sacrifice single-robot performance.

Joint-Space vs. Task-Space Control: When to Use Each

Task-space control commands end-effector poses (position and orientation in Cartesian coordinates) rather than joint angles. The robot's inverse-kinematics solver computes the joint configuration needed to achieve the target pose. Task-space actions are intuitive for humans—"move the gripper 5 cm forward"—and generalize across robot morphologies, because the action space is defined in world coordinates rather than joint coordinates.

Joint-space control is preferred when single-embodiment performance is the priority. Policies that output joint angles avoid IK solver latency and singularities, and they can exploit the robot's full kinematic redundancy. OpenVLA uses joint-space actions for manipulation tasks and achieves 12% higher success rates than task-space baselines on the same hardware^[5]. Joint-space policies also handle joint limits and self-collision constraints more naturally, because the policy learns these constraints implicitly from demonstration data.

Task-space control is preferred when cross-embodiment generalization is required. A policy that outputs end-effector deltas can run on any arm with sufficient reach and dexterity, because the action space is robot-agnostic. RT-X trains a single task-space policy on data from 22 robot embodiments and deploys it on new morphologies without retraining. The cost is a 10–15% performance penalty on each individual robot compared to joint-space specialists^[6].

In practice, many systems use hybrid approaches: joint-space control for high-precision manipulation and task-space control for coarse positioning. CALVIN uses task-space actions for navigation and joint-space actions for grasping, switching modes based on the task phase.

Training Data Requirements for Joint-Space Policies

Joint-space policies require teleoperation demonstrations that record joint angles, velocities, and torques at each timestep, synchronized with camera observations and proprioceptive state. The LeRobot dataset format stores joint-space trajectories in HDF5 or Parquet with per-joint columns for position, velocity, and effort. A typical manipulation dataset contains 500–2,000 demonstrations per task, each 10–60 seconds long, yielding 50,000–200,000 timesteps of joint-space data^[7].

Teleoperation quality is critical. Jerky or inconsistent joint motions degrade policy performance, because the policy learns to mimic the noise. UMI uses a custom gripper with force feedback to enable smooth teleoperation, reducing demonstration variance by 40% compared to keyboard control^[8]. Truelabel's physical-AI marketplace vets collectors on teleoperation smoothness and task completion rate before accepting joint-space datasets.

Embodiment metadata must accompany joint-space data: robot model, joint limits, link lengths, and DH parameters. Without this metadata, downstream users cannot verify kinematic feasibility or retarget trajectories to similar robots. RLDS defines a standard schema for joint-space episodes, but adoption remains incomplete—only 30% of robotics datasets on Hugging Face include full kinematic specifications^[9].

Action frequency matters. Policies trained on 10 Hz joint-space data exhibit choppier motion than policies trained on 50 Hz data, because low-frequency actions undersample the robot's dynamics. DROID collects at 20 Hz, a compromise between data volume and motion fidelity. Higher frequencies (50–100 Hz) are necessary for contact-rich tasks where force transients occur on millisecond timescales.

Embodiment Specificity and the Cross-Robot Transfer Problem

Joint-space actions are embodiment-specific by construction. A 7-DOF Franka Panda policy outputs 7-element joint vectors; a 6-DOF UR5 policy outputs 6-element vectors. The action spaces are incompatible, so a policy trained on one robot cannot run on the other without modification. This is the central challenge for joint-space control in multi-robot settings.

Kinematic retargeting attempts to map joint-space actions from one robot to another by solving for equivalent end-effector poses. Given a Franka joint configuration, compute the end-effector pose, then solve IK for the UR5 to achieve the same pose. This works for simple reaching tasks but fails for contact-rich manipulation, where joint-space dynamics (inertia, friction, torque limits) differ between robots. Sim-to-real transfer studies show that kinematic retargeting preserves only 60–70% of task success when transferring between dissimilar arms^[10].

Action abstraction layers convert joint-space actions to task-space deltas or skill primitives, enabling cross-robot deployment. RT-2 uses a learned action encoder that maps robot-specific joint vectors to a shared latent space, then decodes to the target robot's joint space. This adds 15–20 ms of inference latency and reduces success rates by 8–12% compared to native joint-space policies^[11].

Multi-embodiment datasets like Open X-Embodiment train joint-space policies separately for each robot, then ensemble predictions at test time. The policy selects the robot-specific head based on embodiment ID, avoiding the performance penalty of shared action spaces. This approach scales to 10–20 robots but requires separate data collection for each embodiment, multiplying dataset costs.

Joint-Space Control in Simulation and Sim-to-Real Transfer

Simulation environments for manipulation—robosuite, ManiSkill, RLBench—expose joint-space control APIs that mirror real-robot interfaces. Policies trained in simulation output joint-space actions that transfer to hardware with minimal adaptation, provided the simulator accurately models joint dynamics (friction, backlash, gear ratios). Domain randomization varies joint parameters during training to improve sim-to-real robustness, reducing the reality gap by 30–40%^[12].

Physics fidelity is critical for joint-space sim-to-real transfer. Position-controlled policies are forgiving—small errors in joint friction or damping have limited impact on task success. Torque-controlled policies are brittle—a 10% error in link inertia can cause instability or task failure. Dynamics randomization trains torque policies on a distribution of simulated dynamics, enabling zero-shot transfer to real hardware in 70% of contact-rich tasks^[13].

Action scaling and clipping are necessary when transferring joint-space policies from simulation to hardware. Simulated joint limits are often looser than real limits; policies that command near-limit angles in simulation may violate hardware constraints. LeRobot includes action normalization utilities that map policy outputs to safe joint ranges, preventing hardware damage during deployment.

Latency compensation is required for high-frequency joint-space control. Real robots have 5–20 ms servo latency; policies trained in zero-latency simulation exhibit oscillations or overshoot when deployed. RT-1 uses action chunking—predicting 8-step joint-space sequences—to smooth out latency-induced jitter, improving real-world success rates by 15%^[14].

Datasets and Benchmarks for Joint-Space Manipulation

DROID is the largest open joint-space manipulation dataset, with 76,000 trajectories across 564 tasks collected on Franka Panda arms. Each trajectory includes 7-DOF joint positions, velocities, and torques at 20 Hz, plus wrist and third-person camera streams. DROID enables training of joint-space policies that generalize across object categories and task variations within the Franka embodiment^[15].

ALOHA provides 1,200 bimanual teleoperation demonstrations for fine-grained manipulation tasks (cable routing, battery insertion, food transfer). Joint-space actions are recorded at 50 Hz for both arms, capturing the temporal coordination required for two-handed tasks. Policies trained on ALOHA data achieve 85% success on unseen object instances, demonstrating that joint-space imitation learning can generalize within an embodiment^[16].

BridgeData V2 contains 60,000 trajectories collected on WidowX 250 6-DOF arms, covering kitchen and tabletop tasks. The dataset uses joint-space position control and includes per-task success labels, enabling evaluation of policy performance across task difficulty levels. BridgeData V2 is the standard benchmark for comparing joint-space imitation learning algorithms.

robomimic provides a suite of simulated manipulation tasks with ground-truth joint-space demonstrations, used to benchmark behavioral cloning, offline RL, and diffusion-based policies. The benchmark reports success rates, action smoothness, and sample efficiency, making it the primary evaluation framework for joint-space policy research.

Commercial Platforms and Tooling for Joint-Space Data Collection

Scale AI's physical-AI platform offers managed teleoperation services for joint-space data collection, with trained operators producing 200–500 demonstrations per day per robot. Scale handles hardware setup, task design, and quality control, delivering joint-space trajectories in RLDS or LeRobot format. Pricing starts at $150 per hour of teleoperation, with volume discounts for datasets exceeding 10,000 trajectories^[17].

CloudFactory's industrial robotics solution provides joint-space annotation for manipulation datasets, labeling grasp quality, contact events, and failure modes in recorded trajectories. Annotators review joint-space playback videos and mark timesteps where the policy should have taken corrective action, generating preference data for reinforcement learning from human feedback.

Truelabel's physical-AI marketplace connects buyers with collectors who own manipulation hardware and can produce custom joint-space datasets. Collectors submit embodiment specifications (robot model, joint limits, workspace dimensions) and task proposals; buyers fund data collection via requests. Truelabel verifies kinematic metadata and trajectory smoothness before releasing payment, ensuring dataset quality.

LeRobot is an open-source toolkit for joint-space data collection and policy training. It provides ROS drivers for Franka, UR, and Kinova arms, teleoperation interfaces with force feedback, and dataset loaders compatible with Hugging Face. LeRobot reduces the engineering overhead for joint-space imitation learning, enabling academic labs to collect production-quality datasets without custom infrastructure.

Future Directions: Learned Kinematic Models and Action Abstraction

Current joint-space policies are embodiment-locked: trained on one robot, they cannot transfer to another. Future systems will learn kinematic adapters that map between joint spaces, enabling cross-robot deployment without full retraining. RT-2's action encoder is an early example, but it sacrifices 10–15% performance compared to native joint-space policies^[18].

Learned forward kinematics could replace hand-coded DH parameters, enabling policies to generalize across robots with similar but not identical kinematic chains. A policy trained on 7-DOF Franka data could adapt to a 7-DOF Kinova arm by learning the mapping from joint angles to end-effector poses, then inverting this mapping for the new robot. OpenVLA experiments with learned kinematics show 5–8% success-rate improvements over fixed IK solvers on out-of-distribution embodiments^[19].

Hierarchical action spaces decompose manipulation into task-space waypoints and joint-space motion primitives. A high-level policy outputs Cartesian subgoals; a low-level joint-space controller executes smooth trajectories between subgoals. This hybrid approach combines the generalization of task-space planning with the precision of joint-space control. CALVIN uses this architecture for long-horizon tasks, achieving 78% success on 5-step manipulation chains^[20].

Foundation models for manipulation like NVIDIA Cosmos will likely use joint-space actions for embodiment-specific fine-tuning, even if the pretrained model operates in task space. Joint-space fine-tuning preserves the smoothness and dynamics learned from teleoperation data, while task-space pretraining enables zero-shot generalization to new objects and scenes.

Procurement Considerations for Joint-Space Training Data

Buyers procuring joint-space manipulation datasets must verify embodiment metadata completeness. The dataset should include robot model, joint limits, link lengths, DH parameters, and controller firmware version. Without this metadata, trajectories cannot be validated for kinematic feasibility or retargeted to similar robots. Truelabel's data provenance framework requires collectors to submit embodiment specifications before data collection begins, ensuring metadata completeness.

Action frequency and latency must match deployment requirements. A policy trained on 10 Hz joint-space data will exhibit choppy motion if deployed on a 50 Hz control loop. Buyers should specify target control frequency in the dataset RFP and verify that teleoperation was recorded at that frequency. LeRobot supports variable-frequency playback, but policies trained on low-frequency data cannot recover high-frequency dynamics.

Trajectory smoothness metrics should be part of dataset acceptance criteria. Jerky teleoperation produces noisy joint-space data that degrades policy performance. Buyers can compute per-joint acceleration variance and reject demonstrations that exceed a smoothness threshold. UMI reports 40% lower acceleration variance than keyboard teleoperation, making it a quality benchmark for joint-space data^[21].

Licensing and commercialization rights are critical for joint-space datasets used in product development. Many academic datasets (DROID, ALOHA, BridgeData) use CC-BY-NC licenses that prohibit commercial use. Buyers building commercial manipulation systems must procure datasets with permissive licenses or negotiate custom terms. Truelabel's marketplace offers CC-BY-4.0 and custom commercial licenses for all joint-space datasets, with transparent pricing and usage rights.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Physical AI glossaryGlossary hub Open X-Embodiment alternativePublic dataset alternative Consent artifactDefinition and terminology Data provenanceDefinition and terminology Egocentric dataDefinition and terminology Off-the-shelf datasetDefinition and terminology Physical AI training dataDefinition and terminology Robot demonstrationsDefinition and terminology

External references and source context

Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Joint-space policies achieve 12–18% higher success than task-space on single embodiments
arXiv ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID includes 76,000 trajectories with torque control data
arXiv ↩
CALVIN paper
Diffusion-based joint-space policies achieve 89% success versus 67% for single-step regression
arXiv ↩
Teleoperation datasets are becoming the highest-intent physical AI content category
ALOHA records bimanual joint-space data at 50 Hz for coordination learning
tonyzhaozh.github.io ↩
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA joint-space policies achieve 12% higher success rates than task-space
arXiv ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
RT-X task-space policies have 10–15% performance penalty versus joint-space specialists
arXiv ↩
LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch
Typical manipulation datasets contain 500–2,000 demonstrations yielding 50,000–200,000 timesteps
arXiv ↩
Project site
UMI reduces demonstration variance by 40% compared to keyboard control
umi-gripper.github.io ↩
Robotics datasets on Hugging Face need a buyer-readiness layer
Only 30% of robotics datasets on Hugging Face include full kinematic specifications
Hugging Face ↩
Sim-to-Real Transfer for Robotic Manipulation with Multi-Task Domain Adaptation
Kinematic retargeting preserves 60–70% of task success between dissimilar arms
arXiv ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 action encoder adds 15–20 ms latency and reduces success by 8–12%
arXiv ↩
Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization reduces reality gap by 30–40% for joint-space policies
arXiv ↩
Sim-to-Real Transfer of Robotic Control with Dynamics Randomization
Dynamics randomization enables zero-shot transfer in 70% of contact-rich tasks
arXiv ↩
RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 uses action chunking to smooth latency-induced jitter, improving success by 15%
arXiv ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID contains 76,000 trajectories across 564 tasks on Franka Panda
arXiv ↩
Teleoperation datasets are becoming the highest-intent physical AI content category
ALOHA policies achieve 85% success on unseen object instances
tonyzhaozh.github.io ↩
scale.com physical ai
Scale AI teleoperation pricing starts at $150 per hour with volume discounts
scale.com ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 action encoder sacrifices 10–15% performance versus native joint-space policies
arXiv ↩
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA learned kinematics show 5–8% improvement over fixed IK solvers
arXiv ↩
CALVIN paper
CALVIN hierarchical architecture achieves 78% success on 5-step manipulation chains
arXiv ↩
Project site
UMI reports 40% lower acceleration variance than keyboard teleoperation
umi-gripper.github.io ↩

More glossary terms

Consent artifactSigned documentation that contributors agreed to commercial use of their data.Data provenanceTraceability metadata: source, consent, rights, capture conditions, chain of custody.Egocentric dataFirst-person camera footage capturing how a worker or operator sees a task.Off-the-shelf datasetAn existing public or commercial dataset bought without custom collection.Physical AI training dataData that teaches models to perceive, reason about, and act in physical environments.Robot demonstrationsRecorded successful task executions used as demonstrations for imitation learning.

FAQ

Can a joint-space policy trained on one robot run on a different robot?

No, not without modification. Joint-space actions are embodiment-specific—a 7-DOF Franka policy outputs 7-element joint vectors that cannot run on a 6-DOF UR5. Kinematic retargeting (mapping joint angles to equivalent end-effector poses, then solving IK for the new robot) works for simple reaching but fails for contact-rich tasks where dynamics differ. Cross-robot deployment requires task-space actions, learned kinematic adapters, or separate policies per embodiment.

Why do most learned manipulation policies use position control instead of torque control?

Position control is more stable and easier to train. The policy outputs target joint angles; the robot's firmware PID controller handles dynamics and disturbance rejection. Torque control exposes the full dynamics, enabling force-sensitive behaviors, but requires policies that model inertia, friction, and external forces—much harder to learn from demonstrations. DROID includes 76,000 torque-controlled trajectories, but 80–90% of manipulation tasks use position control because it is more robust and sample-efficient.

How many demonstrations are needed to train a joint-space manipulation policy?

500–2,000 demonstrations per task for behavioral cloning, depending on task complexity and policy architecture. Simple pick-and-place tasks require 500–800 demonstrations; bimanual or contact-rich tasks require 1,500–2,000. Diffusion-based policies are more sample-efficient than single-step regression, achieving 89% success with 1,000 demonstrations versus 67% for regression on the same data. Data quality (teleoperation smoothness, task success rate) matters more than raw quantity—1,000 clean demonstrations outperform 2,000 noisy ones.

What metadata must accompany a joint-space manipulation dataset?

Robot model, joint limits, link lengths, DH parameters, controller firmware version, and action frequency. Without this metadata, trajectories cannot be validated for kinematic feasibility or retargeted to similar robots. Only 30% of robotics datasets on Hugging Face include full kinematic specifications, making metadata completeness a key procurement criterion. Truelabel's provenance framework requires collectors to submit embodiment specifications before data collection, ensuring metadata completeness.

How does joint-space control handle redundant manipulators with more than 6 DOF?

Joint-space control exploits redundancy naturally—the policy learns to use extra DOF for obstacle avoidance, singularity avoidance, or joint-limit avoidance from demonstration data. A 7-DOF Franka policy can learn to keep the elbow away from the table while reaching, using the redundant DOF for self-collision avoidance. Task-space control requires explicit redundancy resolution (null-space projection or optimization), which adds complexity. Joint-space policies handle redundancy implicitly by imitating human teleoperators who naturally exploit the full kinematic space.

What is the difference between joint-space actions and joint-space observations?

Joint-space observations are proprioceptive state—current joint angles, velocities, and torques measured by encoders and force sensors. Joint-space actions are control commands—target joint angles, velocities, or torques sent to the robot's servo controller. Policies use joint-space observations as input (along with camera images) and produce joint-space actions as output. Both are n-dimensional vectors for an n-joint robot, but observations are measured state and actions are commanded state.

Find datasets covering joint-space control

Truelabel surfaces vetted datasets and capture partners working with joint-space control. Send the modality, scale, and rights you need and we route you to the closest match.

Browse Joint-Space Teleoperation Datasets