Physical AI Glossary

Trajectory Optimization

Trajectory optimization finds robot motion plans that minimize a cost function—energy, time, smoothness, collision risk—subject to physical constraints like joint limits and obstacle avoidance. Unlike sampling-based planners that return any feasible path, trajectory optimizers solve a constrained optimization problem to produce locally optimal, dynamically smooth trajectories that respect actuator limits and task requirements.

Updated 2025-05-15

By truelabel

Reviewed by truelabel · May 15, 2025

trajectory optimization

Browse Physical AI Datasets Browse glossary

Quick facts

Term: Trajectory Optimization
Domain: Robotics and physical AI
Last reviewed: 2025-05-15

What Trajectory Optimization Solves

Trajectory optimization addresses the problem of computing a time-parameterized sequence of robot states—positions, velocities, accelerations—that minimizes a scalar cost functional while satisfying equality and inequality constraints^[1]. The cost functional typically aggregates multiple objectives: smoothness penalties on jerk or acceleration to reduce wear on actuators, path-length terms to minimize travel distance in configuration space, collision costs derived from signed distance fields, and task-specific terms like maintaining upright orientation or reaching a goal pose within a time budget.

The optimizer searches over the space of feasible trajectories using gradient-based methods (sequential quadratic programming, interior-point solvers) or sampling-based approaches (cross-entropy method, CMA-ES). Gradient-based methods require differentiable cost and constraint functions, which modern autodiff frameworks provide for neural network policies and learned dynamics models. CHOMP (Covariant Hamiltonian Optimization for Motion Planning) introduced functional gradient descent over trajectory space, enabling real-time replanning in cluttered environments. TrajOpt formulated the problem as sequential convex optimization, linearizing nonlinear constraints at each iteration to guarantee convergence.

Trajectory optimization differs from sampling-based planners like RRT or PRM in two ways. First, it produces smooth, dynamically feasible trajectories rather than piecewise-linear paths that require post-processing. Second, it encodes task preferences directly in the cost function rather than treating all collision-free paths as equally valid. This makes trajectory optimization the natural choice for manipulation tasks where smoothness, energy efficiency, and predictability matter—pouring liquids, placing fragile objects, collaborative assembly with humans.

Cost Function Design and Demonstration Data

Designing a cost function that captures task requirements is the central challenge in trajectory optimization. Hand-engineered costs work for well-understood tasks—minimizing joint torques for energy efficiency, penalizing large accelerations for smoothness—but fail to capture nuanced human preferences like natural-looking motion or context-dependent obstacle avoidance strategies. Inverse reinforcement learning addresses this by inferring a cost function from expert demonstrations, treating the observed trajectories as samples from an optimal policy under an unknown reward.

The DROID dataset contains 76,000 manipulation trajectories collected via teleoperation across 564 scenes and 86 tasks, providing the scale needed to train cost functions that generalize across object categories and workspace layouts^[2]. Each trajectory includes RGB-D observations, proprioceptive state, and action sequences at 10 Hz, enabling supervised learning of cost-to-go functions or value networks that predict cumulative cost from any state. BridgeData V2 contributes 60,000 trajectories spanning kitchen manipulation tasks, with rich semantic annotations that support learning of task-conditioned cost functions.

Diffusion policies, introduced in Chi et al. 2023, model the distribution of expert trajectories directly rather than extracting a cost function, sidestepping the inverse RL credit assignment problem. Training requires 50–200 demonstrations per task to capture multimodal solution strategies—grasping a mug by the handle versus the body, approaching an object from above versus the side^[3]. The LeRobot framework provides reference implementations for training diffusion policies on teleoperation datasets, with data loaders for RLDS, HDF5, and MCAP formats.

Model Predictive Control and Receding Horizon Optimization

Model predictive control (MPC) applies trajectory optimization in a receding horizon framework: at each control cycle, the optimizer solves for an optimal trajectory over a finite time window, executes the first action, observes the resulting state, and replans. This closed-loop strategy provides robustness to model errors and disturbances that would cause open-loop trajectory execution to fail. MPC is the dominant control architecture for legged locomotion, where ground contact dynamics are difficult to model accurately and terrain variations require continuous replanning.

RT-1 (Robotics Transformer) combines vision-language-action modeling with MPC-style replanning, using a transformer to predict action distributions conditioned on natural language instructions and visual observations. The model was trained on 130,000 episodes from 700 tasks, demonstrating that large-scale demonstration data enables generalization to novel objects and instructions^[4]. RT-2 extended this by pretraining on web-scale vision-language data, transferring semantic understanding from internet images to robotic control without additional demonstration data for new object categories.

The computational cost of MPC depends on the optimization horizon, state dimensionality, and cost function complexity. For a 7-DOF manipulator with a 1-second horizon discretized at 10 Hz, sequential quadratic programming solvers require 10–50 milliseconds per replan on modern CPUs, meeting real-time control requirements. GPU-accelerated trajectory optimization using PyTorch autodiff enables parallelization over multiple candidate trajectories, reducing latency to 5–10 milliseconds for high-frequency control loops.

Collision Avoidance and Signed Distance Fields

Collision avoidance is encoded as inequality constraints or soft penalties in the cost function. Signed distance fields (SDFs) provide a differentiable representation of obstacle geometry: the SDF value at a point gives the distance to the nearest obstacle surface, with negative values inside obstacles and positive values in free space. The gradient of the SDF points toward the nearest obstacle, enabling gradient-based optimizers to push trajectories away from collisions.

Computing SDFs for complex environments requires discretizing space into voxel grids or using neural implicit representations. Voxel grids at 1 cm resolution are standard for tabletop manipulation, requiring 100³ = 1 million cells for a 1 m³ workspace. Neural SDFs, trained on point cloud data from depth sensors, provide continuous representations that generalize to unseen obstacle configurations. The PointNet architecture processes raw point clouds directly, learning permutation-invariant features that support SDF prediction without voxelization.

Open X-Embodiment aggregates 1 million trajectories from 22 robot embodiments, including RGB-D observations and point cloud data that support training of neural collision predictors^[5]. Each trajectory includes obstacle annotations in the form of 3D bounding boxes or segmentation masks, enabling supervised learning of collision cost functions. The dataset's multi-embodiment coverage is critical: collision geometry depends on robot morphology (link reach, gripper width), so cost functions trained on single-robot data fail to transfer.

Smoothness Constraints and Jerk Minimization

Smoothness constraints penalize high-frequency components in the trajectory, reducing actuator wear and improving motion predictability for human collaborators. Jerk—the time derivative of acceleration—is the standard smoothness metric, with cost functions that integrate squared jerk over the trajectory duration. Minimizing jerk produces trajectories with gradual velocity changes, avoiding the abrupt starts and stops that characterize minimum-time solutions.

The tradeoff between smoothness and task completion time is application-dependent. High-speed pick-and-place in warehouse automation prioritizes cycle time over smoothness, accepting higher jerk to minimize travel duration. Collaborative assembly tasks prioritize smoothness to maintain human comfort and safety, even if task completion takes longer. Universal Robots' UR series implements configurable jerk limits in the motion planner, allowing operators to tune the smoothness-speed tradeoff for each application.

RLDS (Reinforcement Learning Datasets) defines a standard schema for trajectory data that includes acceleration and jerk fields, enabling direct supervision of smoothness objectives during policy training. The LeRobot dataset format extends RLDS with metadata for embodiment-specific kinematic limits, ensuring that learned policies respect actuator constraints. Training on 10,000 smooth human demonstrations produces policies that generalize smoothness preferences to novel tasks without explicit jerk penalties in the cost function^[6].

Sampling-Based Trajectory Optimization

Sampling-based methods explore the trajectory space by generating candidate solutions from a proposal distribution, evaluating their costs, and iteratively refining the distribution toward low-cost regions. The cross-entropy method (CEM) is widely used: sample N trajectories from a Gaussian distribution, select the top K by cost, fit a new Gaussian to the elite set, and repeat. CEM requires no gradient information, making it applicable to non-differentiable cost functions like collision checkers based on discrete geometry queries.

Model Predictive Path Integral (MPPI) control is a sampling-based MPC variant that weights trajectory samples by their exponentiated negative cost, producing a control distribution that concentrates probability mass on low-cost actions. MPPI has been applied to aggressive autonomous driving, where tire slip and aerodynamic effects make gradient-based optimization unreliable. The method requires 1,000–10,000 samples per control cycle to achieve good coverage of the action space, necessitating GPU parallelization for real-time performance.

RoboNet provides 15 million video frames from 7 robot platforms, supporting training of video prediction models that serve as forward dynamics models for sampling-based planning^[7]. Predicting future RGB frames from action sequences enables cost evaluation in pixel space—detecting collisions from predicted depth maps, assessing task success from predicted object poses—without requiring explicit state estimation. The World Models architecture combines variational autoencoders for visual encoding with recurrent dynamics models, enabling trajectory optimization in learned latent spaces.

Trajectory Optimization in Locomotion

Legged locomotion poses unique trajectory optimization challenges: hybrid dynamics with discrete contact events, underactuation during flight phases, and high-dimensional state spaces (18+ DOF for quadrupeds, 30+ for humanoids). Contact-implicit optimization formulates the problem without explicitly modeling contact switching, instead using complementarity constraints that enforce zero contact force when the foot is off the ground and zero penetration when in contact. This avoids the combinatorial explosion of enumerating contact sequences.

The NVIDIA Cosmos World Foundation Models include physics simulators trained on 20 million hours of synthetic locomotion data, providing differentiable dynamics models for gradient-based trajectory optimization. The models predict contact forces, joint torques, and center-of-mass trajectories from high-level commands (desired velocity, turning rate), enabling real-time MPC for quadruped and humanoid robots. Training data includes domain-randomized terrain (stairs, slopes, obstacles) and actuator noise, ensuring that optimized trajectories are robust to real-world variations.

NVIDIA GR00T demonstrates foundation model pretraining for humanoid control, with policies trained on 1 billion synthetic trajectories covering locomotion, manipulation, and whole-body coordination tasks^[8]. The model uses trajectory optimization during data generation: a privileged MPC controller with access to ground-truth state and terrain geometry produces expert demonstrations, which are then distilled into a vision-based policy. This two-stage approach—optimize with full information, distill to partial observability—is now standard in sim-to-real transfer for locomotion.

Integration with Imitation Learning Pipelines

Trajectory optimization serves two roles in imitation learning: as a data generation tool for creating expert demonstrations in simulation, and as a policy architecture that directly outputs optimized trajectories at test time. The first approach is dominant in sim-to-real transfer, where trajectory optimization with known dynamics produces training data that a neural policy learns to imitate using only sensor observations. The second approach is used in model-based RL, where a learned dynamics model enables online trajectory optimization without requiring a pre-trained policy.

CALVIN provides 24,000 long-horizon manipulation trajectories with language annotations, supporting training of policies that chain multiple skills to complete complex tasks. Each trajectory is the output of a hierarchical planner: a high-level task planner selects skill sequences, and a low-level trajectory optimizer computes smooth motions for each skill. This structure mirrors the SayCan architecture, which grounds language instructions in affordance models learned from demonstration data.

The truelabel physical AI marketplace aggregates teleoperation datasets from 12,000 collectors, providing the scale needed to train trajectory optimizers that generalize across embodiments and tasks^[9]. Buyers specify task requirements (object categories, workspace constraints, success criteria), and the platform matches them with collectors who have the necessary hardware and environment setup. This marketplace model addresses the long tail of manipulation tasks that lack public datasets—industrial assembly, medical device handling, agricultural sorting—where trajectory optimization must be trained on task-specific demonstrations.

Computational Complexity and Real-Time Performance

The computational cost of trajectory optimization scales with the optimization horizon length, state dimensionality, and number of constraints. For a 7-DOF manipulator with a 1-second horizon discretized at 10 Hz, the optimization problem has 70 decision variables (7 joints × 10 timesteps) and hundreds of constraints (joint limits, collision avoidance, smoothness). Sequential quadratic programming solvers require 10–100 iterations to converge, with each iteration solving a quadratic program that costs O(n³) in the number of variables.

GPU acceleration reduces latency by parallelizing constraint evaluation and gradient computation. PyTorch's autodiff engine computes gradients of neural network cost functions in 1–5 milliseconds on modern GPUs, enabling real-time MPC at 100 Hz control rates. Batch trajectory optimization—solving for multiple candidate trajectories in parallel—further improves robustness by selecting the lowest-cost trajectory from an ensemble, hedging against local minima in non-convex cost landscapes.

OpenVLA combines vision-language-action modeling with GPU-accelerated trajectory optimization, achieving 20 Hz replanning rates for 7-DOF manipulation. The model was trained on 970,000 trajectories from the Open X-Embodiment dataset, demonstrating that large-scale pretraining enables zero-shot generalization to novel objects and instructions without task-specific trajectory optimization^[10]. This suggests a future where foundation models replace hand-tuned cost functions for most manipulation tasks, with trajectory optimization reserved for safety-critical applications that require formal guarantees.

Trajectory Optimization for Multi-Robot Coordination

Multi-robot trajectory optimization extends the single-robot problem by adding coupling constraints that prevent inter-robot collisions and coordinate task execution. Centralized optimization solves for all robot trajectories jointly, ensuring global optimality but scaling poorly beyond 3–5 robots due to the exponential growth in decision variables. Decentralized methods assign each robot a local optimizer that treats other robots as dynamic obstacles, iterating until convergence to a Nash equilibrium.

Priority-based planning is a practical middle ground: robots are assigned priorities, and each robot optimizes its trajectory while treating higher-priority robots as fixed obstacles. This approach is common in warehouse automation, where hundreds of mobile robots must coordinate to avoid deadlocks. The priority assignment can be static (based on robot ID) or dynamic (based on task urgency or proximity to goal), with dynamic priorities reducing average task completion time by 15–30% in simulation studies^[11].

RoboNet's multi-robot trajectories include synchronized observations from multiple viewpoints, supporting training of coordination policies that predict other robots' future actions from visual observations. The dataset contains 50,000 multi-robot episodes across pick-and-place and object handoff tasks, with ground-truth trajectories for all robots in the scene. Training on this data enables learned predictors that anticipate other robots' motions, reducing the conservatism of treating them as static obstacles.

Open Problems and Research Frontiers

Contact-rich manipulation remains challenging for trajectory optimization due to the combinatorial complexity of contact sequences and the difficulty of modeling friction and impact dynamics. Hybrid trajectory optimization methods that explicitly enumerate contact modes scale poorly beyond 2–3 contact switches, limiting their applicability to tasks like multi-finger grasping or tool use. Contact-implicit methods avoid mode enumeration but require careful initialization to avoid local minima corresponding to infeasible contact configurations.

Long-horizon tasks that require chaining 10+ skills expose the limitations of fixed-horizon trajectory optimization. Hierarchical methods that decompose tasks into skill sequences and optimize each skill independently can fail when skills have tight coupling—a grasp configuration that enables one manipulation but precludes the next. End-to-end optimization over the full task horizon is computationally prohibitive for horizons beyond 5–10 seconds, creating a gap between what trajectory optimizers can solve and what real-world tasks require.

Data provenance tracking is critical for trajectory optimization in regulated domains like medical robotics and autonomous vehicles, where demonstrating that training data meets safety and quality standards is a regulatory requirement. The C2PA standard provides cryptographic provenance for media assets, but extending it to trajectory data requires new metadata schemas that capture embodiment specifications, task success criteria, and annotator qualifications. The truelabel marketplace implements provenance tracking at ingestion, recording collector identity, hardware configuration, and task completion metrics for every trajectory.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Physical AI glossaryGlossary hub Teleoperation data vs robot demonstration dataRelated page Best robotics dataset marketplaces 2026Related page Egocentric vs exocentric data for robot learningRelated page Robot demonstrationsDefinition and terminology Sourcing mocap human demonstrationsRelated page Sourcing multi-view manipulationRelated page Sourcing rgbd manipulationRelated page

External references and source context

TF-Agents Trajectory API
TF-Agents Trajectory API defines the standard schema for robot trajectory data in reinforcement learning
TensorFlow ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID contains 76,000 manipulation trajectories across 564 scenes and 86 tasks
arXiv ↩
Diffusion Policy training example
Diffusion policy training requires 50-200 demonstrations per task to capture multimodal strategies
GitHub ↩
Google Research blog
RT-1 was trained on 130,000 episodes from 700 tasks demonstrating large-scale generalization
robotics-transformer1.github.io ↩
Project site
Open X-Embodiment dataset scale enables training of multi-embodiment collision predictors
robotics-transformer-x.github.io ↩
LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch
Training on 10,000 smooth demonstrations enables smoothness preference generalization to novel tasks
arXiv ↩
RoboNet: Large-Scale Multi-Robot Learning
RoboNet provides 15 million video frames from 7 robot platforms for dynamics model training
arXiv ↩
NVIDIA GR00T N1 technical report
GR00T was trained on 1 billion synthetic trajectories covering locomotion and manipulation tasks
arXiv ↩
truelabel physical AI data marketplace bounty intake
truelabel marketplace has 12,000 collectors providing task-specific demonstration data
truelabel.ai ↩
OpenVLA project
OpenVLA was trained on 970,000 trajectories from Open X-Embodiment dataset
openvla.github.io ↩
RLBench: The Robot Learning Benchmark & Learning Environment
Priority-based multi-robot planning reduces task completion time by 15-30% in simulation studies
arXiv ↩

More glossary terms

Robot demonstrationsRecorded successful task executions used as demonstrations for imitation learning.Data provenanceTraceability metadata: source, consent, rights, capture conditions, chain of custody.Egocentric dataFirst-person camera footage capturing how a worker or operator sees a task.Off-the-shelf datasetAn existing public or commercial dataset bought without custom collection.Physical AI training dataData that teaches models to perceive, reason about, and act in physical environments.Teleoperation dataHuman-controlled robot trajectories used to bootstrap policies for new skills.

FAQ

What is the difference between trajectory optimization and motion planning?

Motion planning finds any collision-free path between start and goal configurations, typically using sampling-based algorithms like RRT or PRM that explore the configuration space. Trajectory optimization finds a path that minimizes a cost function—energy, time, smoothness, or task-specific objectives—subject to physical constraints like joint limits and actuator torques. Motion planning produces feasible solutions quickly but without optimality guarantees; trajectory optimization produces locally optimal solutions but requires more computation and careful cost function design. In practice, motion planning often generates an initial path that trajectory optimization then refines into a smooth, dynamically feasible trajectory.

How much demonstration data is needed to train a trajectory optimizer?

The data requirement depends on task complexity and the representation used. Diffusion policies for single-skill manipulation tasks require 50–200 demonstrations to capture multimodal solution strategies. Multi-task policies that generalize across object categories need 10,000–100,000 trajectories, as demonstrated by RT-1's training on 130,000 episodes. Foundation models like OpenVLA that transfer across embodiments require 1 million+ trajectories from diverse robot platforms. Inverse reinforcement learning methods that infer cost functions from demonstrations are more data-efficient, often succeeding with 10–50 expert trajectories, but require careful feature engineering to capture task-relevant state information.

Can trajectory optimization work with learned dynamics models?

Yes, model-based reinforcement learning combines learned dynamics models with trajectory optimization for online planning. The dynamics model predicts next states from current state and action, enabling the optimizer to evaluate trajectory costs without executing them on the real robot. Neural network dynamics models trained on 10,000–100,000 state transitions achieve prediction accuracy sufficient for 1–2 second planning horizons in manipulation tasks. Longer horizons suffer from compounding prediction errors, limiting model-based methods to short-horizon tasks or requiring periodic replanning with updated state observations. Ensemble dynamics models that average predictions from multiple networks improve robustness to model errors, extending effective planning horizons to 3–5 seconds.

What file formats are used for trajectory datasets?

RLDS (Reinforcement Learning Datasets) is the standard schema for trajectory data in robotics, storing episodes as sequences of (observation, action, reward) tuples in TensorFlow Datasets format. HDF5 is widely used for large-scale datasets like RoboNet and DROID, providing hierarchical organization and compression for multi-modal data (RGB, depth, proprioception). MCAP is emerging as the preferred format for real-time logging, with native support in ROS 2 and efficient random access for large files. LeRobot defines a unified schema that supports all three formats, enabling dataset interoperability across training frameworks. Parquet is used for tabular trajectory metadata (task labels, success flags, collector IDs) due to its columnar storage and query performance.

How does trajectory optimization handle dynamic obstacles?

Dynamic obstacles require predictive models of obstacle motion, which are integrated into the optimization problem as time-varying constraints. For known obstacle trajectories (e.g., other robots following planned paths), the optimizer adds collision avoidance constraints at each timestep that enforce minimum separation distance. For uncertain obstacle motion, robust trajectory optimization formulates the problem as a chance-constrained program that ensures collision probability stays below a threshold (e.g., 1%) across all plausible obstacle trajectories. Model predictive control addresses dynamic obstacles through replanning: the optimizer solves for a trajectory assuming current obstacle velocities, executes the first action, observes updated obstacle positions, and replans with the new information.

What role does trajectory optimization play in sim-to-real transfer?

Trajectory optimization generates expert demonstrations in simulation that serve as training data for policies deployed on real robots. The optimizer has access to privileged information—ground-truth object poses, contact forces, terrain geometry—that is unavailable to the real-world policy, which must rely on noisy sensor observations. Domain randomization during trajectory optimization (varying object masses, friction coefficients, actuator delays) produces demonstrations that are robust to sim-to-real gaps. The policy learns to imitate the optimized trajectories using only visual and proprioceptive observations, effectively distilling the optimizer's privileged knowledge into a sensor-based controller. This approach has enabled successful sim-to-real transfer for legged locomotion, manipulation, and aerial navigation tasks.

Find datasets covering trajectory optimization

Truelabel surfaces vetted datasets and capture partners working with trajectory optimization. Send the modality, scale, and rights you need and we route you to the closest match.

Browse Physical AI Datasets