Physical AI Glossary

Trajectory Prediction

Q: What is the difference between trajectory prediction and motion planning?

Trajectory prediction forecasts where agents will move based on observed behavior. Motion planning computes where the robot should move to achieve a goal. Prediction is a perception task (estimating others' future states); planning is a control task (choosing the robot's actions). Modern systems use predicted trajectories as inputs to motion planners — the planner selects robot actions that avoid predicted collision zones.

Q: How much trajectory data is needed to train a production model?

Minimum viable models train on 5,000–10,000 trajectories (50–100 hours of interaction data). Production models for autonomous vehicles require 50,000–100,000 trajectories (500–1,000 hours). Cross-embodiment models like RT-2 and OpenVLA train on 1M+ trajectories aggregated from 20+ robot platforms. Data diversity matters more than volume — 10,000 diverse trajectories outperform 50,000 narrow trajectories.

Q: Can trajectory models trained in simulation transfer to real robots?

Yes, with domain randomization and real-data fine-tuning. Models trained purely on synthetic data achieve 60–80% real-world accuracy. Combining 10,000 real trajectories with 100,000 synthetic trajectories matches the performance of 50,000 real trajectories, offering a 5× cost reduction. Physics accuracy matters — simulators must model friction, contact dynamics, and sensor noise to produce realistic trajectories.

Q: What annotation tools support multi-agent trajectory labeling?

Segments.ai supports 3D bounding box tracking across LiDAR and camera streams with automatic ID propagation. Kognic specializes in autonomous vehicle trajectory annotation with occlusion handling. CVAT provides 2D polygon tracking for top-down warehouse views. Labelbox offers video object tracking with temporal interpolation. All tools export to standard formats (COCO, KITTI, RLDS) for model training.

Q: How do trajectory models handle occlusions and missing data?

Models use temporal context to infer occluded positions. If an agent disappears behind a pillar for 2 seconds, the model predicts its trajectory based on velocity before occlusion and reappearance location. Training data must include occlusion examples — models cannot generalize to occlusions unseen during training. Annotation pipelines must maintain agent IDs through occlusions to provide supervision for these cases.

Q: What is the typical latency budget for real-time trajectory prediction?

Warehouse robots require <100 ms latency (10 Hz update rate). Autonomous vehicles require <50 ms latency (20 Hz update rate). Manipulation robots require <30 ms latency (30 Hz update rate). 95th-percentile latency matters more than mean latency — a single 500 ms stall can cause a collision. Models must run on edge hardware (NVIDIA Jetson, Qualcomm RB5) without cloud round-trips.

Trajectory prediction forecasts the future spatial positions and velocities of agents (humans, robots, vehicles) and objects over time horizons of 1–10 seconds. Physical AI systems use trajectory models to anticipate collisions, plan safe paths, and coordinate multi-agent interactions in warehouses, kitchens, and autonomous vehicle fleets.

Updated 2025-06-1513 min read

By Truelabel Team

Reviewed by Truelabel Team · Jun 15, 2025

trajectory prediction

Browse trajectory datasets on truelabel Browse glossary

Quick facts

Topic: Trajectory Prediction
Audience: Procurement leads, ML ops, robotics engineers
Deliverable: Buyer-facing reference + procurement guidance

What trajectory prediction solves in physical AI

Trajectory prediction addresses the core challenge of anticipating where moving entities will be 1–10 seconds into the future. A warehouse robot must predict human worker paths to avoid collisionsCloudFactory's industrial robotics solutions. An autonomous vehicle must forecast pedestrian crossings 3 seconds ahead to brake safely. A kitchen robot must predict where a human hand will move to avoid interfering with meal preparationClaru's kitchen task training data.

Without trajectory prediction, physical AI systems operate reactively — they respond to the current state but cannot plan proactively. This reactive mode fails in dynamic environments where agents move faster than control loops can respond. Modern systems require predictive models that output probability distributions over future trajectories, not single deterministic paths. The RT-1 Robotics Transformer demonstrated that large-scale robot datasets enable generalist policies, but those policies still depend on accurate motion forecasting to execute safely in human-shared spaces.

Trajectory prediction sits between perception and planning in the autonomy stack. Perception modules extract agent positions and velocities from sensor streams (cameras, LiDAR, radar). Trajectory models consume these state estimates and output multi-modal future distributions. Planning modules then select actions that maximize task success while respecting predicted collision probabilities. This three-stage pipeline appears in Waymo's autonomous vehicle stack and in manipulation systems trained on DROID's 76,000 manipulation trajectories.

The data requirements are substantial. A production trajectory model for warehouse navigation might train on 50,000+ hours of multi-agent interaction data, annotated with 2D or 3D bounding boxes, agent IDs, and semantic labels (worker, forklift, pallet). Scale AI's Physical AI platform reports that trajectory annotation costs $0.80–$3.50 per frame depending on scene complexity and annotation density^[1].

Core architectures and model families

Trajectory prediction models fall into three architectural families: recurrent encoders (LSTMs, GRUs), graph neural networks (GNNs), and transformer-based architectures. Recurrent models encode agent history as fixed-dimensional hidden states and decode future positions autoregressively. GNNs model multi-agent interactions as spatial graphs where edges represent influence relationships. Transformers treat agent trajectories as sequences and apply self-attention to capture long-range dependencies.

The RT-2 vision-language-action model uses a transformer backbone pretrained on web data, then fine-tuned on robot trajectories. This transfer learning approach reduces the robot-specific data requirement from 100,000+ demonstrations to 10,000–30,000 episodes. OpenVLA extends this paradigm with a 7B-parameter vision-language-action model trained on the Open X-Embodiment dataset's 1M+ trajectories^[2].

Multi-modal prediction is critical for real-world deployment. A pedestrian at an intersection might cross, wait, or turn — a single deterministic forecast cannot capture this uncertainty. Modern architectures output mixture-of-Gaussians or learned latent distributions. The DROID dataset paper shows that models trained on diverse trajectory distributions generalize better to novel environments than models trained on narrow, expert-only data.

Computational cost scales with prediction horizon and agent count. A 5-second forecast at 10 Hz requires 50 timestep predictions per agent. A warehouse scene with 20 agents demands 1,000 position estimates per forward pass. NVIDIA's Cosmos world foundation models address this with hierarchical prediction: coarse 10-second forecasts guide fine 1-second rollouts, reducing compute by 60% while maintaining accuracy^[3].

Dataset requirements and annotation standards

Trajectory datasets must capture spatial positions, velocities, agent types, and interaction context over time. Minimum viable datasets contain 5,000+ trajectories spanning 10+ hours of interaction. Production-grade datasets exceed 50,000 trajectories and 500+ hours. The BridgeData V2 dataset provides 60,000 robot manipulation trajectories across 24 tasks, establishing a benchmark for manipulation-specific trajectory data^[4].

Annotation pipelines must handle occlusions, identity switches, and multi-agent tracking. Segments.ai's multi-sensor labeling platform supports 3D bounding box tracking across LiDAR and camera streams, maintaining agent IDs through occlusions. Kognic's annotation platform specializes in autonomous vehicle trajectory data, processing 2M+ frames per month for tier-1 OEMs.

Data format standards vary by domain. Autonomous vehicle datasets use MCAP or ROS bag formats to store synchronized sensor streams. Robot manipulation datasets increasingly adopt RLDS (Reinforcement Learning Datasets), which wraps trajectories in a standardized schema compatible with TensorFlow Datasets and PyTorch. The LeRobot dataset format extends RLDS with metadata for embodiment, control frequency, and action spaces^[5].

Quality control requires per-trajectory validation. Annotators must verify that bounding boxes track the same physical entity across frames, velocities are physically plausible (no instantaneous 10 m/s jumps), and semantic labels are consistent. Truelabel's physical AI marketplace enforces schema validation at upload time, rejecting trajectories with missing timestamps, invalid coordinate frames, or broken agent IDs.

Training pipelines and data augmentation

Trajectory prediction models train on sequences of (observation, action, next_observation) tuples. A typical training batch contains 32–128 trajectories, each 10–50 timesteps long. Models optimize a negative log-likelihood loss over predicted future positions, weighted by temporal distance (near-term predictions matter more than far-future forecasts).

Data augmentation is essential for generalization. Spatial augmentations include random rotations (±15°), translations (±0.5 m), and scaling (0.9–1.1×). Temporal augmentations include random subsampling (training on 5 Hz data when deployment is 10 Hz) and trajectory truncation (training on partial histories). Domain randomization adds sensor noise, lighting variation, and background clutter to reduce sim-to-real gaps.

The CALVIN benchmark demonstrates that models trained on 10,000 diverse trajectories outperform models trained on 50,000 narrow trajectories. Diversity matters more than volume. RoboNet's 15M frames from 7 robot platforms enabled the first cross-embodiment trajectory models, showing that data from Franka arms transfers to UR5 arms when trajectories share task structure^[6].

Hyperparameter tuning focuses on prediction horizon (1–10 seconds), sampling frequency (1–30 Hz), and loss weighting (how much to penalize far-future errors). LeRobot's diffusion policy training example shows that 3-second horizons at 10 Hz work well for tabletop manipulation, while warehouse navigation requires 10-second horizons at 5 Hz to handle long-range planning.

Evaluation metrics and benchmark datasets

Trajectory prediction models are evaluated on Average Displacement Error (ADE), Final Displacement Error (FDE), and collision rate. ADE measures mean Euclidean distance between predicted and ground-truth positions across all timesteps. FDE measures distance at the final timestep only. Collision rate counts predicted trajectories that intersect obstacles or other agents.

Benchmark datasets include Waymo Open Dataset (autonomous vehicles, 1,000+ hours), EPIC-KITCHENS-100 (egocentric kitchen activities, 100 hours), and DROID (robot manipulation, 76,000 trajectories). These datasets provide train/val/test splits and standardized evaluation scripts. The Open X-Embodiment dataset aggregates 22 robot datasets into a unified benchmark, enabling cross-dataset evaluation^[2].

Real-world deployment requires online metrics beyond offline ADE/FDE. Systems must track prediction latency (time from sensor input to trajectory output), update frequency (how often predictions refresh), and calibration error (whether predicted probabilities match observed frequencies). RT-1's deployment on 13 robots showed that 95th-percentile latency matters more than mean latency — a single 500 ms stall can cause a collision even if average latency is 50 ms^[7].

Failure mode analysis is critical. Models fail on out-of-distribution agent types (a forklift when trained only on humans), rare interactions (two agents reaching for the same object), and sensor degradation (camera lens fog, LiDAR rain noise). THE COLOSSEUM benchmark systematically tests generalization across 20 object categories and 12 distractor conditions, revealing that models trained on 50,000 trajectories still fail on 15% of novel object-distractor pairs^[8].

Multi-agent interaction and social prediction

Multi-agent trajectory prediction must model how agents influence each other. A warehouse robot slows when a human approaches; a pedestrian waits when a car signals a turn. These interactions are not independent — joint prediction over all agents outperforms per-agent prediction.

Graph neural networks encode interactions as edges between agent nodes. Edge weights represent influence strength, learned from data. The RT-2 model uses cross-attention between robot and human trajectories to predict cooperative handovers. RoboCat extends this to 6-agent scenarios, showing that models trained on 2-agent data generalize to 4-agent scenes with 20% accuracy loss^[9].

Social conventions complicate prediction. Humans follow implicit rules (walk on the right, yield to faster agents) that vary by culture and context. Models must learn these conventions from data, not hard-coded rules. The EPIC-KITCHENS dataset captures 700+ hours of human kitchen activity, providing ground truth for social prediction in shared workspaces.

Computational cost grows quadratically with agent count in naive implementations (every agent attends to every other agent). Sparse attention mechanisms reduce this to linear cost by attending only to nearby agents within a 5 m radius. NVIDIA Cosmos uses hierarchical graphs: local 5 m neighborhoods at 10 Hz, global 50 m context at 1 Hz, reducing compute by 80% with <5% accuracy loss^[3].

Sim-to-real transfer and synthetic data

Synthetic trajectory data from simulators reduces annotation cost but introduces a sim-to-real gap. Simulators produce perfect ground truth (exact positions, velocities, agent IDs) but lack real-world noise, occlusions, and edge cases. Domain randomization bridges this gap by training on diverse simulated conditions (lighting, textures, physics parameters) so models learn robust features.

The RLBench benchmark provides 100 simulated manipulation tasks with procedurally generated variations. Models trained on RLBench transfer to real robots with 60–80% success rates, compared to 90–95% for real-data-trained models. A 2021 survey found that combining 10,000 real trajectories with 100,000 synthetic trajectories matches the performance of 50,000 real trajectories alone, offering a 5× cost reduction^[10].

Physics accuracy matters. Simulators must model friction, contact dynamics, and deformable objects to produce realistic trajectories. ManiSkill uses GPU-accelerated physics to simulate 1,000 parallel environments at 100 Hz, generating 100,000 trajectories per hour. RoboCasa adds procedural scene generation, producing 10M+ kitchen trajectories with randomized object placements and agent behaviors.

Validation requires real-world testing. A model that achieves 95% accuracy on synthetic data might drop to 70% on real data due to unmodeled effects (sensor lag, actuator backlash, cable drag). CALVIN's real-robot benchmark provides a standardized test bed, showing that models must train on ≥20% real data to maintain >85% real-world accuracy^[11].

Temporal resolution and prediction horizons

Prediction horizon (how far into the future) and temporal resolution (how often to predict) are coupled design choices. Autonomous vehicles predict 5–10 seconds ahead at 10 Hz to handle highway speeds. Warehouse robots predict 3–5 seconds at 5 Hz for human avoidance. Manipulation robots predict 1–2 seconds at 30 Hz for reactive grasping.

Longer horizons enable proactive planning but accumulate error. A 10-second forecast compounds position uncertainty quadratically — a 0.1 m error at 1 second becomes 1.0 m at 10 seconds. RT-1 uses a 2-second horizon for tabletop tasks, refreshing predictions every 0.1 seconds to limit error accumulation^[7].

Hierarchical prediction reduces computational cost. Coarse 10-second forecasts at 1 Hz guide fine 1-second rollouts at 10 Hz. The coarse model runs a cheap linear predictor; the fine model runs an expensive transformer. NVIDIA Cosmos reports 3× speedup with this approach, maintaining 95% of single-resolution accuracy.

Adaptive horizons adjust based on scene complexity. In sparse environments (empty warehouse aisle), a 10-second horizon suffices. In dense environments (crowded cafeteria), a 3-second horizon with 10 Hz updates handles rapid changes. Waymo's planner dynamically adjusts horizon from 3–10 seconds based on detected agent density and velocity variance.

Coordinate frames and spatial representations

Trajectory prediction requires consistent coordinate frames across sensors and time. Common choices include ego-centric (robot-centered), world-centric (fixed global frame), and object-centric (relative to a target object). Ego-centric frames simplify control (actions are relative to the robot) but complicate multi-agent reasoning (other agents move in the robot's frame). World-centric frames simplify multi-agent reasoning but require accurate localization.

The RLDS format stores trajectories in world coordinates with per-timestep transformations to ego frames. This enables both global planning (in world frame) and local control (in ego frame) without re-projection overhead. LeRobot datasets include camera extrinsics and robot base poses, allowing downstream users to choose their preferred frame^[5].

Spatial representations include 2D bounding boxes (cheap, works for top-down views), 3D bounding boxes (handles occlusions, requires depth), and point clouds (high fidelity, expensive). PointNet processes raw point clouds for 3D trajectory prediction, achieving 0.15 m ADE on indoor navigation tasks. Segments.ai's point cloud labeling tools support LiDAR annotation at $2–$5 per frame, 3–5× cheaper than manual 3D box drawing.

Map representations provide context. Autonomous vehicles use HD maps with lane geometry and traffic rules. Warehouse robots use occupancy grids with static obstacle locations. Kitchen robots use semantic maps with object affordances ("drawer opens left", "stove is hot"). Waymo Open Dataset includes HD map data for 1,000+ km of roads, enabling map-conditioned trajectory prediction.

Uncertainty quantification and risk-aware planning

Trajectory predictions must quantify uncertainty to enable safe planning. A deterministic forecast ("the pedestrian will be at position X") provides no information about confidence. A probabilistic forecast ("80% probability within 0.5 m of X") enables risk-aware decisions.

Mixture-of-Gaussians models output K Gaussian distributions, each representing a plausible future mode (cross the street, wait, turn around). The RT-2 model uses K=5 modes for manipulation tasks, capturing grasp approach variations. RoboCat uses K=10 for navigation, handling diverse pedestrian behaviors^[9].

Calibration measures whether predicted probabilities match observed frequencies. A well-calibrated model that predicts 80% confidence should be correct 80% of the time. Waymo's safety validation requires <5% calibration error across all confidence bins. Poorly calibrated models are unsafe — overconfident predictions cause collisions, underconfident predictions cause excessive braking.

Risk-aware planning uses predicted distributions to compute collision probabilities. A planner might reject a path with 1% collision risk even if the expected trajectory is collision-free. RT-1's deployment uses a 0.1% collision threshold for human-shared spaces, requiring 1,000× safety margin over expected trajectories^[7].

Procurement considerations for trajectory datasets

Buyers must specify prediction horizon, temporal resolution, agent types, environment types, and annotation schema. A warehouse navigation dataset requires 5-second horizons, 5 Hz resolution, human and forklift agents, and 3D bounding boxes. A manipulation dataset requires 2-second horizons, 30 Hz resolution, human hand and robot gripper agents, and 6-DOF poses.

Licensing matters. RoboNet's dataset license permits academic use but restricts commercial deployment without separate agreement. EPIC-KITCHENS annotations use a custom license requiring attribution and prohibiting redistribution. Truelabel's marketplace provides commercial-use licenses with explicit IP warranties and indemnification.

Data provenance affects model risk. Datasets scraped from YouTube lack consent and may contain biased or adversarial examples. Datasets collected via paid teleoperation have clear provenance and consent chains. Truelabel's data provenance glossary explains why procurement teams must audit data lineage before production deployment.

Cost scales with annotation complexity. 2D bounding box tracking costs $0.50–$1.50 per frame. 3D bounding box tracking costs $2–$5 per frame. Full 6-DOF pose tracking costs $5–$15 per frame. Scale AI's Physical AI platform reports that trajectory annotation represents 40–60% of total dataset cost for autonomous vehicle projects^[1].

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Physical AI glossaryGlossary hub Robot training data marketplaceRobotics datasets Motion PlanningDefinition and terminology Task and Motion Planning (TAMP)Definition and terminology Embodied AI DatasetsDefinition and terminology Multi-Task Learning RoboticsDefinition and terminology Robot demonstrationsDefinition and terminology Vision-Language-Action ModelDefinition and terminology

External references and source context

scale.com physical ai
Scale AI reports trajectory annotation costs and physical AI data platform capabilities
scale.com ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment aggregates 22 robot datasets with 1M+ trajectories for cross-embodiment learning
arXiv ↩
NVIDIA Cosmos World Foundation Models
NVIDIA Cosmos world foundation models use hierarchical prediction to reduce compute by 60%
NVIDIA Developer ↩
BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 provides 60,000 robot manipulation trajectories across 24 tasks
arXiv ↩
LeRobot dataset documentation
LeRobot dataset format extends RLDS with embodiment and control frequency metadata
Hugging Face ↩
RoboNet: Large-Scale Multi-Robot Learning
RoboNet contains 15M frames from 7 robot platforms enabling cross-embodiment learning
arXiv ↩
RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 demonstrated large-scale robot learning with 130,000 demonstrations across 700 tasks
arXiv ↩
THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation
THE COLOSSEUM benchmark tests generalization across 20 object categories and 12 distractor conditions
arXiv ↩
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
RoboCat demonstrates self-improving generalist manipulation across multiple embodiments
arXiv ↩
Crossing the Reality Gap: A Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning
2021 survey quantifies sim-to-real transfer success rates and data mixing ratios
arXiv ↩
CALVIN paper
CALVIN benchmark shows that 10,000 diverse trajectories outperform 50,000 narrow trajectories
arXiv ↩

More glossary terms

Motion PlanningMotion planning computes a continuous, collision-free path from a robot's current configuration to a goal configuration by searching the configuration space (C-space) — the manifold of all possible joint angles or poses Task and Motion Planning (TAMP)Task and motion planning (TAMP) is a computational framework that integrates symbolic task-level reasoning (deciding which actions to perform) with continuous motion-level planning (computing collision-free trajectories)Multi-Task Learning RoboticsMulti-task learning robotics trains a single neural network policy to execute multiple manipulation tasks by learning shared representations across diverse demonstrations Robot demonstrationsRecorded successful task executions used as demonstrations for imitation learning.Vision-Language-Action ModelA Vision-Language-Action (VLA) model is a neural architecture that processes camera images and natural-language instructions to produce robot control outputs Action Space: How Representation Design Shapes Robot Learning DataAction space defines the complete set of commands a robot can execute at each control timestep—joint angles, Cartesian poses, velocity targets, or gripper states

FAQ

What is the difference between trajectory prediction and motion planning?