Reference
Physical AI data glossary
Plain-English definitions of the terms buyers and suppliers use when scoping physical AI data bounties — modalities, capture rigs, formats, and metadata. 103 terms covered.
How to use this hub
Start here when you know the broad category but haven't nailed the exact bounty spec yet. Each linked page narrows the request into a concrete data shape: modality, task, environment, metadata, rights, consent, delivery format, and sample QA. That structure is what turns a vague physical AI data need into something a supplier can prove or reject with evidence.
The hub isn't meant to be the last page you read. It should hand off to a detail page where the specific intent is answered with sample specs, comparison tables, proof requirements, and external source context.
103 pages — search and filter
Showing 50 of 103 datasets
6-DOF Grasp Planning
Physical AI Glossary
6-DOF grasp planning computes a six-dimensional gripper pose—three translational coordinates (x, y, z) and three rotational angles (roll, pitch, yaw)—that enables a robot to approach, contact, and close around an object from any direction in SE(3) space. Unlike planar top-down methods, 6-DOF planning handles arbitrary orientations essential for bin-picking, shelf manipulation, and cluttered scenes, using point-cloud perception and learned grasp-quality networks trained on datasets containing tens of thousands of annotated grasp attempts.
Action Chunking
Physical AI Glossary
Action chunking is a robot policy technique that predicts sequences of K future actions (typically 8-32 timesteps) at each decision point instead of single-step outputs. By committing to coherent multi-step plans, chunking reduces compounding errors in behavioral cloning—where small prediction mistakes cascade into trajectory drift—and produces smoother, more temporally consistent motions. Google's RT-1 uses 15-action chunks[ref:ref-rt1-paper], ACT defaults to 100-step sequences[ref:ref-act-paper], and Diffusion Policy generates 16-action horizons[ref:ref-diffusion-policy]. The approach is now standard in manipulation policies trained on datasets like DROID (76,000 trajectories)[ref:ref-droid-paper] and Open X-Embodiment (1 million+ episodes)[ref:ref-open-x-embodiment].
Action Segmentation
Glossary
Action segmentation partitions untrimmed video or sensor streams into non-overlapping temporal segments, each labeled with a discrete action class. In robotics, this technique decomposes continuous demonstrations into discrete skills—enabling modular policy learning, task decomposition, and hierarchical planning. Modern architectures like MS-TCN and ASFormer achieve frame-level accuracy by modeling long-range temporal dependencies across variable-duration actions.
Action Space: How Representation Design Shapes Robot Learning Data
Physical AI Glossary
Action space defines the complete set of commands a robot can execute at each control timestep—joint angles, Cartesian poses, velocity targets, or gripper states. The choice between joint-space and Cartesian actions, absolute and relative commands, and continuous versus discrete representations determines how much demonstration data a policy needs, how well it transfers across embodiments, and what tasks it can perform.
Active Learning
Glossary
Active learning is a machine-learning paradigm in which the model selects which unlabeled samples to annotate next, querying a human oracle only for the most informative examples. By prioritizing uncertain, diverse, or boundary-case data points, active learning reduces annotation costs by 40–80% compared to random sampling while maintaining equivalent model performance—critical for physical-AI domains where per-frame labeling can cost $0.50–$5.00.
Activity Annotation
Glossary
Activity annotation assigns semantic labels and precise temporal boundaries to human actions in video, producing structured timelines of start/end timestamps, action classes, and object tags. Granularity ranges from atomic motor primitives (grasp, release) to multi-minute tasks (prepare meal). The EPIC-KITCHENS-100 dataset contains 90,000 action segments across 100 hours of egocentric kitchen video[ref:ref-epic-100], while DROID captures 76,000 manipulation trajectories in 350 diverse environments[ref:ref-droid-paper].
Affordance Prediction
Physical AI Glossary
Affordance prediction is a computer vision task that identifies actionable regions on objects—where a robot can grasp, push, pull, or manipulate. Modern systems use vision-language-action models trained on teleoperation datasets containing RGB-D images, point clouds, and human demonstration trajectories. Google's RT-2 achieved 62% success on novel objects by grounding language instructions in affordance heatmaps, while OpenVLA reports 29.4% absolute improvement over prior methods when trained on 970K trajectories from the Open X-Embodiment dataset.
Behavioral Cloning
Physical AI Glossary
Behavioral cloning (BC) is a supervised imitation learning method that trains robot policies to replicate expert demonstrations by minimizing prediction error between observed states and recorded actions. Unlike reinforcement learning, BC requires no reward function or environment simulator—just paired (observation, action) tuples from teleoperation or scripted trajectories. Modern BC architectures like ACT and Diffusion Policy use transformers and generative models to handle multimodal action distributions, addressing the compounding-error problem that plagued early feedforward approaches.
Benchmark Curation
Glossary
Benchmark curation is the systematic process of designing, assembling, annotating, and validating evaluation datasets that measure whether AI models possess specific capabilities under controlled conditions. Unlike training data curation, which maximizes learning signal, benchmark curation prioritizes measurement integrity: test sets must produce scores that reliably reflect real-world performance, discriminate between capability levels, and remain stable across evaluation runs.
Bounding Box Annotation
Computer Vision Glossary
Bounding box annotation is the process of drawing axis-aligned rectangular labels around objects in images or video frames, defined by corner coordinates (x_min, y_min, x_max, y_max) and a class label. It is the dominant annotation primitive for training object detection models because it balances localization precision with annotation speed—2-7 seconds per instance versus 30-90 seconds for pixel-level masks—enabling datasets like COCO (860,000+ boxes) and Objects365 (10 million+ boxes) at economically viable scale.
Collision Avoidance for Physical AI Systems
Glossary
Collision avoidance is a real-time safety mechanism that prevents robots from striking obstacles, people, or themselves during motion by fusing sensor data (LiDAR, depth cameras, force-torque sensors) with learned or geometric models to predict and halt unsafe trajectories before contact occurs.
Consent artifact
Glossary
Consent artifact means a record showing that a contributor or site granted permission for data capture and downstream use. The term matters because it turns a model or procurement concept into concrete data requirements you can evaluate samples against.
Contact-Rich Manipulation
Physical AI Glossary
Contact-rich manipulation encompasses robot tasks where sustained, precisely modulated physical contact drives success: peg-in-hole insertion, surface wiping, gear meshing, cable routing, snap-fit assembly. These tasks demand multi-modal sensing (vision + force/torque + tactile) and training data that captures force dynamics invisible to RGB cameras alone.
Cross-Embodiment Data
Physical AI Glossary
Cross-embodiment data aggregates robot demonstrations from multiple hardware platforms—Franka Panda, WidowX, KUKA, Sawyer—into unified schemas like RLDS or LeRobot format. The Open X-Embodiment dataset combines 1M+ trajectories across 22 embodiments, enabling models like RT-2-X to achieve 50% higher success rates than single-robot baselines by learning embodiment-invariant manipulation skills.
Cross-Embodiment Transfer
Physical AI Glossary
Cross-embodiment transfer is the ability of a robot policy to operate on a different physical platform than the one it was trained on—for example, a manipulation policy trained on a Franka Panda arm executing tasks on a Universal Robots UR5. This capability decouples data collection from deployment hardware, enabling teams to pool demonstrations across labs and embodiments into shared datasets that improve generalization.
Data Deduplication
Glossary
Data deduplication identifies and removes duplicate or near-duplicate samples from training datasets to prevent overfitting, reduce storage costs, and improve model generalization. In physical AI, deduplication operates at three levels: exact (byte-identical copies), near-duplicate (minor compression or crop differences), and semantic (functionally equivalent trajectories). Effective deduplication can reduce dataset size by 15-40% while maintaining or improving model performance, as demonstrated in large-scale robot learning datasets like Open X-Embodiment and DROID.
Data Enrichment
Glossary
Data enrichment transforms raw sensor captures into training-ready datasets by layering annotations, metadata, and derived features onto each sample. For physical AI, enrichment adds bounding boxes, segmentation masks, depth maps, language captions, quality scores, and embedding vectors to raw RGB-D video, point clouds, and telemetry streams—turning unstructured captures into structured training inputs that enable vision-language-action models to learn manipulation policies at scale.
Data Flywheel
Glossary
A data flywheel is a self-reinforcing cycle in which deploying a trained model generates new data—especially from failure cases—that is used to retrain and improve the model, which then generates even more useful data on its next deployment. In physical AI, each flywheel turn requires real-world data collection: human operators, physical environments, and specialized hardware. Companies that build effective data flywheels compound their advantage with every deployment cycle, while those relying on static datasets fall permanently behind.
Data provenance
Glossary
Data provenance means the record of where data came from, how it was collected, what rights apply, and how it changed before delivery. The term matters because it turns a model or procurement concept into concrete data requirements you can evaluate samples against.
Data Quality Scoring
Glossary
Data quality scoring assigns numeric ratings to individual training samples and datasets across measurable dimensions—technical capture quality (resolution, depth completeness, motion blur), annotation accuracy (bounding box IoU, label correctness, inter-annotator agreement), and task relevance (demonstration diversity, failure mode coverage). Automated scoring pipelines combine signal processing algorithms, reference-based metrics, and learned quality models to produce continuous values that enable fine-grained curation: training on top-N percentiles, weighting samples during optimization, or targeting collection efforts toward underrepresented scenarios.
Dataset Diversity
Glossary
Dataset diversity measures how broadly a training set spans the scenarios, objects, environments, and conditions a model will encounter in deployment. High diversity enables generalization; low diversity confines models to narrow slices of the operational distribution, causing brittle performance on out-of-distribution inputs.
Deformable Object Manipulation
Physical AI Glossary
Deformable object manipulation is the robotic task of handling materials—cloth, rope, cables, dough, soft packaging—that change shape under contact forces. Unlike rigid-body manipulation, deformable tasks require models that predict continuous shape evolution across contact sequences, typically using vision transformers or graph neural networks trained on 10,000–100,000+ teleoperation trajectories capturing state transitions under varied grasp points, pull directions, and material properties.
Depth Anything V2
Glossary
Depth Anything V2 is a monocular depth estimation model that predicts dense per-pixel depth maps from single RGB images using a DINOv2 Vision Transformer encoder and Dense Prediction Transformer decoder. Released in June 2024, it was trained on 595,000 labeled images plus 62 million pseudo-labeled frames, achieving zero-shot generalization across indoor, outdoor, and egocentric domains without domain-specific fine-tuning.
Depth Data
Physical AI Glossary
Depth data is a spatial measurement modality that encodes the distance from a camera sensor to each visible surface point in the scene, represented as a 2D image where pixel values indicate distance in millimeters or meters. Combined with RGB imagery, depth maps enable robots to compute 3D [link:ref-link-point-cloud]point clouds[/link], estimate object poses, plan collision-free paths, and generate [link:ref-link-6-dof-grasp]6-DOF grasp vectors[/link] that appearance-only models cannot infer reliably.
Dexterous Manipulation
Physical AI Glossary
Dexterous manipulation is the use of multi-finger robot hands (typically 3-5 fingers, 12-24 degrees of freedom) to perform fine motor tasks requiring in-hand object rotation, force modulation across multiple contact points, and dynamic regrasping. Unlike parallel-jaw grippers, dexterous hands coordinate independent joint angles to achieve human-like manipulation. The high-dimensional action space creates the hardest data collection challenge in robot learning: each timestep requires 16-24 joint position labels, contact state annotations, and force/torque measurements across multiple fingertips.
Diffusion Policy in Robotics
Physical AI Glossary
Diffusion Policy is a robot learning architecture that generates action sequences by iteratively denoising random Gaussian noise conditioned on visual observations. Introduced by Chi et al. in 2023, it frames visuomotor control as a conditional denoising diffusion process rather than direct regression, enabling the policy to represent multimodal action distributions where multiple valid responses exist for a single observation.
Domain Randomization
Physical AI Glossary
Domain randomization trains robot policies in simulation by randomly varying visual parameters (textures, lighting, camera angles) and physical parameters (mass, friction, actuator delays) across episodes. This forces the policy to learn robust features that generalize across a wide distribution of environments, making the real world just one more sample point rather than an out-of-distribution domain. First demonstrated by Tobin et al. in 2017 for object localization and extended by OpenAI in 2019 for the Rubik's Cube-solving Dactyl system, DR now underpins sim-to-real pipelines at [link:ref-scale-physical-ai]Scale AI[/link], [link:ref-nvidia-cosmos]NVIDIA Cosmos[/link], and open projects like [link:ref-rlbench]RLBench[/link].
Egocentric data
Glossary
Egocentric data means first-person video or sensor data captured from the perspective of a person or embodied actor. The term matters because it turns a model or procurement concept into concrete data requirements you can evaluate samples against.
Few-Shot Imitation Learning
Physical AI Glossary
Few-shot imitation learning trains a robot policy to perform novel manipulation tasks from 1–10 human demonstrations, compared to hundreds required by standard behavioral cloning. The technique relies on pretraining across diverse multi-task datasets—such as Open X-Embodiment's 1 million trajectories or DROID's 76,000 episodes—so the model learns reusable manipulation primitives and task-inference mechanisms that generalize to unseen skills with minimal new data.
Force-Torque Sensing
Physical AI Data Glossary
Force-torque (F/T) sensing measures six-dimensional interaction vectors—three linear forces (Fx, Fy, Fz) and three rotational torques (Tx, Ty, Tz)—at a robot's joints or end-effector. Dedicated wrist-mounted strain-gauge sensors and integrated joint-torque arrays both capture contact dynamics invisible to cameras, enabling policy learning for insertion, assembly, and compliant manipulation tasks where force feedback is the primary control signal.
Foundation Model Robotics
Glossary
Foundation model robotics refers to large neural networks—typically 100M to 10B+ parameters—pretrained on internet-scale vision and language data, then fine-tuned on robot demonstrations to produce generalist policies that follow natural language instructions and manipulate novel objects across embodiments. The architecture pattern is Vision-Language-Action (VLA): a vision encoder processes camera frames, a language model backbone integrates visual features with text instructions, and an action head outputs robot-executable commands.
GR00T N1: NVIDIA's Humanoid Foundation Model
Physical AI Glossary
GR00T N1 (Generalist Robot 00 Technology N1) is NVIDIA's open-weight foundation model for humanoid robot control, released March 2025. It implements a dual-system architecture: System 1 runs reactive visuomotor policies at 30+ Hz for balance and manipulation, while System 2 executes vision-language reasoning at 1-5 Hz for task planning and natural language grounding, communicating through shared goal representations.
Grasp Planning
Physical AI Glossary
Grasp planning is the computational process of determining a 6-DoF gripper pose (position, orientation, finger configuration) that achieves stable contact with a target object. Modern approaches use neural networks trained on millions of labeled grasp attempts to predict grasp quality directly from RGB-D images or point clouds, replacing analytical force-closure methods that require exact CAD models.
Grasping Dataset
Physical AI Glossary
A grasping dataset is a labeled collection pairing object observations—RGB-D images, point clouds, or meshes—with gripper poses and binary success outcomes. Modern datasets range from 885 image-rectangle pairs in Cornell (2011) to 17.7 million 6-DOF poses in ACRONYM and over one billion grasp candidates in GraspNet-1Billion, enabling supervised learning of grasp affordances across parallel-jaw, suction, and multi-finger end-effectors.
Gripper Design
Physical AI Glossary
Gripper design is the engineering discipline that selects, configures, and optimizes end-effectors—the physical interfaces between a robot arm and objects in its workspace. Design choices (parallel-jaw vs. suction vs. soft vs. multi-fingered) directly constrain what objects a robot can grasp, how reliably, and under what conditions. Modern physical AI systems treat gripper design as a co-optimization problem: hardware geometry, sensor placement, and training data must align to produce robust learned policies that generalize across object shapes, materials, and poses.
Hand-Object Interaction
Physical AI Glossary
Hand-object interaction (HOI) research studies how human hands contact, grasp, manipulate, and release objects across reach, grip, in-hand adjustment, and release phases. HOI datasets provide demonstration data that teaches dexterous robots to replicate human manipulation skills in unstructured environments. Leading benchmarks include EPIC-KITCHENS (100 hours of egocentric kitchen tasks), DexYCB (582,000 RGB-D frames with 3D hand pose and object pose), and DROID (76,000 trajectories across 564 scenes). Procurement requires verifying 3D hand pose accuracy, contact annotation density, object diversity, and licensing terms for commercial model training.
Haptic Feedback in Physical AI
Glossary
Haptic feedback refers to force, torque, and tactile sensor signals that enable robots to perceive contact dynamics during manipulation. Unlike vision-only systems, haptic modalities capture slip detection, surface texture, and grasp stability — critical for contact-rich tasks like assembly, insertion, and deformable object handling where visual occlusion limits camera-based perception.
Human Intent Prediction
Physical AI Glossary
Human intent prediction infers what a person will do next from sensor observations—gaze direction, hand trajectory, object proximity—so collaborative robots can assist proactively rather than react after the fact. Production systems combine vision transformers pretrained on egocentric video with domain-specific teleoperation datasets annotated for grasp intent, handover timing, and task-phase transitions. Performance depends on training data coverage: models fail on operator poses, object categories, or lighting conditions absent from the training distribution.
Humanoid Robot
Physical AI Glossary
A humanoid robot is a bipedal machine with human-like morphology—two legs, two arms, torso, and head—designed to operate in environments built for human dimensions without modification. Training humanoid policies requires whole-body coordination data spanning locomotion, manipulation, and balance, captured via teleoperation, motion capture, or egocentric video, then formatted as multi-modal trajectories pairing joint states, camera feeds, and force-torque readings across 20–50 degrees of freedom.
Imitation Learning
Physical AI Glossary
Imitation learning trains robot control policies by observing expert demonstrations rather than through trial-and-error reinforcement learning. The expert—human teleoperator or scripted controller—provides examples of correct task execution, and the learning algorithm extracts a policy that reproduces that behavior. Behavioral cloning treats demonstrations as supervised learning; DAgger iteratively collects on-policy corrections; inverse RL infers the expert's reward function; generative models like diffusion policies and ACT capture multimodal action distributions for contact-rich manipulation.
Instance Segmentation
Computer Vision Glossary
Instance segmentation detects every object in an image and produces a pixel-precise mask for each individual instance, distinguishing separate objects of the same class. Unlike bounding-box detection, it delineates exact spatial boundaries; unlike semantic segmentation, it assigns unique identities to each object. Modern methods like Mask R-CNN and Mask2Former power robotic manipulation by enabling per-object grasping, collision avoidance, and task planning in cluttered environments.
Inter-Annotator Agreement
Glossary
Inter-annotator agreement (IAA) measures how consistently multiple human annotators assign the same labels to identical data. It is the primary statistical signal distinguishing reliable training labels from annotation noise. Cohen's kappa corrects for chance agreement in two-annotator scenarios; Krippendorff's alpha generalizes to any number of raters and missing data. For spatial tasks like bounding boxes or segmentation masks, Intersection over Union (IoU) thresholds (typically 0.5–0.75) define agreement. IAA sets the performance ceiling for any model trained on those labels—if annotators disagree, the model learns contradictory signals and cannot exceed human-level consistency.
Inverse Kinematics
Physical AI Glossary
Inverse kinematics (IK) solves for the joint configuration that places a robot's end-effector at a specified Cartesian pose. For a 6-DOF arm, IK inverts the forward-kinematics function FK(q)=T to find q given T_target. Analytical solvers exploit geometric structure for closed-form solutions; numerical methods iteratively minimize pose error via Jacobian descent. Redundant arms (n>6) yield infinite solutions, requiring null-space optimization. IK underpins every manipulation trajectory—teleoperation datasets capture human-demonstrated end-effector paths that policies must reproduce via IK at inference.
Joint-Space Control
Physical AI Glossary
Joint-space control commands a robot's internal degrees of freedom—joint angles, velocities, or torques—rather than end-effector Cartesian poses. For an n-joint manipulator, the control input is an n-dimensional vector specifying per-joint targets at each timestep. Position control (target joint angles tracked by PID) dominates learned manipulation because it is stable, unambiguous, and directly executable. Velocity and torque modes offer finer dynamics but require more sophisticated controllers. Joint-space actions are embodiment-specific: a policy trained on Franka Panda 7-DOF joint vectors cannot transfer to UR5 6-DOF without retraining, unlike task-space policies that generalize across kinematic chains.
Keypoint Annotation
Glossary
Keypoint annotation marks semantically meaningful landmark points—joint centers, fingertips, object corners—on images or video frames as (x, y) coordinates with visibility flags. These sparse spatial annotations train pose estimation models that give robots spatial awareness of bodies, hands, and objects for manipulation tasks.
Language-Conditioned Policy
Physical AI Glossary
A language-conditioned policy is a robot control model that accepts both sensory observations (camera images, depth maps, proprioception) and a natural language instruction as input, then outputs motor actions that execute the described task. The language instruction serves as a task specification, enabling a single policy to perform many different tasks depending on what it is told to do, rather than requiring a separate policy per task.
Manipulation Trajectory
Physical AI Glossary
A manipulation trajectory is a time-ordered sequence of (observation, action, state) tuples recorded during a single robot manipulation episode. Each trajectory captures synchronized sensor streams—RGB-D video, joint positions, gripper state, force/torque readings—paired with the action commands (Cartesian deltas, joint velocities, or gripper open/close signals) executed at each timestep. Trajectories are the atomic training unit for imitation learning: datasets like DROID contain 76,000 trajectories across 564 skills, while Open X-Embodiment aggregates 1M+ trajectories from 22 robot embodiments to train generalist policies like RT-X.
Monocular Depth Estimation
Physical AI Glossary
Monocular depth estimation (MDE) infers a dense depth map from a single RGB camera frame, recovering 3D scene geometry without stereo pairs or LiDAR. Transformer-based models like Depth Anything V2 achieve sub-10% relative error on zero-shot indoor scenes, enabling robots to navigate cluttered warehouses and grasp novel objects using commodity cameras that cost under $50.
Motion Planning
Physical AI Glossary
Motion planning computes a continuous, collision-free path from a robot's current configuration to a goal configuration by searching the configuration space (C-space) — the manifold of all possible joint angles or poses. Classical sampling-based algorithms like RRT and PRM build graphs of collision-free waypoints; optimization-based methods like CHOMP and TrajOpt refine trajectories by minimizing cost functionals; learned planners trained on millions of solved problems accelerate inference by predicting heuristics or entire paths.
Multi-Task Learning Robotics
Physical AI Glossary
Multi-task learning robotics trains a single neural network policy to execute multiple manipulation tasks by learning shared representations across diverse demonstrations. Unlike single-task policies that overfit to narrow scenarios, multi-task architectures extract transferable features from heterogeneous training data—enabling robots to generalize across object categories, environmental contexts, and task variations with 40–60% fewer parameters than ensemble approaches.
Multimodal Foundation Model
Physical AI Glossary
A multimodal foundation model is a large-scale transformer pretrained on text, images, video, audio, and action data that learns cross-modal representations transferable to downstream tasks. For physical AI, these models bridge internet-scale knowledge and embodied robot behavior by processing sensor streams and language instructions in a unified architecture.
Neural Radiance Field (NeRF)
Glossary
A neural radiance field (NeRF) is a continuous volumetric scene representation encoded by a multilayer perceptron that maps 5D coordinates (spatial location x,y,z plus viewing direction θ,φ) to volume density and view-dependent RGB color. Introduced in 2020, NeRF synthesizes photorealistic novel views by integrating color and density along camera rays via differentiable volumetric rendering, enabling 3D reconstruction from as few as 20–100 posed 2D images without explicit geometry.
Object Pose Estimation
Physical AI Glossary
Object pose estimation computes the six-degree-of-freedom (6-DoF) position and orientation of objects in 3D space from sensor data. Modern systems fuse RGB images, depth maps, and point clouds through learned representations—typically vision transformers or convolutional networks pretrained on large-scale datasets and fine-tuned on domain-specific robot data. Performance is bounded by training data quality: systematic gaps in data coverage produce systematic deployment failures, making data collection and curation the primary engineering challenge for production pose estimation systems.
Occupancy Grid
Glossary
An occupancy grid is a probabilistic spatial representation that partitions 3D space into discrete voxels, each storing a belief about whether that region is free, occupied, or unknown. Robots fuse sensor streams—LiDAR, depth cameras, stereo vision—into this grid to perform collision checking, path planning, and object localization in real time.
Off-the-shelf dataset
Glossary
Off-the-shelf dataset means an existing dataset a supplier can license without running a new capture program. The term matters because it turns a model or procurement concept into concrete data requirements you can evaluate samples against.
Open X-Embodiment
Physical AI Glossary
Open X-Embodiment (OXE) is a collaborative robot learning dataset released in October 2023 by Google DeepMind and 20 academic institutions, aggregating over 1 million robot trajectories from 22 different embodiments across 527 skills and 160,266 tasks[ref:ref-oxe-paper]. The dataset demonstrated that training on diverse cross-embodiment data produces 50% better emergent skill generalization than single-robot datasets[ref:ref-oxe-paper], establishing the principle that exposure to varied kinematics and action spaces teaches transferable manipulation primitives applicable across robot platforms.
Optical Flow
Computer Vision Glossary
Optical flow is a dense 2D vector field that estimates the apparent motion of every pixel between consecutive video frames, encoding horizontal and vertical displacement (u, v) for each spatial location. Physical AI systems use optical flow to decompose camera ego-motion from independent object motion, enabling real-time obstacle avoidance, visual odometry, and action recognition without explicit depth sensors.
Physical AI
Glossary
Physical AI refers to artificial intelligence systems that perceive, reason about, and act within three-dimensional physical environments—encompassing robot manipulation policies, world foundation models, autonomous vehicle stacks, and physics-aware video generators. Unlike digital AI operating on text or static images, physical AI must respect real-world constraints: collision dynamics, material properties, temporal causality, and sensor noise across modalities (RGB-D cameras, LiDAR, tactile arrays, proprioception).
Physical AI training data
Glossary
Physical AI training data means data that teaches models to perceive, reason about, and act in real or simulated physical environments. The term matters because it turns a model or procurement concept into concrete data requirements you can evaluate samples against.
Point Cloud
Physical AI Glossary
A point cloud is an unordered set of 3D coordinates (X, Y, Z) representing sampled surface locations in physical space, captured by LiDAR, depth cameras, or stereo vision systems. Each point may carry attributes like RGB color, intensity, or surface normals. Unlike meshes or voxels, point clouds preserve raw sensor geometry without imposing topology, making them the primary 3D perception modality for robot manipulation, autonomous navigation, and scene understanding tasks.
Policy Distillation
Glossary
Policy distillation compresses a large teacher policy—trained on millions of demonstrations—into a smaller student network that runs on edge hardware. The student mimics teacher outputs via supervised learning on state-action pairs, achieving 70–90% of teacher performance at 5–20× lower inference cost. Critical for deploying vision-language-action models like RT-2 or OpenVLA onto robots with limited compute budgets.
Pose Estimation
Physical AI Glossary
Pose estimation is the computational task of determining the position and orientation of an entity—human body, rigid object, or robot—from sensor data. In physical AI, it spans 2D keypoint detection for imitation learning, 6-DoF object pose for grasping, and proprioceptive state estimation for closed-loop control. Modern vision-language-action models like RT-1 and RT-2 rely on pose-annotated demonstration data to map human actions onto robot joint commands.
Preference Annotation
Glossary
Preference annotation is the systematic collection of human comparative judgments between AI-generated outputs, forming the training signal for reward models in reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). Annotators evaluate pairs or ranked sets of model responses, selecting which better satisfies criteria like helpfulness, safety, or task success, enabling AI systems to learn latent quality functions that align behavior with human values without requiring absolute scoring rubrics.
Proprioceptive Data
Physical AI Glossary
Proprioceptive data records a robot's internal physical state—joint angles, velocities, torques, end-effector poses, and contact forces—providing the body awareness that learned policies require for precise manipulation. Modern datasets like [link:ref-droid]DROID[/link] and [link:ref-open-x]Open X-Embodiment[/link] pair RGB-D video with 7–14 DoF proprioceptive vectors at 10–30 Hz, enabling vision-language-action models to ground language commands in force-reactive control loops that adapt to contact dynamics invisible to cameras alone.
RAFT (Recurrent All-Pairs Field Transforms)
Glossary
RAFT is a convolutional recurrent architecture for dense optical flow estimation introduced by Teed and Deng in 2020. It constructs a 4D correlation volume from feature pairs across consecutive frames, then iteratively refines flow predictions using a ConvGRU update operator indexed at multiple scales, achieving top-1 accuracy on Sintel Final (1.43 EPE) and KITTI 2015 (5.10% outlier rate) benchmarks while maintaining real-time inference speed.
Reinforcement Learning Robotics
Physical AI Glossary
Reinforcement learning robotics trains robot policies by maximizing cumulative reward through trial-and-error interaction with physical or simulated environments. Unlike imitation learning (which clones expert demonstrations), RL algorithms explore action spaces autonomously, discovering strategies that may exceed human performance. Modern RL systems combine vision transformers pretrained on web-scale data with domain-specific robot trajectories: Google's RT-1 trained on 130K episodes across 700 tasks, RT-2 integrated 6B-parameter vision-language models, and the Open X-Embodiment dataset aggregated 1M+ trajectories from 22 robot embodiments to enable cross-platform generalization.
Reward Model
Physical AI Glossary
A reward model is a neural network trained on human preference annotations to predict scalar quality scores for robot trajectories or AI outputs. In physical AI, reward models convert pairwise human judgments—'trajectory A handles the object more carefully than B'—into continuous reward signals that guide reinforcement learning policies toward safer, smoother, and more task-aligned behavior without hand-crafted reward functions.
Reward Shaping for Physical AI
Glossary
Reward shaping augments sparse task rewards with intermediate feedback signals that guide reinforcement learning agents toward desired behaviors without altering the optimal policy. In robotics, shaped rewards reduce sample complexity by 40–70% compared to sparse-only formulations, enabling faster skill acquisition on manipulation tasks where end-task success occurs infrequently. The technique requires careful design: poorly shaped rewards introduce bias that prevents convergence to true optima, while well-designed shaping preserves policy invariance under potential-based transformations.
RGB-D Data
Physical AI Data Glossary
RGB-D data combines a standard RGB color image with a spatially aligned depth map, where each pixel stores metric distance from the camera to the surface. This multimodal format enables robots to perceive both visual appearance and 3D geometry simultaneously, making it the dominant modality for indoor manipulation, navigation, and scene understanding in physical AI systems.
RLDS: Reinforcement Learning Dataset Standard
Glossary
RLDS (Reinforcement Learning Datasets) is an episode-based data specification and storage format developed by Google DeepMind for sequential decision-making datasets in robotics and reinforcement learning. Built on TensorFlow Datasets infrastructure, RLDS structures robot interaction data as collections of episodes—ordered sequences of timesteps containing observations, actions, rewards, discount factors, and metadata—enabling standardized sharing and consumption across heterogeneous robot platforms and research groups.
RLHF: Reinforcement Learning from Human Feedback
Glossary
RLHF is a three-stage training paradigm that aligns AI models with human preferences through pairwise comparison annotations, reward model training, and policy optimization. Annotators compare two candidate outputs and select the preferred one; these preferences train a reward model that scores outputs; reinforcement learning then fine-tunes the base model to maximize reward while maintaining proximity to the original policy via KL-divergence constraints.
Robot demonstrations
Glossary
Robot demonstrations means task examples showing a robot or human demonstrator completing a behavior that a model should learn or evaluate. The term matters because it turns a model or procurement concept into concrete data requirements you can evaluate samples against.
Robot Learning
Physical AI Glossary
Robot learning is the field of acquiring robot capabilities—manipulation skills, navigation strategies, perception abilities, and decision-making policies—from data rather than manual programming. Systems learn from human demonstrations (imitation learning), trial-and-error experience (reinforcement learning), or simulated environments, then generalize across objects, scenes, and tasks. Modern approaches train vision-language-action models on multi-embodiment datasets: RT-2 learned from 130,000 demonstrations across 13 robot types, OpenVLA trained on 970,000 trajectories from the Open X-Embodiment corpus, and NVIDIA's GR00T N1 ingested 1.5 million teleoperation episodes to achieve 85% success on unseen manipulation tasks.
Safety Constraint Learning
Physical AI Glossary
Safety constraint learning trains robots to infer and respect operational boundaries from demonstration data, enabling deployment in human-shared environments without exhaustive rule specification. The approach combines inverse reinforcement learning with constraint inference: given expert trajectories that avoid collisions, exceed force limits, or violate workspace bounds, algorithms recover implicit cost functions that penalize unsafe states. Modern implementations use neural networks to parameterize constraint functions over high-dimensional sensor inputs, then integrate learned constraints into model-predictive control or policy optimization loops to guarantee safe behavior during execution.
SAM (Segment Anything Model)
Glossary
SAM (Segment Anything Model) is a foundation model released by Meta AI in April 2023 that performs zero-shot image segmentation from point, box, mask, or text prompts. Trained on SA-1B—1.1 billion masks across 11 million images—SAM uses a Vision Transformer image encoder (632M parameters) and a lightweight mask decoder (4M parameters) to generate pixel-precise segmentation masks in real time, making it a core perception primitive for robotics annotation, scene understanding, and physical AI data pipelines.
Scene Understanding
Physical AI Glossary
Scene understanding is the computational process of parsing multi-modal sensor streams into structured spatial representations that encode object identity, geometry, material properties, spatial relationships, and affordances. Unlike isolated vision tasks, scene understanding synthesizes segmentation, depth estimation, object detection, and relationship inference into a unified model queryable by planning modules—typically a 3D semantic map, scene graph, or neural radiance field.
Self-Supervised Learning Robotics
Physical AI Glossary
Self-supervised learning for robotics trains neural networks to extract task-relevant features from unlabeled sensor streams (RGB-D video, proprioception, tactile) by solving pretext tasks like temporal ordering, masked prediction, or contrastive pairing. This approach reduces human annotation costs by 60-80% compared to fully supervised pipelines while enabling cross-embodiment transfer. Modern implementations leverage vision transformers pretrained on internet-scale datasets, then fine-tuned on robot teleoperation or simulation data to ground visual semantics in action affordances.
Sensor Fusion for Physical AI
Glossary
Sensor fusion merges data from heterogeneous sensors—RGB cameras, depth sensors, LiDAR, force-torque transducers, IMUs, proprioceptive encoders—into a single spatiotemporally aligned representation that robot policies consume. Modern implementations use learned feature extractors (vision transformers, PointNet architectures) trained on synchronized multi-modal datasets where each sensor stream is timestamped, calibrated, and annotated with task-relevant labels. Performance depends on training data coverage: a policy trained on 10,000 RGB-D grasps will fail on force-sensitive assembly tasks unless the dataset includes synchronized wrench measurements and contact labels.
Sim-to-real gap
Glossary
Sim-to-real gap means the performance gap between behavior learned in simulation and behavior deployed in real physical environments. The term matters because it turns a model or procurement concept into concrete data requirements you can evaluate samples against.
Spatial Action Maps
Physical AI Glossary
Spatial action maps represent robot control policies as dense, pixel-aligned action predictions over an image observation. Instead of outputting a single action vector, the policy produces a spatial map where each pixel encodes the value or likelihood of executing an action at that location. The robot selects its action by identifying the peak pixel coordinate and converting it to a physical command through camera calibration, exploiting translation-equivariance in visual affordances.
Spatial Intelligence
Physical AI Glossary
Spatial intelligence is an AI system's ability to perceive 3D geometry, reason about object affordances, and plan actions in physical environments. Unlike 2D computer vision, spatial intelligence reconstructs volumetric scenes from multi-sensor input—RGB-D cameras, LiDAR, IMUs—to enable navigation, manipulation, and collision-free path planning in robotics and autonomous systems.
Synthetic Data for Physical AI
Glossary
Synthetic data for physical AI refers to training examples generated procedurally in physics simulation rather than collected from real robots. Simulators render camera images, compute object poses and contact forces, and record state-action trajectories of scripted or learned policies performing tasks in virtual environments. This approach reduces data collection costs by four to five orders of magnitude—one hour of real teleoperation costs $50–200 in operator time, while simulated data costs fractions of a cent in cloud compute—but the sim-to-real gap means simulation cannot fully replace real-world demonstrations for production deployment.
Task and Motion Planning (TAMP)
Physical AI Glossary
Task and motion planning (TAMP) is a computational framework that integrates symbolic task-level reasoning (deciding which actions to perform) with continuous motion-level planning (computing collision-free trajectories). TAMP systems solve long-horizon manipulation problems by iteratively proposing symbolic action sequences—pick, place, open, pour—and verifying geometric feasibility through motion planners that respect kinematic constraints, collision avoidance, and grasp stability.
Task Decomposition
Physical AI Glossary
Task decomposition partitions multi-step robot manipulation into discrete sub-goals that vision-language-action models can execute sequentially. Google's RT-2 demonstrated 62% success on 6,000 real-world trials by decomposing instructions like "bring me the Coke" into perceive-grasp-navigate-place primitives. Training requires teleoperation datasets annotated with sub-task boundaries—typically 10,000–100,000 trajectories per domain. Truelabel's marketplace aggregates decomposed teleoperation data from 20,000 collectors across warehouse, kitchen, and assembly environments.
Teleoperation data
Glossary
Teleoperation data means robot observations, state, and action traces recorded while a human remotely controls the robot. The term matters because it turns a model or procurement concept into concrete data requirements you can evaluate samples against.
Temporal Annotation
Glossary
Temporal annotation assigns time-aligned semantic labels to video segments by marking start timestamps, end timestamps, and event descriptions. Unlike spatial bounding boxes that label where objects appear in frames, temporal annotation labels when actions, states, or events occur across time. Robotics datasets like EPIC-KITCHENS-100 contain 90,000 action segments with frame-level boundaries; egocentric manipulation datasets require sub-second precision for grasp-to-release transitions that inform visuomotor policies.
Trajectory Optimization
Physical AI Glossary
Trajectory optimization finds robot motion plans that minimize a cost function—energy, time, smoothness, collision risk—subject to physical constraints like joint limits and obstacle avoidance. Unlike sampling-based planners that return any feasible path, trajectory optimizers solve a constrained optimization problem to produce locally optimal, dynamically smooth trajectories that respect actuator limits and task requirements.
Trajectory Prediction
Physical AI Glossary
Trajectory prediction forecasts the future spatial positions and velocities of agents (humans, robots, vehicles) and objects over time horizons of 1–10 seconds. Physical AI systems use trajectory models to anticipate collisions, plan safe paths, and coordinate multi-agent interactions in warehouses, kitchens, and autonomous vehicle fleets.
Transfer Learning Robotics
Physical AI Glossary
Transfer learning robotics applies knowledge from a source domain—simulation, multi-robot datasets, internet vision corpora—to a target robot task, reducing target data requirements by 60–80% versus training from scratch. The pretrain-finetune recipe dominates: models learn general representations on abundant source data, then adapt via limited target demonstrations.
Video Prediction
Physical AI Glossary
Video prediction generates future video frames from past observations and optional action inputs, serving as a learned world model for robot planning. Unlike classical physics simulators requiring explicit geometry and dynamics, video prediction models learn visual dynamics directly from data—predicting pixel-level consequences of actions in unstructured environments where analytic models fail.
Vision Transformer (ViT)
Glossary
Vision Transformer (ViT) splits images into fixed-size patches (typically 16×16 pixels), embeds each patch as a token, and processes the sequence through multi-head self-attention layers. Introduced by Dosovitskiy et al. in 2020, ViT eliminates convolutional layers entirely, treating visual recognition as a sequence modeling task. When pretrained on datasets exceeding 14 million images, ViT matches or surpasses CNN accuracy on ImageNet while scaling more efficiently to billion-parameter regimes, making it the default visual encoder for robotics foundation models like RT-2, OpenVLA, and NVIDIA GR00T.
Vision-Language-Action Model
Physical AI Glossary
A Vision-Language-Action (VLA) model is a neural architecture that processes camera images and natural-language instructions to produce robot control outputs. VLA models pretrain on internet-scale vision-language pairs (e.g., CLIP, SigLIP embeddings) then fine-tune on robot demonstration datasets to ground semantic concepts in continuous action spaces. Google's RT-2 trained on 13 robot embodiments and 800,000 trajectories, achieving 62% success on novel tasks versus 32% for behavior-cloning baselines.
Visual Grounding
Glossary
Visual grounding is the task of localizing objects or regions in an image given a natural language description. In robotics, it enables language-conditioned manipulation by mapping instructions like 'pick up the red mug' to pixel coordinates or 3D bounding boxes. Modern systems use vision-language models pretrained on web-scale image-text pairs, then fine-tuned on robot-specific datasets with spatial annotations. Performance depends on training data diversity: models fail on object categories, viewpoints, or lighting conditions absent from the training distribution.
Visual Servoing
Glossary
Visual servoing is a closed-loop control technique that uses real-time camera feedback to guide robot end-effector motion toward a target pose or trajectory. Unlike open-loop systems that execute pre-programmed paths, visual servoing continuously compares observed image features (edges, keypoints, object centroids) against desired features and computes corrective motor commands. Modern implementations leverage vision-language-action models trained on 350K+ teleoperation trajectories to map pixel observations directly to joint velocities or Cartesian displacements, enabling adaptive manipulation in unstructured environments.
Visuomotor Policy
Physical AI Glossary
A visuomotor policy is a neural network that accepts raw camera images as input and outputs robot motor commands (joint positions, velocities, or torques) as a single differentiable function, learning the entire perception-to-action pipeline end-to-end from demonstration or interaction data. Unlike modular robotics architectures that separate object detection, trajectory planning, and low-level control into discrete subsystems, visuomotor policies collapse this stack into one learned mapping.
VLA (Vision-Language-Action Model)
Physical AI Glossary
A Vision-Language-Action (VLA) model is a neural architecture that ingests camera images and natural language instructions, then outputs continuous robot control signals. VLAs merge a vision encoder (typically a Vision Transformer), a pretrained language model backbone, and an action decoder head. By leveraging internet-scale vision-language pretraining, VLAs transfer semantic understanding of objects, spatial relationships, and task verbs directly into physical manipulation policies—eliminating the need for separate perception, planning, and control modules.
VLA model
Glossary
VLA model means a vision-language-action model that connects visual observations and language context to physical actions. The term matters because it turns a model or procurement concept into concrete data requirements you can evaluate samples against.
Workspace Mapping
Physical AI Glossary
Workspace mapping constructs 3D spatial representations of a robot's operating environment from sensor streams—RGB-D cameras, LiDAR, tactile arrays—to enable collision-free motion planning, grasp pose synthesis, and dynamic obstacle avoidance. Modern systems fuse point clouds, voxel grids, and learned geometric priors into unified scene models that update at 10–30 Hz during task execution.
World Model
Physical AI Glossary
A world model is a neural network that learns to predict future environment states given current observations and proposed actions, enabling agents to plan by simulating action sequences internally before physical execution. Training world models requires diverse real-world video capturing causal structure—robotics teams use teleoperation datasets like DROID (76,000 trajectories across 564 skills) and BridgeData V2 (60,000 demonstrations) to teach models how objects respond to manipulation.
World model AI
Glossary
World model AI means a model that learns predictive structure about environments, objects, motion, and consequences. The term matters because it turns a model or procurement concept into concrete data requirements you can evaluate samples against.
Zero-Shot Generalization
Physical AI Glossary
Zero-shot generalization is a robot's ability to perform tasks, manipulate objects, or operate in environments absent from its training data—without fine-tuning or additional demonstrations. Unlike few-shot adaptation (which requires new examples) or domain randomization (which simulates variance), zero-shot transfer tests whether a policy learned from dataset A succeeds on dataset B with no overlap in objects, scenes, or instructions.
Zero-Shot Manipulation
Physical AI Glossary
Zero-shot manipulation is a robot's ability to grasp, move, or interact with objects it has never encountered during training. Unlike task-specific controllers trained on fixed object sets, zero-shot policies generalize from diverse training data to novel instances by learning transferable representations of shape, affordances, and physical dynamics rather than memorizing object identities.
π₀ (pi-zero): Physical Intelligence's Vision-Language-Action Model
Glossary
π₀ (pi-zero) is a Vision-Language-Action model released by Physical Intelligence in October 2024 that unifies pretrained vision-language understanding with flow matching action generation to control robots across 68 manipulation tasks—including folding laundry, busing tables, and assembling boxes—at 50 Hz with bimanual dexterity previously requiring task-specific controllers.
53 remaining
Procurement questions before posting a bounty
- What exact model behavior or evaluation question should this data improve?
- Which modality, camera viewpoint, robot state, or metadata stream is required?
- What evidence proves the supplier has rights, consent, and provenance?
- Which delivery format must the sample open in before scale-up?
- What specific failure reasons should cause sample rejection?
Quality gate before a page becomes a deal spec
A page in this hub should not be treated as a finished procurement document by itself. It is a starting point for a bounty. Before a buyer funds capture or licenses off-the-shelf data, the page needs to become a short operating spec: accepted examples, rejected examples, file format, metadata fields, consent requirements, delivery location, and a named reviewer who can approve the sample.
The practical test is simple: if two suppliers read the same detail record, would they submit comparable samples? If not, the buyer needs to narrow the research into a more specific bounty. The strongest truelabel references help with that narrowing by linking from broad hubs into task pages, dataset profiles, format guides, glossary definitions, and public dataset alternatives.
| Gate | Question | Pass signal |
|---|---|---|
| Intent | What model behavior does the data improve? | The objective is tied to a task, benchmark, or evaluation gap. |
| Evidence | What proves a supplier can deliver? | A sample package includes files, manifest, rights, and QA notes. |
| Ingestion | Can the buyer load the sample? | The sample opens in the expected format or converter. |
Hub FAQ
How should buyers use the Physical AI data glossary hub?
Use the Physical AI data glossary hub to move from a broad physical AI data need into a concrete page with modality, sample, QA, format, rights, and supplier-evidence requirements.
Are these pages public datasets?
No. These pages are sourcing and specification guides for posting bounties. They help buyers define what a supplier must prove before data is accepted.
Why does this hub link to so many detail pages?
Each detail page handles one specific task, dataset, comparison, definition, or format. The hub is the index that helps a buyer pick the right one for the bounty they want to post.
What makes a page ready for a bounty?
A page is ready when it names a model objective, concrete files, metadata requirements, rights and consent expectations, sample QA checks, and a delivery format.
External source context
- Scale AI physical AI data engine
Shows enterprise demand for custom physical AI collection and enrichment programs.
- NVIDIA Physical AI Data Factory Blueprint
Frames physical AI data as an end-to-end factory problem spanning curation, generation, evaluation, and delivery.
- Open X-Embodiment
Baseline open robotics data entity for cross-embodiment tasks and VLA pretraining discussions.
- Ego4D dataset
Canonical egocentric video benchmark for first-person physical-world capture and limitations.