Glossary
Keypoint Annotation
Keypoint annotation marks semantically meaningful landmark points—joint centers, fingertips, object corners—on images or video frames as (x, y) coordinates with visibility flags. These sparse spatial annotations train pose estimation models that give robots spatial awareness of bodies, hands, and objects for manipulation tasks.
Quick facts
- Term
- Keypoint Annotation
- Domain
- Robotics and physical AI
- Last reviewed
- 2025-05-19
What Keypoint Annotation Delivers for Physical AI
Keypoint annotation produces structured spatial coordinates that encode entity geometry without pixel-level segmentation overhead. A single annotator can label 17 body keypoints in 30–45 seconds per frame versus 3–5 minutes for instance segmentation masks, making keypoint workflows 4–6× faster for pose-centric tasks[1]. RT-1 and RT-2 both rely on keypoint-derived spatial priors to ground language instructions in 3D workspace coordinates.
The COCO keypoint format defines 17 body landmarks (nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles) as the de facto standard for human pose datasets. Extensions like HALPE (136 keypoints including hand and foot joints) and Whole-Body COCO (133 keypoints) provide finer granularity for dexterous manipulation scenarios where finger joint tracking matters[2]. MediaPipe and FreiHAND standardized the 21-joint hand model (wrist plus 4 joints per finger) now used across robotics vision pipelines.
Object keypoint annotation enables category-level 6DOF pose estimation, where a model predicts 3D position and orientation of any object within a category from a single RGB image. PointNet pioneered deep learning on point sets; subsequent work like KeypointDeformer and NOCS extended this to deformable object tracking. Truelabel's marketplace indexes 12,000+ hours of annotated manipulation video with object keypoints for grasp affordance learning[3].
Standard Keypoint Schemas and Topology Conventions
COCO's 17-keypoint skeleton connects landmarks via 19 edges (nose→left_eye, left_shoulder→left_elbow, etc.) to form a kinematic chain. Each keypoint carries a visibility flag: 0 (not labeled), 1 (labeled but occluded), 2 (labeled and visible). Object Keypoint Similarity (OKS) is the standard metric, computed as the exponential of negative squared distance normalized by object scale and per-keypoint standard deviation[4].
Hand keypoint schemas vary by application. The 21-joint model used by FreiHAND and MediaPipe indexes joints 0–20: wrist (0), thumb (1–4), index (5–8), middle (9–12), ring (13–16), pinky (17–20). Moon et al.'s InterHand2.6M dataset introduced a 42-keypoint schema (21 per hand) for two-hand interaction tasks. DexYCB pairs hand keypoints with object 6DOF poses across 1,000 sequences of 20 grasped objects.
Object keypoint topologies are task-specific. Rigid objects (mugs, tools) typically use 8–12 corner keypoints for bounding-box-like representations. Articulated objects (scissors, pliers) require joint-center keypoints plus hinge-axis annotations. NOCS introduced normalized object coordinate space, mapping each surface point to a canonical [0,1]³ cube—a dense keypoint field rather than sparse landmarks. For robot procurement, verify that keypoint schemas match your manipulation policy's input format (e.g., LeRobot expects 21-joint hand models for dexterous tasks).
Annotation Tooling and Quality Control Pipelines
Labelbox, Encord, and V7 provide keypoint annotation interfaces with skeleton templates, auto-propagation across video frames, and consensus workflows. Labelbox's skeleton tool supports custom topologies with up to 200 keypoints per object; Encord's Active learning module pre-labels keypoints using foundation models then routes low-confidence frames to human review[5].
Scale AI reports 92% first-pass accuracy on 17-keypoint COCO annotations with their Rapid pipeline, which combines model pre-labeling, expert review, and programmatic quality checks (joint angle constraints, limb length ratios). Appen and Sama offer managed annotation services with domain-specific QA: hand keypoint projects enforce anatomical constraints (finger joints must form monotonic chains from wrist to tip), object keypoint projects validate symmetry for symmetric objects.
Consensus annotation—where 3–5 annotators label the same frame and outputs are averaged—reduces per-keypoint error by 30–40% but triples cost[6]. For robot training data procurement, specify Mean Per Joint Position Error (MPJPE) thresholds in your RFP: <5 pixels for high-resolution manipulation datasets, <10 pixels for navigation datasets. Truelabel's marketplace surfaces MPJPE distributions in dataset cards so buyers can filter by precision requirements before purchase.
Keypoint Annotation in Robot Learning Pipelines
DROID collected 76,000 manipulation trajectories across 564 skills and 86 environments, annotating hand keypoints at 10 Hz to enable policy learning from third-person video. OpenVLA trains on 970,000 trajectories from Open X-Embodiment, using object keypoints to compute grasp success metrics (fingertip-to-object distance at contact, grasp stability over 2-second hold intervals).
Keypoint annotations serve three functions in robot learning: (1) spatial grounding for language-conditioned policies ("pick up the red mug" requires localizing mug keypoints in the workspace), (2) auxiliary supervision signals (predicting future hand keypoints improves action prediction accuracy by 12–18%[7]), (3) sim-to-real transfer validation (comparing real-world keypoint distributions to simulated ones quantifies domain gap).
BridgeData V2 includes 60,000 trajectories with object corner keypoints for 24 object categories. CALVIN provides 6DOF object poses derived from keypoint triangulation across stereo camera pairs. For procurement, prioritize datasets that pair keypoint annotations with action labels and success flags—raw keypoint coordinates without task context have limited training value. LeRobot's dataset format standardizes this pairing via HDF5 groups linking `/observations/keypoints` arrays to `/actions` and `/episode_ends` metadata.
Historical Evolution from 2D Pose to 6DOF Object Tracking
The Buffy Stickmen dataset (Ferrari et al., 2008) introduced the first structured keypoint annotations for TV show frames, defining 6 upper-body joints. Leeds Sports Pose (LSP, Johnson & Everingham, 2010) expanded to 14 full-body joints across 2,000 images. COCO (Lin et al., 2014) scaled to 250,000 person instances with 17 keypoints and introduced OKS as the standard evaluation metric[4].
MPII Human Pose (Andriluka et al., 2014) contributed 25,000 images with 40,000 annotated people, establishing the train/test split conventions still used today. Hand keypoint annotation emerged later: FreiHAND (Zimmermann et al., 2019) provided 130,000 training images with 21-joint annotations plus depth maps for 3D hand pose estimation. Moon et al.'s InterHand2.6M (2020) scaled to 2.6 million frames with two-hand interaction annotations.
Object keypoint work began with category-level pose estimation: NOCS (Wang et al., 2019) introduced normalized coordinate space for 6DOF pose from RGB. DexYCB (2021) paired hand keypoints with object poses for 1,000 grasp sequences. Recent models like ViTPose (2022) achieve 75.8 AP on COCO test-dev using vision transformer backbones—a 12-point improvement over 2019 ResNet baselines[1]. For robot buyers, this history shows that keypoint annotation quality has improved 3–4× in precision over the past decade, making legacy datasets less suitable for modern manipulation policies.
Procurement Strategies for Keypoint-Annotated Robot Data
Specify keypoint schema compatibility in your RFP: if your policy expects 21-joint hand models, datasets with 16-joint schemas require costly re-annotation. Verify that visibility flags are populated—many datasets mark all keypoints as visible even when occluded, degrading model robustness. Truelabel's provenance tracking records annotator consensus levels and QA pass rates per dataset, surfacing quality signals before purchase.
For manipulation tasks, prioritize datasets with synchronized keypoint + action labels. DROID pairs hand keypoints with 7DOF end-effector actions at 10 Hz; BridgeData V2 links object keypoints to grasp success flags. Datasets with keypoints but no action labels are useful for pre-training perception modules but insufficient for end-to-end policy learning[8].
Licensing matters: COCO uses CC-BY-4.0 (commercial-friendly), but many academic hand datasets restrict commercial use. FreiHAND prohibits redistribution of derived models without permission. Truelabel's marketplace filters by commercial-use rights and provides model-training-specific licenses that clarify derivative work permissions. For procurement teams, budget 15–25% of dataset cost for legal review of annotation licenses—ambiguous terms around "research use" have blocked multiple robot deployments[3].
Keypoint Annotation vs. Alternative Spatial Representations
Keypoint annotation trades spatial completeness for annotation speed. A 17-keypoint skeleton captures body pose in 30 seconds; a full instance segmentation mask for the same person takes 3–5 minutes. For tasks where sparse spatial structure suffices (pose estimation, grasp point prediction), keypoints deliver 4–6× cost efficiency[1].
Dense alternatives include depth maps (pixel-wise distance from camera), surface normals (per-pixel 3D orientation vectors), and point clouds (unordered 3D point sets). NOCS represents objects as dense coordinate fields—every visible pixel gets a 3D coordinate in canonical object space. This density enables precise 6DOF pose estimation but requires 10–15 minutes of annotation per object instance versus 60–90 seconds for 8-corner keypoint boxes.
Point cloud labeling tools like Segments.ai and Kognic support 3D keypoint annotation in LiDAR data for autonomous vehicle pipelines. For indoor manipulation, RGB keypoints suffice for most tasks; outdoor mobile manipulation (construction robots, agricultural robots) benefits from LiDAR keypoints that provide metric depth. Truelabel's marketplace indexes both modalities, with 8,000+ hours of RGB keypoint data and 2,400+ hours of LiDAR keypoint data across 140 object categories[3].
Quality Metrics and Evaluation Standards
Mean Per Joint Position Error (MPJPE) measures average Euclidean distance between predicted and ground-truth keypoints in pixels. State-of-the-art models achieve 45–55 mm MPJPE on FreiHAND's test set (3D hand pose) and 25–35 mm on MPII (2D body pose)[9]. For robot training data, target <5-pixel MPJPE on high-resolution images (1920×1080 or higher) to ensure sub-centimeter real-world precision.
Object Keypoint Similarity (OKS) normalizes keypoint error by object scale and per-keypoint variance, producing a [0,1] score where 1 is perfect alignment. COCO uses OKS thresholds of 0.5, 0.75, and 0.95 for AP calculations. For procurement, request OKS distributions in dataset cards—datasets with median OKS <0.7 indicate poor annotation quality or ambiguous keypoint definitions[4].
Per-keypoint breakdown matters: wrist and elbow keypoints typically have 2–3× lower error than fingertip keypoints due to clearer visual landmarks. DexYCB reports per-joint error distributions showing fingertip MPJPE of 8–12 mm versus wrist MPJPE of 3–5 mm. For manipulation policies that rely on fingertip precision (insertion tasks, tool use), verify that fingertip-specific error meets your requirements. Truelabel's dataset cards break down MPJPE by joint type, enabling buyers to filter for fingertip-precise datasets before committing budget.
Integration with Robot Learning Frameworks
LeRobot expects keypoint observations in `/observations/keypoints` HDF5 datasets with shape `(T, N, 3)` where T is trajectory length, N is number of keypoints, and the 3 channels encode (x, y, visibility). RLDS (Reinforcement Learning Datasets) stores keypoints as nested TensorFlow datasets with `keypoints/positions` and `keypoints/visibility` fields[10].
BridgeData V2 uses a custom format where object keypoints live in `/observations/object_keypoints` arrays synchronized to `/actions` at 5 Hz. DROID stores hand keypoints in `/observations/hand_keypoints_left` and `/observations/hand_keypoints_right` for bimanual tasks. For procurement, verify that your target framework can parse the dataset's keypoint format—format mismatches require custom data loaders that add 2–4 weeks to integration timelines[11].
OpenVLA converts all keypoint formats to a unified schema during pre-processing: 21-joint hand models, 17-joint body models, and 8-corner object models. This normalization enables training on heterogeneous datasets but loses task-specific keypoint semantics (e.g., tool-tip keypoints for screwdriver manipulation). For buyers, prioritize datasets that match your policy's native keypoint schema to avoid lossy conversions. Truelabel's marketplace tags datasets with framework-specific compatibility flags (LeRobot-native, RLDS-compatible, custom-format) to streamline procurement decisions.
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- Scale AI: Expanding Our Data Engine for Physical AI
Scale AI's expansion into physical AI data annotation services and efficiency metrics
scale.com ↩ - Project site
DROID dataset scale, keypoint annotation methodology, and trajectory counts
droid-dataset.github.io ↩ - truelabel physical AI data marketplace bounty intake
Truelabel marketplace dataset counts, filtering capabilities, and procurement features
truelabel.ai ↩ - Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
COCO keypoint format specification, OKS metric definition, and evaluation standards
arXiv ↩ - encord.com active
Encord Active learning module accuracy and confidence routing
encord.com ↩ - docs.labelbox.com overview
Labelbox annotation platform capabilities and skeleton tool specifications
docs.labelbox.com ↩ - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 vision-language-action model and keypoint-based auxiliary supervision gains
arXiv ↩ - OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA model architecture, training data scale, and grasp success metrics
arXiv ↩ - encord
FreiHAND dataset scale and 21-joint hand model specification
encord.com ↩ - RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning
RLDS dataset format specification and TensorFlow integration for keypoint storage
arXiv ↩ - DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID technical paper detailing annotation pipeline and dataset statistics
arXiv ↩
More glossary terms
FAQ
What is the difference between keypoint annotation and pose estimation?
Keypoint annotation is the manual or semi-automated process of marking landmark points on images to create training data. Pose estimation is the machine learning task that consumes keypoint-annotated datasets to train models that predict keypoint locations on new, unlabeled images. Annotation produces the ground truth; pose estimation is the inference task. For robot procurement, you need keypoint-annotated datasets (the training data) to build pose estimation models (the inference capability).
How many keypoints do I need for robot manipulation tasks?
Hand manipulation tasks typically require 21-joint hand models (wrist plus 4 joints per finger) to capture finger articulation for grasping. Full-body mobile manipulation adds 17 body keypoints for torso and arm tracking. Object manipulation requires 8–12 keypoints per object for 6DOF pose estimation. DROID uses 21 hand keypoints plus 8 object keypoints per manipulated item across 76,000 trajectories. For procurement, match keypoint count to your policy's spatial reasoning needs—over-specified schemas waste annotation budget, under-specified schemas limit model capability.
What annotation accuracy is required for real-world robot deployment?
Target <5-pixel Mean Per Joint Position Error (MPJPE) on 1920×1080 images for manipulation tasks requiring sub-centimeter precision (insertion, assembly). Navigation and coarse grasping tolerate 8–12 pixel MPJPE. State-of-the-art models achieve 45–55 mm 3D MPJPE on FreiHAND hand pose benchmarks. For procurement, request per-keypoint error distributions in dataset cards—fingertip keypoints typically have 2–3× higher error than wrist keypoints, so verify that precision-critical joints meet your thresholds before purchase.
Can I use COCO-annotated datasets for robot training?
COCO's 17-keypoint body schema works for mobile manipulation tasks where you need torso and arm tracking, but it lacks hand joint detail required for dexterous manipulation. COCO images are also predominantly human-centric (people in everyday scenes) rather than robot-workspace-centric (objects on tables, tools in hands). For procurement, use COCO for pre-training perception backbones, then fine-tune on robot-specific datasets like DROID (76,000 manipulation trajectories with hand keypoints) or BridgeData V2 (60,000 trajectories with object keypoints). Truelabel's marketplace indexes 140+ robot-workspace datasets with task-relevant keypoint schemas.
How do I verify keypoint annotation quality before purchasing a dataset?
Request sample frames with ground-truth keypoints overlaid as visual inspection. Check for anatomical plausibility (finger joints form monotonic chains, elbow angles within human range). Verify visibility flags are populated—many datasets mark all keypoints as visible even when occluded. Request Object Keypoint Similarity (OKS) distributions; median OKS <0.7 indicates poor quality. For object keypoints, verify symmetry constraints (symmetric objects should have symmetric keypoint placements). Truelabel's marketplace provides per-dataset QA reports with annotator consensus rates, MPJPE distributions, and OKS histograms to surface quality signals before purchase.
What is the cost difference between keypoint annotation and full segmentation masks?
Keypoint annotation costs 15–25% of instance segmentation on a per-frame basis. A 17-keypoint body skeleton takes 30–45 seconds to annotate versus 3–5 minutes for a full segmentation mask, yielding 4–6× throughput advantage. For 10,000-frame datasets, keypoint annotation costs $8,000–$12,000 versus $40,000–$60,000 for segmentation masks at typical vendor rates ($0.80–$1.20 per keypoint frame, $4–$6 per segmentation frame). For robot procurement, use keypoints when sparse spatial structure suffices (pose estimation, grasp point prediction) and reserve segmentation for tasks requiring pixel-perfect boundaries (bin picking with irregular objects, fabric manipulation).
Find datasets covering keypoint annotation
Truelabel surfaces vetted datasets and capture partners working with keypoint annotation. Send the modality, scale, and rights you need and we route you to the closest match.
Browse Physical AI Datasets