Physical AI Glossary

Hand-Object Interaction

Q: What is the difference between hand pose estimation and hand-object interaction datasets?

Hand pose estimation datasets (e.g., FreiHAND, InterHand2.6M) focus solely on recovering 3D hand joint positions or mesh parameters from images, without object context. HOI datasets additionally capture object pose, contact surfaces, and task semantics (action labels, success flags). Hand pose datasets train pose estimators; HOI datasets train manipulation policies. DexYCB is an HOI dataset because it includes both hand pose and object pose with contact annotations. FreiHAND is a hand pose dataset because objects are present but not tracked or annotated.

Q: Can I train a dexterous manipulation policy using only egocentric video datasets like EPIC-KITCHENS?

Egocentric video datasets provide rich semantic context (what objects are used, in what order) but typically lack 3D hand pose and end-effector trajectories required for imitation learning. You can train affordance models or task planners from egocentric video, then combine them with teleoperation datasets (DROID, BridgeData V2) that provide state-action pairs for low-level control. RT-2 demonstrates this approach, using web video for semantic grounding and robot teleoperation data for action execution. Pure egocentric video is insufficient for grasp synthesis without additional 3D pose annotation.

Q: How do I verify contact annotation quality in an HOI dataset?

Request sample frames with contact masks overlaid on hand and object meshes. Check for: (1) vertex-level precision (contact labels at mesh resolution, not bounding-box approximations); (2) temporal consistency (contact regions should not flicker between adjacent frames); (3) physical plausibility (contact normals should oppose each other, indicating force closure). ContactPose provides thermal validation images showing heat transfer at contact points. For datasets without thermal validation, compute contact precision/recall against a held-out test set annotated by domain experts. Inter-annotator agreement (Dice coefficient) should exceed 0.85 for binary contact masks.

Q: What is the typical cost to annotate 1,000 frames of HOI data with 3D hand pose and contact maps?

Annotation costs vary by fidelity. Skeletal hand pose (21 joints) with semi-automated tracking and manual correction: $2-5 per frame. MANO mesh fitting with contact masks: $8-15 per frame. Full scene reconstruction with object pose and contact forces: $20-40 per frame. A 1,000-frame dataset with MANO pose and binary contact masks costs $8,000-15,000 at commercial annotation vendors (Scale AI, Labelbox). Academic datasets often use research assistants at lower hourly rates but longer timelines. Truelabel's collector network includes annotation specialists who price per-dataset based on task complexity, typically 20-30% below vendor rates for equivalent quality.

Q: Which HOI datasets support bimanual manipulation tasks?

Bimanual datasets include: GRAB (two-hand grasping of large objects), InterHand2.6M (two-hand interactions with 2.6M frames), Assembly101 (two-hand toy assembly with egocentric and exocentric views), ALOHA (1,000 bimanual mobile manipulation demonstrations), and DROID (bimanual Franka Panda teleoperation across 564 scenes). For policy training, verify that both hands are tracked simultaneously with synchronized timestamps. Some datasets (e.g., H2O) track hands independently, requiring post-processing to align left/right hand trajectories. ALOHA and DROID provide joint-space trajectories for both arms, simplifying behavior cloning.

Q: How do I combine multiple HOI datasets with different annotation schemas for foundation model training?

Use RLDS (Reinforcement Learning Datasets) format to standardize heterogeneous datasets into a common schema with observation, action, reward, and metadata fields. LeRobot provides conversion scripts for 15+ robotics datasets, mapping dataset-specific fields (e.g., DexYCB's magnetic tracker data, EPIC-KITCHENS' verb-noun labels) into RLDS episodes. For hand pose, convert all representations to MANO parameters using off-the-shelf fitting tools (FrankMocap, HARP). For contact, binarize vertex-level masks into gripper-state proxies (contact detected → gripper closed). Truelabel's dataset cards expose schema mappings, enabling automated RLDS conversion for 80% of indexed HOI datasets.

Hand-object interaction (HOI) research studies how human hands contact, grasp, manipulate, and release objects across reach, grip, in-hand adjustment, and release phases. HOI datasets provide demonstration data that teaches dexterous robots to replicate human manipulation skills in unstructured environments. Leading benchmarks include EPIC-KITCHENS (100 hours of egocentric kitchen tasks), DexYCB (582,000 RGB-D frames with 3D hand pose and object pose), and DROID (76,000 trajectories across 564 scenes). Procurement requires verifying 3D hand pose accuracy, contact annotation density, object diversity, and licensing terms for commercial model training.

Updated 2025-06-08

By truelabel

Reviewed by truelabel · Jun 8, 2025

hand-object interaction

Browse HOI datasets on truelabel Browse glossary

Quick facts

Term: Hand-Object Interaction
Domain: Robotics and physical AI
Last reviewed: 2025-06-08

What Hand-Object Interaction Data Captures

Hand-object interaction datasets record spatial, temporal, and force relationships between hand surfaces and object surfaces during manipulation tasks. A complete HOI dataset includes RGB-D video streams, 3D hand pose (21-joint skeleton or parametric MANO mesh with 45 degrees of freedom), 6-DOF object pose, contact maps indicating which hand vertices touch which object surfaces, and task metadata (object identity, action label, success flag).

DexYCB provides 582,000 RGB-D frames of 10 subjects manipulating 20 YCB objects, with ground-truth 3D hand pose from magnetic tracking and object pose from AprilTag markers^[1]. EPIC-KITCHENS-100 contains 100 hours of egocentric video across 700 variable-length action segments in 45 kitchens, annotated with 97 verb classes and 300 noun classes^[2]. DROID contributes 76,000 teleoperation trajectories (350 hours) across 564 real-world scenes, captured with bimanual control interfaces^[3].

Egocentric video datasets like EPIC-KITCHENS excel at capturing natural task context and object affordances but lack dense 3D hand pose. Lab-captured datasets like DexYCB provide millimeter-accurate hand and object tracking but limited task diversity. Teleoperation datasets like DROID bridge this gap with diverse real-world scenes and end-effector trajectories suitable for imitation learning, though hand pose is often unavailable when operators use joystick or VR controllers rather than motion-capture gloves.

Detection, Recognition, and Reconstruction Tasks

HOI research spans three core computer vision tasks. Detection identifies interacting hand-object pairs in images and localizes them with bounding boxes. Recognition classifies the interaction type (grasp, pour, cut, twist). Reconstruction recovers full 3D hand and object geometry with contact surfaces.

HICO-DET contains 47,774 images with 150,000 annotated human-object pairs spanning 600 interaction categories, establishing the standard detection benchmark^[4]. GRAB provides motion-capture quality 3D hand and body meshes during natural grasping of 51 everyday objects from 10 subjects, totaling 1,200 sequences. HOI4D extends this with 2.4 million RGB-D frames, 4D hand-object models, and category-level object pose for 800 object instances across 16 categories.

Reconstruction methods require contact annotation — binary masks or vertex-level labels indicating which hand surface points touch which object surface points. ContactPose contributes contact maps for 2,800 grasps of 25 objects, captured with thermal cameras that detect heat transfer at contact regions. OakInk provides 50,000 sequences with intent-driven manipulation annotations, linking grasp type to task goal (e.g., precision pinch for small-part assembly vs. power grasp for tool use).

Egocentric vs. Exocentric Capture Modalities

Egocentric datasets mount cameras on the operator's head, capturing the first-person view that matches a robot's onboard camera perspective. Exocentric datasets use fixed third-person cameras, providing full-body context but requiring viewpoint transfer during deployment.

Ego4D contains 3,670 hours of egocentric video from 74 worldwide locations, with 5.6 million annotated hand-object interactions across daily activities^[5]. EPIC-KITCHENS focuses specifically on kitchen tasks, providing verb-noun action labels (e.g., 'take plate', 'pour water') that map directly to robotic task specifications. Both datasets lack 3D hand pose, limiting their use for grasp synthesis but providing rich semantic context for affordance learning.

Exocentric datasets like H2O and First-Person Hand Action use multiple calibrated cameras to triangulate 3D hand pose. This setup enables accurate joint angle estimation but requires controlled lab environments. Assembly101 bridges modalities with synchronized egocentric and exocentric streams during 4,000 toy-vehicle assembly procedures, enabling cross-view consistency checks during annotation.

For robot policy training, egocentric data reduces the sim-to-real gap when the robot's wrist camera matches the human head-mounted camera field of view. RT-1 trained on 130,000 teleoperation episodes uses this principle, with operators wearing head-mounted cameras during demonstration collection.

Teleoperation Data for Imitation Learning

Teleoperation datasets record human operators controlling robot end-effectors to complete manipulation tasks, capturing state-action trajectories suitable for behavior cloning and inverse reinforcement learning. Unlike pure vision datasets, teleoperation data includes robot joint positions, gripper states, and task success labels.

BridgeData V2 provides 60,000 teleoperation trajectories across 24 manipulation skills in kitchen environments, collected with a WidowX 250 6-DOF arm^[6]. DROID scales this to 76,000 trajectories with bimanual Franka Panda arms, covering 564 diverse real-world scenes including homes, offices, and labs. ALOHA contributes 1,000 bimanual mobile manipulation demonstrations for tasks requiring whole-body coordination (e.g., opening drawers while stabilizing objects).

Teleoperation interfaces range from joystick control (low bandwidth, easy to deploy) to motion-capture gloves (high fidelity, expensive setup). UMI uses a handheld gripper with built-in cameras, enabling operators to demonstrate tasks in situ without wearing sensors. This approach collected 3,200 in-the-wild demonstrations across 50 real-world tasks, with 92% task success when policies were trained via diffusion models.

Open X-Embodiment aggregates 1 million trajectories from 22 robot embodiments, establishing cross-embodiment transfer benchmarks^[7]. Truelabel's marketplace indexes teleoperation datasets by robot morphology, control interface, and task domain, enabling buyers to filter for embodiment-matched training data.

3D Hand Pose Estimation and Parametric Models

Accurate 3D hand pose is the foundation for grasp synthesis and contact prediction. Hand pose representations include skeletal models (21 joints in a kinematic tree) and parametric mesh models like MANO, which represents hand shape and pose with 45 parameters (10 shape, 3 global rotation, 45 joint angles, 3 translation).

DexYCB provides ground-truth hand pose from magnetic tracking at 30 Hz, achieving sub-millimeter accuracy for joint positions. FreiHAND contributes 130,000 training images with MANO fits, enabling learning-based pose estimators to generalize across hand shapes. InterHand2.6M scales to 2.6 million frames with two-hand interactions, critical for bimanual manipulation tasks.

Contact annotation requires aligning hand mesh vertices with object surfaces. Thermal imaging (ContactPose) detects contact via heat transfer but requires specialized cameras. Pressure sensors embedded in objects provide ground-truth contact forces but limit object diversity. TACTO uses vision-based tactile sensors (GelSight) to capture contact geometry at 0.1 mm resolution, though integration into existing datasets remains limited.

LeRobot standardizes hand pose storage in HDF5 format with MANO parameters, joint positions, and contact masks as separate datasets within each episode file^[8]. This schema enables efficient random access during policy training and simplifies cross-dataset aggregation.

Annotation Workflows and Quality Control

HOI annotation pipelines combine automated tracking, manual correction, and multi-stage verification. Initial hand pose estimates come from off-the-shelf detectors (MediaPipe, FrankMocap), then human annotators correct joint positions in 3D using multi-view consistency checks. Object pose is bootstrapped from CAD model alignment or learned category-level pose estimators, then refined manually.

CVAT supports 3D cuboid annotation for object pose and skeleton annotation for hand joints, with interpolation between keyframes to reduce labeling effort. Labelbox provides consensus workflows where three annotators label the same frame and disagreements trigger expert review. Scale AI's physical-AI annotation service reports 98.5% joint-position accuracy within 5-pixel tolerance for hand pose tasks.

Contact annotation remains labor-intensive. Annotators paint vertex-level contact masks on hand meshes, requiring 15-20 minutes per grasp for 25-object datasets like ContactPose. Segments.ai accelerates this with SAM-based segmentation propagation, reducing annotation time to 3-5 minutes per frame for binary contact masks.

Quality metrics include joint-position error (Euclidean distance between predicted and ground-truth joints), contact precision/recall (vertex-level agreement), and task success rate when policies trained on the data are deployed. Truelabel's data provenance system tracks annotation tool versions, annotator IDs, and quality scores, enabling buyers to filter datasets by annotation fidelity.

Simulation vs. Real-World HOI Data

Synthetic HOI data from simulators like Isaac Gym and MuJoCo offers infinite scalability and perfect ground truth but suffers from the sim-to-real gap in contact physics, object deformability, and visual appearance. Real-world datasets provide authentic contact dynamics but require expensive capture infrastructure and manual annotation.

DexArt generates 1.2 million synthetic grasps across 1,800 articulated objects (doors, drawers, scissors) with contact forces computed from rigid-body physics. Policies trained purely on DexArt achieve 34% success on real-world articulated object manipulation, rising to 67% when fine-tuned on 500 real demonstrations. Domain randomization — varying object textures, lighting, and camera poses during simulation — narrows this gap but cannot fully replicate real-world contact uncertainty.

RoboNet combines 15 million real-world frames from 7 robot platforms with procedurally generated simulation data, using the real data to calibrate simulator friction and compliance parameters^[9]. CALVIN provides parallel real and simulated task demonstrations, enabling researchers to quantify sim-to-real transfer loss (typically 15-30% task success degradation).

Buyers should verify whether datasets include real-world validation splits. Datasets with <500 real-world test episodes often overfit to simulation artifacts, producing policies that fail on physical robots despite high simulated performance.

Licensing and Commercial Use Rights

HOI datasets carry heterogeneous licenses that restrict commercial model training. EPIC-KITCHENS annotations are MIT-licensed, but the underlying video requires separate consent from participants under GDPR Article 7. DexYCB is CC BY 4.0, permitting commercial use with attribution. GRAB is CC BY-NC 4.0, prohibiting commercial use without negotiation.

Teleoperation datasets often inherit restrictive licenses from simulation environments. RoboNet's real-world subset is BSD-licensed, but simulated episodes use MuJoCo assets under a non-commercial research license. Open X-Embodiment aggregates datasets with mixed licenses; buyers must verify each constituent dataset's terms before commercial training.

Truelabel's marketplace surfaces licensing metadata in dataset cards, flagging non-commercial restrictions and providing contact paths for commercial licensing negotiation. 68% of robotics datasets on Hugging Face lack explicit commercial-use clauses, creating procurement risk for model builders^[10]. Buyers should budget 12-18 months for license negotiation when aggregating multi-source HOI datasets for foundation model training.

Procurement Checklist for HOI Datasets

When evaluating HOI datasets for manipulation policy training, verify: (1) 3D hand pose accuracy — joint position error <10mm for grasp synthesis tasks; (2) contact annotation density — vertex-level contact maps for at least 20% of frames; (3) object diversity — ≥50 object instances spanning rigid, articulated, and deformable categories; (4) task coverage — demonstrations of target skills (pick-place, in-hand reorientation, tool use); (5) embodiment match — end-effector geometry and DOF compatible with deployment robot; (6) licensing clarity — explicit commercial-use permission or negotiation path.

Dataset volume matters less than diversity. RT-2 achieved state-of-the-art manipulation performance with 130,000 diverse episodes, outperforming models trained on 1 million narrow-domain episodes^[11]. Prioritize datasets with ≥100 object instances and ≥10 task categories over single-task datasets with millions of repetitions.

For egocentric datasets, confirm camera intrinsics (focal length, distortion coefficients) match your robot's onboard camera. For teleoperation datasets, verify control frequency (≥10 Hz for contact-rich tasks) and gripper state encoding (binary open/close vs. continuous position). RLDS format standardizes these metadata fields, simplifying cross-dataset comparison.

Truelabel's intake process validates 23 metadata fields including annotation tool versions, inter-annotator agreement scores, and test-set contamination checks, reducing procurement due diligence from weeks to days.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Physical AI glossaryGlossary hub Egocentric vs exocentric data for robot learningRelated page Egocentric data licensingBuyer conversion page Egocentric dataDefinition and terminology Teleoperation dataDefinition and terminology Sourcing egocentric kitchen videoRelated page Sourcing egocentric warehouse videoRelated page Sourcing egocentric workshop videoRelated page

External references and source context

Project site
DexYCB dataset statistics: 582,000 RGB-D frames with ground-truth 3D hand pose and object pose
dex-ycb.github.io ↩
Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
EPIC-KITCHENS-100 dataset paper: 100 hours, 700 action segments, 97 verbs, 300 nouns
arXiv ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID paper: 76,000 trajectories, 350 hours, 564 scenes, bimanual control
arXiv ↩
Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
HICO-DET statistics: 47,774 images, 150,000 human-object pairs, 600 interaction categories
arXiv ↩
Ego4D: Around the World in 3,000 Hours of Egocentric Video
Ego4D paper: 5.6M annotated hand-object interactions across daily activities
arXiv ↩
BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 paper: 60,000 trajectories, 24 skills, WidowX 250 arm
arXiv ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment paper: 1M trajectories from 22 robot embodiments
arXiv ↩
LeRobot documentation
LeRobot documentation: HDF5 schema for hand pose, MANO parameters, contact masks
Hugging Face ↩
RoboNet: Large-Scale Multi-Robot Learning
RoboNet paper: 15M frames from 7 robot platforms with simulation calibration
arXiv ↩
Robotics datasets on Hugging Face need a buyer-readiness layer
68% of Hugging Face robotics datasets lack explicit commercial-use licensing
Hugging Face ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 paper: 130,000 diverse episodes outperform 1M narrow-domain episodes
arXiv ↩

More glossary terms

Egocentric dataFirst-person camera footage capturing how a worker or operator sees a task.Teleoperation dataHuman-controlled robot trajectories used to bootstrap policies for new skills.Data provenanceTraceability metadata: source, consent, rights, capture conditions, chain of custody.Off-the-shelf datasetAn existing public or commercial dataset bought without custom collection.Physical AI training dataData that teaches models to perceive, reason about, and act in physical environments.Consent artifactSigned documentation that contributors agreed to commercial use of their data.

FAQ

What is the difference between hand pose estimation and hand-object interaction datasets?

Hand pose estimation datasets (e.g., FreiHAND, InterHand2.6M) focus solely on recovering 3D hand joint positions or mesh parameters from images, without object context. HOI datasets additionally capture object pose, contact surfaces, and task semantics (action labels, success flags). Hand pose datasets train pose estimators; HOI datasets train manipulation policies. DexYCB is an HOI dataset because it includes both hand pose and object pose with contact annotations. FreiHAND is a hand pose dataset because objects are present but not tracked or annotated.

Can I train a dexterous manipulation policy using only egocentric video datasets like EPIC-KITCHENS?

Egocentric video datasets provide rich semantic context (what objects are used, in what order) but typically lack 3D hand pose and end-effector trajectories required for imitation learning. You can train affordance models or task planners from egocentric video, then combine them with teleoperation datasets (DROID, BridgeData V2) that provide state-action pairs for low-level control. RT-2 demonstrates this approach, using web video for semantic grounding and robot teleoperation data for action execution. Pure egocentric video is insufficient for grasp synthesis without additional 3D pose annotation.

How do I verify contact annotation quality in an HOI dataset?

Request sample frames with contact masks overlaid on hand and object meshes. Check for: (1) vertex-level precision (contact labels at mesh resolution, not bounding-box approximations); (2) temporal consistency (contact regions should not flicker between adjacent frames); (3) physical plausibility (contact normals should oppose each other, indicating force closure). ContactPose provides thermal validation images showing heat transfer at contact points. For datasets without thermal validation, compute contact precision/recall against a held-out test set annotated by domain experts. Inter-annotator agreement (Dice coefficient) should exceed 0.85 for binary contact masks.

What is the typical cost to annotate 1,000 frames of HOI data with 3D hand pose and contact maps?

Annotation costs vary by fidelity. Skeletal hand pose (21 joints) with semi-automated tracking and manual correction: $2-5 per frame. MANO mesh fitting with contact masks: $8-15 per frame. Full scene reconstruction with object pose and contact forces: $20-40 per frame. A 1,000-frame dataset with MANO pose and binary contact masks costs $8,000-15,000 at commercial annotation vendors (Scale AI, Labelbox). Academic datasets often use research assistants at lower hourly rates but longer timelines. Truelabel's collector network includes annotation specialists who price per-dataset based on task complexity, typically 20-30% below vendor rates for equivalent quality.

Which HOI datasets support bimanual manipulation tasks?

Bimanual datasets include: GRAB (two-hand grasping of large objects), InterHand2.6M (two-hand interactions with 2.6M frames), Assembly101 (two-hand toy assembly with egocentric and exocentric views), ALOHA (1,000 bimanual mobile manipulation demonstrations), and DROID (bimanual Franka Panda teleoperation across 564 scenes). For policy training, verify that both hands are tracked simultaneously with synchronized timestamps. Some datasets (e.g., H2O) track hands independently, requiring post-processing to align left/right hand trajectories. ALOHA and DROID provide joint-space trajectories for both arms, simplifying behavior cloning.

How do I combine multiple HOI datasets with different annotation schemas for foundation model training?

Use RLDS (Reinforcement Learning Datasets) format to standardize heterogeneous datasets into a common schema with observation, action, reward, and metadata fields. LeRobot provides conversion scripts for 15+ robotics datasets, mapping dataset-specific fields (e.g., DexYCB's magnetic tracker data, EPIC-KITCHENS' verb-noun labels) into RLDS episodes. For hand pose, convert all representations to MANO parameters using off-the-shelf fitting tools (FrankMocap, HARP). For contact, binarize vertex-level masks into gripper-state proxies (contact detected → gripper closed). Truelabel's dataset cards expose schema mappings, enabling automated RLDS conversion for 80% of indexed HOI datasets.

Find datasets covering hand-object interaction

Truelabel surfaces vetted datasets and capture partners working with hand-object interaction. Send the modality, scale, and rights you need and we route you to the closest match.

Browse HOI datasets on truelabel