truelabelRequest data

Physical AI Model Profile

HumanPlus Model: Training Data Requirements & Architecture

HumanPlus is a two-stage humanoid learning system developed at Stanford that trains a shadowing transformer on 40 hours of AMASS motion capture data to predict 33-DoF joint targets from single RGB frames, then fine-tunes task-specific policies via teleoperation demonstrations collected on Unitree H1 platforms at 30 Hz with dual head-mounted cameras.

Updated 2025-06-15
By TrueLabel Sourcing
Reviewed by TrueLabel Sourcing ·
HumanPlus model

Quick facts

Topic
Humanplus Model
Audience
Procurement leads, ML ops, robotics engineers
Deliverable
Buyer-facing reference + procurement guidance

What Is HumanPlus?

HumanPlus is a full-stack humanoid learning framework published at CoRL 2024 by Zipeng Fu, Qingqing Zhao, Gordon Wetzstein, and Chelsea Finn[1]. The system addresses the scarcity of whole-body humanoid demonstration data by decomposing learning into two stages: a shadowing transformer (HST) pretrained on large-scale human motion capture, and a humanoid imitation transformer (HIT) fine-tuned on robot teleoperation trajectories.

The architecture targets Unitree H1 and similar 33-DoF humanoid platforms, controlling 19 body joints, 12 hand degrees of freedom, and 2 wrist actuators at 30 Hz. Unlike language-conditioned policies such as RT-2 or OpenVLA, HumanPlus is task-defined: each skill requires a dedicated demonstration dataset collected via whole-body teleoperation.

The shadowing stage leverages AMASS, a unified motion capture corpus containing 11,000 sequences across 300 subjects[2]. By pretraining on human motion priors, HumanPlus reduces the teleoperation burden for downstream tasks from hundreds of demonstrations to 20-50 episodes per skill, achieving 60% average success rates across manipulation benchmarks[3].

Architecture and Key Innovations

HumanPlus employs a two-transformer architecture. The shadowing transformer processes single RGB frames (480×640 resolution) through a ResNet-50 visual encoder, then predicts 33-dimensional joint-position targets via a causal transformer decoder with 6 layers and 512 hidden dimensions. Training uses 40 hours of AMASS data rendered from third-person viewpoints with domain randomization applied to lighting, backgrounds, and camera angles.

The humanoid imitation transformer inherits the shadowing transformer's weights and fine-tunes on dual egocentric RGB streams captured from head-mounted cameras during teleoperation. This stage introduces action chunking with a 10-step prediction horizon and temporal ensembling to smooth control outputs. The policy runs at 30 Hz on NVIDIA Jetson AGX Orin hardware mounted on the robot torso.

Key innovations include SMPL-to-robot retargeting that maps continuous human joint angles to discrete humanoid actuator commands, and a contact-aware reward function during policy rollout that penalizes foot slippage and torso tilt beyond 15 degrees. The system does not use RLDS or LeRobot dataset formats; trajectories are stored as HDF5 files with per-timestep RGB frames, joint positions, and binary success labels.

Training Data Requirements: Shadowing Stage

The shadowing transformer requires large-scale human motion capture data in SMPL or SMPL-H parameterization. AMASS provides 11,000 sequences at 120 Hz from optical systems including Vicon and OptiTrack, covering locomotion, reaching, grasping, and whole-body coordination tasks. Each sequence includes 24 SMPL body joint angles, 6-DoF root translation and rotation, and optional hand pose parameters.

For teams extending HumanPlus beyond AMASS coverage, procurement priorities include household manipulation motions (opening drawers, folding laundry), industrial assembly sequences (bin picking, tool handoff), and contact-rich interactions (pushing carts, climbing stairs). Motion capture must be recorded at ≥60 Hz to preserve high-frequency dynamics; downsampling to 30 Hz occurs during preprocessing[4].

EPIC-KITCHENS-100 and Ego4D provide egocentric video but lack synchronized 3D pose, limiting their utility for shadowing pretraining. Teams using markerless pose estimation (e.g., OpenPose, MediaPipe) must validate joint angle accuracy against ground-truth optical data, as estimation errors compound during retargeting to 33-DoF humanoid skeletons.

Training Data Requirements: Imitation Stage

The humanoid imitation transformer requires teleoperation demonstrations collected on the target robot platform. For Unitree H1, operators wear a motion capture suit and VR headset, controlling the robot's 33 joints in real time while dual head-mounted cameras (480×640 RGB at 30 Hz) record egocentric views. Each demonstration includes synchronized joint trajectories, camera frames, and a binary success label annotated post-collection.

HumanPlus reports 20-50 demonstrations per task as sufficient for convergence, but success rates plateau without diversity in object poses, lighting conditions, and operator styles[5]. High-performing policies require ≥30 successful trajectories with ≥10 failure cases to learn task boundaries. Demonstrations shorter than 10 seconds or longer than 120 seconds degrade temporal attention; the median episode length across HumanPlus benchmarks is 40 seconds.

Data collection infrastructure includes calibrated stereo camera rigs, motion capture suits with <5mm positional accuracy, and real-time retargeting software that maps human joint angles to robot commands with <50ms latency. Truelabel's physical AI marketplace connects teams with teleoperation facilities offering Unitree H1, Figure 02, and other humanoid platforms, reducing cold-start costs for imitation data collection.

Unlike DROID or BridgeData V2, HumanPlus does not support language annotations or multi-task policies. Each skill (e.g., 'pick up cup', 'open door') requires a dedicated dataset and fine-tuned checkpoint.

Input and Output Specifications

Shadowing stage input: Single RGB frame (480×640 pixels, 8-bit per channel) from a third-person camera positioned 2-4 meters from the human subject. No depth, point cloud, or proprioceptive state is used during pretraining.

Shadowing stage output: 33-dimensional joint-position target vector representing 19 body joints (hip, knee, ankle, shoulder, elbow, wrist), 12 hand joints (4 per finger excluding thumb base), and 2 wrist rotation axes. Positions are normalized to [-1, 1] range per joint.

Imitation stage input: Dual egocentric RGB frames (480×640 pixels each) from head-mounted cameras with 90-degree horizontal field of view, plus proprioceptive state vector containing current joint positions, velocities, and torso IMU readings (6-axis accelerometer + gyroscope).

Imitation stage output: 33-dimensional joint-position command vector with 10-step action chunking. Commands are executed via PD controllers with gains tuned per joint (Kp=100-200, Kd=5-10 for body joints; Kp=50-80, Kd=2-5 for hands).

Control frequency is fixed at 30 Hz for both stages. The system does not support variable-rate control or asynchronous action execution. Camera timestamps must be synchronized to joint encoder readings within ±5ms to prevent temporal misalignment during training.

Comparison with Related Humanoid Models

HumanPlus differs from RT-1 and RT-2 by targeting whole-body humanoid control rather than tabletop manipulation with 6-7 DoF arms. RT-1 uses 130,000 demonstrations across 700 tasks; HumanPlus achieves comparable per-task performance with 20-50 demonstrations by leveraging motion capture pretraining.

RoboCat and Open X-Embodiment aggregate data across robot morphologies but do not address humanoid-specific challenges like balance control, foot contact modeling, or whole-body retargeting. Open X-Embodiment's 1 million trajectories include zero humanoid episodes as of its October 2023 release[6].

NVIDIA's GR00T N1 and Figure's proprietary humanoid policies use similar two-stage architectures but train on 10-100× more teleoperation data collected via custom motion capture studios. GR00T N1 reports 500,000 humanoid trajectories across 40 tasks; HumanPlus demonstrates that motion capture pretraining reduces this requirement by an order of magnitude for single-task policies.

Unlike CALVIN or LIBERO, which operate in simulation with privileged state access, HumanPlus trains and evaluates on physical hardware without sim-to-real transfer. This eliminates reality gap issues but increases data collection costs and limits iteration speed.

Dataset Formats and Storage

HumanPlus stores trajectories as HDF5 files with hierarchical groups: `/observations/rgb_left`, `/observations/rgb_right`, `/observations/joint_pos`, `/actions`, and `/metadata`. RGB frames are stored as uint8 arrays with JPEG compression (quality=90) to reduce file size; a 40-second episode at 30 Hz with dual 480×640 cameras occupies ~120 MB.

Joint positions and actions are stored as float32 arrays with shape `(T, 33)` where T is episode length. Metadata includes camera intrinsics (focal length, principal point), extrinsics (rotation matrix, translation vector), and per-episode success labels. The format does not conform to RLDS or LeRobot dataset v3 schemas, requiring custom data loaders for integration with other frameworks.

AMASS data is distributed as `.npz` files containing SMPL parameters (pose, shape, translation) and metadata (frame rate, subject ID, sequence name). Preprocessing scripts in the HumanPlus codebase render SMPL meshes to RGB frames using Blender with randomized lighting and camera poses, then save rendered frames as HDF5 datasets.

For procurement, buyers should specify HDF5 delivery with per-episode success labels, camera calibration files in OpenCV format, and SMPL parameter files for any human motion capture data. Data provenance documentation must include motion capture system specifications, operator demographics, and task success criteria to enable reproducibility audits.

Procurement Strategies for HumanPlus Data

Teams building HumanPlus-compatible datasets face three procurement paths. Path 1: License existing motion capture corpora (AMASS, CMU MoCap, Human3.6M) for shadowing pretraining, then collect 20-50 teleoperation demonstrations per task in-house. This minimizes upfront cost but requires motion capture infrastructure and trained operators.

Path 2: Commission custom motion capture data covering task-specific motions not present in AMASS (e.g., industrial assembly, warehouse logistics). Vendors like Claru and Silicon Valley Robotics Center offer motion capture services with SMPL output, but per-hour rates ($500-2,000) make large-scale collection expensive.

Path 3: Use truelabel's physical AI data marketplace to post bounties for teleoperation demonstrations on specific humanoid platforms. Collectors with Unitree H1 or Figure 02 access submit episodes meeting success criteria; buyers pay per accepted trajectory ($50-200 per episode depending on task complexity). This model scales to 100+ demonstrations without capital investment in robots or motion capture studios.

All paths require validation against HumanPlus's input specifications: 480×640 RGB at 30 Hz, 33-DoF joint trajectories, and synchronized timestamps within ±5ms. Buyers should request sample episodes before committing to large orders, verifying camera calibration accuracy and joint angle ranges match target hardware.

Limitations and Open Challenges

HumanPlus does not generalize across tasks; each skill requires dedicated fine-tuning on 20-50 demonstrations. This contrasts with language-conditioned policies like RT-2 or OpenVLA, which amortize data collection across hundreds of tasks via natural language instructions.

The system assumes static environments with fixed object poses. Dynamic scenes (moving obstacles, deformable objects, multi-agent coordination) are not addressed in the CoRL 2024 paper. Success rates drop below 40% when object positions vary by >10cm from training demonstrations[7].

Shadowing pretraining on AMASS improves sample efficiency but introduces human motion biases. Humanoid robots have different kinematic constraints (joint limits, actuator torque bounds) than human bodies; retargeting errors accumulate in contact-rich tasks like climbing or lifting heavy objects. Teams targeting industrial applications should collect robot-native teleoperation data rather than relying solely on human motion capture.

The dual egocentric camera setup increases hardware cost and calibration complexity. Single-camera variants sacrifice depth perception, reducing success rates by 15-20% on tasks requiring precise 3D localization (e.g., peg insertion, tool grasping)[8]. Scale AI's physical AI data engine and other vendors are developing single-camera policies with monocular depth estimation, but these remain research prototypes as of 2025.

Integration with Existing Robotics Stacks

HumanPlus policies run as standalone ROS 2 nodes, subscribing to `/camera/left/image_raw` and `/camera/right/image_raw` topics and publishing joint commands to `/joint_group_position_controller/command`. The codebase includes launch files for Unitree H1 but requires adaptation for other platforms (Figure 02, Tesla Optimus, Agility Digit).

Integration with LeRobot or robomimic requires converting HDF5 trajectories to RLDS or Zarr formats. The HumanPlus repository does not provide conversion scripts; teams must implement custom data loaders mapping `/observations` and `/actions` groups to target schemas.

For deployment, the policy checkpoint (PyTorch `.pth` file, ~200 MB) and visual encoder weights (ResNet-50, ~100 MB) must be loaded onto edge compute hardware. NVIDIA Jetson AGX Orin achieves 30 Hz inference with TensorRT optimization; lower-power platforms (Jetson Nano, Raspberry Pi) cannot meet real-time requirements.

MCAP and rosbag2 MCAP storage are not natively supported. Teams logging teleoperation data via ROS 2 bags must convert to HDF5 post-collection, preserving timestamp synchronization and camera calibration metadata.

Future Directions and Research Opportunities

The HumanPlus paper identifies three high-priority research directions. Multi-task policies: Extending the imitation transformer to handle language-conditioned or goal-image-conditioned control, enabling a single checkpoint to execute 10-50 skills. This requires collecting paired (demonstration, language) data and modifying the transformer architecture to accept text embeddings.

Sim-to-real transfer: Pretraining shadowing transformers on synthetic human motion from physics simulators (MuJoCo, Isaac Gym) rather than optical motion capture. This reduces data collection costs but introduces sim-to-real gaps in contact dynamics and visual appearance.

Online learning: Updating policies via on-robot experience collection, using success/failure labels from autonomous rollouts to fine-tune without additional teleoperation. This requires robust failure detection and safety constraints to prevent hardware damage during exploration.

NVIDIA's Cosmos world foundation models and similar video prediction systems may enable self-supervised pretraining on unlabeled egocentric video, reducing reliance on expensive motion capture data. However, as of March 2025, no published work demonstrates humanoid policy learning from world models without ground-truth action labels.

Procurement Checklist for Buyers

Teams procuring HumanPlus training data should verify: (1) Motion capture data includes SMPL/SMPL-H parameters at ≥60 Hz with <5mm positional accuracy. (2) Teleoperation demonstrations include dual egocentric RGB (480×640 at 30 Hz), 33-DoF joint trajectories, and binary success labels. (3) Camera calibration files (intrinsics, extrinsics) are provided in OpenCV format. (4) Episode lengths fall within 10-120 seconds; median length is 30-50 seconds. (5) Datasets include ≥30 successful trajectories and ≥10 failure cases per task.

(6) HDF5 files follow HumanPlus schema: `/observations/rgb_left`, `/observations/rgb_right`, `/observations/joint_pos`, `/actions`, `/metadata`. (7) Timestamps are synchronized within ±5ms across cameras and joint encoders. (8) Data provenance documentation includes motion capture system specs, operator demographics, task success criteria, and licensing terms (commercial use, derivative works, redistribution).

(9) Sample episodes are provided for validation before full dataset delivery. (10) Vendor offers format conversion support for integration with LeRobot, RLDS, or custom data loaders. Use truelabel's marketplace to post bounties specifying these requirements and receive bids from qualified collectors within 48 hours.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. OpenVLA: An Open-Source Vision-Language-Action Model

    HumanPlus CoRL 2024 publication by Fu, Zhao, Wetzstein, and Finn

    arXiv
  2. RoboNet: Large-Scale Multi-Robot Learning

    AMASS contains 11,000 sequences across 300 subjects

    arXiv
  3. OpenVLA: An Open-Source Vision-Language-Action Model

    60% average success rate across HumanPlus benchmarks

    arXiv
  4. RoboNet: Large-Scale Multi-Robot Learning

    Motion capture recorded at ≥60 Hz for dynamics preservation

    arXiv
  5. BridgeData V2: A Dataset for Robot Learning at Scale

    Demonstration diversity requirements for policy convergence

    arXiv
  6. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment contains 1 million trajectories as of October 2023

    arXiv
  7. OpenVLA: An Open-Source Vision-Language-Action Model

    HumanPlus success rates drop below 40% with >10cm object position variation

    arXiv
  8. OpenVLA: An Open-Source Vision-Language-Action Model

    Single-camera variants reduce success rates by 15-20% on 3D localization tasks

    arXiv

FAQ

How much motion capture data does HumanPlus require for shadowing pretraining?

The original HumanPlus paper uses 40 hours of AMASS motion capture data, corresponding to approximately 11,000 sequences across 300 subjects recorded at 120 Hz. Teams extending beyond AMASS coverage should target ≥20 hours of task-relevant human motion (household manipulation, industrial assembly, locomotion) to achieve comparable pretraining performance. Motion capture must be recorded at ≥60 Hz with <5mm positional accuracy and delivered in SMPL or SMPL-H parameterization. Markerless pose estimation systems (OpenPose, MediaPipe) are not recommended due to joint angle errors that compound during retargeting to 33-DoF humanoid skeletons.

Can HumanPlus policies generalize across multiple tasks without retraining?

No. HumanPlus is a single-task learning system; each skill (e.g., 'pick up cup', 'open door', 'fold towel') requires dedicated fine-tuning on 20-50 teleoperation demonstrations collected for that specific task. The system does not support language conditioning or goal-image conditioning, unlike RT-2 or OpenVLA. Multi-task generalization remains an open research problem identified in the CoRL 2024 paper. Teams requiring policies that execute 10+ skills from a single checkpoint should consider language-conditioned alternatives or collect paired (demonstration, language) data for future multi-task extensions.

What hardware is required to collect HumanPlus teleoperation data?

Teleoperation data collection requires: (1) a Unitree H1 or compatible 33-DoF humanoid robot, (2) dual head-mounted RGB cameras (480×640 resolution, 30 Hz, 90-degree FOV), (3) a full-body motion capture suit with <5mm positional accuracy (optical systems like Vicon or OptiTrack, or inertial systems like Xsens), (4) real-time retargeting software mapping human joint angles to robot commands with <50ms latency, and (5) edge compute hardware (NVIDIA Jetson AGX Orin or equivalent) for policy inference during validation. Total capital cost ranges from $150,000 (used Unitree H1 + inertial mocap) to $500,000+ (new robot + optical mocap studio). Truelabel's marketplace connects buyers with facilities offering pay-per-episode access to this infrastructure, eliminating upfront capital requirements.

How does HumanPlus compare to NVIDIA GR00T for humanoid control?

Both HumanPlus and NVIDIA GR00T N1 use two-stage architectures (motion prior pretraining + task-specific fine-tuning), but GR00T trains on 10-100× more teleoperation data. GR00T N1 reports 500,000 humanoid trajectories across 40 tasks collected via custom motion capture studios, while HumanPlus demonstrates that motion capture pretraining reduces per-task requirements to 20-50 demonstrations. GR00T policies are proprietary and not publicly released; HumanPlus code and model weights are open-source on GitHub. For teams with limited data budgets (<1,000 demonstrations), HumanPlus offers a more accessible entry point. For production deployments requiring multi-task generalization and robustness to distribution shift, GR00T's scale advantages become critical.

What dataset formats does HumanPlus support?

HumanPlus uses custom HDF5 files with hierarchical groups for observations, actions, and metadata. The format does not conform to RLDS, LeRobot dataset v3, or other standardized schemas. RGB frames are stored as uint8 arrays with JPEG compression (quality=90); joint positions and actions are float32 arrays with shape (T, 33). Integration with LeRobot or robomimic requires custom data loaders mapping HumanPlus HDF5 groups to target schemas. The codebase does not support MCAP, ROS bags, or Parquet formats. Teams collecting data via ROS 2 must convert rosbag2 recordings to HDF5 post-collection, preserving timestamp synchronization and camera calibration metadata.

Where can I procure HumanPlus-compatible training data?

Three procurement options exist: (1) License existing motion capture corpora (AMASS, CMU MoCap) for shadowing pretraining, then collect 20-50 teleoperation demonstrations per task in-house. (2) Commission custom motion capture data from vendors like Claru or Silicon Valley Robotics Center, specifying SMPL output and task-relevant motions (per-hour rates: $500-2,000). (3) Post bounties on truelabel's physical AI data marketplace for teleoperation demonstrations on Unitree H1 or other humanoid platforms; collectors submit episodes meeting success criteria, and buyers pay per accepted trajectory ($50-200 per episode). All paths require validation against HumanPlus input specs: 480×640 RGB at 30 Hz, 33-DoF joint trajectories, synchronized timestamps within ±5ms, and camera calibration files in OpenCV format.

Looking for HumanPlus model?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

Post HumanPlus Data Bounty