Model Profile

HPT Training Data: Heterogeneous Pre-trained Transformers for Robot Learning

Q: How many episodes do I need to fine-tune HPT on a new robot?

50–200 episodes suffice for simple pick-and-place tasks with expert teleoperation data (>90% success rate). Moderate-variation tasks (multiple objects, randomized poses) require 200–500 episodes. Contact-rich assembly or long-horizon tasks need 1,000–2,000 episodes. These numbers assume clean action labels, accurate timestamps (<5 ms jitter), and multi-modal observations (RGB + proprioception minimum). Autonomous rollouts with 50–70% success rates require 3–5× more episodes to achieve equivalent policy performance.

Q: Can I use HPT with depth cameras or point clouds, or only RGB?

HPT natively supports RGB images, depth maps, point clouds, and proprioceptive state through embodiment-specific stems. At minimum, you need RGB (single or multi-view) plus proprioception. Adding depth improves performance on contact-rich tasks by 8–12 percentage points. Point clouds (from lidar or reconstructed depth) are useful for navigation and outdoor manipulation. All modalities must be time-synchronized to within 10 ms—MCAP and RLDS formats handle this via shared timestamps.

Q: What metadata do I need to provide alongside trajectory data for HPT?

You must provide: (1) URDF defining the robot's kinematic chain, joint limits, and mass properties; (2) camera calibration files with intrinsics (focal length, distortion) and extrinsics (pose relative to robot base); (3) action-space definition listing bounds, control mode (position/velocity/torque), and dimensionality. HPT uses the URDF to initialize the proprioception stem, calibration to undistort images and reason about 3D space, and action bounds to normalize outputs. Inaccurate metadata causes the policy to plan invalid motions or reach for objects at wrong positions.

Q: How does HPT compare to OpenVLA for cross-robot transfer?

HPT uses modular stems and heads to handle heterogeneous action spaces, sensor modalities, and control frequencies without discretization. OpenVLA uses a shared 256-bin action tokenizer, enabling plug-and-play transfer if your robot fits the vocabulary but losing precision on high-DoF systems (humanoids, dexterous hands). HPT handles variable action spaces (7-DoF arms, 23-DoF hands, mobile bases) and multi-modal observations (RGB + depth + proprioception) natively, but requires implementing new stems and heads for each embodiment. For custom hardware or high-precision tasks, HPT's flexibility justifies the integration cost.

Q: Can I pretrain HPT on my own dataset mixture, or must I use the published 52-dataset checkpoint?

You can pretrain from scratch on a custom mixture if you have 20+ diverse datasets (different robots, sensors, tasks) and 200–800 GPU-days of compute budget. The published 52-dataset checkpoint is a starting point—you can continue pretraining by adding new datasets to the mixture and resuming training for 50,000–100,000 additional steps. For most buyers, fine-tuning the published checkpoint is more cost-effective than solo pretraining unless you have proprietary data that covers embodiment-sensor combinations absent from the original 52.

Q: What licensing terms apply to models fine-tuned on HPT's pretrained trunk?

Model weights inherit the licenses of all datasets in the pretraining mixture. HPT's published checkpoint was pretrained on a mix of CC-BY-4.0 academic datasets (BridgeData V2, DROID), custom-license datasets (EPIC-KITCHENS-100 requires separate commercial negotiation), and unrestricted synthetic data (RLBench). If you fine-tune on proprietary data, your stems and heads are yours, but the trunk weights carry the upstream licenses. For commercial deployment, audit the 52-dataset mixture for non-commercial restrictions and negotiate licenses where needed. Truelabel's marketplace flags datasets with unclear licensing to reduce legal risk.

HPT is a modular transformer architecture pretrained on 52 heterogeneous robot datasets totaling 800,000 episodes and 93 million transitions. It uses embodiment-specific stems to tokenize diverse sensor inputs (RGB, depth, point clouds, proprioception) into a fixed 32-token sequence, processes them through a shared ViT trunk, and decodes actions via embodiment-specific heads. Pretraining on this mixture enables zero-shot transfer and sample-efficient fine-tuning across platforms with different action spaces and control frequencies.

Updated 2025-05-15

By truelabel

Reviewed by truelabel · May 15, 2025

HPT training data

Browse HPT-Compatible Datasets How sourcing works

Quick facts

Model class: Model Profile
Primary focus: HPT training data
Last reviewed: 2025-05-15

What Is HPT and Why Heterogeneous Pretraining Matters

HPT (Heterogeneous Pre-trained Transformers) is a modular architecture developed by MIT CSAIL and Meta FAIR that pretrained a single trunk on 52 robot datasets spanning 800,000 episodes^[1]. Unlike prior approaches that required identical observation and action spaces, HPT uses embodiment-specific stems to tokenize heterogeneous inputs—RGB images, depth maps, point clouds, proprioceptive state—into a uniform 32-token sequence fed to a shared Vision Transformer trunk. Embodiment-specific heads then decode actions for each platform's native control interface.

This design solves a critical data fragmentation problem in physical AI: most robot datasets are siloed by platform, sensor suite, and task domain. Open X-Embodiment demonstrated that pooling diverse datasets improves generalization, but required manual alignment of observation and action spaces. HPT automates that alignment through learned stems and heads, enabling pretraining on datasets as varied as BridgeData V2 (single-arm manipulation with wrist camera) and DROID (bimanual teleoperation with multiple viewpoints). The pretrained trunk captures cross-embodiment priors—object affordances, contact dynamics, spatial reasoning—that transfer to new platforms with minimal fine-tuning data^[1].

For procurement teams, HPT's architecture means you can leverage a pretrained trunk and fine-tune on 50–2,000 demonstrations from your target robot, rather than collecting 10,000+ episodes from scratch. The key requirement: your fine-tuning data must include the metadata (URDF, sensor calibration, action space definition) needed to configure new stems and heads.

HPT Architecture: Stems, Trunk, and Heads

HPT's three-stage pipeline mirrors the tokenization-processing-decoding structure of large language models but adapts it for heterogeneous robot data. Stems are embodiment-specific encoders: a pretrained ViT-B/16 processes RGB images into 16 tokens, a smaller ViT handles depth, a PointNet-style encoder tokenizes point clouds, and an MLP maps proprioceptive state (joint positions, velocities, forces) to tokens. Each stem outputs a fixed number of tokens regardless of input resolution or DoF count, ensuring the trunk sees a consistent 32-token sequence per timestep^[1].

The trunk is a standard Vision Transformer with 12–24 layers (scaling experiments tested 10M to 1B parameters). It processes the concatenated stem tokens through self-attention, learning cross-modal and cross-embodiment representations. Critically, the trunk is frozen after pretraining—fine-tuning only updates the new stems and heads for your target robot, preserving the pretrained priors and preventing catastrophic forgetting.

Heads are embodiment-specific decoders that map trunk outputs to action spaces. A 7-DoF arm uses a 7-dimensional continuous head; a mobile manipulator adds a 3-DoF base velocity head; a gripper uses a binary classification head. HPT supports variable control frequencies by training heads to predict actions at each dataset's native rate (10 Hz for some teleoperation datasets, 50 Hz for others). During fine-tuning, you instantiate new stems and heads for your robot's sensor suite and action space, then train them on 50–2,000 episodes while keeping the trunk fixed. This modular design is why HPT achieves 20% higher success rates than training from scratch on low-data regimes^[1].

Pretraining Dataset Mixture: 52 Datasets and 93 Million Transitions

HPT's pretrained trunk was trained on a mixture of 52 datasets totaling 800,000 episodes and 93 million state-action transitions^[1]. The mixture includes single-arm datasets like BridgeData V2 (60,000 episodes of kitchen manipulation), bimanual datasets like ALOHA (650 episodes of high-precision assembly), mobile manipulation from DROID (76,000 episodes across 564 scenes), and simulated datasets from RLBench (100 tasks with procedural variation). The mixture also incorporates egocentric video datasets like EPIC-KITCHENS-100 (90,000 action segments) to provide human priors for object interactions, though these contribute visual tokens only (no action labels).

Dataset weighting follows a square-root sampling strategy: each dataset contributes in proportion to the square root of its episode count, preventing large datasets from dominating the mixture while ensuring small high-quality datasets (like expert teleoperation) remain visible during training. This balances breadth (many embodiments) with depth (sufficient examples per embodiment to learn robust stems). The authors found that pretraining on 52 datasets outperformed pretraining on subsets of 10 or 25 datasets by 12–18% on held-out transfer tasks^[1].

For buyers, this mixture composition has two implications. First, contributing a new dataset to an HPT pretraining run is most valuable when it adds a novel embodiment-sensor combination not well-represented in the existing 52. A quadruped with lidar and IMU, for instance, would expand the trunk's coverage more than another single-arm RGB dataset. Second, fine-tuning data quality matters more than quantity: 500 expert teleoperation episodes with clean action labels outperform 5,000 noisy autonomous rollouts.

Input and Output Specifications for HPT Fine-Tuning

Fine-tuning HPT on a new robot requires three data components: observations, actions, and metadata. Observations must include at least one visual modality (RGB, depth, or point cloud) and proprioceptive state (joint positions and velocities for all actuated DoFs). Multi-view RGB is common: a third-person camera captures workspace context, a wrist-mounted camera provides end-effector detail. Depth and point clouds are optional but improve performance on contact-rich tasks. All sensors must be time-synchronized to within 10 ms; MCAP or RLDS formats handle this natively.

Actions are recorded at the robot's native control frequency (10–50 Hz for most arms). HPT supports continuous action spaces (joint positions, joint velocities, end-effector deltas in SE(3)) and discrete spaces (gripper open/close, mode switches). Action labels must be causal: the action at timestep t is the command sent to the robot after observing state t, not the state achieved at t+1. This distinction matters for teleoperation data, where human operators react to visual feedback with 100–300 ms latency—naive timestamp alignment introduces a half-step lag that degrades policy performance by 8–15%^[2].

Metadata includes the robot's URDF (for forward kinematics), camera intrinsics and extrinsics (for spatial reasoning), and action space bounds (for normalization). HPT's stems and heads are initialized using this metadata: the proprioception stem's input dimension matches the DoF count, the action head's output dimension matches the action space. Without accurate metadata, the policy will hallucinate invalid joint configurations or clip actions to incorrect bounds. Truelabel's marketplace enforces metadata schemas for all listed datasets, ensuring HPT-compatible formatting.

Scaling Laws: How Much Pretraining Data Does HPT Need?

HPT exhibits log-linear scaling: doubling the pretraining dataset count improves downstream task success rates by 3–5 percentage points, with diminishing returns above 40 datasets^[1]. The authors tested trunks pretrained on 10, 25, and 52 datasets, then fine-tuned each on five held-out manipulation tasks with 500 demonstrations per task. The 52-dataset trunk achieved 78% average success; the 25-dataset trunk reached 72%; the 10-dataset trunk managed 65%. Extrapolating the curve suggests 100 datasets would yield 82% success, but the marginal gain per dataset drops below 0.5 points.

Parameter count also scales predictably. HPT trunks range from 10M parameters (ViT-Tiny with 6 layers) to 1B parameters (ViT-Huge with 24 layers). On a fixed 52-dataset mixture, the 1B trunk outperformed the 10M trunk by 12 points on average, but required 8× more GPU-hours to pretrain and 4× more VRAM during fine-tuning. For most buyers, the 200M parameter ViT-Base trunk offers the best performance-per-dollar: it fine-tunes on a single A100 in 2–6 hours and matches the 1B trunk's performance on tasks with fewer than 1,000 fine-tuning episodes^[1].

Data diversity matters more than volume. Adding a 53rd dataset with a novel sensor modality (thermal camera, tactile skin) improves generalization more than adding 10,000 episodes to an existing dataset. This is why Open X-Embodiment prioritized breadth over depth, collecting 1,000–5,000 episodes per embodiment across 22 robots rather than 50,000 episodes on a single platform. For procurement, this means: if you're contributing data to an HPT pretraining consortium, prioritize unique embodiment-sensor pairs over episode count.

Comparison with RT-1, RT-2, and OpenVLA

HPT, RT-1, RT-2, and OpenVLA all use transformer trunks for robot learning, but differ in how they handle heterogeneity. RT-1 trained on 130,000 episodes from a single robot (a mobile manipulator with fixed RGB cameras), achieving 97% success on its training distribution but zero-shot transfer to new robots failed. RT-2 added vision-language pretraining by initializing the trunk with PaLI-X weights, enabling semantic generalization ("pick up the extinct animal") but still required embodiment-specific data collection—no cross-robot transfer.

OpenVLA pretrained on Open X-Embodiment's 970,000 episodes across 22 robots, using a shared action tokenizer that discretized all action spaces into 256 bins. This enabled zero-shot transfer to new robots if their action spaces fit the 256-bin vocabulary, but failed on high-DoF systems (humanoids, dexterous hands) where discretization loses precision. OpenVLA also required all datasets to use RGB images at 224×224 resolution—depth and point clouds were discarded.

HPT's modular stems and heads remove these constraints. It handles variable action spaces (7-DoF arms, 23-DoF hands, 3-DoF mobile bases) without discretization, processes multi-modal observations (RGB + depth + proprioception) natively, and supports variable control frequencies (10 Hz teleoperation, 50 Hz torque control). The trade-off: HPT requires implementing new stems and heads for each embodiment, whereas OpenVLA's shared tokenizer is plug-and-play if your robot fits the 256-bin action vocabulary. For buyers with custom hardware or high-precision tasks, HPT's flexibility justifies the integration cost.

Fine-Tuning Data Requirements: 50 to 2,000 Episodes

HPT's pretrained trunk enables sample-efficient fine-tuning, but the exact episode count depends on task complexity and data quality. The authors report that 50 expert teleoperation episodes suffice for simple pick-and-place tasks (single object, fixed start pose), 200–500 episodes handle moderate variation (multiple objects, randomized poses), and 1,000–2,000 episodes are needed for contact-rich assembly or long-horizon tasks^[1]. These numbers assume expert demonstrations—human teleoperation or scripted controllers that succeed >90% of the time. Autonomous rollouts with 50–70% success rates require 3–5× more episodes to achieve the same policy performance.

Data quality trumps quantity. A 200-episode dataset with clean action labels, accurate timestamps, and minimal occlusions outperforms a 1,000-episode dataset with label noise, timestamp jitter, or motion blur. ALOHA demonstrated this with bimanual insertion tasks: 650 high-quality episodes (recorded at 50 Hz with sub-millimeter position accuracy) achieved 85% success, while a baseline dataset of 2,000 episodes recorded at 10 Hz with ±2mm noise reached only 68%^[2].

For procurement, this means: budget for expert data collection (human teleoperators with task-specific training) rather than autonomous exploration. Truelabel's marketplace lists teleoperation datasets with per-episode success labels, action-label accuracy metrics, and timestamp precision guarantees. Filtering for >90% success rate and <5ms timestamp jitter reduces fine-tuning data needs by 40–60% compared to unvetted datasets.

Multi-Modal Sensor Requirements: RGB, Depth, and Proprioception

HPT's stems support four sensor modalities: RGB images, depth maps, point clouds, and proprioceptive state. At minimum, fine-tuning requires RGB (single or multi-view) plus proprioception (joint positions and velocities). Depth and point clouds are optional but improve performance on tasks requiring precise 3D reasoning—insertion, stacking, deformable object manipulation.

RGB images should be 224×224 or 256×256 pixels (matching the pretrained ViT's input size) at 10–30 Hz. Multi-view setups are common: a third-person camera at 1–2 meters captures workspace context, a wrist-mounted camera provides end-effector detail. The pretrained ViT stem processes each view independently, then concatenates the token sequences. Lighting must be consistent across episodes; domain randomization (varying brightness, contrast, hue) during fine-tuning improves robustness but requires 20–30% more episodes to converge^[3].

Depth maps (from RealSense, Kinect, or stereo rigs) are encoded by a smaller ViT stem into 8 tokens. Point clouds (from lidar or reconstructed from depth) use a PointNet-style encoder that samples 1,024–2,048 points per frame. Proprioceptive state includes joint positions, velocities, and optionally torques/forces for all actuated DoFs. An MLP stem maps this vector to 4–8 tokens. All modalities must be time-synchronized to within 10 ms—MCAP and RLDS formats handle this via shared timestamps.

For buyers, the sensor trade-off is cost versus performance. RGB-only setups (two cameras, $400 total) achieve 70–80% success on tabletop manipulation. Adding depth ($200 RealSense) lifts success to 80–88% on contact-rich tasks. Adding lidar ($2,000–$8,000) is justified only for navigation or outdoor manipulation where depth cameras fail in sunlight.

Action Space Encoding: Continuous, Discrete, and Hybrid

HPT's action heads decode trunk outputs into embodiment-specific action spaces. Continuous spaces (joint positions, joint velocities, end-effector deltas) use a linear layer followed by tanh activation to bound outputs. For a 7-DoF arm, the head outputs a 7-dimensional vector; for a mobile manipulator, it outputs 10 dimensions (7 arm + 3 base). Actions are normalized to [-1, 1] during training using the bounds from the robot's URDF, then denormalized at inference.

Discrete spaces (gripper open/close, mode switches) use a classification head with softmax. A binary gripper uses a 2-class head; a three-finger gripper with independent control uses three 2-class heads. Hybrid spaces combine continuous and discrete: a 7-DoF arm with gripper uses a 7-dimensional continuous head plus a 2-class discrete head, trained with separate loss terms (MSE for continuous, cross-entropy for discrete).

Control frequency varies by dataset. ALOHA records actions at 50 Hz for high-precision insertion; BridgeData V2 uses 10 Hz for tabletop manipulation. HPT trains separate heads for each frequency rather than resampling data, preserving the temporal resolution of fast datasets. At inference, you run the policy at your robot's native control rate—if your hardware supports 50 Hz but your fine-tuning data was 10 Hz, the policy will output the same action for five consecutive timesteps (zero-order hold).

For procurement, ensure your dataset's action labels match the control interface you'll deploy. If you're buying teleoperation data recorded in joint-velocity mode but plan to deploy a position-controlled policy, you'll need to integrate the velocities during preprocessing—a step that introduces numerical drift if not done carefully. Truelabel's marketplace tags datasets by control mode (position, velocity, torque, end-effector) to avoid this mismatch.

Metadata Requirements: URDF, Calibration, and Action Bounds

HPT's stems and heads are initialized using robot-specific metadata. The URDF (Unified Robot Description Format) defines the kinematic chain: link lengths, joint axes, mass properties. HPT uses the URDF to initialize the proprioception stem's input dimension (number of joints) and to compute forward kinematics for spatial reasoning tasks. If your URDF is inaccurate—wrong link lengths, missing collision geometry—the policy will plan motions that collide with the environment or reach for objects at incorrect positions.

Camera calibration provides intrinsics (focal length, principal point, distortion coefficients) and extrinsics (rotation and translation from robot base to camera frame). HPT's vision stem uses intrinsics to undistort images before feeding them to the ViT; extrinsics enable the trunk to reason about 3D spatial relationships between the end-effector and observed objects. Calibration errors of ±5 pixels degrade pick success rates by 10–15% on tasks requiring sub-centimeter precision^[2].

Action bounds define the valid range for each DoF: joint limits from the URDF, velocity limits from motor specs, torque limits from safety constraints. HPT normalizes actions to [-1, 1] during training using these bounds, then denormalizes at inference. If you provide incorrect bounds—e.g., listing a joint limit as ±180° when the physical range is ±90°—the policy will command out-of-range actions that the robot's safety controller will clip, causing jerky motion and task failures.

For procurement, request metadata alongside trajectory data. Truelabel's marketplace enforces a metadata schema: every dataset includes a URDF, camera calibration files (OpenCV YAML or ROS CameraInfo), and action-space JSON defining bounds and control mode. Datasets missing metadata are flagged as incomplete and excluded from HPT-compatible listings.

Pretraining Consortium Model: Pooling Datasets Across Buyers

HPT's architecture enables a pretraining consortium model: multiple buyers contribute datasets to a shared pretraining run, then each fine-tunes the resulting trunk on their proprietary data. This amortizes the cost of pretraining (200–800 GPU-days for a 200M parameter trunk on 52 datasets) across participants while preserving competitive differentiation—your fine-tuning data and task-specific heads remain private.

The consortium model works because HPT's trunk learns cross-embodiment priors (object affordances, contact dynamics, spatial reasoning) that transfer across tasks, while task-specific knowledge resides in the heads. A trunk pretrained on kitchen manipulation, warehouse picking, and assembly tasks will improve performance on all three domains, even though no single buyer contributed data for all three. The key requirement: dataset diversity. A consortium of five buyers each contributing 10,000 episodes from the same robot platform yields minimal benefit; a consortium of five buyers each contributing 2,000 episodes from different platforms (single-arm, bimanual, mobile manipulator, quadruped, humanoid) produces a trunk that generalizes better than any single-buyer dataset.

Scale AI and NVIDIA Cosmos are piloting consortium models for physical AI pretraining, pooling datasets from automotive OEMs, warehouse automation vendors, and research labs. Participants contribute data under NDA, receive access to the pretrained trunk, and retain exclusive rights to their fine-tuning data. For buyers, the trade-off is coordination overhead (aligning data formats, metadata schemas, legal terms) versus the 10–50× cost reduction compared to solo pretraining.

Licensing and Provenance for HPT Training Data

HPT pretraining mixes academic datasets (released under permissive licenses like CC-BY-4.0), proprietary datasets (contributed under consortium agreements), and synthetic data (generated in simulation with no real-world PII). Fine-tuning data is typically proprietary—collected by the buyer for their specific robot and task. Licensing clarity matters because model weights inherit dataset licenses: if you pretrain on a dataset with a non-commercial restriction, your fine-tuned policy may be non-commercial too, blocking deployment in production systems.

Academic datasets like BridgeData V2 and DROID use CC-BY-4.0, which permits commercial use with attribution. EPIC-KITCHENS-100 uses a custom license that allows research use but requires separate negotiation for commercial deployment. Synthetic datasets from RLBench or robosuite are typically unrestricted, but check whether the simulation assets (3D models, textures) have their own licenses—some CAD models prohibit redistribution even if the trajectory data is open.

Data provenance tracking is critical for consortium models. Each dataset in the pretraining mixture should include a manifest listing: source organization, collection date, sensor specs, annotator pool (human teleoperators, autonomous agents, simulation), and any known biases (e.g., all episodes collected in the same lab lighting). Truelabel's marketplace enforces provenance schemas and flags datasets with unclear licensing, reducing legal risk for buyers building commercial systems.

Benchmarking HPT: Success Rates and Compute Costs

HPT's authors benchmarked the pretrained trunk on five manipulation tasks: pick-and-place, stacking, insertion, drawer opening, and cloth folding. Fine-tuning on 500 demonstrations per task, the 200M parameter trunk achieved 78% average success rate, compared to 58% for a trunk trained from scratch and 65% for a trunk pretrained on Open X-Embodiment alone^[1]. The performance gap widened on low-data regimes: with only 50 demonstrations, HPT reached 62% success versus 38% from scratch.

Compute costs for pretraining scale with trunk size and dataset count. A 200M parameter trunk on 52 datasets (800,000 episodes) required 400 A100-days, costing $80,000–$120,000 at cloud rates. A 1B parameter trunk required 1,200 A100-days ($240,000–$360,000). Fine-tuning is cheaper: 500 episodes on a single task took 4–8 A100-hours ($8–$16), and the trunk remained frozen so VRAM usage stayed under 40 GB (fitting on a single A100).

For buyers, the cost-benefit calculation depends on your data budget. If you can collect 10,000+ episodes for your task, training from scratch may be cheaper than licensing a pretrained trunk. If you're limited to 500–2,000 episodes (common for complex tasks or expensive hardware), HPT's pretraining amortizes the compute cost across all downstream tasks. A consortium of 10 buyers splitting the $100,000 pretraining cost pays $10,000 each—far less than the $50,000–$200,000 cost of collecting 10,000 proprietary episodes.

Integration with LeRobot and Hugging Face Ecosystem

HPT is compatible with the LeRobot framework, which provides dataset loaders, training scripts, and evaluation tools for robot learning. LeRobot's dataset format is a superset of RLDS, storing episodes as Parquet files with HDF5 chunks for images and point clouds. To use HPT with LeRobot, you implement custom stems and heads as PyTorch modules, then register them in LeRobot's model zoo. The pretrained trunk weights are loaded from Hugging Face Hub, and fine-tuning uses LeRobot's standard training loop with embodiment-specific data loaders.

LeRobot's dataset catalog includes 15+ robot datasets (ALOHA, BridgeData V2, DROID, RoboNet) with standardized metadata: URDF files, camera calibration, action-space definitions. Each dataset has a Hugging Face dataset card listing episode count, success rate, sensor specs, and licensing terms. For HPT users, this means you can prototype fine-tuning on public datasets before collecting proprietary data—validate that your stems and heads work, tune hyperparameters, estimate episode requirements.

The Hugging Face ecosystem also simplifies model sharing. After fine-tuning HPT on your proprietary data, you can upload the new stems and heads to a private Hugging Face repo (keeping the trunk weights as a reference to the public pretrained checkpoint). Your deployment pipeline then pulls the stems/heads from your private repo and the trunk from the public repo, avoiding the need to store or transfer the full 200M–1B parameter trunk. For buyers with air-gapped production environments, this reduces the artifact size from 800 MB (full model) to 50 MB (stems + heads only).

Sourcing HPT-Compatible Datasets on Truelabel

Truelabel's physical AI data marketplace lists 200+ robot datasets, with 40+ tagged as HPT-compatible. Compatibility criteria include: multi-modal observations (RGB + proprioception minimum), time-synchronized sensor streams (<10 ms jitter), action labels at native control frequency, and complete metadata (URDF, calibration, action bounds). Each dataset page shows episode count, success rate distribution, sensor specs, and licensing terms.

For pretraining, prioritize datasets with novel embodiment-sensor combinations. If the existing 52-dataset HPT mixture includes five single-arm datasets with wrist cameras, adding a sixth single-arm wrist-camera dataset yields minimal benefit. Instead, look for: bimanual systems, mobile manipulators, quadrupeds, humanoids, or single-arm systems with unusual sensors (thermal, tactile, force/torque). Truelabel's search filters let you query by robot type, sensor modality, and task domain.

For fine-tuning, prioritize high-quality teleoperation data with >90% success rate. Truelabel's dataset pages include per-episode success labels (verified by the collector) and action-label accuracy metrics (measured by replaying actions in simulation). Datasets with <5 ms timestamp jitter and <1% action-label error reduce fine-tuning episode requirements by 40–60% compared to unvetted datasets. You can request custom data collection for your specific robot and task—typical lead time is 4–8 weeks for 500–2,000 episodes, with pricing starting at $50–$150 per episode depending on task complexity.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Teleoperation training dataTask-specific requirements Physical AI data providers: criteria and optionsRelated page Best robotics dataset marketplaces 2026Related page Best teleoperation data providers 2026Related page Data provenance for physical AIRelated page What is physical AI training data?Related page Sourcing teleop kitchen dataRelated page Sourcing teleop warehouse dataRelated page

External references and source context

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
HPT architecture, pretraining on 52 datasets with 800,000 episodes and 93 million transitions, scaling laws, and fine-tuning episode requirements
arXiv ↩
Teleoperation datasets are becoming the highest-intent physical AI content category
ALOHA teleoperation dataset with 650 high-precision bimanual episodes recorded at 50 Hz
tonyzhaozh.github.io ↩
Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization techniques for sim-to-real transfer requiring 20-30% more training data
arXiv ↩

FAQ

How many episodes do I need to fine-tune HPT on a new robot?

50–200 episodes suffice for simple pick-and-place tasks with expert teleoperation data (>90% success rate). Moderate-variation tasks (multiple objects, randomized poses) require 200–500 episodes. Contact-rich assembly or long-horizon tasks need 1,000–2,000 episodes. These numbers assume clean action labels, accurate timestamps (<5 ms jitter), and multi-modal observations (RGB + proprioception minimum). Autonomous rollouts with 50–70% success rates require 3–5× more episodes to achieve equivalent policy performance.

Can I use HPT with depth cameras or point clouds, or only RGB?

HPT natively supports RGB images, depth maps, point clouds, and proprioceptive state through embodiment-specific stems. At minimum, you need RGB (single or multi-view) plus proprioception. Adding depth improves performance on contact-rich tasks by 8–12 percentage points. Point clouds (from lidar or reconstructed depth) are useful for navigation and outdoor manipulation. All modalities must be time-synchronized to within 10 ms—MCAP and RLDS formats handle this via shared timestamps.

What metadata do I need to provide alongside trajectory data for HPT?

You must provide: (1) URDF defining the robot's kinematic chain, joint limits, and mass properties; (2) camera calibration files with intrinsics (focal length, distortion) and extrinsics (pose relative to robot base); (3) action-space definition listing bounds, control mode (position/velocity/torque), and dimensionality. HPT uses the URDF to initialize the proprioception stem, calibration to undistort images and reason about 3D space, and action bounds to normalize outputs. Inaccurate metadata causes the policy to plan invalid motions or reach for objects at wrong positions.

How does HPT compare to OpenVLA for cross-robot transfer?

HPT uses modular stems and heads to handle heterogeneous action spaces, sensor modalities, and control frequencies without discretization. OpenVLA uses a shared 256-bin action tokenizer, enabling plug-and-play transfer if your robot fits the vocabulary but losing precision on high-DoF systems (humanoids, dexterous hands). HPT handles variable action spaces (7-DoF arms, 23-DoF hands, mobile bases) and multi-modal observations (RGB + depth + proprioception) natively, but requires implementing new stems and heads for each embodiment. For custom hardware or high-precision tasks, HPT's flexibility justifies the integration cost.

Can I pretrain HPT on my own dataset mixture, or must I use the published 52-dataset checkpoint?

You can pretrain from scratch on a custom mixture if you have 20+ diverse datasets (different robots, sensors, tasks) and 200–800 GPU-days of compute budget. The published 52-dataset checkpoint is a starting point—you can continue pretraining by adding new datasets to the mixture and resuming training for 50,000–100,000 additional steps. For most buyers, fine-tuning the published checkpoint is more cost-effective than solo pretraining unless you have proprietary data that covers embodiment-sensor combinations absent from the original 52.

What licensing terms apply to models fine-tuned on HPT's pretrained trunk?

Model weights inherit the licenses of all datasets in the pretraining mixture. HPT's published checkpoint was pretrained on a mix of CC-BY-4.0 academic datasets (BridgeData V2, DROID), custom-license datasets (EPIC-KITCHENS-100 requires separate commercial negotiation), and unrestricted synthetic data (RLBench). If you fine-tune on proprietary data, your stems and heads are yours, but the trunk weights carry the upstream licenses. For commercial deployment, audit the 52-dataset mixture for non-commercial restrictions and negotiate licenses where needed. Truelabel's marketplace flags datasets with unclear licensing to reduce legal risk.

Looking for HPT training data?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

Browse HPT-Compatible Datasets

Quick facts

What Is HPT and Why Heterogeneous Pretraining Matters

HPT Architecture: Stems, Trunk, and Heads

Pretraining Dataset Mixture: 52 Datasets and 93 Million Transitions

Input and Output Specifications for HPT Fine-Tuning

Scaling Laws: How Much Pretraining Data Does HPT Need?

Comparison with RT-1, RT-2, and OpenVLA

Fine-Tuning Data Requirements: 50 to 2,000 Episodes

Multi-Modal Sensor Requirements: RGB, Depth, and Proprioception

Action Space Encoding: Continuous, Discrete, and Hybrid

Metadata Requirements: URDF, Calibration, and Action Bounds

Pretraining Consortium Model: Pooling Datasets Across Buyers

Licensing and Provenance for HPT Training Data

Benchmarking HPT: Success Rates and Compute Costs

Integration with LeRobot and Hugging Face Ecosystem

Sourcing HPT-Compatible Datasets on Truelabel

Related pages

External references and source context

FAQ

Looking for HPT training data?