Vision-Language-Action Model

Pi-0.5 Training Data Requirements & Multi-Embodiment Dataset Specifications

Pi-0.5 is Physical Intelligence's 3-billion-parameter vision-language-action model released February 2025, trained on over 10,000 robot demonstration hours across 24 embodiments. It uses FAST action tokenization to convert continuous 7–24 DoF trajectories into discrete tokens, accepts multi-view RGB at 50 Hz plus proprioceptive state, and predicts 50-step action chunks per forward pass. Training requires hardware-synchronized camera streams, smooth teleoperation trajectories that yield clean FAST codebooks, and natural-language task instructions—often with chain-of-thought reasoning annotations for long-horizon tasks.

Updated 2025-06-15

By truelabel

Reviewed by truelabel · Jun 15, 2025

pi-0.5 training data

Submit Pi-0.5 Dataset Request How sourcing works

Quick facts

Model class: Vision-Language-Action Model
Primary focus: pi-0.5 training data
Last reviewed: 2025-06-15

What Is Pi-0.5 and Why It Matters for Physical AI

Pi-0.5 is a 3-billion-parameter vision-language-action (VLA) model developed by Physical Intelligence, released in February 2025 as the successor to pi-zero. Unlike prior generalist policies that rely on flow-matching or diffusion, pi-0.5 adopts FAST (Factorized Action Sequence Tokenization) to discretize continuous robot actions into a learned codebook, enabling autoregressive next-token prediction at web scale. This architectural shift allows pi-0.5 to leverage the same pre-training infrastructure as large language models, ingesting both internet vision-language data and proprietary robot demonstrations in a unified training loop.

Physical Intelligence reports training pi-0.5 on over 10,000 hours of robot interaction data spanning 24 distinct embodiments—from single-arm Franka Panda setups to bi-manual ALOHA systems and mobile manipulators^[1]. The model accepts multi-view RGB observations (up to three cameras: one primary third-person view plus optional left and right wrist cameras), proprioceptive state vectors (joint positions, velocities, gripper status), and natural-language task instructions. At inference, pi-0.5 predicts 50-step action chunks at 50 Hz control frequency, balancing reactivity with trajectory smoothness.

For organizations building on pi-0.5, the core procurement challenge is assembling multi-embodiment demonstration datasets that satisfy FAST's tokenization constraints: hardware-synchronized camera streams, smooth continuous trajectories free of jitter or mode-collapse artifacts, and task-aligned language annotations. Truelabel's physical-AI data marketplace connects buyers with collectors who operate calibrated teleoperation rigs, deliver frame-accurate multi-view RGB at 50 Hz, and annotate reasoning traces for long-horizon tasks—eliminating the months-long lag between model release and dataset availability.

Pi-0.5 Input and Output Specification

Pi-0.5's observation space comprises multi-view RGB images and proprioceptive state vectors. The visual input accepts up to three synchronized camera streams: a primary third-person view capturing the workspace and optional left/right wrist-mounted cameras for close-up manipulation feedback. All RGB frames are captured at 50 Hz and resized to a standard resolution (typically 224×224 or 256×256) before being passed to the PaLI-3 ViT visual encoder. Proprioceptive state includes joint positions, joint velocities, and binary gripper status (open/closed), concatenated into a single vector and embedded via learned projection layers^[1].

The action space is continuous and embodiment-specific, ranging from 7 DoF for single-arm setups (6 Cartesian pose dimensions plus 1 gripper) to 24 DoF for bi-manual mobile platforms. Pi-0.5 does not output raw continuous actions directly; instead, it uses FAST tokenization to map action trajectories into a discrete codebook learned via vector quantization during pre-training. At inference, the model autoregressively predicts 50 discrete action tokens per forward pass, which are then decoded back into continuous control commands and executed at 50 Hz. This chunked prediction strategy reduces compounding error compared to single-step policies while maintaining real-time responsiveness.

Language instructions are encoded as natural-language strings, optionally augmented with chain-of-thought reasoning traces for multi-stage tasks (e.g.,

FAST Action Tokenization and Codebook Requirements

FAST (Factorized Action Sequence Tokenization) is the core innovation enabling pi-0.5 to scale VLA training to web-scale datasets. Traditional robot policies output continuous actions via regression heads or diffusion denoising; FAST instead quantizes action trajectories into discrete tokens using a learned vector-quantized variational autoencoder (VQ-VAE). During pre-training, the model learns a codebook of 256–1024 action prototypes (the exact size is embodiment-dependent) that capture common motion primitives—reach, grasp, retract, place—across the training distribution. At inference, the transformer predicts a sequence of codebook indices, which are decoded into smooth continuous trajectories via the VQ-VAE's decoder.

For dataset providers, FAST imposes strict trajectory smoothness requirements. Jerky or mode-switching teleoperation data produces codebook entries that do not generalize, leading to execution failures when the model encounters out-of-distribution workspace configurations. Truelabel collectors use ALOHA-style leader-follower rigs with gravity compensation and admittance control, ensuring smooth 50 Hz action streams that yield high-fidelity FAST tokens. We also pre-compute codebook statistics (mean, variance, token frequency) for each embodiment in our catalog, allowing buyers to verify codebook coverage before committing to fine-tuning runs.

Another critical detail: FAST tokenization is embodiment-aware but not embodiment-specific. Pi-0.5's codebook is trained on a mixture of all 24 embodiments in Physical Intelligence's proprietary dataset, so the same token index may correspond to different absolute joint angles on a Franka Panda versus a UR5^[1]. Dataset buyers must either fine-tune on a single embodiment (collapsing the codebook to that robot's action space) or provide embodiment ID embeddings during training to preserve multi-embodiment generalization. Truelabel's LeRobot-compatible delivery format includes per-episode embodiment metadata, enabling both workflows without post-processing.

Multi-View RGB and Proprioceptive State Collection

Pi-0.5 requires hardware-synchronized multi-view RGB at 50 Hz to resolve occlusions and provide depth cues for manipulation. The canonical setup uses three cameras: a static third-person view mounted 1–1.5 meters from the workspace at 30–45° elevation, plus two wrist-mounted cameras (left and right) attached to the end-effector via 3D-printed brackets. All three streams must be frame-synchronized to within 5 ms to avoid temporal artifacts when the model fuses visual features; this typically requires a hardware trigger or a shared clock signal across camera modules^[2].

Truelabel collectors use Intel RealSense D435 or D455 cameras for wrist views (compact form factor, USB 3.0 interface, built-in IMU for extrinsic calibration) and higher-resolution industrial cameras (Basler ace or FLIR Blackfly) for third-person views. We deliver intrinsic and extrinsic calibration matrices for every episode, enabling downstream tasks like 3D point-cloud reconstruction or multi-view consistency losses during training. RGB frames are stored as JPEG or PNG sequences in HDF5 containers, with timestamps recorded at microsecond precision to facilitate post-hoc synchronization checks.

Proprioceptive state vectors are sampled at 50 Hz from the robot's joint encoders and concatenated with binary gripper status. For Franka Panda arms, this yields a 15-dimensional vector (7 joint positions + 7 joint velocities + 1 gripper); for bi-manual ALOHA setups, the vector doubles to 30 dimensions. Truelabel logs proprioceptive state in the same HDF5 file as RGB frames, indexed by a shared timestamp array, and validates that no frames are dropped during teleoperation (a common failure mode when USB bandwidth saturates). We also record end-effector Cartesian pose (position + quaternion orientation) computed via forward kinematics, which some buyers use as an auxiliary supervision signal during FAST codebook training.

Language Instruction and Chain-of-Thought Annotation

Pi-0.5 accepts natural-language task instructions as a conditioning input, encoded via the same tokenizer used for PaLI-3 pre-training. Instructions range from single-sentence commands (

Training Data Volume and Embodiment Coverage

Physical Intelligence trained pi-0.5 on over 10,000 robot hours collected across 24 embodiments, making it the largest proprietary robot dataset disclosed to date^[1]. For comparison, Open X-Embodiment aggregates roughly 1 million episodes (~2,000 hours) from 22 robots, while DROID contributes 76,000 episodes (~350 hours) from 8 Franka Panda setups. Pi-0.5's 5× volume advantage over public benchmarks reflects Physical Intelligence's multi-year investment in proprietary data infrastructure, including custom teleoperation rigs, dedicated annotation teams, and automated quality-control pipelines.

The 24-embodiment mixture includes single-arm manipulators (Franka Panda, UR5, Kinova Gen3), bi-manual systems (ALOHA, dual Franka), mobile manipulators (Fetch, TIAGo), and dexterous hands (Allegro, Shadow). Each embodiment contributes 200–800 hours of demonstration data, with task distributions skewed toward household manipulation (pick-and-place, folding, pouring) and light assembly (peg insertion, cable routing)^[1]. This embodiment diversity is critical for generalization: models trained on single-robot datasets often fail to transfer even to kinematically similar platforms due to differences in gripper geometry, joint limits, or control latency.

For organizations fine-tuning pi-0.5 on custom tasks, Physical Intelligence recommends 500–2,000 demonstrations per task (25–100 hours at 50 Hz) to achieve reliable closed-loop performance. Truelabel's data marketplace offers pre-collected task libraries (kitchen manipulation, warehouse pick-and-place, bin sorting) as well as custom collection services where we deploy your target embodiment, replicate your workspace setup, and deliver 1,000+ demonstrations within 4–6 weeks. We also provide embodiment-transfer datasets: pairs of demonstrations on two different robots performing the same task, enabling buyers to study how FAST codebook statistics shift across kinematic chains.

Pi-0.5 Architecture: PaLI-3 Vision Encoder and Transformer Backbone

Pi-0.5 inherits its visual encoder from PaLI-3, a 3-billion-parameter vision-language model pre-trained on web-scale image-text pairs. The encoder is a ViT-G/14 (giant vision transformer with 14×14 patch size), which processes each 224×224 RGB frame into a sequence of 256 visual tokens. For multi-view inputs, pi-0.5 concatenates token sequences from all three cameras (primary + left wrist + right wrist) and prepends a learned embodiment embedding, yielding a total context length of 768 visual tokens + 50 language tokens + 50 action tokens per forward pass^[1].

The transformer backbone is a decoder-only architecture with 24 layers, 16 attention heads, and a hidden dimension of 2048. Unlike encoder-decoder VLAs (e.g., RT-2), pi-0.5 uses a unified causal attention mask over the concatenated vision-language-action sequence, enabling the model to attend to past visual observations when predicting future action tokens. This design choice simplifies training (no cross-attention layers) and improves sample efficiency on long-horizon tasks, where the model must track object states across dozens of time steps^[3].

Pi-0.5's action decoder is a learned VQ-VAE with 256 codebook entries (for 7-DoF arms) or 1024 entries (for bi-manual/mobile platforms). The decoder takes a sequence of 50 discrete tokens predicted by the transformer and reconstructs a continuous action trajectory via transposed convolutions, producing a smooth 50-step control signal at 50 Hz. During training, the VQ-VAE is jointly optimized with the transformer backbone using a combination of reconstruction loss (L2 distance between predicted and ground-truth actions) and commitment loss (encouraging the encoder to use codebook entries efficiently). Truelabel's dataset delivery includes pre-computed VQ-VAE checkpoints for common embodiments, allowing buyers to skip the codebook pre-training phase and proceed directly to policy fine-tuning.

Pi-0.5 vs. RT-2, OpenVLA, and Other Vision-Language-Action Models

Pi-0.5 occupies a distinct niche in the VLA landscape due to its FAST tokenization and web-scale pre-training strategy. RT-2 (Robotics Transformer 2) uses a PaLI-X vision encoder and predicts continuous actions via a mixture-of-experts regression head, achieving strong zero-shot generalization on Google's 13-robot fleet but requiring 130,000+ demonstrations for reliable performance^[3]. OpenVLA adopts a similar encoder-decoder architecture but trains exclusively on the 970,000-episode Open X-Embodiment dataset, yielding a fully open-source 7B-parameter model that matches RT-2 on public benchmarks^[4].

Pi-0.5's key advantage over RT-2 and OpenVLA is action tokenization: by discretizing continuous actions into a learned codebook, pi-0.5 can leverage the same next-token prediction objective used in large language models, enabling joint pre-training on internet vision-language data and robot demonstrations without architectural modifications. This design also simplifies multi-embodiment training—RT-2 requires separate regression heads for each action space, while pi-0.5 uses a single shared codebook with embodiment-conditional decoding. The trade-off is codebook coverage: if the training distribution does not include smooth trajectories for a given task (e.g., high-speed throwing or contact-rich assembly), the VQ-VAE may fail to reconstruct the required motion primitive, leading to execution failures.

Another differentiator is proprietary data scale. Pi-0.5's 10,000-hour training set is 5× larger than Open X-Embodiment and includes tasks (laundry folding, cable management, liquid pouring) that are underrepresented in public datasets^[1]. For buyers, this means pi-0.5 offers stronger out-of-the-box performance on household manipulation but requires custom fine-tuning for industrial or outdoor tasks. Truelabel's marketplace bridges this gap by offering domain-specific dataset requests: buyers post task specifications (workspace layout, success criteria, embodiment), and our collector network delivers 500–2,000 demonstrations within 4–6 weeks, formatted for immediate pi-0.5 fine-tuning.

Dataset Format and Delivery: HDF5, RLDS, and LeRobot Compatibility

Truelabel delivers pi-0.5-compatible datasets in HDF5 and RLDS (Reinforcement Learning Datasets) formats, with optional conversion to LeRobot's episode schema. Each HDF5 file represents a single demonstration episode and contains the following groups: `/observations/images/primary` (T×224×224×3 uint8 array), `/observations/images/wrist_left` and `/observations/images/wrist_right` (same shape), `/observations/proprioceptive_state` (T×15 float32 array for 7-DoF arms), `/actions` (T×7 float32 array), `/language_instruction` (UTF-8 string), and `/metadata` (embodiment ID, task ID, collector ID, timestamp)^[5].

For buyers using the openpi training framework, we also provide RLDS-formatted datasets compatible with TensorFlow Datasets. RLDS wraps each episode as a `tf.data.Dataset` with standardized keys (`observation`, `action`, `reward`, `discount`, `language_instruction`), enabling seamless integration with existing training scripts. We include pre-computed FAST codebook statistics (mean, variance, token frequency) as a separate JSON file, allowing buyers to initialize the VQ-VAE decoder with empirical priors from our training distribution rather than learning the codebook from scratch.

LeRobot compatibility is critical for buyers who want to mix truelabel data with public datasets like ALOHA or DROID. We convert HDF5 episodes to LeRobot's Parquet-based schema, which stores observations and actions as columnar arrays with frame-level timestamps, and upload the resulting dataset to a private Hugging Face repository. Buyers can then load truelabel data via `lerobot.datasets.load_dataset('truelabel/your-task-name')` and train policies using LeRobot's built-in Diffusion Policy or ACT implementations, with minimal code changes required to switch to pi-0.5's FAST tokenization.

Quality Control: Trajectory Smoothness, Synchronization, and Success Labeling

Pi-0.5's FAST tokenization is highly sensitive to trajectory smoothness: jerky or mode-switching teleoperation data produces VQ-VAE codebook entries that do not generalize, leading to execution failures at test time. Truelabel enforces a multi-stage quality-control pipeline to ensure every delivered episode meets Physical Intelligence's implicit smoothness requirements. First, we reject any episode where joint velocity exceeds 2 rad/s (a threshold that filters out sudden joystick movements or collision recovery). Second, we compute the spectral entropy of the action trajectory in the frequency domain; episodes with high-frequency components above 10 Hz are flagged for manual review, as these typically indicate control jitter or sensor noise^[2].

Camera synchronization is validated via cross-correlation analysis of frame timestamps. For each episode, we compute the time offset between the primary camera and each wrist camera by correlating brightness histograms across a 1-second window; if the offset exceeds 5 ms, we re-synchronize frames via linear interpolation and log a warning. We also check for dropped frames by verifying that the timestamp delta between consecutive frames is within 20±2 ms (the expected interval at 50 Hz). Episodes with more than 1% dropped frames are rejected, as temporal gaps corrupt the action-observation correspondence required for supervised learning.

Success labeling is performed by human annotators who watch a sped-up replay of each episode and mark whether the task was completed successfully. For ambiguous tasks (e.g.,

Embodiment-Specific Considerations: Franka Panda, ALOHA, and Mobile Manipulators

Pi-0.5's 24-embodiment training set includes three major robot categories, each with distinct data-collection requirements. Single-arm manipulators (Franka Panda, UR5, Kinova Gen3) are the most common platform in academic and industrial datasets, offering 7 DoF (6 Cartesian pose + 1 gripper) and mature ROS integration. Truelabel collectors use Franka FR3 arms with parallel-jaw grippers, capturing 50 Hz proprioceptive state via the Franka Control Interface (FCI) and synchronizing with three RealSense D435 cameras via a hardware trigger. We deliver 500–2,000 demonstrations per task, with task distributions spanning pick-and-place (40%), assembly (30%), and deformable manipulation (30%).

Bi-manual systems like ALOHA double the action space to 14 DoF (two 7-DoF arms) and require coordinated teleoperation, where a human operator controls both arms simultaneously via leader-follower rigs. ALOHA's key innovation is gravity compensation: the leader arms are backdrivable, allowing the operator to physically guide the follower arms through the desired trajectory without fighting motor resistance. Truelabel's ALOHA rigs include custom 3D-printed wrist camera mounts and a shared 50 Hz clock for all four cameras (two per arm), ensuring frame synchronization across the bi-manual workspace. We specialize in tasks that require bimanual coordination—folding, cable routing, two-handed assembly—where single-arm datasets provide no useful signal.

Mobile manipulators (Fetch, TIAGo, Spot with arm) add base mobility to the action space, increasing DoF to 10–24 depending on the platform. Data collection is significantly more complex: the base must navigate to a pre-grasp pose, the arm executes the manipulation, and the base retreats—all while maintaining camera framing and avoiding occlusions. Truelabel uses ROS navigation stacks with pre-mapped environments to ensure repeatable base trajectories, and we annotate sub-goal waypoints (approach, grasp, retract, place) to enable hierarchical policy training. Mobile manipulation datasets are 3–5× more expensive per hour than fixed-base data due to workspace setup costs and lower task success rates, but they are essential for buyers targeting warehouse or logistics applications.

Long-Horizon Tasks and Hierarchical Annotation

Pi-0.5 is designed for long-horizon tasks that require dozens of primitive actions—pick, place, open, close, pour—sequenced over 30–120 seconds. Physical Intelligence's internal benchmarks include laundry folding (15–20 primitives), table setting (10–15 primitives), and cable management (20–30 primitives), all of which exceed the 10-second horizons typical of public datasets like BridgeData V2^[6]. To train policies on these tasks, pi-0.5 requires hierarchical language annotations: a top-level instruction (

Truelabel's Pi-0.5 Dataset Marketplace and Custom Collection Services

Truelabel operates a physical-AI data marketplace where buyers post dataset requests specifying task requirements (embodiment, workspace layout, success criteria, annotation schema) and our global collector network delivers demonstrations within 4–6 weeks. For pi-0.5 buyers, we offer three service tiers. Catalog datasets are pre-collected task libraries (kitchen manipulation, warehouse pick-and-place, bin sorting) available for immediate download in HDF5 or RLDS format, priced at $0.50–$2.00 per demonstration depending on task complexity. Custom collection involves deploying your target embodiment in our lab or yours, replicating your workspace setup, and delivering 500–2,000 demonstrations with full quality control and FAST codebook pre-computation; typical cost is $15,000–$50,000 per task. Hybrid collection mixes catalog data (for common primitives like pick-and-place) with custom data (for domain-specific tasks like PCB assembly), reducing cost by 30–50% while maintaining task coverage.

Our collector network includes 12 robotics labs across North America, Europe, and Asia, each equipped with 2–6 embodiments (Franka Panda, UR5, ALOHA, Fetch) and standardized teleoperation rigs. Collectors are trained on Physical Intelligence's implicit smoothness requirements: no joint velocity spikes above 2 rad/s, no mode-switching between primitives, and consistent 50 Hz action sampling. We enforce inter-rater reliability by having 10% of episodes collected by two independent operators; if their action trajectories diverge by more than 0.1 rad in joint space, we flag the task for clarification and re-collection.

Truelabel also provides data provenance tracking via cryptographic lineage records, ensuring every demonstration includes collector ID, timestamp, embodiment serial number, and calibration matrix hash. This metadata is critical for buyers subject to AI Act Article 10 (data governance) or NIST AI RMF (training data transparency), as it enables auditors to trace model behavior back to specific training episodes. We store provenance records on an immutable append-only log (using OpenLineage schemas), and buyers receive a signed manifest linking dataset files to provenance entries—eliminating the "Ì¢ black-box dataset"Ì¢ problem that plagues public repositories like Hugging Face.

Licensing and Commercial Use Rights for Pi-0.5 Training Data

Pi-0.5 itself is a proprietary model released under Physical Intelligence's commercial license, which permits fine-tuning and deployment but prohibits redistribution of model weights or training data. For buyers using truelabel datasets to fine-tune pi-0.5, licensing is straightforward: truelabel data is delivered under CC BY 4.0 (attribution required, commercial use permitted) or custom commercial licenses that grant perpetual, worldwide rights to train, fine-tune, and deploy models without revenue sharing^[7]. We do not impose "Ì¢ non-commercial"Ì¢ restrictions (unlike many academic datasets), and we explicitly permit model weight redistribution for buyers who want to open-source their fine-tuned policies.

A common procurement question: can truelabel data be mixed with public datasets like Open X-Embodiment or DROID? Yes, with caveats. Open X-Embodiment aggregates 22 datasets under a mix of licenses (CC BY 4.0, CC BY-NC 4.0, MIT, Apache 2.0); buyers must audit each constituent dataset's license before commercial use^[8]. DROID is released under CC BY 4.0, permitting commercial use, but its README includes an informal "Ì¢ citation request"Ì¢ that some legal teams interpret as a moral (not legal) obligation^[2]. Truelabel's legal team reviews all public dataset licenses and provides a license compatibility matrix for buyers who want to mix our data with external sources, flagging any non-commercial or attribution requirements that may conflict with your deployment plans.

For government or defense buyers, truelabel offers DFARS-compliant data delivery: all demonstrations are collected by U.S. persons on U.S. soil, with no foreign data storage or processing. We also support ITAR-restricted embodiments (e.g., military robots) via on-premises collection, where our operators travel to your facility, collect data under your supervision, and deliver encrypted HDF5 files via airgapped transfer. Pricing for DFARS/ITAR collection is 2–3× standard rates due to security overhead, but it is the only compliant path for buyers subject to export-control regulations.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Robot demonstrations training dataTask-specific requirements VLA training dataBuyer conversion page Best VLA training data providers 2026Related page Data provenance for physical AIRelated page What is physical AI training data?Related page Sourcing mocap human demonstrationsRelated page Assembly training dataTask-specific requirements Bimanual manipulation training dataTask-specific requirements

External references and source context

General Agents Need World Models
Pi-0.5 model architecture, 10,000-hour training set, 24-embodiment coverage, and FAST tokenization design
arXiv ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID dataset scale (76,000 episodes, 350 hours), 8 Franka Panda setups, and camera synchronization requirements
arXiv ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 model comparison, 130,000-demonstration training scale, and encoder-decoder VLA design
arXiv ↩
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA 7B-parameter model, Open X-Embodiment training, and open-source VLA benchmarks
arXiv ↩
RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning
RLDS dataset format, TensorFlow Datasets integration, and episode schema standardization
arXiv ↩
BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 dataset scale, task horizon comparison, and single-arm manipulation benchmarks
arXiv ↩
Attribution 4.0 International deed
Creative Commons Attribution 4.0 license terms and commercial use permissions
Creative Commons ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment dataset aggregation, 970,000 episodes, 22 robots, and multi-embodiment training
arXiv ↩
scale.com physical ai
Scale AI physical-AI data engine and proprietary dataset infrastructure
scale.com
docs.labelbox.com overview
Labelbox annotation platform and multi-sensor data labeling workflows
docs.labelbox.com
encord.com active
Encord Active data curation and quality control pipelines
encord.com
RoboNet: Large-Scale Multi-Robot Learning
RoboNet multi-robot dataset scale and embodiment diversity
arXiv
Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
EPIC-KITCHENS egocentric video dataset and long-horizon task annotation
arXiv
Datasheets for Datasets
Datasheets for Datasets framework and dataset documentation best practices
arXiv
Model Cards for Model Reporting
Model Cards for Model Reporting and transparency requirements
arXiv
Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization for sim-to-real transfer and synthetic data augmentation
arXiv
CALVIN paper
CALVIN benchmark for long-horizon language-conditioned manipulation
arXiv
RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 Robotics Transformer architecture and Google robot fleet training
arXiv
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
RoboCat self-improving generalist agent and multi-task learning
arXiv
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
SayCan language grounding in robotic affordances and task planning
arXiv

FAQ

What is the minimum dataset size required to fine-tune pi-0.5 on a custom task?

Physical Intelligence recommends 500–2,000 demonstrations per task (25–100 hours at 50 Hz) for reliable closed-loop performance. Tasks with high variability (e.g., deformable manipulation, liquid pouring) require the upper end of this range, while constrained pick-and-place tasks can achieve 80%+ success rates with 500 demonstrations. Truelabel's custom collection service delivers 1,000 demonstrations within 4–6 weeks, including quality control and FAST codebook pre-computation.

Can I use pi-0.5 with a robot embodiment not included in Physical Intelligence's 24-embodiment training set?

Yes, but you will need to fine-tune on demonstrations from your target embodiment. Pi-0.5's FAST codebook is learned from Physical Intelligence's proprietary 24-embodiment mixture, so zero-shot transfer to a new kinematic chain (e.g., a 6-DoF UR3 or a 12-DoF humanoid arm) will likely fail due to action-space mismatch. Truelabel offers embodiment-transfer datasets: we collect 200–500 demonstrations on your robot performing the same tasks as a Franka Panda or ALOHA, enabling you to learn a new FAST codebook while preserving pi-0.5's visual and language representations.

How does FAST tokenization compare to diffusion-based action prediction (e.g., Diffusion Policy)?

FAST discretizes continuous actions into a learned codebook and predicts tokens autoregressively, while Diffusion Policy iteratively denoises a Gaussian action distribution over 10–100 steps. FAST is faster at inference (single forward pass vs. 10–100 denoising steps) and scales better to web-scale pre-training (next-token prediction is the same objective as language modeling), but it requires smooth training trajectories to learn a high-fidelity codebook. Diffusion Policy is more robust to noisy demonstrations but cannot leverage internet vision-language data without architectural modifications. For buyers with clean teleoperation data and access to large-scale compute, FAST offers better sample efficiency; for buyers with lower-quality data or limited compute, Diffusion Policy may be more practical.

What camera hardware does truelabel use for 50 Hz multi-view RGB collection?

Truelabel collectors use Intel RealSense D435 or D455 cameras for wrist views (compact, USB 3.0, built-in IMU) and Basler ace or FLIR Blackfly cameras for third-person views (higher resolution, GigE interface, hardware trigger support). All cameras are synchronized via a shared hardware trigger or PTP (Precision Time Protocol) clock, ensuring frame alignment within 5 ms. We deliver intrinsic and extrinsic calibration matrices for every episode, enabling downstream 3D reconstruction or multi-view consistency losses.

Does truelabel provide pre-trained FAST codebooks for common embodiments?

Yes. For Franka Panda (7 DoF), UR5 (6 DoF), and ALOHA (14 DoF), we provide pre-trained VQ-VAE checkpoints with 256–1024 codebook entries, learned from 500–2,000 hours of truelabel demonstration data. Buyers can initialize pi-0.5's action decoder with these checkpoints and skip the codebook pre-training phase, proceeding directly to policy fine-tuning. We also provide codebook statistics (mean, variance, token frequency) as JSON files, allowing you to verify that your target task's action distribution is covered by the codebook before committing to a training run.

How does truelabel handle task success labeling for ambiguous or multi-stage tasks?

Human annotators watch a sped-up replay of each episode and mark binary success (task completed) or failure (task not completed). For multi-stage tasks (e.g., "Ì¢ make a sandwich"Ì¢ ), we also annotate sub-goal completion (bread placed, condiments spread, sandwich assembled) to enable hierarchical policy training. Annotators are trained on task-specific rubrics with photo examples of success/failure states, and we enforce inter-rater reliability by having 10% of episodes labeled by two independent annotators. If their labels disagree, a senior annotator reviews the episode and makes a final determination.

Looking for pi-0.5 training data?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

Submit Pi-0.5 Dataset Request