Model profiles

Models that train on physical AI data

Profiles of vision-language-action models, foundation models for robotics, and benchmark models — what they were trained on, and what data they need to deploy. 22 models covered.

How to use this hub

Start here when you know the broad category but haven't nailed the exact bounty spec yet. Each linked page narrows the request into a concrete data shape: modality, task, environment, metadata, rights, consent, delivery format, and sample QA. That structure is what turns a vague physical AI data need into something a supplier can prove or reject with evidence.

The hub isn't meant to be the last page you read. It should hand off to a detail page where the specific intent is answered with sample specs, comparison tables, proof requirements, and external source context.

22 pages — search and filter

22 of 22 datasets

BridgeData V2 Model: Real-World Manipulation Dataset for Generalist Policies

Physical AI Training Data

BridgeData V2 is a 60,096-demonstration robot manipulation dataset collected on WidowX 250 arms across 24 real-world kitchen environments at UC Berkeley RAIL. Released in August 2023, it provides 7-DoF end-effector delta actions at 5 Hz with 256×256 RGB observations and natural-language task descriptions in RLDS format, serving as the primary pretraining corpus for Octo, RT-X, and other vision-language-action models targeting tabletop pick-and-place scenarios.

Robot manipulation dataset
RLDS format

CrossFormer: Cross-Embodiment Transformer for Multi-Robot Policy Transfer

Model Profile

CrossFormer is a cross-embodiment transformer policy developed by UC Berkeley and Carnegie Mellon that addresses the heterogeneity problem in multi-robot learning through embodiment-specific tokenization and detokenization layers. Pretrained on 900,000+ episodes from the Open X-Embodiment dataset spanning 20+ robot platforms, CrossFormer achieves zero-shot transfer to new embodiments and fine-tunes effectively with 50–500 demonstrations per task, making it a practical foundation model for organizations deploying diverse robot fleets.

Cross Embodiment learning
Robot policy transfer

Diffusion Policy Model: Training Data Requirements & Integration

Physical AI Model Profile

Diffusion Policy treats robot action generation as a conditional denoising diffusion probabilistic model (DDPM), predicting temporally-correlated 16-step action chunks by iteratively refining Gaussian noise conditioned on visual observations. Introduced by Chi et al. at Columbia/MIT/TRI in 2023, it achieves 80-90% success rates on manipulation benchmarks from 100-200 teleoperation demonstrations per task, outperforming behavior cloning and implicit policies by modeling multi-modal action distributions without mode collapse.

Diffusion policy training data
Denoising diffusion robot control

Gato Training Data: Multi-Task Tokenization for Generalist Agents

Model Profile

Gato is DeepMind's 1.2B-parameter transformer trained on 604 distinct tasks spanning Atari games, image captioning, dialogue, and real-world robot manipulation. It tokenizes all modalities—RGB images, proprioceptive state, continuous actions, and text—into a unified sequence using mu-law encoding and 16×16 patch embeddings, demonstrating that a single set of weights can perform both digital and physical tasks at variable control frequencies from 5 Hz to 60 Hz.

Gato model
Generalist agent training

GENIMA Training Data Requirements & Integration

Vision-Language-Action Model

GENIMA (Generative Image as Action Models) fine-tunes Stable Diffusion to produce colored-sphere affordance images that an ACT controller decodes into 7-DoF joint trajectories. Training requires time-synchronized RGB streams (128×128 sim, 480×640 real), joint-position recordings at 10–30 Hz, gripper states, and calibrated camera parameters (intrinsics/extrinsics). Published by Dyson Robot Learning Lab in July 2024, GENIMA achieved 64% success on 18 RLBench tasks with 100 demonstrations per task.

Affordance image generation
Stable Diffusion robotics

GR-2: ByteDance's Video-Language-Action Model for Robot Manipulation

Physical AI Model Profile

GR-2 is a generative video-language-action transformer developed by ByteDance Research that autoregressively predicts interleaved video tokens and 6-DoF action deltas for robot manipulation. The model is pretrained on 38 million video clips totaling 50 billion tokens from Something-Something-v2, EPIC-KITCHENS, and Kinetics-400, then fine-tuned on 800,000 robot manipulation episodes across 100 tasks, achieving 97.7% success on CALVIN benchmark and 94.9% on real-robot evaluations[ref:ref-gr2-paper]. Unlike vision-language-action models that encode video into fixed embeddings, GR-2 treats video frames as discrete VQGAN tokens in a unified autoregressive sequence, enabling the model to leverage large-scale video pretraining for manipulation policy learning.

Video Language Action model
VQGAN tokenization robotics

GR00T N1: NVIDIA's Dual-System Humanoid Foundation Model

Physical AI Model

GR00T N1 is NVIDIA's 34-billion-parameter humanoid foundation model released March 2025, combining Eagle-2 vision-language reasoning (System 2, 10 Hz) with a diffusion transformer motor controller (System 1, 50+ Hz). Trained on 50,000+ robot trajectories, 3 million human egocentric videos, and synthetic Isaac Sim data, it processes multi-view RGB plus proprioceptive state to output continuous joint-position targets for variable-DoF humanoid embodiments.

NVIDIA humanoid model
Eagle 2 VLM

HPT Training Data: Heterogeneous Pre-trained Transformers for Robot Learning

Model Profile

HPT is a modular transformer architecture pretrained on 52 heterogeneous robot datasets totaling 800,000 episodes and 93 million transitions. It uses embodiment-specific stems to tokenize diverse sensor inputs (RGB, depth, point clouds, proprioception) into a fixed 32-token sequence, processes them through a shared ViT trunk, and decodes actions via embodiment-specific heads. Pretraining on this mixture enables zero-shot transfer and sample-efficient fine-tuning across platforms with different action spaces and control frequencies.

Heterogeneous pretraining
Robot transformer datasets

HumanPlus Model: Training Data Requirements & Architecture

Physical AI Model Profile

HumanPlus is a two-stage humanoid learning system developed at Stanford that trains a shadowing transformer on 40 hours of AMASS motion capture data to predict 33-DoF joint targets from single RGB frames, then fine-tunes task-specific policies via teleoperation demonstrations collected on Unitree H1 platforms at 30 Hz with dual head-mounted cameras.

Humanoid robot training data
AMASS motion capture

MVP (Masked Visual Pre-training) Training Data & Integration

Model Profile

MVP is a self-supervised visual representation learning framework developed at UC Berkeley that applies masked autoencoder (MAE) pre-training to in-the-wild video, producing frozen ViT-Base encoders (86M parameters) that downstream robot manipulation policies consume as observation encoders. Pre-trained on Ego4D's 3,670 hours of egocentric video, MVP achieves 91.3% success on Adroit manipulation tasks and 67.2% on Meta-World, outperforming ImageNet-supervised baselines by 16-24 percentage points without task-specific fine-tuning.

Masked autoencoder robotics
Visual representation learning

Octo Model: Open-Source Generalist Robot Policy & Training Data

Physical AI Model

Octo is the first fully open-source generalist robot manipulation policy released by UC Berkeley in May 2024, pre-trained on 800,000 trajectories from 25 datasets in the Open X-Embodiment collection[ref:ref-oxe-paper]. It accepts 256×256 RGB observations (primary + wrist views), outputs 7-DoF end-effector delta actions via a diffusion head, and supports language or goal-image conditioning through a T5-Base encoder at ~10 Hz control frequency.

Generalist robot policy
Open X Embodiment

PaLM-E: Embodied Multimodal Language Model

Model Profile

PaLM-E is a 562-billion-parameter vision-language-action model from Google DeepMind that grounds natural language reasoning in real-world sensor data[ref:ref-palm-e-paper]. Released March 2023, it interleaves PaLM's 540B-parameter language backbone with a 22B-parameter vision transformer to generate high-level action plans from RGB observations and task instructions. Unlike motor-level policies, PaLM-E outputs step-by-step natural language plans executed by downstream controllers like RT-1[ref:ref-rt1-paper], achieving 84% success on long-horizon mobile manipulation tasks across three robot platforms.

Embodied multimodal LLM
Vision Language Action model

Pi-0.5 Training Data Requirements & Multi-Embodiment Dataset Specifications

Vision-Language-Action Model

Pi-0.5 is Physical Intelligence's 3-billion-parameter vision-language-action model released February 2025, trained on over 10,000 robot demonstration hours across 24 embodiments. It uses FAST action tokenization to convert continuous 7–24 DoF trajectories into discrete tokens, accepts multi-view RGB at 50 Hz plus proprioceptive state, and predicts 50-step action chunks per forward pass. Training requires hardware-synchronized camera streams, smooth teleoperation trajectories that yield clean FAST codebooks, and natural-language task instructions—often with chain-of-thought reasoning annotations for long-horizon tasks.

Physical intelligence pi 0.5
FAST action tokenization

R3M Training Data Requirements & Egocentric Video Datasets

Model Profile

R3M (Reusable Representations for Robotic Manipulation) is a visual representation model pretrained on 670 hours of egocentric video from Ego4D, producing 2048-dimensional embeddings that reduce downstream robot manipulation demonstration requirements by 20-40%. The model requires 224×224 RGB frames paired with timestamped natural language narrations for time-contrastive learning, then transfers to robot tasks via frozen or fine-tuned feature extraction.

Egocentric video datasets
Visual representation learning

RoboCat Training Data: Cross-Embodiment Datasets for Self-Improving Agents

Foundation Model

RoboCat is DeepMind's self-improving generalist agent that learns manipulation across multiple robot embodiments by bootstrapping from seed demonstrations and iteratively refining its policy through self-generated rollouts. The model requires 253+ task demonstrations per embodiment, multi-view RGB video at 5-10 Hz, 6-DoF end-effector actions discretized into 1024 bins, and binary success labels for 500+ rollout episodes per task. Truelabel's marketplace supplies the high-quality seed data—100% successful completions, diverse object configurations, hardware-synchronized multi-camera recording—that RoboCat's self-improvement loop cannot generate autonomously.

Cross Embodiment robotics datasets
Self Improving robot agents

RoboFlamingo Training Data: VLM-Compatible Datasets for Robot Manipulation

Vision-Language-Action Model

RoboFlamingo is a vision-language-action model published by ByteDance Research in 2023 that adapts DeepMind's Flamingo VLM for robot manipulation by freezing the visual encoder and language model while training a lightweight policy head on 7-DoF continuous actions. It achieved 88.9% success on CALVIN's long-horizon benchmark using 10M robot demonstration frames, demonstrating that pre-trained vision-language models transfer effectively to embodied control when paired with task-specific action data in RLDS or HDF5 formats with multi-frame observation windows.

Vision Language Action model
Flamingo VLM robot control

RT-1 Training Data: Architecture, Dataset Requirements & Integration

Model Profile

RT-1 (Robotics Transformer 1) is Google's vision-language-action model trained on 130,000 demonstrations spanning 744 tasks collected over 17 months. It processes 300×300 RGB images with 6-frame history, discretizes 7-DoF end-effector deltas into 256 bins per dimension, and conditions on natural-language instructions via Universal Sentence Encoder embeddings injected through FiLM layers[ref:ref-rt1-paper].

Robotics transformer dataset
RLDS format

RT-2 Training Data Requirements & VLA Dataset Specifications

Vision-Language-Action Model

RT-2 requires 320×320 RGB observations paired with 7-DoF end-effector actions discretized into 256 text tokens and free-form language instructions. Google DeepMind trained the original model on 130,000 real-robot episodes collected at 3 Hz control frequency from a single kitchen environment. The architecture co-trains a PaLI-X vision-language backbone on internet image-text pairs and robot demonstrations simultaneously, producing emergent generalization capabilities not present in either data source alone.

Vision Language Action model
Robot demonstration dataset

SuSIE: Hierarchical Robot Manipulation via Subgoal Image Editing

Model Profile

SuSIE (Subgoal Synthesis via Image Editing) is a two-tier manipulation architecture from UC Berkeley that fine-tunes InstructPix2Pix on 220,847 video instruction pairs to generate 256×256 subgoal images, then trains a goal-conditioned policy on 60,000 robot demonstrations to execute low-level motor commands at 5–10 Hz.

Hierarchical robot control
InstructPix2Pix fine Tuning

Theia Vision Model: Training Data Requirements & Architecture

Physical AI Model

Theia is an 86-million-parameter vision foundation model developed by the Boston Dynamics AI Institute that compresses four teacher models—DINOv2, CLIP, SAM, and Depth Anything—into a single ViT-Base encoder trained on 1.2 million images in 150 GPU hours. It produces 224×224 RGB feature maps for downstream manipulation policies, achieving 75% success on CortexBench with 50–200 demonstrations per task while requiring 4× less compute than training separate teacher models.

Theia robotics
Vision foundation model robotics

VC-1 Training Data: Egocentric Video Corpora for Visual Representation Learning

Model Profile

VC-1 is a 307M-parameter Vision Transformer pretrained via masked autoencoding on 4,000+ hours of egocentric video from Ego4D, producing 1024-dimensional visual features for embodied AI tasks. Unlike vision-language-action models, VC-1 outputs dense representations without action prediction; downstream policies map these features to robot controls. Truelabel supplies egocentric video corpora filtered for hand-object interactions, warehouse operations, and domestic tasks, plus annotated RGB frames at 224×224 resolution with action labels for policy training on VC-1 features.

Egocentric video dataset
Visual representation learning

Voltron: Language-Driven Visual Pretraining for Robot Manipulation

Vision-Language-Action Model

Voltron is a visual representation learning framework developed at Stanford that combines masked autoencoding with language-conditioned objectives to pretrain Vision Transformer encoders for robot manipulation. Published at RSS 2023, it trains on 220,847 video clips from Something-Something-v2 and produces 384-dimensional (ViT-Small) or 768-dimensional (ViT-Base) feature embeddings that downstream policies consume for control tasks, achieving 25% higher success rates than MAE or CLIP baselines on real-world WidowX manipulation benchmarks.

Language Driven visual pretraining
Something Something V2 dataset

Procurement questions before posting a bounty

What exact model behavior or evaluation question should this data improve?
Which modality, camera viewpoint, robot state, or metadata stream is required?
What evidence proves the supplier has rights, consent, and provenance?
Which delivery format must the sample open in before scale-up?
What specific failure reasons should cause sample rejection?

Quality gate before a page becomes a deal spec

A page in this hub should not be treated as a finished procurement document by itself. It is a starting point for a bounty. Before a buyer funds capture or licenses off-the-shelf data, the page needs to become a short operating spec: accepted examples, rejected examples, file format, metadata fields, consent requirements, delivery location, and a named reviewer who can approve the sample.

The practical test is simple: if two suppliers read the same detail record, would they submit comparable samples? If not, the buyer needs to narrow the research into a more specific bounty. The strongest truelabel references help with that narrowing by linking from broad hubs into task pages, dataset profiles, format guides, glossary definitions, and public dataset alternatives.

Gate	Question	Pass signal
Intent	What model behavior does the data improve?	The objective is tied to a task, benchmark, or evaluation gap.
Evidence	What proves a supplier can deliver?	A sample package includes files, manifest, rights, and QA notes.
Ingestion	Can the buyer load the sample?	The sample opens in the expected format or converter.

Hub FAQ

How should buyers use the Models that train on physical AI data hub?

Use the Models that train on physical AI data hub to move from a broad physical AI data need into a concrete page with modality, sample, QA, format, rights, and supplier-evidence requirements.

Are these pages public datasets?

No. These pages are sourcing and specification guides for posting bounties. They help buyers define what a supplier must prove before data is accepted.

Why does this hub link to so many detail pages?

Each detail page handles one specific task, dataset, comparison, definition, or format. The hub is the index that helps a buyer pick the right one for the bounty they want to post.

What makes a page ready for a bounty?

A page is ready when it names a model objective, concrete files, metadata requirements, rights and consent expectations, sample QA checks, and a delivery format.

External source context

Scale AI physical AI data engine
Shows enterprise demand for custom physical AI collection and enrichment programs.
NVIDIA Physical AI Data Factory Blueprint
Frames physical AI data as an end-to-end factory problem spanning curation, generation, evaluation, and delivery.
Open X-Embodiment
Baseline open robotics data entity for cross-embodiment tasks and VLA pretraining discussions.
Ego4D dataset
Canonical egocentric video benchmark for first-person physical-world capture and limitations.