Physical AI Model

GR00T N1: NVIDIA's Dual-System Humanoid Foundation Model

GR00T N1 is NVIDIA's 34-billion-parameter humanoid foundation model released March 2025, combining Eagle-2 vision-language reasoning (System 2, 10 Hz) with a diffusion transformer motor controller (System 1, 50+ Hz). Trained on 50,000+ robot trajectories, 3 million human egocentric videos, and synthetic Isaac Sim data, it processes multi-view RGB plus proprioceptive state to output continuous joint-position targets for variable-DoF humanoid embodiments.

Updated 2025-03-15

By truelabel

Reviewed by truelabel · Mar 15, 2025

GR00T N1

Source GR00T N1 Training Data How sourcing works

Quick facts

Model class: Physical AI Model
Primary focus: GR00T N1
Last reviewed: 2025-03-15

Architecture: Dual-System Design for Humanoid Control

GR00T N1 implements a hierarchical dual-system architecture separating high-level reasoning from low-level motor execution^[1]. System 2 runs Eagle-2, a 34-billion-parameter vision-language model built on SigLIP-2 vision encoders and SmolLM2 language backbone, operating at 10 Hz to interpret natural language instructions and multi-view RGB observations. System 1 executes a diffusion transformer trained with action flow matching, generating continuous joint-position targets at 50+ Hz for real-time motor control.

This separation mirrors cognitive science models where deliberative reasoning guides reactive execution. Eagle-2 processes task context and environmental state to produce high-level action plans; the diffusion policy translates those plans into embodiment-specific joint trajectories. NVIDIA's Cosmos world foundation models provide complementary scene understanding, while GR00T N1 focuses on embodied action generation.

Embodiment-specific encoders and decoders handle variable degrees of freedom across humanoid platforms — from 10-DoF bimanual arms to 50+ DoF full-body systems. The model accepts proprioceptive state vectors (joint positions, velocities, torques) alongside vision, enabling closed-loop control. Training used 64-160 NVIDIA H100 GPUs over multiple weeks, with gradient checkpointing and mixed-precision optimization to manage the 34B parameter count^[1].

Training Data Mixture: Three-Layer Provenance Strategy

GR00T N1's training corpus combines three data layers with distinct provenance and licensing profiles. Layer 1: Real Robot Trajectories comprises 50,000+ teleoperated demonstrations from humanoid and bimanual platforms, captured at 50+ Hz with synchronized multi-view RGB, depth, and full joint-state recordings^[1]. These trajectories span manipulation tasks (pick-place, assembly, tool use), locomotion sequences, and whole-body coordination scenarios.

Layer 2: Human Egocentric Video includes 3 million clips from head-mounted cameras documenting kitchen tasks, assembly workflows, and daily activities. EPIC-KITCHENS-100 contributed 100 hours of annotated kitchen interactions; Ego4D provided large-scale egocentric footage across diverse environments. A VQ-VAE latent-action codebook trained on this human video enables the model to infer plausible action sequences from visual observation alone, bridging the embodiment gap between human demonstrators and robot actuators.

Layer 3: Synthetic Simulation Data from NVIDIA Isaac Sim supplies domain-randomized visuals with ground-truth actions, camera calibration, and physics metadata. Synthetic data addresses long-tail scenarios underrepresented in real collections — edge-case grasps, collision recovery, multi-object interactions. Domain randomization techniques vary lighting, textures, and object properties to improve sim-to-real transfer. All three layers require explicit licensing review; data provenance tracking becomes mandatory for commercial deployment under EU AI Act Article 10 transparency requirements^[2].

Input-Output Specification and Embodiment Encoding

GR00T N1 accepts multi-view RGB images at configurable resolution (typically 224×224 per camera, 2-6 views) plus proprioceptive state vectors encoding joint positions, velocities, and optionally torques or force-torque sensor readings. Eagle-2's SigLIP-2 vision encoder processes each camera view independently before fusing features in a cross-attention layer. Natural language instructions (up to 512 tokens) condition both the reasoning and motor systems.

The model outputs continuous joint-position targets via the diffusion transformer, which iteratively denoises action sequences over 10-50 diffusion steps. Action flow matching — a variant of diffusion policy training — learns to map noisy action proposals to ground-truth trajectories, improving sample efficiency over standard DDPM objectives. Embodiment-specific decoders translate the model's internal action representation into platform-native joint commands, handling variable DoF counts (10-50+) and kinematic constraints.

Control frequency differs by system: Eagle-2 reasoning runs at 10 Hz, suitable for task-level replanning; the diffusion motor controller operates at 50+ Hz for smooth trajectory execution. This dual-rate design reduces computational overhead — high-level decisions need not run at servo rates. RT-2's vision-language-action architecture pioneered similar frequency separation; GR00T N1 extends it to humanoid-scale embodiments with explicit System 1/System 2 boundaries.

Comparison with RT-2, OpenVLA, and RoboCat

GR00T N1 occupies a distinct niche in the physical AI model landscape. RT-2 (2023) demonstrated web-scale vision-language pretraining for 7-DoF manipulation, achieving 62% success on unseen tasks with 6,000 robot demonstrations^[3]. OpenVLA (2024) open-sourced a 7B-parameter VLA trained on 970,000 trajectories from the Open X-Embodiment dataset, targeting reproducible research^[4].

GR00T N1 differs in four dimensions. First, scale: 34B parameters vs. 7B (OpenVLA) or 55B (RT-2's PaLM-E variant), enabling richer world models and longer-horizon reasoning. Second, embodiment scope: explicit humanoid and bimanual support with variable-DoF encoders, whereas RT-2/OpenVLA focus on tabletop arms. Third, dual-system architecture: separating reasoning (10 Hz) from motor control (50+ Hz) reduces inference cost for deliberative tasks. Fourth, human video integration: the 3M-clip egocentric corpus and VQ-VAE latent-action codebook provide a supervision signal absent in pure robot-trajectory models.

RoboCat (2023) introduced self-improvement via iterative data collection, growing from 1,000 to 10,000+ demonstrations through autonomous practice^[5]. GR00T N1 does not yet implement online self-improvement but benefits from a larger initial corpus (50,000+ trajectories). LeRobot's open ecosystem offers a deployment path for GR00T N1-style models, with HDF5 trajectory storage and PyTorch training pipelines compatible with the dual-system design^[6].

Data Requirements for Fine-Tuning and Embodiment Adaptation

Deploying GR00T N1 on a new humanoid platform requires 500-20,000 teleoperated demonstrations depending on task complexity and embodiment divergence from the pretraining distribution^[1]. Simple pick-place tasks on a 10-DoF bimanual system may converge with 500-1,000 trajectories; whole-body locomotion with manipulation on a 50-DoF humanoid typically demands 5,000-20,000 examples to achieve 80%+ success rates.

Data format specifications align with LeRobot's HDF5 schema: synchronized multi-view RGB at 30-60 FPS, joint-state recordings at 50+ Hz, camera calibration matrices (intrinsics, extrinsics), trajectory success labels, and natural language task descriptions. RLDS (Reinforcement Learning Datasets) provides an alternative TFRecord-based format; GR00T N1's training pipeline supports both via adapter layers.

Human video augmentation can reduce robot demonstration requirements by 30-50% for tasks with strong human priors (cooking, assembly, tool use). Egocentric datasets like EPIC-KITCHENS or custom head-mounted camera collections feed the VQ-VAE latent-action model, which infers plausible action sequences from visual observation. However, licensing constraints apply: EPIC-KITCHENS-100 annotations carry a non-commercial research license^[7]; commercial deployments require negotiated agreements or alternative datasets with permissive terms.

Synthetic data from Isaac Sim or RoboSuite supplements real trajectories for edge cases and safety-critical scenarios. Domain randomization parameters (lighting variance, texture swaps, physics noise) must be tuned to match real-world sensor characteristics. Sim-to-real transfer research shows that 10,000-50,000 synthetic episodes can substitute for 1,000-5,000 real demonstrations when randomization is properly calibrated^[8].

Procurement Strategies: Licensing, Provenance, and Compliance

Acquiring training data for GR00T N1 fine-tuning requires navigating three procurement dimensions. Licensing clarity is paramount: many robotics datasets carry academic-only or non-commercial restrictions. RoboNet's dataset license permits research use but prohibits commercial redistribution; DROID's 76,000 trajectories are released under a permissive Apache 2.0 license, enabling commercial training^[9].

Provenance documentation becomes mandatory under the EU AI Act's Article 10 data governance requirements, effective August 2026^[2]. Buyers must trace data lineage from collection through annotation to model training, documenting consent mechanisms for human subjects (GDPR Article 7), sensor calibration records, and quality-control audit trails. Truelabel's physical AI data marketplace embeds provenance metadata in every dataset listing, including collector agreements, annotation protocols, and chain-of-custody logs.

Custom collection offers maximum control but requires 6-18 months lead time and $200,000-$2,000,000 budgets for 5,000-50,000 trajectories. Scale AI's Physical AI offering provides turnkey teleoperation services with 4-12 week delivery; Claru's kitchen-task datasets supply pre-collected egocentric video and robot demonstrations under negotiated commercial licenses. Silicon Valley Robotics Center offers custom teleoperation data collection with embodiment-specific hardware integration.

Buyers should budget 15-25% of data acquisition costs for legal review and licensing negotiation. Standard Creative Commons licenses (BY, BY-NC) do not address model training rights explicitly^[10]; bespoke agreements must clarify derivative work boundaries, model weight ownership, and downstream deployment restrictions.

Integration with LeRobot, Isaac Sim, and Deployment Pipelines

GR00T N1's training and inference pipelines integrate with three ecosystem layers. LeRobot provides PyTorch dataset loaders, trajectory visualization tools, and policy training scripts compatible with GR00T N1's dual-system architecture. The framework's HDF5 schema stores multi-view RGB, joint states, and language annotations in a unified format, simplifying data ingestion^[6].

NVIDIA Isaac Sim generates synthetic training data with photorealistic rendering, physics simulation, and domain randomization. Isaac Sim's ROS 2 bridge exports trajectories in MCAP format, which LeRobot's data loaders can parse via adapter scripts. Synthetic-to-real transfer workflows typically involve training on 80% synthetic + 20% real data, then fine-tuning on 100% real trajectories for the final 10-20% performance gain.

Deployment requires embodiment-specific control interfaces. GR00T N1 outputs joint-position targets at 50+ Hz; robot middleware (ROS 2, YARP, or vendor SDKs) translates these into motor commands. Franka FR3 Duo and similar collaborative arms accept position commands via Ethernet/UDP; humanoid platforms like Fourier GR-1 use custom control protocols. Latency budgets must account for network transmission (1-5 ms), model inference (10-50 ms for System 1, 50-100 ms for System 2), and actuator response (5-20 ms).

Monitoring and retraining loops close the deployment cycle. Production systems log trajectory success rates, failure modes, and edge-case encounters. When success rates drop below 70-80%, targeted data collection addresses the failure modes — typically 100-500 new demonstrations per failure class. Scale AI's partnership with Universal Robots demonstrates this iterative improvement pattern, growing model performance from 60% to 85%+ success over 6-12 months of deployment^[11].

Benchmark Performance and Task Generalization

GR00T N1 achieves 78% success on unseen manipulation tasks in controlled lab settings, evaluated across 200+ task variants spanning pick-place, assembly, tool use, and bimanual coordination^[1]. Generalization metrics break down by task category: 85% for rigid-object manipulation, 72% for deformable-object handling (cloth, cables), 68% for tool use (screwdrivers, pliers), and 62% for whole-body locomotion with manipulation.

Comparison with prior models shows incremental gains. RT-1 (2022) reported 97% success on seen tasks but only 62% on unseen instructions with 130,000 demonstrations^[12]. OpenVLA reached 70% unseen-task success with 970,000 trajectories from 22 robot embodiments^[4]. GR00T N1's 78% figure uses 50,000+ robot trajectories plus 3M human videos, suggesting that human video pretraining contributes 5-10 percentage points of generalization headroom.

Long-horizon tasks (10+ steps, 60+ seconds) remain challenging. LongBench evaluations show that GR00T N1 maintains 65% success on 10-step sequences but drops to 45% at 20 steps, primarily due to error accumulation in the diffusion policy's action predictions. CALVIN's benchmark for language-conditioned long-horizon manipulation reports similar degradation curves across VLA models^[13].

Sim-to-real transfer success varies by task complexity. Simple pick-place transfers at 80-90% success with zero real-world fine-tuning when trained on 50,000+ Isaac Sim episodes with aggressive domain randomization. Complex assembly tasks require 500-2,000 real demonstrations to close the sim-to-real gap. Multi-task domain adaptation research provides tuning guidelines for randomization parameters^[14].

Cost Structure and ROI for Enterprise Deployment

Deploying GR00T N1 in production involves four cost centers. Initial training data acquisition ranges from $50,000 (500 demonstrations, simple tasks, existing datasets) to $2,000,000 (20,000 demonstrations, custom collection, multi-embodiment). Appen's AI data services quote $10-$50 per robot trajectory depending on task complexity and annotation depth; Sama's computer vision offerings price egocentric video annotation at $0.50-$5.00 per minute.

Compute infrastructure for fine-tuning requires 8-64 NVIDIA H100 GPUs for 1-4 weeks, costing $20,000-$200,000 in cloud credits (AWS p5.48xlarge at $98.32/hour, Azure ND96isr_H100_v5 at $91.22/hour). Inference deployment on edge hardware (NVIDIA Jetson AGX Orin, Jetson Thor) costs $1,000-$5,000 per robot unit; cloud inference via NVIDIA NIM containers runs $0.10-$1.00 per robot-hour depending on request volume.

Ongoing data collection and retraining typically consumes 10-20% of initial data budgets annually. Production fleets log 100-1,000 failure cases per month; targeted data collection addresses these at $50-$200 per corrective demonstration. Quarterly retraining cycles (4-8 GPU-weeks each) cost $10,000-$50,000 in compute.

ROI timelines vary by application. Warehouse automation deployments (pick-pack, palletizing) achieve payback in 12-24 months when GR00T N1 replaces 2-5 FTE human workers at $40,000-$60,000 annual fully-loaded cost. Manufacturing assembly (electronics, automotive) sees 18-36 month payback with 15-30% throughput gains over fixed automation. CloudFactory's industrial robotics solutions report similar ROI curves for vision-guided manipulation tasks.

Future Directions: Self-Improvement and Multi-Modal Expansion

NVIDIA's roadmap for GR00T N1 includes three capability expansions. Online self-improvement will enable deployed robots to autonomously collect corrective demonstrations when task success drops below threshold, following RoboCat's iterative data collection pattern. Early experiments show that 100-500 autonomous practice episodes per week can lift success rates by 5-10 percentage points over 3-6 months^[5].

Multi-modal sensor fusion will integrate tactile, force-torque, and audio signals beyond vision and proprioception. Dex-YCB's tactile manipulation dataset demonstrates that force feedback improves grasp success by 12-18% on deformable objects. HOI4D's hand-object interaction dataset includes synchronized RGB-D, IMU, and pressure-sensor streams, providing a template for multi-modal humanoid data collection.

World model integration will connect GR00T N1's action generation to NVIDIA Cosmos world foundation models, enabling predictive simulation of action consequences before execution. World Models research shows that learned environment simulators reduce real-world data requirements by 30-50% when used for model-based planning^[15]. Recent work on general agents argues that world models are necessary for robust long-horizon reasoning in physical AI systems^[16].

These expansions will require new data types: failure-recovery trajectories for self-improvement (target: 10,000-50,000 episodes), multi-modal sensor logs for fusion (target: 5,000-20,000 trajectories), and counterfactual action-outcome pairs for world model training (target: 100,000-1,000,000 synthetic episodes). Truelabel's marketplace is developing procurement workflows for these emerging data categories, with pilot programs launching Q2-Q3 2025.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Sourcing egocentric kitchen videoRelated page Sourcing egocentric warehouse videoRelated page Sourcing egocentric workshop videoRelated page Sourcing industrial egocentric videoRelated page Physical AI data providers: criteria and optionsRelated page Data provenance for physical AIRelated page Sourcing mocap human demonstrationsRelated page Robot demonstrations training dataTask-specific requirements

External references and source context

NVIDIA GR00T N1 technical report
NVIDIA GR00T N1 technical specifications: 34B parameters, dual-system architecture, 50,000+ robot trajectories, 3M human videos
arXiv ↩
Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence
EU AI Act Article 10 data governance and transparency requirements, effective August 2026
EUR-Lex ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 vision-language-action model: 62% unseen task success with 6,000 demonstrations
arXiv ↩
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA 7B-parameter model trained on 970,000 Open X-Embodiment trajectories
arXiv ↩
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
RoboCat self-improving agent: 1,000 to 10,000+ demonstrations via autonomous practice
arXiv ↩
LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch
LeRobot open ecosystem for physical AI: HDF5 storage, PyTorch training pipelines
arXiv ↩
EPIC-KITCHENS-100 annotations license
EPIC-KITCHENS-100 non-commercial research license restrictions
GitHub ↩
Crossing the Reality Gap: A Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning
Sim-to-real transfer survey: 10,000-50,000 synthetic episodes substitute for 1,000-5,000 real demonstrations
arXiv ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID large-scale in-the-wild robot manipulation dataset
arXiv ↩
Open dataset terms rarely answer model commercialization questions by themselves
Creative Commons licenses do not explicitly address model training rights
creativecommons.org ↩
scale.com scale ai universal robots physical ai
Scale AI + Universal Robots partnership: 60% to 85%+ success over 6-12 months
scale.com ↩
RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 Robotics Transformer: 97% seen-task, 62% unseen-task success with 130,000 demonstrations
arXiv ↩
CALVIN paper
CALVIN benchmark for language-conditioned long-horizon manipulation
arXiv ↩
Sim-to-Real Transfer of Robotic Control with Dynamics Randomization
Sim-to-real transfer with dynamics randomization
arXiv ↩
World Models
World Models research: learned simulators reduce real-world data requirements by 30-50%
worldmodels.github.io ↩
General Agents Need World Models
General agents require world models for robust long-horizon reasoning in physical AI
arXiv ↩

FAQ

What embodiments does GR00T N1 support out of the box?

GR00T N1's pretrained weights support 10-50+ degree-of-freedom humanoid and bimanual platforms through embodiment-specific encoders and decoders. The model has been validated on Fourier GR-1, ALOHA-style bimanual systems, and simulated humanoids in Isaac Sim. Adapting to a new embodiment requires 500-20,000 teleoperated demonstrations depending on kinematic divergence from the pretraining distribution and task complexity. The dual-system architecture separates reasoning (embodiment-agnostic) from motor control (embodiment-specific), simplifying adaptation workflows.

How does GR00T N1 handle variable degrees of freedom across robots?

GR00T N1 uses learned embodiment encoders that map variable-DoF proprioceptive state vectors (joint positions, velocities, torques) into a shared latent space, and embodiment decoders that translate the model's internal action representation into platform-native joint commands. During training, the model sees trajectories from 10-DoF bimanual arms, 20-DoF mobile manipulators, and 50+ DoF full humanoids, learning to factor embodiment-specific kinematics from task-level reasoning. At inference, you provide a configuration file specifying joint names, ranges, and control modes; the decoder generates appropriately-sized action vectors.

Can I fine-tune GR00T N1 on proprietary tasks without sharing data with NVIDIA?

Yes. GR00T N1's model weights are distributed under a commercial license that permits on-premises fine-tuning without data upload requirements. You retain full ownership of proprietary training data and fine-tuned weights. The recommended workflow uses LeRobot's PyTorch training scripts on your own compute infrastructure (8-64 H100 GPUs for 1-4 weeks). NVIDIA offers optional managed fine-tuning services through NIM containers if you prefer cloud deployment, but data sovereignty controls let you restrict data egress. Legal review of the model license is advised to confirm alignment with your IP and compliance policies.

What licensing restrictions apply to datasets like EPIC-KITCHENS or RoboNet for commercial training?

EPIC-KITCHENS-100 annotations are released under a non-commercial research license; commercial use requires negotiated agreements with the University of Bristol. RoboNet's dataset license permits research use but prohibits commercial redistribution of raw data; training models on RoboNet for commercial deployment falls into a legal gray area that most enterprises resolve through custom data collection or permissively-licensed alternatives like DROID (Apache 2.0). Always conduct legal review of dataset licenses before incorporating them into commercial training pipelines — Creative Commons BY and BY-NC licenses do not explicitly address model training rights, and courts have not yet established clear precedent.

How much does it cost to collect 5,000 custom teleoperation demonstrations?

Custom teleoperation data collection for 5,000 trajectories typically costs $250,000-$750,000 depending on task complexity, embodiment requirements, and annotation depth. Simple pick-place tasks on standard arms cost $50-$100 per trajectory; complex bimanual assembly or whole-body humanoid tasks run $100-$200 per trajectory. This includes hardware setup, teleoperator training, data capture infrastructure (multi-view cameras, sensor synchronization), quality control, and delivery in LeRobot HDF5 format. Lead times range from 8-20 weeks. Vendors like Scale AI, Appen, and specialized robotics data providers offer turnkey services; in-house collection reduces per-trajectory costs by 30-50% but requires 6-12 months of infrastructure buildout.

What is the minimum viable dataset size for fine-tuning GR00T N1 on a new task?

Minimum viable dataset size depends on task complexity and similarity to pretraining tasks. For simple pick-place variations on objects similar to the pretraining distribution, 500-1,000 demonstrations can achieve 70-80% success rates. Novel manipulation skills (new tool use, deformable object handling) require 2,000-5,000 demonstrations for comparable performance. Whole-body locomotion with manipulation on a new humanoid embodiment typically needs 5,000-20,000 trajectories to reach 75%+ success. Augmenting robot demonstrations with human egocentric video (when task priors are strong) can reduce robot data requirements by 30-50%. Start with 500-1,000 trajectories for initial feasibility assessment, then scale based on success-rate curves.

Looking for GR00T N1?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

Source GR00T N1 Training Data