Physical AI Glossary

GR00T N1: NVIDIA's Humanoid Foundation Model

GR00T N1 (Generalist Robot 00 Technology N1) is NVIDIA's open-weight foundation model for humanoid robot control, released March 2025. It implements a dual-system architecture: System 1 runs reactive visuomotor policies at 30+ Hz for balance and manipulation, while System 2 executes vision-language reasoning at 1-5 Hz for task planning and natural language grounding, communicating through shared goal representations.

Updated 2025-03-15

By truelabel

Reviewed by truelabel · Mar 15, 2025

NVIDIA GR00T N1 model

List Your Humanoid Dataset Browse glossary

Quick facts

Term: GR00T N1: NVIDIA's Humanoid Foundation Model
Domain: Robotics and physical AI
Last reviewed: 2025-03-15

What GR00T N1 Is and Why It Matters

GR00T N1 targets the humanoid form factor — bipedal robots with dual-arm manipulation — as the platform for general-purpose physical AI. NVIDIA CEO Jensen Huang positioned humanoids as the next major AI platform at GTC 2025, arguing that human-designed environments demand human-shaped agents^[1]. The model's dual-system design separates concerns: fast reactive control handles the continuous sensorimotor loop (balance, posture, trajectory tracking), while slow deliberative reasoning interprets instructions and plans multi-step sequences.

This architecture addresses a core tension in vision-language-action models. Language models require 100-500ms inference latency, but balance corrections on a bipedal platform demand sub-30ms response times. By decoupling these timescales, GR00T N1 enables natural language task specification without sacrificing real-time control stability. System 2 produces goal representations that condition System 1's reactive behavior, creating a hierarchical control stack where high-level intent flows down to low-level motor commands.

The open-weight release strategy mirrors OpenVLA's approach to accelerating research through shared pretrained checkpoints. NVIDIA provides base weights trained on simulation data from Isaac Sim and Isaac Lab, expecting robotics teams to fine-tune on proprietary teleoperation datasets for deployment-specific tasks. This positions GR00T N1 as infrastructure rather than a product — a foundation that reduces the cold-start problem for humanoid manipulation projects.

NVIDIA's investment in humanoid-specific tooling (Isaac Gym for locomotion, Cosmos for world models, Omniverse for scene synthesis) creates a vertically integrated stack from simulation to deployment^[2]. The foundation model sits at the top of this stack, consuming synthetic training data at scale while providing a common interface for downstream task adaptation.

Dual-System Architecture: Fast Reactive Control and Slow Deliberation

System 1 implements a visuomotor policy running at 30+ Hz, processing RGB-D camera streams and proprioceptive sensor data (joint angles, torques, IMU readings) to output joint velocity or torque commands. The policy architecture likely builds on RT-1's Transformer-based design, adapted for continuous control rather than discrete action spaces. Training uses reinforcement learning in simulation with domain randomization to bridge the sim-to-real gap^[3].

System 2 operates at 1-5 Hz, running a vision-language model that accepts natural language instructions and egocentric video to produce task plans and goal states. This component shares design principles with RT-2, which demonstrated that pretraining on web-scale vision-language data transfers to robotic control. The slower update rate reflects the computational cost of large language model inference and the timescale of human task instructions (seconds to minutes, not milliseconds).

The shared representation space between systems is the critical interface. System 2 outputs goal embeddings — learned vector representations of desired end states — that System 1 consumes as conditioning inputs. This design avoids brittle symbolic interfaces (predefined action primitives, motion planners) in favor of learned representations that can capture nuanced task variations. The goal embedding might encode "grasp the red mug" as a continuous vector that System 1 interprets through its visuomotor policy.

Training the dual-system architecture requires datasets with both low-level teleoperation trajectories (for System 1's reactive policy) and high-level language annotations (for System 2's reasoning). DROID's 76,000 teleoperation trajectories exemplify the System 1 training regime, while Open X-Embodiment's language-annotated episodes provide System 2 supervision. The challenge is acquiring datasets that pair both modalities at humanoid scale — most existing robot datasets target tabletop manipulation, not bipedal whole-body control.

Training Data Requirements for Humanoid Foundation Models

Humanoid foundation models demand three data categories: simulation rollouts for locomotion and balance, teleoperation demonstrations for manipulation skills, and language-annotated episodes for task grounding. NVIDIA's Isaac Sim generates synthetic locomotion data at scale — the technical report mentions 386,000 simulation hours for GR00T N1 pretraining^[4]. Simulation handles the high-volume, low-diversity regime: walking gaits, stair climbing, obstacle avoidance, recovery from perturbations.

Teleoperation data captures human manipulation strategies that are difficult to specify through reward engineering. ALOHA demonstrated that 50-100 teleoperation episodes per task suffice for imitation learning on bimanual manipulation, but humanoid platforms introduce whole-body coordination challenges (maintaining balance while reaching, coordinating arm motion with torso lean) that increase data requirements. Existing humanoid teleoperation datasets remain scarce — most public robot datasets target fixed-base arms or mobile manipulators, not bipedal platforms.

Language annotations provide the grounding layer for natural language control. BridgeData V2 pioneered large-scale language annotation for robot data (60,000 trajectories with natural language descriptions), but its tabletop manipulation focus limits transfer to humanoid whole-body tasks. The annotation challenge scales with task complexity: "pick up the mug" is straightforward, but "clear the table while avoiding the laptop" requires compositional understanding of constraints and subgoals.

Data provenance becomes critical when mixing simulation, teleoperation, and language sources. Truelabel's data provenance framework tracks lineage from raw sensor streams through annotation pipelines to training batches, enabling ablation studies that isolate the contribution of each data source. For procurement, buyers need transparency into simulation parameters (physics engine, domain randomization ranges), teleoperation hardware (controller type, force feedback), and annotation protocols (crowdsourced vs expert, single-pass vs iterative refinement).

Deployment Contexts and Fine-Tuning Strategies

GR00T N1's open-weight release targets research labs and robotics startups building humanoid platforms. The base model provides locomotion and basic manipulation capabilities out of the box, but deployment-specific tasks (warehouse picking, household assistance, manufacturing assembly) require fine-tuning on proprietary datasets. NVIDIA's partnership with Figure AI, Agility Robotics, Apptronik, and Sanctuary AI suggests a co-development model where hardware partners contribute teleoperation data in exchange for early model access^[2].

Fine-tuning strategies split between System 1 and System 2. For System 1, teams collect teleoperation demonstrations of target tasks using their specific hardware (different actuators, sensor suites, kinematic chains). LeRobot's training pipelines support this workflow, ingesting teleoperation data in standardized formats (RLDS, HDF5, MCAP) and fine-tuning pretrained visuomotor policies through behavior cloning or offline RL. The data volume depends on task complexity — simple pick-and-place might need 50 episodes, while dexterous bimanual assembly could require 500+.

For System 2, fine-tuning adapts the vision-language model to domain-specific vocabulary and task structures. A warehouse deployment needs to understand SKU codes, bin locations, and inventory terminology; a household assistant needs to recognize furniture, appliances, and spatial prepositions ("on the counter," "inside the drawer"). This fine-tuning uses language-annotated episodes from the target environment, often starting with 100-500 examples and expanding through active learning as the system encounters edge cases.

The dual-system interface — the shared goal representation space — also requires domain adaptation. Pretrained goal embeddings might not capture task-specific constraints ("grasp gently" for fragile objects, "maintain upright orientation" for liquids). Fine-tuning this interface involves end-to-end training where System 2's goal outputs are evaluated by System 1's task success, creating a feedback loop that aligns high-level intent with low-level execution. Truelabel's marketplace connects teams needing domain-specific humanoid data with collectors who can capture teleoperation demonstrations on target hardware in target environments.

Comparison to Alternative Vision-Language-Action Architectures

GR00T N1's dual-system design contrasts with end-to-end vision-language-action models like RT-2 and OpenVLA, which collapse reasoning and control into a single Transformer that outputs actions directly from language and vision inputs. The end-to-end approach simplifies training (one loss function, one model) but struggles with the timescale mismatch: language model inference at 2-10 Hz is too slow for reactive control tasks that demand 30+ Hz update rates.

RT-1 and RT-2 addressed this by operating at 3 Hz and relying on low-level controllers (inverse kinematics, motion primitives) to interpolate between high-level actions. This works for tabletop manipulation where the base is fixed and balance is not a concern, but bipedal humanoids require continuous torque control to maintain stability. A 3 Hz policy cannot react to unexpected perturbations (a push, a slippery floor) fast enough to prevent falls.

The dual-system architecture also appears in RoboCasa's hierarchical policies and Google's SayCan framework, which pair language models with low-level skills. GR00T N1 differs by learning the interface between systems (goal embeddings) rather than using predefined action primitives. This enables more flexible task composition — System 2 can specify novel goals that System 1 generalizes to, rather than being limited to a fixed skill library.

Alternative architectures include world models like Ha and Schmidhuber's 2018 framework, which learn forward dynamics models to simulate action consequences before execution. NVIDIA's Cosmos project explores this direction for physical AI, but GR00T N1's technical report does not mention explicit world model components. The dual-system design may implicitly learn forward models within System 2's planning process, but this remains an open research question.

Data Format and Infrastructure Requirements

GR00T N1 training consumes data in RLDS (Reinforcement Learning Datasets) format, the standard introduced by Google Research in 2021 for sharing robot learning datasets. RLDS wraps trajectories as TensorFlow Datasets with standardized schemas: observations (images, proprioception), actions (joint commands), rewards, episode boundaries. This format enables dataset mixing — combining simulation rollouts from Isaac Sim with real teleoperation data from multiple hardware platforms.

For teleoperation data collection, teams typically use MCAP or ROS bag formats to record raw sensor streams, then convert to RLDS for training. MCAP's columnar storage handles high-frequency multimodal data (60 Hz RGB-D video, 100 Hz joint states, 1 kHz force-torque sensors) more efficiently than ROS bags, reducing storage costs for large-scale humanoid datasets. A single hour of humanoid teleoperation at full sensor resolution can generate 50-100 GB of raw data.

Language annotations are stored as episode-level metadata in RLDS, following the schema established by Open X-Embodiment: a natural language instruction string, optional step-by-step subgoal descriptions, and semantic tags (object categories, task types). Annotation quality varies widely — crowdsourced labels from platforms like Scale AI's data engine provide coverage at lower cost, while expert annotations from roboticists capture nuanced task constraints but scale poorly.

Infrastructure requirements for GR00T N1 training mirror other large vision-language models: multi-GPU clusters (NVIDIA DGX systems with A100 or H100 GPUs), distributed data loading pipelines to saturate GPU utilization, and experiment tracking to manage hyperparameter sweeps across System 1 and System 2 components. The dual-system architecture allows independent scaling — System 1's visuomotor policy is smaller and faster to train than System 2's vision-language model, enabling rapid iteration on reactive control while the reasoning component trains in parallel.

Evaluation Benchmarks and Generalization Metrics

Evaluating humanoid foundation models requires benchmarks that test both locomotion stability and manipulation dexterity across diverse environments. Existing benchmarks like RLBench and ManiSkill focus on tabletop manipulation with fixed-base arms, missing the whole-body coordination challenges of bipedal platforms. NVIDIA's technical report introduces new evaluation tasks: navigating cluttered environments while carrying objects, bimanual assembly on unstable surfaces, and recovering from external perturbations during manipulation.

Generalization metrics track zero-shot transfer to novel objects, environments, and task variations. The COLOSSEUM benchmark measures this for tabletop manipulation by testing policies on unseen object geometries and material properties. For humanoids, generalization extends to terrain variations (stairs, ramps, uneven ground), lighting conditions (indoor/outdoor, day/night), and distractor objects (navigating around obstacles while maintaining task focus). GR00T N1's simulation pretraining uses domain randomization to improve these generalization axes^[3].

Language grounding evaluation tests whether System 2 correctly interprets instructions with spatial references ("the mug on the left"), temporal constraints ("before picking up the plate"), and compositional structure ("stack the blocks in order of size"). The CALVIN benchmark pioneered this for tabletop tasks with a test suite of 1,000 language-annotated episodes. Humanoid-scale language grounding remains an open challenge — most existing datasets lack the linguistic diversity and compositional complexity needed to stress-test vision-language reasoning.

Sim-to-real transfer metrics quantify the performance gap between simulation training and real-world deployment. Zhao et al.'s 2021 survey identifies key factors: physics simulation fidelity, sensor noise modeling, actuator dynamics, and contact modeling. GR00T N1's reliance on Isaac Sim for pretraining inherits NVIDIA's physics engine capabilities, but real-world validation on diverse humanoid platforms (different actuators, sensor suites, mass distributions) remains necessary to validate transfer claims.

Procurement Considerations for Humanoid Training Data

Buyers procuring humanoid training data face three sourcing decisions: simulation vs real-world data, teleoperation vs autonomous collection, and internal vs marketplace acquisition. Simulation data from Isaac Sim scales cheaply (marginal cost near zero after infrastructure setup) but introduces sim-to-real gaps that degrade deployment performance. Real-world teleoperation data captures true environment complexity but costs $500-2,000 per hour when accounting for hardware, operator time, and annotation labor^[5].

Teleoperation quality depends on controller design and operator skill. High-fidelity systems like ALOHA's bilateral teleoperation provide force feedback and low-latency control, enabling smooth manipulation demonstrations. Lower-cost alternatives (joystick control, VR interfaces) introduce jitter and suboptimal trajectories that hurt imitation learning performance. Buyers should specify controller type, latency bounds, and operator training protocols in procurement contracts.

Marketplace acquisition through platforms like Truelabel enables access to diverse hardware platforms and environments without capital investment in robots and facilities. Collectors list datasets with metadata on robot model, sensor suite, task categories, and environment types. Buyers filter by deployment-relevant criteria (indoor/outdoor, object categories, lighting conditions) and purchase licenses for training use. This model works when off-the-shelf tasks (pick-and-place, navigation, simple assembly) suffice, but custom tasks require dedicated data collection campaigns.

Licensing terms for humanoid datasets vary widely. Academic datasets like DROID use permissive licenses (MIT, Apache 2.0) that allow commercial use, while vendor-collected data from Scale AI or Appen typically restricts use to the purchasing organization and prohibits redistribution. Buyers training foundation models for resale need explicit commercial rights, ideally with indemnification against third-party IP claims on the training data.

Integration with NVIDIA's Physical AI Ecosystem

GR00T N1 sits at the center of NVIDIA's physical AI stack, consuming data from Isaac Sim and Isaac Lab while deploying on Jetson Orin edge devices for real-time inference. NVIDIA's Physical AI Data Factory blueprint describes this end-to-end pipeline: synthetic data generation in Omniverse, model training on DGX clusters, sim-to-real validation in Isaac Lab, and edge deployment on Jetson hardware.

Isaac Sim provides the simulation environment for locomotion pretraining, with physics-based rendering (ray tracing, material properties) and domain randomization (lighting, textures, object placements) to improve sim-to-real transfer. The platform supports parallel simulation across thousands of environments, generating millions of locomotion episodes per day. This volume is critical for pretraining System 1's reactive policies, which require diverse perturbations (slips, pushes, uneven terrain) to learn robust balance control.

Cosmos, NVIDIA's world foundation model project, aims to provide learned environment dynamics that complement GR00T N1's control policies. Cosmos models predict future video frames given current observations and proposed actions, enabling model-predictive control where System 2 simulates action consequences before committing. This integration remains experimental — the GR00T N1 technical report does not specify whether Cosmos components are used in the released model.

Deployment on Jetson Orin enables real-time inference at the edge, avoiding cloud latency and bandwidth costs. System 1's visuomotor policy runs at 30+ Hz on Jetson AGX Orin (275 TOPS INT8 performance), while System 2's vision-language model may require model compression (quantization, pruning, distillation) to fit within Jetson's memory and compute budget. NVIDIA provides TensorRT optimization tools to accelerate Transformer inference on Jetson, but large language models (7B+ parameters) remain challenging for edge deployment without quality degradation.

Open Research Questions and Future Directions

GR00T N1's open-weight release accelerates research but leaves key questions unresolved. The technical report does not specify System 2's architecture in detail — whether it uses a pretrained vision-language model (CLIP, LLaVA) or a custom design, and how the goal embedding space is learned. Understanding this interface is critical for teams fine-tuning the model, as it determines what task variations System 2 can express and System 1 can execute.

Sim-to-real transfer for bipedal locomotion remains an active research area. While domain randomization improves robustness, real-world humanoid platforms encounter edge cases (wet floors, soft surfaces, unexpected contacts) that simulation struggles to model. The gap between Isaac Sim's 386,000 training hours and real-world deployment performance is not quantified in the technical report, leaving uncertainty about how much real-world fine-tuning is required.

Data efficiency for humanoid manipulation is another open question. Tabletop manipulation policies achieve strong performance with 50-500 demonstrations per task, but whole-body coordination may require more data. DROID's 76,000 trajectories provide a lower bound for diverse manipulation skills, but humanoid-specific datasets at this scale do not yet exist in the public domain. The community needs benchmarks that quantify data requirements as a function of task complexity and hardware platform.

Long-horizon task execution — chaining multiple skills over minutes to hours — tests whether the dual-system architecture scales beyond short manipulation episodes. LongBench introduced evaluation protocols for multi-step tasks, but humanoid platforms add failure modes (balance loss, battery depletion, sensor occlusion) that complicate long-horizon planning. Future work must address error recovery, replanning under uncertainty, and human intervention protocols when the system encounters unrecoverable failures.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Physical AI glossaryGlossary hub Physical AI data providers: criteria and optionsRelated page Data provenance for physical AIRelated page Physical AI training dataDefinition and terminology VLA modelDefinition and terminology World model AIDefinition and terminology Sourcing multi-view manipulationRelated page Sourcing rgbd manipulationRelated page

External references and source context

NVIDIA Cosmos World Foundation Models
NVIDIA Cosmos world foundation models for physical AI simulation and planning
NVIDIA Developer ↩
Scale AI: Expanding Our Data Engine for Physical AI
Scale AI and NVIDIA partnership on physical AI data infrastructure
scale.com ↩
Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization techniques for sim-to-real transfer in deep learning
arXiv ↩
NVIDIA GR00T N1 technical report
NVIDIA GR00T N1 technical report with 386,000 simulation hours detail
arXiv ↩
truelabel physical AI data marketplace bounty intake
Truelabel physical AI data marketplace for dataset procurement
truelabel.ai ↩

More glossary terms

Physical AI training dataData that teaches models to perceive, reason about, and act in physical environments.VLA modelVision-language-action models that map perception and language to robot actions.World model AIModels that learn the dynamics of an environment and can simulate forward.Consent artifactSigned documentation that contributors agreed to commercial use of their data.Data provenanceTraceability metadata: source, consent, rights, capture conditions, chain of custody.Egocentric dataFirst-person camera footage capturing how a worker or operator sees a task.

FAQ

What hardware platforms does GR00T N1 support?

GR00T N1 targets bipedal humanoid robots with dual-arm manipulation capability. NVIDIA has announced partnerships with Figure AI, Agility Robotics, Apptronik, and Sanctuary AI, suggesting the model is designed for their respective humanoid platforms (Figure 02, Digit, Apollo, Phoenix). The base model provides locomotion and manipulation primitives that generalize across platforms, but deployment requires fine-tuning on platform-specific teleoperation data to account for differences in actuators, sensor suites, and kinematic chains. Teams using custom humanoid designs can adapt GR00T N1 by collecting teleoperation demonstrations on their hardware and fine-tuning System 1's visuomotor policy through behavior cloning or offline reinforcement learning.

How much training data does GR00T N1 require for a new task?

Data requirements depend on task complexity and similarity to pretraining tasks. For simple manipulation tasks similar to those in the pretraining distribution (pick-and-place, object handoff), 50-100 teleoperation demonstrations may suffice for fine-tuning System 1. More complex tasks requiring precise bimanual coordination or whole-body motion (assembly, tool use, dynamic manipulation) typically need 200-500 demonstrations. System 2 fine-tuning for domain-specific language grounding starts with 100-500 language-annotated episodes and expands through active learning. These estimates assume high-quality teleoperation data with smooth trajectories and accurate language annotations — noisy data or inconsistent demonstrations increase requirements by 2-5×.

What data formats does GR00T N1 training support?

GR00T N1 training pipelines consume data in RLDS (Reinforcement Learning Datasets) format, which wraps trajectories as TensorFlow Datasets with standardized schemas for observations, actions, rewards, and episode boundaries. Teams typically collect raw teleoperation data in MCAP or ROS bag formats, then convert to RLDS using tools from the Google Research RLDS repository. Language annotations are stored as episode-level metadata following the Open X-Embodiment schema. For simulation data, Isaac Sim exports directly to RLDS format. The standardized format enables dataset mixing — combining simulation rollouts with real teleoperation data from multiple hardware platforms in a single training run.

How does GR00T N1 compare to OpenVLA for humanoid control?

OpenVLA is an end-to-end vision-language-action model that outputs actions directly from language and vision inputs at 3-10 Hz, suitable for tabletop manipulation where the robot base is fixed. GR00T N1's dual-system architecture separates reactive control (System 1 at 30+ Hz) from deliberative reasoning (System 2 at 1-5 Hz), addressing the timescale mismatch for bipedal platforms that require continuous high-frequency torque control to maintain balance. OpenVLA's single-model design simplifies training but cannot achieve the update rates needed for humanoid locomotion. For fixed-base manipulation tasks, OpenVLA may be simpler to deploy; for bipedal whole-body control, GR00T N1's architecture is better suited to the control requirements.

What are the licensing terms for GR00T N1 model weights?

NVIDIA released GR00T N1 as an open-weight model, but the specific license terms (MIT, Apache 2.0, custom NVIDIA license) are not detailed in the March 2025 technical report. Open-weight releases typically allow research and commercial use with attribution requirements, but may restrict redistribution of fine-tuned weights or impose usage reporting obligations. Teams planning commercial deployment should review the license carefully for restrictions on model derivatives, liability disclaimers, and patent grants. For training data procurement, buyers need separate licenses for datasets used in fine-tuning — the model license does not convey rights to training data, which may carry restrictive terms from data vendors or collectors.

Find datasets covering NVIDIA GR00T N1 model

Truelabel surfaces vetted datasets and capture partners working with NVIDIA GR00T N1 model. Send the modality, scale, and rights you need and we route you to the closest match.

List Your Humanoid Dataset