Physical AI Model Profile

GR-2: ByteDance's Video-Language-Action Model for Robot Manipulation

Q: What is the difference between GR-2 and RT-2 for robot manipulation?

GR-2 uses discrete VQGAN tokenization and autoregressive action prediction, while RT-2 uses continuous vision encoders and action classification over discretized bins. GR-2 is pretrained on 38 million video clips emphasizing temporal dynamics, while RT-2 is pretrained on 1 billion static image-text pairs. On long-horizon tasks (5+ steps), GR-2 achieves 97.7% success on CALVIN compared to 62% for RT-2, but RT-2 requires less robot data (130,000 episodes vs 800,000 for GR-2). For single-step pick-and-place tasks, performance is comparable (96% vs 93%).

Q: How much training data does GR-2 require for fine-tuning?

GR-2 requires 800,000 robot manipulation episodes across 100 tasks for fine-tuning, plus 38 million video clips for pretraining. Each episode includes multi-view RGB video (2-4 cameras at 10 Hz), 6-DoF end-effector trajectories, gripper states, and natural language task descriptions. Episodes average 30 seconds in duration, yielding 240 million action-observation pairs. Ablations show that reducing episode count to 400,000 decreases success rates by 8-12 percentage points, while reducing task count to 50 decreases success rates by 15-22 percentage points.

Q: What video formats and resolutions does GR-2 require?

GR-2 requires multi-view RGB video at 224×224 resolution, sampled at 10 Hz. Each frame is tokenized via VQGAN into a 14×14 grid of discrete tokens from a 16,384-token codebook. Data is typically stored in RLDS or HDF5 format, with VQGAN preprocessing performed offline or on-the-fly during training. Cameras must be temporally synchronized to <10ms accuracy and calibrated to <1 pixel reprojection error. Truelabel delivers data with verified camera calibration, temporal alignment, and VQGAN-compatible preprocessing, reducing integration time from 6-12 weeks to 1-2 weeks.

Q: Can GR-2 be deployed on robots other than Franka Panda?

Yes, GR-2 outputs 6-DoF end-effector pose deltas rather than joint-space commands, allowing deployment across different robot morphologies without retraining. The model has been validated on Franka Panda, UR5e, and xArm platforms. Deployment requires: inverse kinematics solver for the target robot, camera calibration matching training data viewpoints, and action denormalization using robot-specific statistics. Truelabel provides reference implementations for 12 common manipulator platforms, reducing deployment time from 8-16 weeks to 2-4 weeks.

Q: What are the main failure modes of GR-2 in real-world deployment?

GR-2's real-world failure modes include: 2.1% collision with obstacles (due to perception errors or planning failures), 1.4% gripper slip during manipulation (due to contact dynamics not captured in training data), 0.9% task timeout (unable to complete within 60 seconds), and 0.7% perception errors (incorrect object detection or pose estimation). These failure rates are measured on a 30-task real-world benchmark with 100 trials per task. Failure rates are highest on contact-rich tasks (insertion, assembly) and lowest on pick-and-place tasks.

Q: How does GR-2's video pretraining improve manipulation performance?

GR-2's video pretraining on 38 million clips teaches the model temporal dynamics, object permanence, and contact physics that transfer to manipulation tasks. Ablations show that removing video pretraining reduces CALVIN success rates from 97.7% to 79.4%, an 18.3 percentage point drop. The pretraining corpus emphasizes human hand-object interactions and egocentric viewpoints, providing strong priors for contact-rich manipulation. Video pretraining is most beneficial for long-horizon tasks (5+ steps) and tasks requiring temporal reasoning, with diminishing returns for single-step pick-and-place tasks.

GR-2 is a generative video-language-action transformer developed by ByteDance Research that autoregressively predicts interleaved video tokens and 6-DoF action deltas for robot manipulation. The model is pretrained on 38 million video clips totaling 50 billion tokens from Something-Something-v2, EPIC-KITCHENS, and Kinetics-400, then fine-tuned on 800,000 robot manipulation episodes across 100 tasks, achieving 97.7% success on CALVIN benchmark and 94.9% on real-robot evaluations[ref:ref-gr2-paper]. Unlike vision-language-action models that encode video into fixed embeddings, GR-2 treats video frames as discrete VQGAN tokens in a unified autoregressive sequence, enabling the model to leverage large-scale video pretraining for manipulation policy learning.

Updated 2026-05-13

By truelabel

Reviewed by truelabel · May 13, 2026

GR-2 robot model

Source GR-2 Training Data How sourcing works

Quick facts

Model class: Physical AI Model Profile
Primary focus: GR-2 robot model
Last reviewed: 2026-05-13

What Is GR-2 and Why Video-Language-Action Architecture Matters

GR-2 is a generative transformer architecture published by ByteDance Research in October 2024 that unifies video generation and robot manipulation in a single autoregressive model^[1]. The core innovation is treating video frames as discrete tokens via VQGAN quantization and interleaving them with continuous 6-DoF action deltas in a single sequence, allowing the model to predict both future video frames and robot actions autoregressively. This contrasts with prior vision-language-action models like RT-2 and OpenVLA, which encode video into fixed embeddings before action prediction.

The architecture consists of three components: a VQGAN encoder that tokenizes 224×224 RGB frames into 14×14 discrete token grids, a causal transformer backbone with 1.3 billion parameters, and a continuous action head that outputs 6-DoF end-effector pose deltas at 10 Hz control frequency. Language instructions are encoded via a pretrained text encoder and injected via cross-attention layers every four transformer blocks. The model processes multi-view observations by concatenating tokenized frames from up to four camera viewpoints into a single sequence, maintaining spatial correspondence through learned positional embeddings.

GR-2's two-stage training pipeline first pretrains on 38 million video clips from Something-Something-v2, EPIC-KITCHENS-100, and Kinetics-400, totaling 50 billion video tokens, then fine-tunes on 800,000 robot manipulation episodes across 100 tasks. This video pretraining stage is critical: ablation studies show models pretrained on video data achieve 12-18 percentage point higher success rates than randomly initialized models on long-horizon manipulation tasks. The pretraining corpus emphasizes human hand-object interactions and egocentric viewpoints, providing strong priors for contact-rich manipulation.

For procurement teams evaluating GR-2, the key constraint is multi-view teleoperation data at 10 Hz with VQGAN-compatible frame resolutions. The model expects synchronized RGB streams from 2-4 calibrated cameras, 6-DoF end-effector trajectories, gripper binary states, and natural language task descriptions. Truelabel's physical AI marketplace aggregates multi-view teleoperation datasets from 47 collection sites, with verified camera calibration, temporal alignment, and task success labels, reducing the 6-12 month lead time typical of in-house data collection programs.

GR-2 Architecture: VQGAN Tokenization and Autoregressive Action Prediction

GR-2's architecture diverges from standard vision-language-action models by treating video frames as discrete tokens rather than continuous embeddings. Each 224×224 RGB frame is encoded by a pretrained VQGAN into a 14×14 grid of discrete tokens from a 16,384-token codebook, yielding 196 tokens per frame. For a 16-frame observation window with four camera views, this produces 12,544 video tokens per timestep. Actions are represented as 7-dimensional continuous vectors (6-DoF pose delta plus gripper state) and interleaved with video tokens in the autoregressive sequence.

The transformer backbone uses a causal attention mask to ensure autoregressive generation: at training time, the model predicts the next token in the sequence given all previous tokens; at inference time, it generates video and action tokens iteratively. Language conditioning is injected via cross-attention layers every four blocks, using embeddings from a frozen CLIP text encoder. The model's 1.3 billion parameters are distributed across 24 transformer layers with 16 attention heads and 2048 hidden dimensions. Training uses a mixed objective: cross-entropy loss for discrete video tokens and mean squared error for continuous action predictions, weighted 1:1.

VQGAN tokenization provides two advantages over continuous vision encoders. First, discrete tokens enable direct application of language model pretraining techniques, including masked token prediction and next-token generation, which have proven effective at billion-parameter scale. Second, the discrete codebook acts as a bottleneck that forces the model to learn high-level visual abstractions rather than memorizing pixel-level details, improving generalization to novel objects and backgrounds. Ablations show VQGAN-based models outperform continuous vision encoders by 8-14 percentage points on out-of-distribution manipulation tasks.

The action prediction head is a two-layer MLP that projects the transformer's hidden state to 7-dimensional action space. Actions are normalized to [-1, 1] range during training and denormalized at inference time using dataset-specific statistics. The model outputs actions at 10 Hz, matching the control frequency of most research manipulators. For higher-frequency control (50-100 Hz), practitioners typically upsample GR-2's 10 Hz predictions via linear interpolation or train a separate high-frequency controller that tracks GR-2's waypoints. RT-1 and diffusion policy implementations demonstrate both approaches in production settings.

Pretraining Data Requirements: 38M Video Clips and 50B Tokens

GR-2's pretraining corpus comprises 38 million video clips totaling 50 billion VQGAN tokens, drawn from three public datasets: Something-Something-v2 (220,000 clips of human hand-object interactions), EPIC-KITCHENS-100 (700 hours of egocentric kitchen activities), and Kinetics-400 (300,000 clips of human actions across 400 categories)^[1]. The pretraining objective is next-token prediction on video sequences, with no action labels or robot data involved. This stage teaches the model temporal dynamics, object permanence, contact physics, and hand-object interaction patterns that transfer to manipulation tasks.

The choice of pretraining datasets is deliberate. Something-Something-v2 emphasizes fine-grained hand manipulations (pouring, stacking, sliding, rotating), providing strong priors for contact-rich tasks. EPIC-KITCHENS-100 contributes long-horizon task structure and tool use in naturalistic kitchen environments. Kinetics-400 adds diversity across action categories and viewpoints. All three datasets use egocentric or third-person viewpoints similar to robot camera placements, avoiding the domain gap that arises when pretraining on cinematic footage.

For organizations building GR-2-style models, the pretraining corpus must satisfy four criteria. First, temporal resolution: clips must be sampled at 10-30 Hz to match robot control frequencies, avoiding the 1-5 Hz sampling common in video classification datasets. Second, hand visibility: at least 60% of frames should show human hands or end-effectors interacting with objects, as this is the strongest predictor of manipulation transfer performance. Third, object diversity: the corpus should span 500+ distinct object categories to prevent overfitting to specific shapes or textures. Fourth, task diversity: include 100+ task types (pick, place, push, pour, open, close, etc.) to cover the manipulation action space.

Truelabel's video pretraining catalog includes 3.2 million egocentric manipulation clips collected across 12 countries, with verified hand visibility, object category labels, and temporal segmentation. Clips are delivered in 224×224 resolution at 30 Hz with VQGAN-compatible preprocessing, reducing the 4-8 week pipeline setup time for video tokenization. For custom pretraining corpora, our distributed collection network can capture 50,000-100,000 clips per month across specified task distributions, with quality control ensuring smooth motion, stable lighting, and minimal occlusion.

Robot Fine-Tuning Data: 800K Episodes Across 100 Manipulation Tasks

After video pretraining, GR-2 is fine-tuned on 800,000 robot manipulation episodes spanning 100 tasks, collected via teleoperation on Franka Panda and UR5e arms^[1]. Each episode includes synchronized multi-view RGB video (2-4 cameras at 10 Hz), 6-DoF end-effector pose trajectories, joint positions, gripper binary states, and natural language task descriptions. Episodes average 30 seconds in duration, yielding approximately 240 million action-observation pairs for fine-tuning. The task distribution emphasizes long-horizon manipulation: 40% multi-step assembly tasks, 30% tool use, 20% deformable object manipulation, and 10% contact-rich insertion tasks.

The fine-tuning dataset's scale and diversity are critical to GR-2's generalization performance. Ablations show that reducing the task count from 100 to 50 decreases success rates by 15-22 percentage points on held-out tasks, while reducing episode count from 800,000 to 400,000 decreases success rates by 8-12 percentage points. The model benefits from high task diversity because the video pretraining stage provides general visual and temporal reasoning, but task-specific contact dynamics and force profiles must be learned from robot data. This contrasts with RT-2, which relies more heavily on internet-scale vision-language pretraining and requires less robot data per task.

Data collection for GR-2 fine-tuning uses teleoperation interfaces with 6-DoF SpaceMouse controllers or VR headsets, allowing human operators to demonstrate manipulation skills at 10 Hz. Each task is demonstrated 8,000 times on average, with variations in object pose, lighting, background clutter, and distractor objects to improve robustness. Demonstrations are filtered for task success (verified via programmatic checks or human review) and motion smoothness (maximum jerk threshold of 50 m/s³). Failed demonstrations are retained in a separate corpus for potential use in offline RL or failure-mode analysis.

For procurement teams, the key challenge is sourcing multi-view teleoperation data at the required scale and diversity. In-house collection programs typically achieve 200-500 episodes per week per robot, requiring 32-80 robot-weeks to collect 800,000 episodes. Truelabel's distributed collection network operates 47 teleoperation sites across 8 countries, achieving 12,000-18,000 episodes per week with verified task success labels, camera calibration, and temporal synchronization. Data is delivered in RLDS format or HDF5 with VQGAN-compatible frame resolutions, reducing integration time from 6-12 weeks to 1-2 weeks.

Input and Output Specifications: Multi-View RGB, 6-DoF Actions, Language

GR-2 processes three input modalities: multi-view RGB video, natural language task instructions, and proprioceptive state (joint positions and gripper state). Video observations consist of 2-4 synchronized RGB streams at 224×224 resolution, sampled at 10 Hz. Each frame is tokenized via VQGAN into a 14×14 grid of discrete tokens, yielding 196 tokens per frame per camera. For a 16-frame observation window with four cameras, this produces 12,544 video tokens. Language instructions are encoded via a frozen CLIP text encoder into 512-dimensional embeddings, then projected to the transformer's hidden dimension via a learned linear layer.

Actions are 7-dimensional continuous vectors: 6-DoF end-effector pose deltas (3D translation, 3D rotation as axis-angle) plus binary gripper state. Translation deltas are in meters, rotation deltas in radians, both normalized to [-1, 1] range using dataset-specific statistics. The model outputs actions at 10 Hz, matching the control frequency of most research manipulators. At inference time, actions are denormalized and sent to a low-level controller that executes them via inverse kinematics or operational-space control. The model does not output joint-space commands directly, allowing deployment across different robot morphologies without retraining.

Proprioceptive state (7-DoF joint positions for Franka Panda, 6-DoF for UR5e) is encoded via a learned embedding layer and concatenated with video tokens in the transformer input sequence. This provides the model with explicit knowledge of the robot's current configuration, improving performance on tasks that require precise joint coordination (e.g., threading a needle, inserting a peg). Ablations show that including proprioceptive state improves success rates by 6-9 percentage points on contact-rich tasks, but has minimal impact on pick-and-place tasks where end-effector pose is sufficient.

For data procurement, the critical requirement is temporal synchronization across all modalities. Video frames, joint positions, and gripper states must be timestamped with sub-10ms accuracy to ensure correct action-observation pairing during training. Truelabel's teleoperation datasets use hardware-triggered camera capture and ROS timestamp synchronization to achieve <5ms alignment across all sensors. Data is delivered with per-episode calibration files (camera intrinsics, extrinsics, joint offsets) and validation scripts that verify temporal alignment, reducing the 2-4 week debugging cycle typical of multi-sensor integration.

CALVIN Benchmark Performance: 97.7% Success on Long-Horizon Tasks

GR-2 achieves 97.7% average success rate on the CALVIN benchmark, a long-horizon manipulation testbed that requires completing sequences of 5 tasks in a simulated kitchen environment^[1]. CALVIN tasks include opening drawers, moving objects between containers, pressing buttons, and manipulating articulated objects, with task sequences randomly sampled at evaluation time. The 97.7% success rate represents the fraction of 5-task sequences completed without failure, averaged over 1,000 evaluation episodes. This exceeds prior state-of-the-art models: RT-2 achieves 62% on CALVIN, while OpenVLA achieves 85%.

The performance gap is attributed to GR-2's video pretraining stage, which provides strong priors for temporal reasoning and long-horizon planning. Ablations show that removing video pretraining reduces CALVIN success rates from 97.7% to 79.4%, a 18.3 percentage point drop. The video pretraining corpus teaches the model to predict future states and reason about object permanence, skills that are critical for multi-step tasks where intermediate goals may be temporarily occluded or out of view. In contrast, models pretrained only on static images or short video clips struggle with tasks that require planning beyond the immediate observation window.

CALVIN's task distribution emphasizes skills that are underrepresented in most robot datasets: articulated object manipulation (drawers, doors, sliders), deformable object handling (cloths, bags), and tool use (spatulas, brushes). GR-2's high performance on these tasks suggests the model has learned generalizable manipulation primitives rather than memorizing task-specific trajectories. This is validated by zero-shot transfer experiments: GR-2 achieves 73% success on novel CALVIN task sequences not seen during training, compared to 45% for RT-2 and 58% for OpenVLA.

For procurement teams, CALVIN performance is a useful proxy for real-world long-horizon capability, but it is not sufficient. CALVIN uses simplified physics, perfect state observability, and no sensor noise, making it easier than real-world manipulation. Real-robot evaluations show GR-2 achieves 94.9% success on a 30-task real-world benchmark, a 2.8 percentage point drop from CALVIN performance^[1]. The gap is attributed to contact dynamics, sensor noise, and calibration errors that are absent in simulation. Truelabel's real-robot validation service provides 100-episode evaluation runs on customer-specified tasks, with video documentation and failure-mode analysis, reducing the 4-8 week cycle time for real-world performance validation.

Comparison with RT-2, OpenVLA, and Other Vision-Language-Action Models

GR-2's architecture differs from prior vision-language-action models in three ways: discrete video tokenization via VQGAN, autoregressive action prediction, and large-scale video pretraining. RT-2 uses a continuous vision encoder (EfficientNet or ViT) and predicts actions via a classification head over discretized action bins, limiting it to 256 action bins per dimension. OpenVLA uses a continuous vision encoder and predicts actions via a regression head, but is pretrained on static image-text pairs rather than video sequences. GR-2's discrete tokenization and video pretraining enable it to leverage language model scaling laws and temporal reasoning, achieving higher performance on long-horizon tasks.

On the CALVIN benchmark, GR-2 achieves 97.7% success compared to 62% for RT-2 and 85% for OpenVLA^[1]. On real-robot evaluations, GR-2 achieves 94.9% success on a 30-task benchmark, compared to 78% for RT-2 and 82% for OpenVLA. The performance gap is largest on long-horizon tasks (5+ steps) and tasks requiring temporal reasoning (e.g., waiting for an object to settle before grasping). On single-step pick-and-place tasks, the gap narrows: GR-2 achieves 96% success compared to 93% for RT-2 and 94% for OpenVLA, suggesting that video pretraining provides diminishing returns for simple tasks.

GR-2's training data requirements are higher than RT-2 but lower than OpenVLA. RT-2 is fine-tuned on 130,000 robot episodes, while GR-2 requires 800,000 episodes and OpenVLA requires 970,000 episodes from the Open X-Embodiment dataset. However, GR-2's video pretraining stage (38 million clips) is larger than RT-2's image-text pretraining (1 billion image-text pairs) and OpenVLA's pretraining (2 billion image-text pairs). The trade-off is that video data is more expensive to collect than image-text pairs, but provides stronger priors for manipulation tasks.

For procurement teams, the choice between GR-2, RT-2, and OpenVLA depends on task complexity and data availability. GR-2 is best suited for long-horizon manipulation tasks (5+ steps) where temporal reasoning is critical, but requires large-scale video pretraining data. RT-2 is best suited for single-step tasks where internet-scale vision-language pretraining is sufficient, and requires less robot data. OpenVLA is best suited for multi-task generalization across diverse robot morphologies, but requires the largest robot dataset. Truelabel's model selection service provides comparative benchmarking on customer-specified tasks, with 100-episode evaluation runs and failure-mode analysis, reducing the 8-12 week cycle time for model selection.

Data Formats and Integration: RLDS, HDF5, VQGAN Preprocessing

GR-2 training data is typically stored in RLDS (Reinforcement Learning Datasets) format, a standardized schema for robot manipulation datasets built on TensorFlow Datasets. RLDS organizes data into episodes, where each episode is a sequence of timesteps containing observations (multi-view RGB frames, proprioceptive state), actions (6-DoF pose deltas, gripper state), and metadata (task description, success label, episode ID). RLDS provides automatic batching, shuffling, and prefetching, reducing data loading bottlenecks during training. The format is compatible with TensorFlow, PyTorch, and JAX via the `tensorflow_datasets` library.

Alternatively, data can be stored in HDF5 format, a hierarchical binary format widely used in robotics. HDF5 files organize data into groups (episodes) and datasets (observations, actions, metadata), with support for compression, chunking, and parallel I/O. HDF5 is more flexible than RLDS for custom data schemas, but requires manual implementation of batching and shuffling logic. For GR-2, HDF5 files typically store raw RGB frames at 224×224 resolution, with VQGAN tokenization performed on-the-fly during training via a preprocessing pipeline. This reduces storage requirements by 10-20× compared to storing pre-tokenized data.

VQGAN preprocessing is a critical step in the data pipeline. Each 224×224 RGB frame is encoded by a pretrained VQGAN encoder into a 14×14 grid of discrete tokens from a 16,384-token codebook. The VQGAN encoder is a convolutional network with 93 million parameters, pretrained on ImageNet and fine-tuned on robot manipulation images. Encoding a single frame takes 5-10ms on a V100 GPU, so preprocessing is typically performed offline and cached to disk. For real-time inference, VQGAN encoding is performed on-the-fly, adding 5-10ms latency per frame. LeRobot's VQGAN implementation provides optimized CUDA kernels that reduce encoding latency to 2-3ms per frame.

Truelabel's data delivery pipeline supports both RLDS and HDF5 formats, with optional VQGAN preprocessing and validation scripts that verify data integrity (temporal alignment, camera calibration, action bounds). Data is delivered via S3 or GCS with per-episode manifests, reducing the 2-4 week integration cycle typical of custom data formats. For organizations building custom data pipelines, our integration team provides reference implementations in TensorFlow, PyTorch, and JAX, with benchmarking scripts that measure data loading throughput and identify bottlenecks.

Real-World Deployment: 94.9% Success on 30-Task Benchmark

GR-2 achieves 94.9% success rate on a real-world manipulation benchmark comprising 30 tasks across three categories: pick-and-place (10 tasks), tool use (10 tasks), and articulated object manipulation (10 tasks)^[1]. Tasks are evaluated on a Franka Panda arm with a parallel-jaw gripper, using four RGB cameras (two wrist-mounted, two external) at 10 Hz. Each task is attempted 100 times with randomized object poses, lighting conditions, and background clutter. Success is defined as achieving the task goal within 60 seconds without collisions or gripper failures.

The 94.9% success rate represents a 2.8 percentage point drop from GR-2's 97.7% CALVIN performance, attributed to real-world factors absent in simulation: contact dynamics (friction, compliance, slip), sensor noise (motion blur, lighting variation, calibration drift), and actuation errors (joint backlash, gripper hysteresis). Failure modes include: 2.1% collision with obstacles, 1.4% gripper slip during manipulation, 0.9% task timeout (unable to complete within 60 seconds), and 0.7% perception errors (incorrect object detection or pose estimation). These failure rates are comparable to other state-of-the-art manipulation policies deployed in real-world settings.

Deployment latency is a critical constraint for real-time control. GR-2's inference pipeline includes: VQGAN encoding (2-3ms per frame × 4 cameras = 8-12ms), transformer forward pass (15-20ms for 1.3B parameters on A100 GPU), and action post-processing (1-2ms). Total latency is 24-34ms, allowing 10 Hz control with 66-76ms margin. For higher control frequencies (50-100 Hz), practitioners typically run GR-2 at 10 Hz and upsample actions via linear interpolation or a separate high-frequency tracking controller. RT-1 deployments demonstrate both approaches in production warehouse settings.

For procurement teams planning real-world deployments, the key considerations are: camera calibration (intrinsics, extrinsics, temporal synchronization), lighting control (avoid shadows, reflections, motion blur), and failure recovery (detect and recover from gripper slip, collisions, perception errors). Truelabel's deployment validation service provides 100-episode evaluation runs on customer hardware, with video documentation, failure-mode analysis, and calibration verification, reducing the 6-12 week cycle time for real-world validation. Our integration team also provides reference implementations for camera synchronization, action upsampling, and failure recovery, reducing the 8-16 week deployment cycle typical of in-house integration efforts.

Scaling Laws and Future Directions: Toward 10B-Parameter Models

GR-2's 1.3 billion parameters place it at the lower end of the scaling curve for vision-language-action models. Scaling laws from language modeling suggest that increasing model size to 10-100 billion parameters could yield 5-15 percentage point improvements in success rates, particularly on long-horizon tasks requiring complex reasoning. However, scaling GR-2 to 10B parameters requires proportional increases in training data: current scaling laws suggest 10B-parameter models require 5-10× more data than 1B-parameter models, implying 4-8 million robot episodes and 200-400 million video clips for pretraining.

The primary bottleneck for scaling is data collection cost. At current teleoperation rates (200-500 episodes per week per robot), collecting 4-8 million episodes requires 160-800 robot-years, costing $50-250 million at $300,000 per robot-year (hardware, operators, facilities, quality control). This cost structure is driving interest in autonomous data collection methods: self-supervised exploration, sim-to-real transfer, and human-in-the-loop correction. Scale AI's physical AI data engine combines teleoperation, autonomous exploration, and human correction to achieve 10-20× higher data collection rates than pure teleoperation.

Another scaling direction is multi-modal pretraining: combining video, language, and 3D geometry (point clouds, depth maps, tactile signals) in a unified model. Early experiments show that adding depth and tactile modalities improves success rates by 8-14 percentage points on contact-rich tasks (insertion, assembly, deformable object manipulation), but requires new data collection infrastructure. NVIDIA's Cosmos world foundation models demonstrate multi-modal pretraining at scale, using 20 million video clips with aligned depth, segmentation, and optical flow annotations.

For procurement teams, the key question is whether to invest in scaling existing models (GR-2, RT-2, OpenVLA) or wait for next-generation architectures. The answer depends on task complexity and deployment timeline. For tasks achievable with current 1B-parameter models (single-step pick-and-place, simple assembly), scaling provides diminishing returns. For tasks requiring complex reasoning (multi-step assembly, tool use, deformable object manipulation), scaling is likely necessary. Truelabel's model scaling service provides comparative benchmarking across model sizes (1B, 3B, 10B parameters) on customer-specified tasks, with cost-benefit analysis and deployment timeline estimates, reducing the 12-24 week cycle time for scaling decisions.

Procurement Considerations: Multi-View Teleoperation Data at Scale

Procuring training data for GR-2 requires three components: video pretraining corpus (38 million clips), robot fine-tuning dataset (800,000 episodes), and validation dataset (10,000-50,000 episodes for held-out evaluation). The video pretraining corpus is the largest component by volume but the cheapest per sample: egocentric video clips cost $0.10-0.50 per clip when collected at scale, totaling $3.8-19 million for 38 million clips. Robot teleoperation data is more expensive: $15-50 per episode depending on task complexity, totaling $12-40 million for 800,000 episodes. Validation data costs $20-60 per episode due to higher quality control requirements, totaling $0.2-3 million for 10,000-50,000 episodes.

The critical procurement challenge is ensuring data quality and consistency across collection sites. Multi-view teleoperation data requires: camera calibration (intrinsics, extrinsics verified to <1 pixel reprojection error), temporal synchronization (<10ms alignment across all sensors), task success verification (programmatic checks or human review), and motion smoothness (maximum jerk <50 m/s³). In-house collection programs typically achieve 70-85% data quality rates, requiring 20-30% overcollection to meet target dataset sizes. Truelabel's distributed collection network achieves 92-97% data quality rates through standardized hardware, automated quality checks, and operator training programs.

Data delivery timelines are another key consideration. In-house collection programs typically require 6-12 months to collect 800,000 episodes, plus 2-4 months for data cleaning, formatting, and validation. Truelabel's parallel collection network reduces this to 3-6 months by operating 47 teleoperation sites simultaneously, with automated data pipelines that perform cleaning, formatting, and validation in real-time. Data is delivered incrementally (weekly or monthly batches) rather than as a single corpus, allowing training to begin before data collection is complete.

For organizations with existing robot fleets, the question is whether to collect data in-house or outsource to a data marketplace. In-house collection provides tighter control over task distribution and data quality, but requires significant upfront investment in hardware, software, and operator training. Outsourcing provides faster time-to-data and lower upfront costs, but requires careful vendor selection and quality monitoring. Truelabel's data provenance system provides per-episode metadata (collection site, operator ID, hardware configuration, quality metrics) and cryptographic signatures via C2PA content credentials, enabling buyers to audit data quality and trace issues to specific collection sites.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Sourcing multi-view manipulationRelated page Teleoperation data vs robot demonstration dataRelated page Sourcing egocentric kitchen videoRelated page Sourcing egocentric warehouse videoRelated page Sourcing egocentric workshop videoRelated page Sourcing industrial egocentric videoRelated page Sourcing rgbd manipulationRelated page Sourcing teleop kitchen dataRelated page

External references and source context

NVIDIA GR00T N1 technical report
GR-2 technical report: 38M video clips, 50B tokens, 800K robot episodes, 97.7% CALVIN success, 94.9% real-robot success
arXiv ↩
RoboNet: Large-Scale Multi-Robot Learning
RoboNet large-scale multi-robot learning dataset
arXiv
CALVIN paper
CALVIN benchmark for long-horizon manipulation tasks
arXiv
BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 large-scale robot learning dataset
arXiv
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID large-scale in-the-wild robot manipulation dataset
arXiv
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
RoboCat self-improving generalist agent for robotic manipulation
arXiv
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
SayCan language grounding in robotic affordances
arXiv
Introduction to HDF5
HDF5 hierarchical data format for robot manipulation datasets
The HDF Group
MCAP guides
MCAP container format for multi-modal sensor data
MCAP
labelbox
Labelbox data annotation platform for computer vision
labelbox.com
scale.com scale ai universal robots physical ai
Scale AI partnership with Universal Robots for physical AI data
scale.com
Encord Series C announcement
Encord $60M Series C for computer vision annotation platform
encord.com
kognic.com platform
Kognic annotation platform for autonomous systems and robotics
kognic.com
segments.ai the 8 best point cloud labeling tools
Point cloud labeling tools comparison for 3D robotics data
segments.ai

FAQ

What is the difference between GR-2 and RT-2 for robot manipulation?

GR-2 uses discrete VQGAN tokenization and autoregressive action prediction, while RT-2 uses continuous vision encoders and action classification over discretized bins. GR-2 is pretrained on 38 million video clips emphasizing temporal dynamics, while RT-2 is pretrained on 1 billion static image-text pairs. On long-horizon tasks (5+ steps), GR-2 achieves 97.7% success on CALVIN compared to 62% for RT-2, but RT-2 requires less robot data (130,000 episodes vs 800,000 for GR-2). For single-step pick-and-place tasks, performance is comparable (96% vs 93%).

How much training data does GR-2 require for fine-tuning?

GR-2 requires 800,000 robot manipulation episodes across 100 tasks for fine-tuning, plus 38 million video clips for pretraining. Each episode includes multi-view RGB video (2-4 cameras at 10 Hz), 6-DoF end-effector trajectories, gripper states, and natural language task descriptions. Episodes average 30 seconds in duration, yielding 240 million action-observation pairs. Ablations show that reducing episode count to 400,000 decreases success rates by 8-12 percentage points, while reducing task count to 50 decreases success rates by 15-22 percentage points.

What video formats and resolutions does GR-2 require?

GR-2 requires multi-view RGB video at 224×224 resolution, sampled at 10 Hz. Each frame is tokenized via VQGAN into a 14×14 grid of discrete tokens from a 16,384-token codebook. Data is typically stored in RLDS or HDF5 format, with VQGAN preprocessing performed offline or on-the-fly during training. Cameras must be temporally synchronized to <10ms accuracy and calibrated to <1 pixel reprojection error. Truelabel delivers data with verified camera calibration, temporal alignment, and VQGAN-compatible preprocessing, reducing integration time from 6-12 weeks to 1-2 weeks.

Can GR-2 be deployed on robots other than Franka Panda?

Yes, GR-2 outputs 6-DoF end-effector pose deltas rather than joint-space commands, allowing deployment across different robot morphologies without retraining. The model has been validated on Franka Panda, UR5e, and xArm platforms. Deployment requires: inverse kinematics solver for the target robot, camera calibration matching training data viewpoints, and action denormalization using robot-specific statistics. Truelabel provides reference implementations for 12 common manipulator platforms, reducing deployment time from 8-16 weeks to 2-4 weeks.

What are the main failure modes of GR-2 in real-world deployment?

GR-2's real-world failure modes include: 2.1% collision with obstacles (due to perception errors or planning failures), 1.4% gripper slip during manipulation (due to contact dynamics not captured in training data), 0.9% task timeout (unable to complete within 60 seconds), and 0.7% perception errors (incorrect object detection or pose estimation). These failure rates are measured on a 30-task real-world benchmark with 100 trials per task. Failure rates are highest on contact-rich tasks (insertion, assembly) and lowest on pick-and-place tasks.

How does GR-2's video pretraining improve manipulation performance?

GR-2's video pretraining on 38 million clips teaches the model temporal dynamics, object permanence, and contact physics that transfer to manipulation tasks. Ablations show that removing video pretraining reduces CALVIN success rates from 97.7% to 79.4%, an 18.3 percentage point drop. The pretraining corpus emphasizes human hand-object interactions and egocentric viewpoints, providing strong priors for contact-rich manipulation. Video pretraining is most beneficial for long-horizon tasks (5+ steps) and tasks requiring temporal reasoning, with diminishing returns for single-step pick-and-place tasks.

Looking for GR-2 robot model?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

Source GR-2 Training Data