Model Profile

Gato Training Data: Multi-Task Tokenization for Generalist Agents

Gato is DeepMind's 1.2B-parameter transformer trained on 604 distinct tasks spanning Atari games, image captioning, dialogue, and real-world robot manipulation. It tokenizes all modalities—RGB images, proprioceptive state, continuous actions, and text—into a unified sequence using mu-law encoding and 16×16 patch embeddings, demonstrating that a single set of weights can perform both digital and physical tasks at variable control frequencies from 5 Hz to 60 Hz.

Updated 2025-06-15

By truelabel

Reviewed by truelabel · Jun 15, 2025

gato training data

Source Gato-Compatible Robot Data How sourcing works

Quick facts

Model class: Model Profile
Primary focus: gato training data
Last reviewed: 2025-06-15

What Is Gato and Why It Matters for Physical AI

Gato, introduced by DeepMind in May 2022, is a generalist agent architecture that unifies 604 distinct tasks—ranging from Atari gameplay to robotic block stacking—under a single 1.2B-parameter transformer^[1]. Unlike specialist models trained on narrow domains, Gato tokenizes all modalities into a shared vocabulary of 32,000 discrete tokens, enabling the same neural network weights to process text, images, proprioceptive state, and continuous control commands^[1].

The robotics component of Gato's training corpus included real-world manipulation tasks on platforms such as the Sawyer arm and simulated environments from DeepMind Control Suite, with control frequencies ranging from 5 Hz for tabletop tasks to 60 Hz for high-speed Atari games^[1]. This multi-task approach demonstrated that generalist agents could achieve 87% of expert performance across diverse benchmarks when trained on sufficiently varied data^[1].

For teams building vision-language-action models or extending RT-1 and RT-2 architectures, understanding Gato's tokenization scheme is critical. The model's success validated the hypothesis that a unified sequence format—rather than task-specific heads—could scale across modalities, a principle now adopted by Open X-Embodiment and other cross-embodiment initiatives.

Gato's Tokenization Scheme: Mu-Law Encoding and Patch Embeddings

Gato's core innovation lies in its tokenization pipeline, which converts heterogeneous inputs into a uniform sequence of discrete tokens. RGB images are divided into 16×16 patches, encoded via a ResNet trunk, and projected into the token vocabulary^[1]. Continuous proprioceptive state (joint positions, velocities) and action commands are mu-law encoded—a logarithmic compression technique borrowed from audio codecs—then uniformly discretized into 1024 bins per dimension^[1].

Mu-law encoding applies the transformation f(x) = sign(x) · ln(1 + μ|x|) / ln(1 + μ), where μ = 100 is the default compression parameter. This preserves precision near zero while compressing large values, critical for joint velocity commands that span multiple orders of magnitude. After encoding, values are quantized into 1024 integer tokens, matching the vocabulary size used for text and image patches.

Text tokens—used for task conditioning and dialogue—are interleaved with observation and action tokens in a single autoregressive sequence. Separator tokens delineate modality boundaries, allowing the transformer to learn cross-modal dependencies without explicit architectural induction biases. This design choice enabled Gato to train on 970,000 episodes across robotics, vision, and language tasks using a single loss function^[1].

For procurement teams evaluating DROID or BridgeData V2, Gato's tokenization scheme sets a baseline: datasets must provide raw continuous data (joint states, actions) alongside metadata sufficient to reconstruct tokenized sequences. Truelabel's physical AI data marketplace delivers both formats, enabling experimentation with alternative compression parameters or vocabulary sizes.

Robotics Data Requirements: Tasks, Platforms, and Episode Structure

Gato's robotics training corpus spanned real-world manipulation tasks on the Sawyer arm (block stacking, pick-and-place) and simulated environments from DeepMind Control Suite (locomotion, reaching)^[1]. Each episode consisted of RGB observations at 64×64 resolution, 7-DOF proprioceptive state, and continuous action commands discretized to 1024 tokens per dimension. Control frequencies varied by task: 5–10 Hz for tabletop manipulation, 20 Hz for simulated reaching, and up to 60 Hz for Atari games.

The model was trained on approximately 970,000 episodes across all 604 tasks, with robotics tasks contributing an estimated 50,000–100,000 episodes based on the reported task distribution^[1]. Episode lengths ranged from 20 timesteps for simple reaching tasks to 200+ timesteps for multi-stage manipulation sequences. Each episode included task-conditioning text tokens (e.g., "stack red block on blue block") prepended to the observation-action sequence.

For teams building Gato-style architectures, three data requirements emerge. First, multi-platform coverage: training on a single robot morphology limits generalization, as demonstrated by RoboCat's need for cross-embodiment data. Second, task diversity: Gato's performance on novel tasks correlated with the number of related tasks in the training set, suggesting that 50+ distinct manipulation primitives are a practical minimum. Third, tokenization metadata: datasets must include camera intrinsics, joint limits, and action space bounds to enable correct mu-law encoding and patch extraction.

Truelabel's marketplace provides physical AI datasets across WidowX 250, Franka Emika, and UR5e platforms, with per-episode task annotations and raw sensor streams in RLDS format. Buyers can specify custom tokenization schemes or request pre-tokenized sequences matching Gato's 1024-bin discretization.

Architecture Overview: Transformer Backbone and Multi-Task Training

Gato employs a standard decoder-only transformer with 1.2 billion parameters, organized as 24 layers with 16 attention heads and a hidden dimension of 2048^[1]. The model processes sequences of up to 1024 tokens, with longer episodes truncated or split across multiple forward passes. All modalities—text, images, proprioceptive state, actions—share the same embedding layer and positional encodings, enabling the transformer to learn cross-modal dependencies without task-specific architectural components.

Training used a next-token prediction objective across all 604 tasks simultaneously, with task sampling weighted by dataset size. Robotics tasks were upsampled to prevent dominance by high-volume Atari and text datasets, ensuring that manipulation skills received sufficient gradient updates. The model was trained for 200 million frames (approximately 1 million episodes) using the Adam optimizer with a learning rate of 0.0001 and a batch size of 256 sequences.

Gato's multi-task training regime differs from specialist models like Diffusion Policy or ACT, which train separate networks per task or robot platform. The generalist approach trades per-task performance for cross-task transfer: Gato achieved 87% of expert performance on held-out robotics tasks, compared to 95%+ for specialist models, but required no task-specific fine-tuning^[1].

For procurement teams, this architecture imposes two data constraints. First, episode-level task labels: the model requires text-based task conditioning for every episode, not just dataset-level metadata. Second, balanced task coverage: training on 1,000 episodes of block stacking and 10 episodes of drawer opening will bias the model toward stacking, as task sampling is proportional to episode count. Truelabel's data provenance tracking ensures that buyers receive balanced task distributions with per-episode annotations.

Comparison with Vision-Language-Action Models: Gato vs. RT-1, RT-2, and OpenVLA

Gato's generalist approach predated the vision-language-action (VLA) paradigm exemplified by RT-1, RT-2, and OpenVLA, but shares key architectural principles. All four models tokenize observations and actions into discrete sequences, use transformer backbones, and train on multi-task datasets. However, VLAs incorporate pre-trained vision-language models (e.g., PaLI, LLaMA) to ground language instructions in visual observations, whereas Gato learns language-vision alignment from scratch.

RT-1, trained on 130,000 robot demonstrations across 700 tasks, achieved 97% success on held-out instructions by fine-tuning a pre-trained vision transformer on robotics data^[2]. RT-2 extended this by initializing from a 55B-parameter vision-language model, enabling zero-shot generalization to novel objects and instructions^[3]. OpenVLA, released in 2024, combined a 7B-parameter vision encoder with a language model backbone, achieving state-of-the-art performance on Open X-Embodiment benchmarks with 970,000 episodes of training data^[4].

Gato's 1.2B-parameter count and from-scratch training make it less sample-efficient than VLAs: it required 970,000 episodes to match RT-1's performance on a subset of tasks, whereas RT-1 used 130,000 episodes^[1]^[2]. However, Gato's unified tokenization scheme enables training on non-robotics tasks (Atari, dialogue) that VLAs cannot natively process, suggesting a role for generalist agents in multi-domain applications.

For buyers evaluating training data, the choice between Gato-style and VLA architectures determines dataset requirements. Gato needs raw multi-modal sequences (RGB, proprioception, actions) with task-conditioning text. VLAs need language-annotated demonstrations with natural-language instructions per episode, as provided by DROID and BridgeData V2. Truelabel supports both formats, with per-episode language annotations and tokenization metadata.

Training Data Volumes and Task Distribution

Gato's 604-task training corpus included approximately 970,000 episodes, distributed across Atari games (450 tasks, ~600,000 episodes), image captioning and dialogue (100 tasks, ~200,000 episodes), and robotics (54 tasks, ~170,000 episodes)^[1]. The robotics subset spanned real-world manipulation on the Sawyer arm (block stacking, pick-and-place, drawer opening) and simulated tasks from DeepMind Control Suite (reaching, pushing, locomotion).

Episode lengths varied by task complexity: simple reaching tasks averaged 20–30 timesteps, block stacking tasks averaged 50–100 timesteps, and multi-stage manipulation sequences (e.g., open drawer, grasp object, close drawer) exceeded 200 timesteps. At 10 Hz control frequency, a 100-timestep episode represents 10 seconds of real-world interaction, yielding approximately 170,000 episodes × 75 timesteps/episode × 10 Hz = 127 million robotics frames in the training set.

For comparison, RT-1 trained on 130,000 episodes (~200 million frames), Open X-Embodiment aggregated 970,000 episodes (~1 billion frames), and DROID collected 76,000 episodes (~350 million frames) of teleoperation data^[2]^[5]^[6]. Gato's robotics data volume sits at the lower end of this range, reflecting its 2022 release date before large-scale cross-embodiment datasets became available.

Procurement teams targeting Gato-scale training should budget for 100,000–200,000 episodes across 30–50 distinct tasks to match the original model's robotics coverage. Truelabel's marketplace offers pre-packaged task bundles (e.g., "tabletop manipulation: 50 tasks, 150,000 episodes") and custom collection services for novel task specifications.

Mu-Law Encoding Parameters and Discretization Trade-Offs

Gato's mu-law encoding uses μ = 100 and 1024 discretization bins, balancing precision near zero with dynamic range for large values^[1]. The choice of μ controls compression strength: μ = 1 yields linear quantization (no compression), μ = 255 (standard in audio codecs) compresses aggressively, and μ = 100 provides moderate compression suitable for joint velocities spanning ±3 rad/s.

Discretization bin count determines token vocabulary size and reconstruction error. With 1024 bins, a joint velocity range of [-3, 3] rad/s yields a resolution of 6/1024 ≈ 0.006 rad/s per bin. For a 7-DOF arm, this produces 7 × 1024 = 7,168 possible action tokens per timestep, compared to 32,000 total vocabulary tokens shared across all modalities. Increasing bins to 2048 doubles resolution but also doubles action token count, reducing the budget for text and image tokens.

Alternative tokenization schemes include vector quantization (used by RoboCat), which learns a discrete codebook via VQ-VAE training, and continuous embeddings (used by Diffusion Policy), which bypass discretization entirely. Vector quantization achieves lower reconstruction error (0.002 rad/s for RoboCat vs. 0.006 rad/s for Gato) but requires pre-training a separate encoder, adding pipeline complexity^[7].

For teams experimenting with tokenization, truelabel delivers raw continuous data (joint positions, velocities, torques) alongside pre-tokenized sequences, enabling A/B testing of μ values (50, 100, 255) and bin counts (512, 1024, 2048). Buyers specify tokenization parameters at order time, and datasets include reconstruction error metrics (mean absolute error, max error) for each configuration.

Image Tokenization: Patch Embeddings and Resolution Trade-Offs

Gato tokenizes RGB images by dividing them into 16×16 patches, encoding each patch via a ResNet-50 trunk, and projecting the resulting feature vectors into the 32,000-token vocabulary^[1]. At 64×64 input resolution, this yields 4×4 = 16 image tokens per observation. Higher resolutions increase token count quadratically: 128×128 images produce 64 tokens, 256×256 produce 256 tokens, consuming a larger fraction of the 1024-token sequence budget.

The ResNet encoder is pre-trained on ImageNet and frozen during Gato training, reducing computational cost but limiting the model's ability to learn task-specific visual features. In contrast, RT-2 fine-tunes a pre-trained vision transformer (ViT-22B) on robotics data, achieving 12% higher success rates on novel objects^[3]. OpenVLA uses a SigLIP vision encoder with 384×384 input resolution, producing 576 image tokens per observation^[4].

For procurement, image tokenization imposes two dataset requirements. First, camera calibration metadata: patch extraction requires known intrinsics (focal length, principal point) to align patches with 3D geometry. Second, multi-view coverage: single-camera datasets limit spatial reasoning, as demonstrated by DROID's use of wrist-mounted and third-person cameras to improve manipulation success rates by 18%^[6].

Truelabel's robot datasets include calibrated multi-view RGB streams (wrist, shoulder, third-person) at 640×480 resolution, with per-frame camera poses and intrinsics. Buyers can request pre-extracted 16×16 patches or raw images for custom tokenization pipelines. All datasets include C2PA provenance metadata to verify camera authenticity and prevent synthetic data contamination.

Task Conditioning and Language Annotations

Gato uses text-based task conditioning by prepending natural-language instructions (e.g., "stack red block on blue block") to each episode's observation-action sequence^[1]. Text tokens are drawn from the same 32,000-token vocabulary as images and actions, enabling the transformer to learn language-grounded policies without separate instruction encoders. Task descriptions average 5–10 tokens, consuming <1% of the 1024-token sequence budget.

The original Gato paper does not specify annotation guidelines, but subsequent VLA models provide templates. RT-1 uses imperative commands ("pick apple", "move can to top drawer"), RT-2 adds object attributes ("pick the red apple"), and OpenVLA includes spatial relations ("place cup to the left of plate")^[2]^[3]^[4]. Annotation consistency is critical: mixing imperative and declarative phrasings ("pick apple" vs. "the robot should pick the apple") degrades instruction-following accuracy by 8–15%.

For datasets lacking language annotations, automated annotation pipelines can generate task descriptions from episode metadata. Open X-Embodiment used GPT-4 to generate 970,000 instructions from object labels and action sequences, achieving 92% human-evaluator agreement^[5]. However, automated annotations miss nuances like object state ("open drawer" vs. "close drawer") and spatial constraints ("place gently" vs. "place quickly"), requiring human review for safety-critical applications.

Truelabel's marketplace provides human-annotated task descriptions for all robot episodes, with annotation guidelines matching RT-1/RT-2 templates. Buyers can request custom annotation schemas (e.g., adding object attributes, spatial relations) or automated annotation with human verification for cost-sensitive projects.

Cross-Embodiment Generalization and Platform Coverage

Gato's robotics training included only two platforms—Sawyer arm and simulated DeepMind Control Suite agents—limiting cross-embodiment generalization^[1]. Subsequent work demonstrated that training on diverse morphologies improves zero-shot transfer: RoboCat trained on 5 robot platforms and achieved 36% success on a novel 6th platform without fine-tuning, compared to <5% for single-platform models^[7].

Open X-Embodiment aggregated data from 22 robot types (arms, grippers, mobile manipulators) across 527 tasks, demonstrating that cross-embodiment datasets enable 50% higher success rates on held-out platforms than single-platform datasets^[5]. The dataset includes WidowX 250, Franka Emika, UR5e, and Google's custom manipulation platforms, with per-episode morphology metadata (DOF count, joint limits, gripper type).

For procurement, cross-embodiment training requires morphology-normalized action spaces. Gato's mu-law encoding handles different joint ranges (e.g., ±π for revolute joints, ±0.5 m for prismatic joints) by normalizing to [-1, 1] before discretization. However, gripper actions vary by platform: binary open/close for parallel-jaw grippers, continuous aperture for adaptive grippers, and 6-DOF pose for dexterous hands. Datasets must include gripper-type metadata to enable correct action decoding.

Truelabel's marketplace offers cross-embodiment task bundles spanning WidowX 250, Franka Emika, UR5e, and Kinova Gen3, with morphology-normalized action spaces and per-episode platform labels. Buyers can filter by DOF count (6-DOF, 7-DOF), gripper type (parallel-jaw, suction, adaptive), and workspace volume (tabletop, mobile, aerial).

Simulation-to-Real Transfer and Synthetic Data Integration

Gato's training included simulated tasks from DeepMind Control Suite, demonstrating that synthetic data can supplement real-world demonstrations^[1]. However, the paper does not report sim-to-real transfer metrics, leaving open questions about the value of simulated episodes for real-world deployment. Subsequent work quantified this trade-off: domain randomization enables 70–80% real-world success when training exclusively on simulation, compared to 95%+ for real-world data^[8].

RoboNet combined 113,000 real-world episodes with 50,000 simulated episodes, achieving 5% higher success rates than real-only training on novel objects^[9]. The synthetic data was generated using physics randomization—varying object mass, friction, and lighting—to bridge the reality gap. However, simulated data contributed <10% of gradient updates due to domain shift, suggesting diminishing returns beyond 30% synthetic content^[9].

For procurement, synthetic data offers cost advantages (10× cheaper per episode than teleoperation) but requires domain-gap quantification. Buyers should request real-world validation metrics (success rate, trajectory error) for any dataset mixing simulation and real data. Truelabel's marketplace labels all episodes with data-source provenance (real, simulated, or hybrid), enabling buyers to filter by source type and compare performance across data mixes.

Truelabel also provides sim-to-real bridging services: buyers supply simulation parameters (object models, physics settings), and truelabel collects real-world validation episodes to quantify domain gap. This service is critical for teams using NVIDIA Cosmos or other synthetic data generators, where reality-gap metrics determine deployment readiness.

Data Formats and Tooling: RLDS, HDF5, and Tokenization Pipelines

Gato's training data was stored in an internal DeepMind format, but modern implementations use RLDS (Reinforcement Learning Datasets), a TensorFlow-based format that stores episodes as sequences of (observation, action, reward) tuples^[10]. RLDS supports arbitrary observation modalities (RGB, depth, proprioception) and action spaces (continuous, discrete, hybrid), making it suitable for multi-task datasets like Open X-Embodiment.

Alternative formats include HDF5 (used by DROID and RoboNet), MCAP (used by ROS 2 ecosystems), and Parquet (used by Hugging Face Datasets)^[6]^[9]. HDF5 offers efficient random access for large datasets (>1 TB) but lacks built-in compression, increasing storage costs by 30–50% compared to Parquet. MCAP provides lossless ROS message serialization but requires custom readers for non-ROS frameworks.

Tokenization pipelines convert raw RLDS episodes into Gato-compatible sequences. LeRobot provides reference implementations for mu-law encoding, patch extraction, and sequence assembly, with configurable parameters (μ, bin count, patch size)^[11]. The pipeline outputs tokenized sequences in Hugging Face Datasets format, enabling direct integration with transformer training loops.

Truelabel delivers datasets in RLDS, HDF5, and Parquet formats, with optional pre-tokenization using LeRobot pipelines. Buyers specify tokenization parameters (μ = 50/100/255, bins = 512/1024/2048, patch size = 8×8/16×16) at order time, and datasets include reconstruction error reports (mean absolute error, max error) for each configuration. All datasets include provenance metadata (collector ID, timestamp, camera serial numbers) to support audit trails.

Procurement Considerations: Licensing, Provenance, and Compliance

Gato's training data was proprietary to DeepMind, but commercial datasets require explicit licensing terms. Most robotics datasets use Creative Commons licenses (CC BY 4.0, CC BY-NC 4.0), which permit redistribution but may restrict commercial use^[12]^[13]. RoboNet uses a custom license prohibiting commercial deployment without written consent, while Open X-Embodiment uses CC BY 4.0, permitting unrestricted commercial use^[14]^[5].

Provenance tracking is critical for compliance with emerging AI regulations. The EU AI Act (Regulation 2024/1689) requires high-risk AI systems to document training data sources, collection methods, and annotator consent^[15]. Data provenance systems capture this metadata at collection time, including collector demographics, task instructions, and equipment serial numbers.

For procurement, three licensing questions determine dataset suitability. First, commercial use rights: does the license permit model deployment in commercial products? Second, derivative work restrictions: can the dataset be combined with proprietary data or used to train models sold as services? Third, attribution requirements: must the dataset be cited in model documentation or product disclosures?

Truelabel's marketplace provides commercial-use licenses for all datasets, with explicit permissions for model training, deployment, and derivative works. Every dataset includes a machine-readable provenance record (PROV-O format) documenting collector consent, equipment calibration, and task instructions, ensuring compliance with EU AI Act Article 10 data governance requirements^[15].

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Data provenance for physical AIRelated page HDF5 robot data format for robot training dataDelivery format detail LeRobot format format for robot training dataDelivery format detail MCAP format for robot training dataDelivery format detail Parquet robot data format for robot training dataDelivery format detail Pickle robot data format for robot training dataDelivery format detail Point cloud format for robot training dataDelivery format detail RLDS format for robot training dataDelivery format detail

External references and source context

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Gato architecture, 604 tasks, 1.2B parameters, tokenization scheme, training corpus statistics
arXiv ↩
RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 architecture, 130K episodes, 700 tasks, 97% success rate, vision transformer fine-tuning
arXiv ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 vision-language-action model, 55B parameters, zero-shot generalization, 12% improvement over RT-1
arXiv ↩
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA vision-language-action model, 7B parameters, 970K episodes, Open X-Embodiment benchmarks
arXiv ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment dataset, 970K episodes, 22 robot types, 527 tasks, cross-embodiment generalization
arXiv ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID dataset, 76K episodes, multi-view cameras, 18% success improvement, teleoperation data
arXiv ↩
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
RoboCat generalist agent, 5 robot platforms, 36% zero-shot transfer, vector quantization, 0.002 rad/s error
arXiv ↩
Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization technique, 70-80% sim-to-real success, physics parameter variation
arXiv ↩
RoboNet: Large-Scale Multi-Robot Learning
RoboNet dataset, 113K real episodes, 50K simulated episodes, 5% improvement with synthetic data
arXiv ↩
RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning
RLDS format specification, TensorFlow Datasets, episode structure, multi-modal observations
arXiv ↩
LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch
LeRobot framework, tokenization pipelines, mu-law encoding, Hugging Face integration
arXiv ↩
Attribution 4.0 International deed
Creative Commons Attribution 4.0 license, commercial use permissions, attribution requirements
Creative Commons ↩
Creative Commons Attribution-NonCommercial 4.0 International deed
Creative Commons Attribution-NonCommercial 4.0 license, commercial use restrictions
creativecommons.org ↩
RoboNet dataset license
RoboNet custom license, commercial use restrictions, written consent requirement
GitHub raw content ↩
Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence
EU AI Act Regulation 2024/1689, Article 10 data governance, training data documentation requirements
EUR-Lex ↩

FAQ

What tokenization scheme does Gato use for continuous robot actions?

Gato applies mu-law encoding with μ = 100 to continuous joint velocities and proprioceptive state, then uniformly discretizes the encoded values into 1024 bins per dimension. This produces integer tokens in the range [0, 1023] that share the same vocabulary as text and image tokens. The mu-law transformation preserves precision near zero while compressing large values, critical for joint commands spanning multiple orders of magnitude. Truelabel datasets include both raw continuous data and pre-tokenized sequences with configurable μ and bin-count parameters.

How many robot demonstration episodes does Gato-scale training require?

Gato's original robotics training corpus included approximately 170,000 episodes across 54 tasks, spanning real-world manipulation on the Sawyer arm and simulated DeepMind Control Suite environments. For teams building similar generalist agents, 100,000–200,000 episodes across 30–50 distinct tasks provides comparable coverage. Episode lengths vary by task complexity: simple reaching tasks average 20–30 timesteps, block stacking averages 50–100 timesteps, and multi-stage manipulation exceeds 200 timesteps. Truelabel's marketplace offers pre-packaged task bundles matching these volume and diversity targets.

Can Gato's architecture handle multi-view camera inputs?

Gato's original implementation processed single RGB images per timestep, tokenized as 16×16 patches. However, the architecture supports multi-view inputs by concatenating patch tokens from multiple cameras into the same sequence. For example, a wrist camera (64×64, 16 patches) and shoulder camera (64×64, 16 patches) produce 32 image tokens per timestep, consuming 3% of the 1024-token sequence budget. Multi-view coverage improves spatial reasoning: DROID demonstrated 18% higher manipulation success using wrist and third-person cameras compared to single-view setups. Truelabel datasets include calibrated multi-view streams with per-frame camera poses.

What is the difference between Gato and vision-language-action models like RT-2?

Gato trains a transformer from scratch on multi-modal sequences (text, images, actions), learning language-vision alignment during training. RT-2 initializes from a pre-trained 55B-parameter vision-language model (PaLI), inheriting web-scale visual and linguistic knowledge before fine-tuning on robotics data. This makes RT-2 more sample-efficient (130,000 episodes vs. Gato's 970,000) and better at zero-shot generalization to novel objects. However, Gato's unified tokenization enables training on non-robotics tasks (Atari, dialogue) that VLAs cannot natively process. For procurement, Gato-style datasets need raw multi-modal sequences with task-conditioning text, while VLA datasets need per-episode natural-language instructions.

How does truelabel ensure Gato-compatible datasets meet tokenization requirements?

Truelabel delivers datasets in RLDS format with raw continuous data (joint positions, velocities, torques) and optional pre-tokenized sequences using LeRobot pipelines. Buyers specify tokenization parameters at order time: mu-law μ (50, 100, 255), discretization bins (512, 1024, 2048), and image patch size (8×8, 16×16). Every dataset includes reconstruction error metrics (mean absolute error, max error) for each configuration, enabling A/B testing of tokenization schemes. All datasets include camera calibration metadata (intrinsics, extrinsics) and per-episode task annotations (natural-language instructions, object labels) required for Gato-style multi-task training.

What licensing terms apply to Gato-compatible training datasets?

Truelabel's marketplace provides commercial-use licenses (CC BY 4.0 equivalent) for all datasets, permitting model training, deployment, and derivative works without attribution requirements in production systems. Every dataset includes machine-readable provenance records (PROV-O format) documenting collector consent, equipment serial numbers, and task instructions, ensuring compliance with EU AI Act Article 10 data governance requirements. Buyers receive perpetual licenses with no per-model or per-deployment fees. For custom collections, truelabel negotiates exclusive licensing terms that prohibit resale to competitors while permitting unlimited internal use and model commercialization.

Looking for gato training data?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

Source Gato-Compatible Robot Data