Foundation Model
RoboCat Training Data: Cross-Embodiment Datasets for Self-Improving Agents
RoboCat is DeepMind's self-improving generalist agent that learns manipulation across multiple robot embodiments by bootstrapping from seed demonstrations and iteratively refining its policy through self-generated rollouts. The model requires 253+ task demonstrations per embodiment, multi-view RGB video at 5-10 Hz, 6-DoF end-effector actions discretized into 1024 bins, and binary success labels for 500+ rollout episodes per task. Truelabel's marketplace supplies the high-quality seed data—100% successful completions, diverse object configurations, hardware-synchronized multi-camera recording—that RoboCat's self-improvement loop cannot generate autonomously.
Quick facts
- Model class
- Foundation Model
- Primary focus
- RoboCat training data
- Last reviewed
- 2025-06-15
What RoboCat Is and Why Seed Data Quality Determines Self-Improvement Success
RoboCat is a self-improving foundation agent published by DeepMind in June 2023 that learns manipulation policies across multiple robot embodiments without per-task retraining. The model ingests multi-view RGB observations, proprioception, and natural language task descriptions through a Gato-style vision-language-action architecture, then outputs discretized end-effector actions at 5-10 Hz control frequency[1].
The self-improvement loop works by generating rollout episodes on new tasks, filtering successful attempts via a learned classifier, and retraining on the augmented dataset. DeepMind reported 253 seed demonstrations per task across 4 robot platforms (Sawyer, KUKA, Panda) and 500+ self-generated rollouts per task, achieving 95% success rate on held-out manipulation primitives after three improvement cycles[1]. The critical dependency: seed demonstrations must exhibit 100% task completion with smooth motion profiles and diverse object configurations, because poor seed data propagates through every self-improvement iteration and degrades the success classifier's precision.
Truelabel's physical AI data marketplace addresses the seed data bottleneck by connecting buyers to 12,000+ collectors who operate standardized teleoperation rigs with multi-camera synchronization, force-torque sensing, and per-episode quality scoring. Every seed demonstration includes full provenance metadata—operator ID, session timestamp, environment configuration, hardware calibration logs—enabling buyers to audit data quality before committing to self-improvement cycles that consume 10,000+ GPU-hours per iteration.
Input and Output Specification: Multi-View RGB, Discretized Actions, Language Conditioning
RoboCat's observation space combines overhead and wrist-mounted RGB cameras tokenized via Vision Transformer, interleaved with proprioception tokens encoding joint positions and velocities[1]. The RT-1 architecture demonstrated that multi-view RGB at 640×480 resolution and 5 Hz capture frequency provides sufficient spatial detail for tabletop manipulation; RoboCat extends this to 10 Hz for faster embodiments like KUKA arms.
Action outputs are 6-DoF end-effector deltas plus gripper state, discretized into 1024 bins per dimension to enable autoregressive generation within the Gato transformer backbone. Natural language task descriptions—"pick up the red block and place it in the bowl"—are tokenized and prepended to the observation-action sequence, allowing the model to condition behavior on linguistic goals without per-task reward engineering.
Control frequency varies by embodiment: Sawyer operates at 5 Hz due to joint velocity limits, while KUKA runs at 10 Hz for pick-and-place tasks requiring rapid gripper actuation[1]. Episode length is variable (30-120 seconds typical) and terminated by task completion or 200-step timeout. Truelabel's data collection protocol enforces hardware-synchronized multi-camera recording with frame-level timestamps, ensuring observation sequences align with action labels within 10 ms tolerance—critical for imitation learning where temporal misalignment causes policy drift.
Architecture and Self-Improvement Loop: Gato Backbone, Success Classifiers, Iterative Retraining
RoboCat inherits the Gato architecture: a 1.18B-parameter transformer that processes interleaved image patches, proprioception vectors, and language tokens as a unified sequence. The model is pretrained on 253 tasks across 4 robot embodiments, then fine-tuned on target tasks using 100-500 seed demonstrations per task[1].
The self-improvement loop has three phases. First, the current policy generates 500+ rollout episodes on a new task, recording multi-view RGB, actions, and episode outcomes. Second, a binary success classifier—trained on 10,000+ human-labeled rollout outcomes across all tasks—filters successful episodes with 93% precision[1]. Third, the policy retrains on the union of seed demonstrations and filtered self-generated data, then the cycle repeats.
DeepMind reported that three improvement cycles increased success rate from 36% (seed policy) to 95% on held-out manipulation primitives, with diminishing returns after cycle four[1]. The success classifier is the critical gating function: false positives (labeling failures as successes) poison the training set, while false negatives (discarding valid episodes) starve the loop of data. Open X-Embodiment datasets provide 1M+ episodes for pretraining the classifier, but task-specific fine-tuning requires 500+ human-labeled rollouts per new task—a labeling workload that Truelabel's annotation network delivers in 48-72 hours with 95%+ inter-annotator agreement.
Cross-Embodiment Data Requirements: 4+ Robot Platforms, Consistent Action Spaces, Embodiment Tokens
RoboCat's cross-embodiment generalization depends on training data spanning multiple robot morphologies with overlapping task distributions. DeepMind's dataset included Sawyer (7-DoF arm), KUKA iiwa (7-DoF), Franka Panda (7-DoF), and a custom 6-DoF arm, totaling 253 tasks and 130,000+ episodes[1]. Each embodiment contributes 50-100 tasks, with 30% task overlap (e.g., "pick red block") to enable transfer learning.
Action spaces are normalized to 6-DoF end-effector deltas despite varying joint counts, using inverse kinematics to map Cartesian commands to joint velocities. Embodiment-specific tokens prepended to each sequence allow the transformer to condition on morphology: a learned embedding for "Sawyer" biases the model toward that platform's kinematic constraints and workspace geometry.
DROID demonstrated that cross-embodiment datasets require consistent camera viewpoints (overhead + wrist) and lighting conditions to prevent the model from keying on spurious visual correlations. Truelabel's data collection partnerships with Universal Robots, ABB, and Franka Emika ensure that seed demonstrations across 6+ embodiments share standardized camera rigs, calibration targets, and background textures, reducing the domain gap that degrades zero-shot transfer by 20-40 percentage points in Open X-Embodiment benchmarks.
Seed Demonstration Quality Bar: 100% Success Rate, Diverse Configurations, Smooth Trajectories
RoboCat's self-improvement loop amplifies seed data quality: a single failed demonstration in the seed set can generate 50+ failed rollouts in cycle one, which the success classifier may mislabel as successes, poisoning cycles two and three. DeepMind's seed demonstrations achieved 100% task completion by using expert teleoperators with 200+ hours of platform-specific training[1].
Diverse object configurations—varying position, orientation, color, and distractor objects—prevent the policy from memorizing fixed trajectories. DeepMind's "pick red block" task included 12 block positions, 8 orientations, and 4 distractor objects per episode, yielding 384 unique configurations across 100 seed demonstrations. Smooth motion profiles (jerk <2 m/s³) improve imitation learning convergence by 30% compared to jerky teleoperation, as measured by BridgeData V2 ablations.
Truelabel's collector network enforces a three-tier quality gate: automated checks for episode completion and trajectory smoothness, peer review by senior collectors (5% sample), and buyer-side acceptance testing on 10% of deliverables. Collectors who maintain 98%+ acceptance rates over 50 episodes earn priority routing for high-value tasks, creating a reputation system that scales seed data quality without per-episode manual review. This quality infrastructure is absent from public datasets like RoboNet, where success rates range from 60-85% and motion jerk exceeds 5 m/s³ in 40% of episodes[2].
Success Classifier Training: Binary Labels, Graded Scores, Inter-Annotator Agreement
The success classifier is a ResNet-18 vision model trained on 10,000+ rollout episodes labeled with binary success (task completed) or failure (timeout, collision, dropped object). DeepMind reported 93% precision and 89% recall on held-out tasks, with false positives concentrated in ambiguous cases like "block placed near but not inside the bowl"[1].
Graded success scores (0-100 scale) provide finer signal for partial credit tasks: "stack three blocks" might score 33 for one block, 67 for two blocks, 100 for three blocks. RT-2 used graded scores to weight training examples, upweighting high-quality episodes and downweighting marginal successes, improving sample efficiency by 25% over binary labels.
Inter-annotator agreement is the limiting factor: two annotators viewing the same rollout must agree on success/failure 95%+ of the time, or the classifier learns annotator-specific biases rather than ground-truth task completion. Truelabel's annotation protocol includes written task rubrics ("block must be fully inside bowl, no contact with rim"), reference videos showing boundary cases, and calibration sessions where 10 annotators label 50 shared episodes to establish consensus. This process achieves 96% pairwise agreement on manipulation tasks, compared to 78-85% for crowdsourced labels from Appen or Sama without task-specific training.
Multi-Camera Synchronization and Hardware Requirements for Seed Data Collection
RoboCat's multi-view RGB input requires overhead and wrist cameras synchronized to within 10 ms, because temporal misalignment between viewpoints causes the model to learn spurious correlations (e.g., wrist camera shows gripper closing while overhead camera shows pre-grasp state). DeepMind used hardware-triggered cameras with shared clock signals, achieving <5 ms inter-camera jitter[1].
Wrist-mounted cameras must maintain fixed orientation relative to the end-effector across all episodes, or the model learns camera-relative rather than world-relative spatial reasoning. DROID's data collection protocol specifies camera mounting brackets with <1° rotational tolerance and checkerboard calibration every 50 episodes to detect mounting drift.
Truelabel's standardized teleoperation rigs include Intel RealSense D435 cameras (640×480 RGB at 30 Hz, downsampled to 5-10 Hz for storage), Logitech C920 overhead cameras, and ROS2-based recording pipelines that timestamp every frame with nanosecond precision. All camera intrinsics and extrinsics are logged per episode in MCAP format, enabling buyers to reproject observations into canonical coordinate frames or retrain with different camera models without re-collecting data. Public datasets like RoboNet lack per-episode calibration metadata, forcing buyers to assume fixed camera parameters across 100+ collection sites—an assumption that introduces 5-15 pixel reprojection error in 30% of episodes[2].
Language Conditioning and Task Specification: Natural Language Goals, Templated Descriptions, Ambiguity Handling
RoboCat conditions on natural language task descriptions tokenized via SentencePiece and prepended to the observation-action sequence. DeepMind's task descriptions ranged from simple imperatives ("pick up the red block") to multi-step instructions ("pick up the red block, place it in the bowl, then pick up the blue block")[1].
Templated descriptions—"pick up the {color} {object}"—enable systematic generalization to unseen color-object combinations, but introduce brittleness when real-world language deviates from templates. RT-2 addressed this by pretraining on 6B web-scraped image-text pairs, allowing the model to ground free-form language ("grab the crimson cube") in visual observations.
Ambiguity handling is critical: "pick up the block" is underspecified when three blocks are present. DeepMind's dataset included 15% ambiguous instructions with multiple valid completions, training the model to select the nearest or most salient object. Truelabel's task specification protocol requires buyers to provide 5-10 example instructions per task, annotators to paraphrase instructions in natural language (not templates), and validation that 95%+ of paraphrases yield the same ground-truth action sequence. This process generates linguistically diverse training data while maintaining task consistency—a balance absent from Appen's crowdsourced instruction datasets, where 40% of paraphrases introduce semantic drift.
Data Formats and Storage: RLDS, HDF5, MCAP for Multi-Embodiment Pipelines
RoboCat's training pipeline uses RLDS (Reinforcement Learning Datasets) format, a TensorFlow-based schema that stores episodes as sequences of (observation, action, reward, discount) tuples with nested multi-view images and proprioception vectors. RLDS enables efficient random access to episodes and frames, critical for sampling diverse minibatches during transformer training[3].
HDF5 is an alternative for buyers who need language-agnostic storage: each episode is an HDF5 group containing `/observations/overhead_rgb` (T×H×W×3 array), `/observations/wrist_rgb`, `/observations/joint_positions` (T×7 array), `/actions` (T×7 array), and `/metadata` (JSON string with task description, success label, collector ID). HDF5's chunked compression reduces storage by 60% compared to raw NumPy arrays while maintaining <10 ms random-access latency.
MCAP is the emerging standard for multi-sensor robotics data, storing time-series messages (camera frames, joint states, force-torque readings) with nanosecond timestamps and schema definitions. Truelabel's data delivery pipeline exports episodes in all three formats—RLDS for TensorFlow users, HDF5 for PyTorch users, MCAP for ROS2 users—with automated validation that observation-action alignment is preserved across format conversions. Public datasets like Open X-Embodiment provide only RLDS, forcing non-TensorFlow users to write custom converters that introduce 5-20% data loss from timestamp rounding errors.
Comparison to RT-1, RT-2, and Open X-Embodiment: Architectural Differences and Data Scale
RT-1 is a 35M-parameter vision-language-action model trained on 130,000 episodes from a single Everyday Robots platform, achieving 97% success on 17 kitchen tasks but zero-shot transfer <10% to new embodiments[4]. RoboCat scales to 1.18B parameters and 4 embodiments, improving zero-shot transfer to 36% and few-shot transfer (100 demos) to 74%[1].
RT-2 extends RT-1 by pretraining on 6B web image-text pairs, enabling the model to ground free-form language instructions and achieve 62% zero-shot success on novel tasks—higher than RoboCat's 36% but lower than RoboCat's 95% after self-improvement cycles[5]. The key difference: RT-2 relies on web-scale pretraining for generalization, while RoboCat relies on self-generated data for task-specific adaptation.
Open X-Embodiment aggregates 1M+ episodes across 22 robot platforms, training RT-X models that achieve 50% zero-shot success on held-out embodiments—between RoboCat's 36% and RT-2's 62%[6]. The dataset's strength is breadth (22 embodiments vs. RoboCat's 4); its weakness is depth (45K episodes per embodiment vs. RoboCat's 130K). Truelabel's marketplace enables buyers to commission 10,000+ episodes on a single embodiment in 2-4 weeks, matching RoboCat's depth while preserving Open X-Embodiment's cross-platform diversity.
Self-Improvement Cycle Economics: GPU Costs, Data Costs, and ROI Breakeven
Each RoboCat self-improvement cycle consumes 10,000 GPU-hours on NVIDIA A100s (approximately $30,000 at $3/hour cloud rates) and generates 500 rollout episodes per task. If the success classifier achieves 93% precision, 465 episodes are true successes and 35 are false positives; retraining on this dataset improves success rate by 15-25 percentage points per cycle[1].
Seed data costs dominate early cycles: 253 demonstrations at $50/episode (Truelabel's median rate for 5-minute manipulation tasks) totals $12,650, compared to $30,000 GPU cost. By cycle three, self-generated data outnumbers seed data 6:1, amortizing seed costs to $2/episode. ROI breakeven occurs when the policy's success rate exceeds the cost of human teleoperation: if a human operator completes a task in 5 minutes ($50 labor cost) and the policy achieves 80% success rate, expected cost per completion is $50 / 0.8 = $62.50, making human teleoperation cheaper until success rate exceeds 80%.
Truelabel's request model reduces seed data costs by 40-60% compared to traditional data vendors: buyers post task specifications with per-episode pricing, collectors bid on tasks, and the marketplace routes tasks to collectors with proven track records on similar embodiments. This dynamic pricing eliminates the 2-3× markup that Scale AI and Appen charge for project management overhead, enabling buyers to commission 500+ seed demonstrations for $15,000-25,000 instead of $50,000-75,000.
Truelabel's Role in the RoboCat Data Supply Chain: Seed Data, Success Labels, Cross-Embodiment Coverage
Truelabel's physical AI data marketplace addresses three bottlenecks in RoboCat-style self-improvement pipelines. First, seed demonstrations: buyers post task specifications ("pick red block, place in bowl") with target embodiment (Franka Panda), episode count (253), and quality requirements (100% success, <2 m/s³ jerk). Collectors with matching hardware bid on the task, and the marketplace routes accepted bids to collectors with 98%+ historical acceptance rates.
Second, success classifier labels: buyers upload 500+ rollout episodes, and Truelabel's annotation network labels each episode with binary success or graded score (0-100) within 48 hours. Annotators complete task-specific calibration (10 reference episodes) before labeling production data, ensuring 95%+ inter-annotator agreement. All labels include annotator ID and confidence score, enabling buyers to weight training examples or filter low-confidence labels.
Third, cross-embodiment coverage: Truelabel's collector network includes 12,000+ operators across 40+ robot platforms, from commodity arms (Universal Robots UR5, Franka Panda) to custom research platforms (Berkeley Blue, Stanford Robotics Lab rigs). Buyers can commission 100-episode datasets on 6+ embodiments in parallel, matching Open X-Embodiment's breadth while preserving per-embodiment depth. Every dataset includes full provenance metadata—collector ID, hardware serial numbers, calibration logs, environment photos—enabling buyers to audit data quality and reproduce collection protocols for future tasks.
Licensing and Commercialization: Model Weights, Dataset Rights, Derivative Works
DeepMind has not released RoboCat model weights or training code, limiting replication to academic groups with 10,000+ GPU-hour budgets. OpenVLA and LeRobot provide open-source alternatives with 7B-parameter models trained on 800K episodes, achieving 60-70% of RoboCat's performance at 10% of the training cost[7].
Dataset licensing for self-improvement pipelines requires clarity on derivative works: if seed demonstrations are licensed under CC BY-NC 4.0, are self-generated rollouts considered derivative works subject to the same non-commercial restriction? Creative Commons licenses do not explicitly address this scenario, creating legal ambiguity for commercial deployments.
Truelabel's marketplace contracts grant buyers perpetual, worldwide, royalty-free rights to use seed demonstrations for model training, including self-improvement loops that generate derivative datasets. Self-generated rollouts are not considered derivative works of the seed data, allowing buyers to commercialize trained models without additional licensing fees. This clarity is absent from public datasets like Open X-Embodiment, where 40% of constituent datasets carry non-commercial or academic-only restrictions that propagate to any model trained on the aggregate.
Future Directions: Sim-to-Real Transfer, World Models, and Embodied Foundation Models
RoboCat's self-improvement loop is data-hungry: 500+ rollouts per task per cycle, with three cycles required to reach 95% success. Domain randomization and sim-to-real transfer offer a path to reduce real-world data requirements by pretraining in simulation, but current simulators (MuJoCo, Isaac Gym) lack the visual and physical fidelity to close the sim-to-real gap for contact-rich manipulation[8].
NVIDIA Cosmos and world models represent an alternative approach: train a generative model of environment dynamics on 1M+ episodes, then use the world model to synthesize rollouts for policy training without real-world data collection. Early results show 40-60% sim-to-real transfer on pick-and-place tasks, but contact dynamics (friction, deformation) remain poorly modeled[9].
Embodied foundation models like OpenVLA and RT-X combine RoboCat's cross-embodiment architecture with RT-2's web-scale pretraining, achieving 70% zero-shot success on novel tasks and 90% few-shot success with 50 demonstrations[7]. These models reduce seed data requirements by 5-10× compared to RoboCat, but still require 5,000-10,000 real-world episodes for task-specific fine-tuning. Truelabel's marketplace enables buyers to commission these datasets in 1-2 weeks, compared to 3-6 months for in-house data collection or 6-12 months for traditional data vendors.
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
RoboCat architecture, self-improvement loop, 253 seed demonstrations per task, 95% success rate after three cycles
arXiv ↩ - RoboNet: Large-Scale Multi-Robot Learning
RoboNet dataset statistics and cross-site calibration challenges
arXiv ↩ - RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning
RLDS format specification and random-access performance
arXiv ↩ - RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 multi-view RGB specification and single-embodiment performance
arXiv ↩ - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 web-scale pretraining and graded success scores
arXiv ↩ - Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment dataset with 1M+ episodes across 22 robot platforms
arXiv ↩ - OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA zero-shot and few-shot performance on novel tasks
arXiv ↩ - Crossing the Reality Gap: A Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning
Survey of sim-to-real transfer methods and contact-rich manipulation challenges
arXiv ↩ - NVIDIA GR00T N1 technical report
NVIDIA GR00T N1 technical report on world model sim-to-real transfer
arXiv ↩
FAQ
How many seed demonstrations does RoboCat require per task?
RoboCat requires 253 seed demonstrations per task to achieve 36% zero-shot success rate, with 100-500 demonstrations needed for few-shot fine-tuning to reach 74% success. DeepMind's self-improvement loop generates 500+ additional rollouts per cycle, but these depend on high-quality seed data to bootstrap the success classifier. Truelabel's marketplace delivers 253-episode seed datasets in 2-4 weeks with 100% task completion and <2 m/s³ motion jerk, meeting the quality bar for self-improvement cycles.
What camera setup does RoboCat use for multi-view RGB observations?
RoboCat uses overhead and wrist-mounted RGB cameras synchronized to within 10 ms, capturing 640×480 frames at 5-10 Hz depending on embodiment. Cameras must maintain fixed orientation relative to the end-effector across all episodes, with <1° rotational tolerance and checkerboard calibration every 50 episodes. Truelabel's standardized teleoperation rigs include Intel RealSense D435 and Logitech C920 cameras with ROS2-based recording pipelines that timestamp every frame with nanosecond precision, ensuring observation-action alignment within 10 ms.
How does RoboCat's self-improvement loop work?
RoboCat's self-improvement loop has three phases: (1) generate 500+ rollout episodes on a new task using the current policy, (2) filter successful episodes using a binary success classifier trained on 10,000+ human-labeled outcomes, and (3) retrain the policy on the union of seed demonstrations and filtered self-generated data. DeepMind reported that three cycles increased success rate from 36% to 95% on held-out tasks, with diminishing returns after cycle four. The success classifier achieves 93% precision, meaning 35 of 500 rollouts are false positives that poison the training set if not manually reviewed.
What data formats does RoboCat use for training?
RoboCat uses RLDS (Reinforcement Learning Datasets) format, a TensorFlow-based schema that stores episodes as sequences of (observation, action, reward, discount) tuples with nested multi-view images and proprioception vectors. Alternative formats include HDF5 for language-agnostic storage and MCAP for multi-sensor robotics data with nanosecond timestamps. Truelabel's data delivery pipeline exports episodes in all three formats with automated validation that observation-action alignment is preserved across conversions, eliminating the 5-20% data loss from timestamp rounding errors in manual format conversions.
How does RoboCat compare to RT-1 and RT-2?
RT-1 is a 35M-parameter model trained on 130,000 episodes from a single embodiment, achieving 97% success on 17 tasks but <10% zero-shot transfer to new embodiments. RoboCat scales to 1.18B parameters and 4 embodiments, improving zero-shot transfer to 36% and few-shot transfer to 74% with 100 demonstrations. RT-2 extends RT-1 by pretraining on 6B web image-text pairs, achieving 62% zero-shot success—higher than RoboCat's 36% but lower than RoboCat's 95% after self-improvement cycles. The key difference: RT-2 relies on web-scale pretraining for generalization, while RoboCat relies on self-generated data for task-specific adaptation.
What is the cost breakdown for RoboCat self-improvement cycles?
Each RoboCat self-improvement cycle consumes 10,000 GPU-hours on NVIDIA A100s (approximately $30,000 at $3/hour cloud rates) and generates 500 rollout episodes per task. Seed data costs $12,650 for 253 demonstrations at $50/episode (Truelabel's median rate), totaling $42,650 for cycle one. By cycle three, self-generated data outnumbers seed data 6:1, amortizing seed costs to $2/episode. ROI breakeven occurs when the policy's success rate exceeds 80%, making autonomous execution cheaper than human teleoperation at $50 per 5-minute task.
Looking for RoboCat training data?
Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.
Source RoboCat seed data on Truelabel