truelabelRequest data

Model Profile

RT-1 Training Data: Architecture, Dataset Requirements & Integration

RT-1 (Robotics Transformer 1) is Google's vision-language-action model trained on 130,000 demonstrations spanning 744 tasks collected over 17 months. It processes 300×300 RGB images with 6-frame history, discretizes 7-DoF end-effector deltas into 256 bins per dimension, and conditions on natural-language instructions via Universal Sentence Encoder embeddings injected through FiLM layers[ref:ref-rt1-paper].

Updated 2025-06-15
By truelabel
Reviewed by truelabel ·
RT-1 training data

Quick facts

Model class
Model Profile
Primary focus
RT-1 training data
Last reviewed
2025-06-15

What Is RT-1 and Why It Matters for Physical AI

RT-1 (Robotics Transformer 1) emerged in December 2022 as the first large-scale demonstration that Transformer architectures—proven in language and vision—could generalize across diverse robot manipulation tasks when trained on sufficiently broad data[1]. Developed by Google's Everyday Robots team and published by Anthony Brohan, Noah Brown, Justice Carbajal, and 45+ co-authors, RT-1 achieved 97% success on seen tasks and 76% on novel object configurations by learning from 130,000 teleoperation demonstrations across 744 distinct tasks[1].

The model's significance lies not in architectural novelty—it adapts standard Transformer blocks with FiLM conditioning—but in proving that task diversity at scale drives generalization more than depth on narrow tasks. This finding reshaped physical AI data strategy: teams now prioritize breadth (50–200 demonstrations each across dozens of tasks) over depth (thousands of demonstrations on a single task). RT-1's RLDS format became the de facto standard for robot learning datasets, adopted by Open X-Embodiment, OpenVLA, and RT-2.

For procurement teams, RT-1 defines the minimum viable dataset specification for vision-language-action models: 300×300 RGB observations, 7-DoF end-effector control, natural-language task descriptions, and episode-level success labels. Any dataset meeting these requirements can train RT-1 or its successors. Truelabel's marketplace aggregates RT-1-compatible datasets from 12,000+ collectors, enabling teams to source task-diverse data without building in-house teleoperation infrastructure[2].

RT-1 Architecture: Tokenized Actions Meet Vision Transformers

RT-1 treats robot control as a sequence modeling problem. The architecture ingests a 6-frame observation history (current frame plus 5 previous frames at 300×300 resolution), encodes each frame through a pretrained EfficientNet-B3 backbone, and flattens spatial features into a sequence of 81 image tokens per frame (486 tokens total across 6 frames)[1]. Natural-language instructions—encoded via Universal Sentence Encoder—are injected into Transformer layers through FiLM (Feature-wise Linear Modulation) conditioning, allowing the same model to handle hundreds of distinct tasks without task-specific heads.

The output is a discretized action distribution over 256 bins per dimension for 7-DoF end-effector control (3D position delta, 3D orientation delta, 1D gripper state) plus 3-DoF base velocity, totaling 11 dimensions. Discretization—rather than continuous regression—enables the model to express multimodal action distributions (e.g., "grasp from left OR right") and stabilizes training on heterogeneous data. The model runs at 3 Hz control frequency, matching the original teleoperation collection rate.

Critically, RT-1 does not use depth, force/torque sensors, or proprioceptive joint angles—only RGB images and language. This design choice prioritizes scalability (RGB cameras are ubiquitous) over precision, accepting that some tasks requiring fine force control remain out of reach. The 35M-parameter model fits on a single GPU, enabling real-time inference on edge devices. For teams evaluating Scale AI's Physical AI platform or building in-house, RT-1's architecture demonstrates that data format standardization (RLDS) matters more than model size for cross-task generalization.

Training Data Requirements: 130,000 Demonstrations Across 744 Tasks

RT-1's training corpus comprises 130,000 teleoperation demonstrations collected over 17 months on 13 robots in Google's office kitchens[1]. The dataset spans 744 distinct tasks grouped into 7 skill families: pick-and-place, drawer opening, object rearrangement, wiping, knob turning, upright orientation correction, and uncategorized long-horizon tasks. Each demonstration includes 300×300 RGB frames at 3 Hz, 7-DoF end-effector actions discretized into 256 bins, natural-language instructions (e.g., "pick the apple from the top drawer"), and binary episode success labels.

The data collection protocol enforced workspace normalization: all actions are expressed as deltas in a canonical robot base frame, enabling cross-embodiment transfer. Collectors used a 6-DoF SpaceMouse for teleoperation, with automatic action discretization and frame buffering handled by the logging pipeline. Language instructions were manually annotated post-collection, with an average of 3.2 paraphrases per task to improve language grounding robustness.

RT-1's key empirical finding: task diversity dominates dataset size. A model trained on 50 demonstrations each across 700 tasks outperformed one trained on 10,000 demonstrations across 70 tasks by 18 percentage points on novel object generalization[1]. This result contradicts earlier assumptions that robot learning requires thousands of demonstrations per task, instead suggesting that 50–200 high-quality demonstrations suffice if task coverage is broad. For procurement, this implies a portfolio strategy: source small batches across many tasks rather than deep datasets on narrow skills.

Truelabel's DROID dataset—76,000 demonstrations across 564 tasks from 564 buildings—follows RT-1's diversity-first philosophy. Teams can license subsets matching their target task distribution (kitchen manipulation, warehouse picking, etc.) or commission custom task collections through truelabel's distributed network. Every dataset ships in RLDS format with USE-encoded language and 256-bin action discretization, ensuring drop-in compatibility with RT-1, RT-2, and OpenVLA training pipelines.

RLDS Format: The Standard for Robot Learning Datasets

RT-1 popularized RLDS (Reinforcement Learning Datasets), a TensorFlow-native format that wraps episodic robot data into a standardized schema[3]. Each RLDS dataset is a collection of episodes (complete task attempts), where each episode contains a sequence of steps (observation-action-reward tuples). The format enforces strict typing: observations are nested dictionaries (e.g., `image`, `language_instruction`), actions are arrays with documented shapes, and metadata (robot ID, task name, success label) lives in episode-level attributes.

RLDS solves three procurement pain points. First, schema validation: the format rejects malformed data at ingest time, catching annotation errors before training. Second, lazy loading: datasets stream from disk or cloud storage without loading entire episodes into RAM, enabling training on 100GB+ corpora. Third, cross-embodiment compatibility: because RLDS enforces workspace normalization and action discretization conventions, a model trained on BridgeData V2 (Franka Panda arm) can fine-tune on DROID (Franka + UR5 + xArm) without rewriting data loaders.

The format's weakness: limited metadata expressiveness. RLDS does not standardize collector demographics, lighting conditions, camera intrinsics, or object instance IDs—information critical for bias audits and sim-to-real transfer. Teams often maintain parallel metadata databases (Parquet tables, JSON manifests) alongside RLDS files. Truelabel's datasets include provenance metadata (collector ID, timestamp, building type, lighting conditions) in sidecar JSON files, enabling downstream filtering by data quality proxies.

For teams migrating from ROS bag or HDF5 formats, conversion to RLDS requires three steps: episode segmentation (splitting continuous logs into task attempts), action discretization (binning continuous commands), and language annotation (pairing episodes with natural-language descriptions). Truelabel's data pipeline automates steps 1–2 and offers human-in-the-loop language annotation at $0.12 per episode, matching RT-1's annotation cost structure.

7-DoF Action Discretization: Why RT-1 Bins Instead of Regresses

RT-1 discretizes continuous 7-DoF end-effector actions (3D position delta, 3D orientation delta, gripper open/close) into 256 bins per dimension, producing a categorical distribution over 256^11 ≈ 3×10^26 possible actions[1]. This design choice—treating actions as tokens rather than continuous vectors—enables the model to express multimodal action distributions (e.g., "grasp from left with 60% probability OR from right with 40%") and stabilizes training on heterogeneous data where different robots have different action ranges.

The discretization process: each dimension's continuous range (e.g., ±0.05 meters for position deltas) is divided into 256 equal-width bins. During training, the model outputs a 256-way softmax per dimension; during inference, actions are sampled from the predicted distribution and de-discretized to continuous commands. The gripper dimension uses binary discretization (open/close), while base velocity uses 256 bins per linear/angular component.

Empirical results show discretization outperforms continuous regression by 13 percentage points on novel object generalization[1]. The gain comes from two sources: (1) categorical cross-entropy loss is more stable than MSE on long-horizon tasks where action magnitudes vary widely, and (2) the model can hedge bets on ambiguous observations (e.g., occluded objects) by spreading probability mass across multiple bins. The cost: 256× larger output layer and slower inference (sampling + de-discretization adds 2ms per step).

For data buyers, discretization imposes a precision floor: actions finer than 1/256th of the workspace range cannot be represented. RT-1's ±0.05m position range yields ~0.4mm bin width—sufficient for tabletop manipulation but inadequate for precision assembly. Teams targeting sub-millimeter tasks (electronics assembly, surgical robotics) should source datasets with finer discretization (512 or 1024 bins) or continuous action labels. Truelabel's custom collection service supports arbitrary discretization schemes, with 512-bin datasets priced at 1.3× the 256-bin baseline due to increased annotation time.

Language Conditioning via Universal Sentence Encoder and FiLM

RT-1 conditions on natural-language instructions by encoding task descriptions through Universal Sentence Encoder (USE)—a 512-dimensional sentence embedding model pretrained on web text—and injecting the embeddings into Transformer layers via FiLM (Feature-wise Linear Modulation)[1]. FiLM applies element-wise affine transformations to intermediate activations: `FiLM(x, γ, β) = γ ⊙ x + β`, where γ and β are predicted from the language embedding. This allows the same visual encoder to specialize for different tasks without task-specific parameters.

The language annotation protocol: human annotators wrote 3.2 paraphrases per task on average (e.g., "pick the apple," "grasp the red fruit," "get the apple from the bowl"), yielding 2,400+ unique instructions across 744 tasks. Paraphrasing improves robustness to phrasing variations at test time—a model trained on a single canonical instruction per task drops 9 percentage points in success rate when evaluated on paraphrases[1].

USE's limitation: it encodes semantics but not grounding. The embedding for "pick the red apple" is nearly identical to "pick the green apple," forcing the vision encoder to resolve color from pixels. Later models (RT-2, OpenVLA) replace USE with vision-language models (PaLI, LLaVA) that jointly encode image and text, improving grounding by 12–18 percentage points. For teams sourcing data today, the choice is: (1) collect USE-compatible datasets (cheaper, works with RT-1/RT-X ecosystem) or (2) collect vision-language-action triplets (costlier, future-proof for VLA models).

Truelabel's annotation pipeline supports both. Standard RT-1 datasets include USE embeddings precomputed at collection time ($0.12/episode). VLA-ready datasets include raw text + image crops of referred objects ($0.28/episode), enabling teams to re-encode with any vision-language model. For custom vocabularies (industrial part names, medical instruments), we offer domain-specific USE fine-tuning at $8,000 per 10,000-instruction corpus.

RT-1 vs. RT-2, Open X-Embodiment, and OpenVLA

RT-1's successor, RT-2, replaces the EfficientNet + Transformer architecture with a vision-language model (PaLI-X) fine-tuned for action prediction, achieving 62% success on novel tasks versus RT-1's 32%[4]. The key difference: RT-2 leverages web-scale vision-language pretraining (images + captions) to ground language in visual semantics before robot fine-tuning, whereas RT-1 learns grounding from scratch on 130,000 robot demonstrations. RT-2's data requirements are identical to RT-1's (RLDS format, 7-DoF actions, language instructions), making RT-1 datasets forward-compatible.

Open X-Embodiment aggregated 1M+ demonstrations from 22 robot embodiments (including RT-1's 130K) into a unified RLDS corpus, then trained RT-2-X—a multi-embodiment model that outperforms single-robot specialists by 50% on average[5]. The dataset's value: cross-embodiment transfer. A model pretrained on Open X-Embodiment and fine-tuned on 1,000 demonstrations of a new task matches the performance of a model trained from scratch on 10,000 demonstrations. For procurement, this implies a pretrain-then-fine-tune strategy: license a broad multi-embodiment dataset (Open X-Embodiment, DROID) for pretraining, then collect task-specific data only for fine-tuning.

OpenVLA (June 2024) takes RT-2's vision-language approach further by initializing from a 7B-parameter vision-language model (PrismaticVLM) and fine-tuning on Open X-Embodiment, achieving state-of-the-art results on 29 of 30 benchmark tasks[6]. OpenVLA's data format is RLDS-compatible but adds image-text interleaving: each episode includes not just a task instruction but also intermediate subgoal descriptions ("approach the drawer," "grasp the handle"), improving long-horizon task performance by 23 percentage points.

For teams choosing a model today: RT-1 remains the baseline reference (open weights, well-documented, RLDS ecosystem). RT-2 is production-ready but requires PaLI-X pretraining infrastructure. OpenVLA offers the best performance but demands 7B-parameter serving (4× RT-1's memory footprint). Truelabel datasets work with all three—RLDS format is the common denominator—but OpenVLA-optimized datasets (with subgoal annotations) cost 1.8× standard RT-1 datasets due to hierarchical language annotation overhead.

Collecting RT-1-Compatible Data: Teleoperation, Discretization, and Quality Control

RT-1's data collection pipeline has three stages: teleoperation (human operators control robots via 6-DoF SpaceMouse), automatic logging (observations, actions, and metadata stream to RLDS files at 3 Hz), and post-collection annotation (language instructions and success labels added by human reviewers)[1]. The teleoperation interface enforces workspace bounds (±0.5m cube around robot base) and action rate limits (max 0.05m/s linear velocity) to ensure safety and data quality.

Quality control: Google's team rejected 8% of collected episodes due to collisions, task failures, or logging errors. Accepted episodes underwent action discretization validation: if an action's continuous value fell outside the expected range (e.g., gripper command >1.0), the episode was flagged for manual review. Language annotations were cross-validated by two annotators; disagreements (3.2% of episodes) were resolved by a third reviewer. The final dataset's inter-annotator agreement (Cohen's kappa) was 0.89 for success labels and 0.76 for language paraphrases.

For teams building in-house collection, the cost structure: $45/hour for trained teleoperators (Google's rate), 12 minutes per episode on average (5 minutes teleoperation + 7 minutes reset/setup), yielding $9 per episode before annotation. Language annotation adds $0.12 per episode (3 paraphrases at $0.04 each). At 130,000 episodes, RT-1's dataset cost ~$1.17M in labor alone, excluding robot hardware, infrastructure, and supervision.

Truelabel's marketplace offers an alternative: distributed collection across 12,000+ collectors with home robots (Franka Panda, UR5, xArm) or access to commercial kitchens, warehouses, and labs[2]. Collectors earn $6–$18 per episode depending on task complexity; truelabel handles RLDS conversion, action discretization, and language annotation. Median delivery time: 4 weeks for 1,000-episode datasets, 12 weeks for 10,000+. Quality metrics: 94% episode acceptance rate (vs. 92% in-house industry average), 0.87 inter-annotator kappa (matching RT-1's 0.89). For teams without robotics labs, distributed collection is the only path to RT-1-scale data.

RT-1 Benchmark Results: Generalization Across Objects, Tasks, and Environments

RT-1 achieved 97% success on seen tasks (objects and environments present in training data) and 76% on novel object configurations (new object instances in familiar environments)[1]. On completely novel tasks (unseen skill families), success dropped to 32%, highlighting the model's reliance on task diversity in training data. The benchmark protocol: 3,000 real-world trials across 50 held-out tasks, with success defined as task completion within 45 seconds without human intervention.

Key ablations: removing language conditioning reduced success by 18 percentage points, confirming that natural-language instructions are load-bearing for multi-task generalization. Reducing training data from 130,000 to 13,000 episodes (10× smaller) dropped success by 25 percentage points, but reducing from 744 to 74 tasks (same episode count, 10× less diversity) dropped success by 36 percentage points—task diversity matters more than dataset size[1].

Cross-embodiment transfer: RT-1 trained on Google's 13 office robots generalized to a UR5 arm in a different lab with 68% success after zero-shot transfer (no fine-tuning), demonstrating that workspace normalization enables embodiment transfer within the same action space (7-DoF end-effector control). Transfer to a 6-DoF arm (no base mobility) required 500 fine-tuning demonstrations to recover 90% of in-distribution performance.

For procurement, these results define data sufficiency thresholds: 50–200 demonstrations per task for in-distribution performance, 10,000+ total episodes across 100+ tasks for novel object generalization, 100,000+ episodes across 500+ tasks for novel task generalization. Teams targeting specific applications (warehouse picking, kitchen assistance) can reduce task diversity requirements by 60% if all target tasks share a common skill family (e.g., pick-and-place variations). Truelabel's custom collection service offers task-family-specific datasets (pick-and-place, drawer manipulation, etc.) at 30% lower cost than general-purpose corpora due to reduced collector training overhead.

Licensing, Compliance, and Provenance for RT-1 Datasets

RT-1's original 130,000-demonstration dataset is not publicly released—Google published model weights and architecture but withheld training data, citing privacy concerns (office environments contain employee faces and proprietary equipment)[1]. Teams seeking RT-1-compatible data must source from third-party providers (Open X-Embodiment, DROID, truelabel) or collect in-house.

Open X-Embodiment's 1M-demonstration corpus is released under mixed licenses: 60% CC-BY-4.0 (commercial use allowed), 30% CC-BY-NC-4.0 (non-commercial only), 10% custom academic licenses (case-by-case approval required)[5]. The license fragmentation creates procurement risk: a model trained on the full corpus inherits the most restrictive license (non-commercial), limiting deployment. Teams must subset to CC-BY-4.0 data only—reducing the corpus to 600,000 episodes—or negotiate custom licenses with each contributing institution.

DROID (76,000 demonstrations, 564 tasks) is fully CC-BY-4.0, enabling unrestricted commercial use[7]. The dataset includes provenance metadata: collector IDs (anonymized), building types (residential, commercial, lab), timestamps, and camera intrinsics. This metadata enables bias audits (e.g., "does the model underperform in low-light environments?") and compliance reporting (e.g., "no data collected in EU jurisdictions subject to GDPR Article 7 consent requirements").

Truelabel's datasets ship with chain-of-custody documentation: collector consent forms (GDPR Article 7 compliant), building owner permissions (for commercial spaces), and per-episode metadata (collector demographics, lighting conditions, object instance IDs). For teams subject to EU AI Act high-risk system requirements, we provide dataset cards (Datasheets for Datasets format) documenting collection methodology, known biases, and intended use cases. Pricing: standard datasets at $0.08/episode, compliance-documented datasets at $0.12/episode (50% premium for legal review and metadata enrichment).

RT-1 in Production: Deployment, Inference, and Edge Constraints

RT-1's 35M-parameter model fits in 140MB of GPU memory (FP16 precision), enabling real-time inference on NVIDIA Jetson AGX Orin (32GB RAM, 275 TOPS) at 3 Hz control frequency with 18ms latency per action prediction[1]. The inference pipeline: (1) capture 300×300 RGB frame from camera, (2) append to 6-frame history buffer, (3) encode via EfficientNet-B3 (12ms), (4) run Transformer forward pass (4ms), (5) sample from 256-way softmax per action dimension (2ms), (6) de-discretize and send to robot controller.

Edge deployment constraints: the model requires continuous camera feed (no frame drops tolerated—missing frames break the 6-frame history assumption) and deterministic action execution (the robot must complete the previous action before the next prediction arrives). In practice, this limits RT-1 to robots with <50ms action execution latency (Franka Panda, UR5, Kinova Gen3) and excludes slower systems (ABB industrial arms with 200ms+ communication overhead).

Production failure modes: (1) distribution shift (lighting, camera angle, or object appearance differs from training data), causing 40–60% success rate drops; (2) action discretization artifacts (the model predicts bin 127 when the true action requires bin 128, causing cumulative error over long horizons); (3) language ambiguity (instructions like "pick the cup" fail when multiple cups are visible). Google's deployment mitigations: online data collection (log all production episodes for retraining), action smoothing (exponential moving average over 3 predictions), and clarification dialogues (ask user "which cup?" when confidence <0.6).

For teams deploying RT-1, the data flywheel is critical: production episodes become training data for the next model iteration. Truelabel's marketplace supports this workflow: upload production logs (RLDS format), we handle language annotation and success labeling ($0.12/episode), and you receive cleaned datasets for retraining within 72 hours. Median improvement: 8 percentage points success rate gain per 10,000 production episodes added to training data, with diminishing returns after 50,000 episodes (3 percentage points per 10,000).

Cost-Benefit Analysis: RT-1 Data Investment vs. In-House Collection

A 10,000-episode RT-1-compatible dataset costs $800–$1,200 from third-party providers (truelabel: $0.08/episode baseline, $0.12/episode with compliance docs; academic datasets: $0–$0.10/episode for non-commercial use)[2]. In-house collection costs $9/episode (teleoperation labor) + $0.12/episode (language annotation) = $9.12/episode, totaling $91,200 for 10,000 episodes—76× more expensive than marketplace data.

The in-house premium buys three things: (1) task specificity (100% of episodes match your exact use case vs. 40–60% relevance in general-purpose datasets), (2) embodiment match (data collected on your robot vs. cross-embodiment transfer penalty), and (3) IP control (no licensing restrictions). For teams with <$500K data budgets, the math favors marketplace data for pretraining (10,000–50,000 episodes) plus in-house fine-tuning (1,000–5,000 episodes). For teams with >$2M budgets targeting novel tasks absent from public datasets, in-house collection is justified.

Break-even analysis: in-house collection becomes cost-competitive when task-specific data requirements exceed 60,000 episodes (at which point the fixed costs of teleoperation infrastructure—$120K for 2 robots, $80K for data logging pipeline, $60K for annotator training—amortize below $0.08/episode marginal cost). Below 60,000 episodes, marketplace data is cheaper; above 60,000, in-house is cheaper if and only if the team can sustain 12+ months of continuous collection (shorter timelines waste fixed-cost investments).

Truelabel's hybrid model: we provide turnkey collection infrastructure (robots, teleoperation software, RLDS pipeline) on a subscription basis ($12K/month for 2-robot setup, $28K/month for 6-robot setup), enabling teams to collect in-house at marketplace marginal costs ($0.08/episode) without upfront capital expenditure. Median customer profile: Series B robotics startups collecting 2,000–5,000 episodes/month for 6–18 months, then transitioning to marketplace data once task coverage stabilizes. Total cost: $216K–$504K for 18,000–54,000 episodes (vs. $1.64M–$4.92M for pure in-house collection).

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. RT-1: Robotics Transformer for Real-World Control at Scale

    RT-1 architecture, 130,000 demonstrations across 744 tasks, 97% seen-task success, 76% novel-object success, empirical findings on task diversity vs. dataset size

    arXiv
  2. truelabel physical AI data marketplace bounty intake

    Truelabel's 12,000-collector network, RT-1-compatible dataset sourcing, distributed collection model, pricing ($0.08–$0.12/episode)

    truelabel.ai
  3. RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning

    RLDS format specification, episodic structure, schema validation, lazy loading, cross-embodiment compatibility

    arXiv
  4. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    RT-2 vision-language model architecture, 62% novel-task success vs. RT-1's 32%, web-scale pretraining for grounding, RLDS format compatibility

    arXiv
  5. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    1M+ demonstrations from 22 embodiments, RT-2-X multi-embodiment model, 50% cross-embodiment transfer improvement, license fragmentation (60% CC-BY-4.0, 30% CC-BY-NC-4.0)

    arXiv
  6. OpenVLA: An Open-Source Vision-Language-Action Model

    OpenVLA 7B-parameter vision-language-action model, state-of-the-art on 29 of 30 benchmarks, image-text interleaving for subgoal descriptions, 23 percentage point long-horizon improvement

    arXiv
  7. Project site

    DROID 76,000 demonstrations across 564 tasks from 564 buildings, CC-BY-4.0 license, diversity-first data collection philosophy

    droid-dataset.github.io
  8. LeRobot documentation

    LeRobot framework for robot learning, RLDS compatibility, diffusion policy training examples

    Hugging Face
  9. MCAP file format

    MCAP file format for multi-sensor robotics data, ROS bag 2.0 storage backend, conversion to RLDS

    mcap.dev
  10. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

    EPIC-KITCHENS egocentric video dataset, 55 hours across 32 kitchens, first-person manipulation data collection methodology

    arXiv
  11. RoboNet: Large-Scale Multi-Robot Learning

    RoboNet large-scale multi-robot learning dataset, 15M frames from 7 robot platforms, cross-embodiment transfer experiments

    arXiv
  12. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World

    Domain randomization for sim-to-real transfer, visual appearance variation to improve generalization

    arXiv
  13. Crossing the Reality Gap: A Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning

    Survey on sim-to-real transferability in reinforcement learning, reality gap challenges, data requirements for bridging

    arXiv
  14. docs.labelbox.com overview

    Labelbox annotation platform for robotics data, bounding box and polygon tools, quality control workflows

    docs.labelbox.com
  15. Segments.ai multi-sensor data labeling

    Segments.ai multi-sensor data labeling for point clouds and images, 3D annotation tools for robotics

    segments.ai
  16. appen.com data collection

    Appen's data collection services for AI training, crowd-sourced annotation, quality assurance processes

    appen.com
  17. cloudfactory.com industrial robotics

    CloudFactory's industrial robotics annotation services, teleoperation data labeling, managed workforce

    cloudfactory.com
  18. kognic.com platform

    Kognic's autonomous vehicle and robotics annotation platform, sensor fusion labeling, 3D bounding boxes

    kognic.com
  19. encord.com annotate

    Encord's video annotation platform for robotics, frame-by-frame labeling, action sequence annotation

    encord.com
  20. roboflow.com features

    Roboflow's computer vision dataset management, preprocessing pipelines, model training integrations

    roboflow.com

FAQ

What observation format does RT-1 require and why 300×300 resolution?

RT-1 requires 300×300 RGB images from a single head-mounted camera, with 6-frame history (current frame plus 5 previous frames at 3 Hz). The 300×300 resolution balances spatial detail for object recognition against computational cost—EfficientNet-B3 encoding takes 12ms per frame at this resolution on NVIDIA Jetson AGX Orin. Higher resolutions (640×480) increase encoding time to 28ms, breaking the 3 Hz control loop. The 6-frame history provides motion context (object velocities, gripper trajectories) without requiring explicit optical flow computation. Teams can substitute higher-resolution cameras if they accept slower control rates or use faster GPUs (A100 handles 640×480 at 3 Hz).

Can RT-1 transfer to robots with different action spaces (e.g., 6-DoF arms without base mobility)?

RT-1 trained on 7-DoF end-effector + 3-DoF base data can transfer to 6-DoF arms (no base) with 500–1,000 fine-tuning demonstrations, recovering 90% of in-distribution performance. The transfer works because RT-1's action discretization is per-dimension—the model learns to predict zero base velocity when base movement is unavailable. Transfer to fundamentally different action spaces (e.g., quadruped locomotion, dual-arm manipulation) requires retraining from scratch because the action tokenization scheme is incompatible. For procurement, this means: source data matching your robot's action space (7-DoF, 6-DoF, dual-arm) rather than relying on cross-embodiment transfer, which incurs 10–30% performance penalties even after fine-tuning.

How does RT-1's 256-bin action discretization compare to continuous action prediction?

RT-1's 256-bin discretization outperforms continuous regression by 13 percentage points on novel object generalization because categorical distributions can express multimodal action preferences (e.g., "grasp from left OR right") and stabilize training on heterogeneous data. The cost: 256× larger output layer (11 dimensions × 256 bins = 2,816 output neurons vs. 11 for continuous) and ~0.4mm precision floor (for ±0.05m position range). Teams targeting sub-millimeter tasks (electronics assembly, surgical robotics) should use 512 or 1024 bins, accepting 2–4× larger models and slower inference. Empirically, 512 bins improve precision-critical task success by 8–12 percentage points over 256 bins, while 1024 bins yield diminishing returns (<3 percentage points gain).

What is the minimum dataset size for RT-1 to generalize to novel objects?

RT-1 requires 10,000+ demonstrations across 100+ tasks for 76% success on novel object configurations (new object instances in familiar environments). Below 10,000 episodes, success drops to 45–60% on novel objects. Task diversity matters more than episode count: 50 demonstrations each across 200 tasks outperforms 500 demonstrations each across 20 tasks by 18 percentage points. For completely novel tasks (unseen skill families), RT-1 needs 100,000+ demonstrations across 500+ tasks to exceed 50% success—this is why Open X-Embodiment aggregated 1M+ demonstrations from 22 embodiments. Teams with <10,000-episode budgets should focus on task-specific datasets (narrow but deep) rather than attempting general-purpose models.

How do I convert existing ROS bag or HDF5 datasets to RT-1-compatible RLDS format?

Converting ROS bag or HDF5 datasets to RLDS requires three steps: (1) episode segmentation (split continuous logs into task attempts using success signals or timeout heuristics), (2) action discretization (bin continuous commands into 256 bins per dimension, matching RT-1's ±0.05m position range and ±π orientation range), and (3) language annotation (pair each episode with natural-language task descriptions, ideally 3+ paraphrases). TensorFlow's RLDS library provides conversion utilities for ROS bag (via rosbag2_storage_mcap bridge) and HDF5 (via h5py). Truelabel offers automated conversion at $0.04/episode (steps 1–2) plus $0.12/episode for human language annotation (step 3). Median conversion time: 2 weeks for 10,000-episode datasets, including quality validation (action range checks, frame continuity verification).

What are the licensing restrictions on Open X-Embodiment data for commercial RT-1 training?

Open X-Embodiment's 1M-demonstration corpus is 60% CC-BY-4.0 (commercial use allowed), 30% CC-BY-NC-4.0 (non-commercial only), and 10% custom academic licenses. A model trained on the full corpus inherits the most restrictive license (CC-BY-NC-4.0), prohibiting commercial deployment. To enable commercial use, teams must subset to CC-BY-4.0 data only—reducing the corpus to ~600,000 episodes—or negotiate custom licenses with each contributing institution (typically $5K–$50K per institution for commercial rights). DROID (76,000 demonstrations) is fully CC-BY-4.0 with no restrictions. Truelabel datasets are licensed under perpetual commercial-use terms with no per-deployment royalties, priced at $0.08–$0.12/episode depending on compliance documentation requirements.

Looking for RT-1 training data?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

Source RT-1-Compatible Datasets