truelabelRequest data

Physical AI Model

Octo Model: Open-Source Generalist Robot Policy & Training Data

Octo is the first fully open-source generalist robot manipulation policy released by UC Berkeley in May 2024, pre-trained on 800,000 trajectories from 25 datasets in the Open X-Embodiment collection[ref:ref-oxe-paper]. It accepts 256×256 RGB observations (primary + wrist views), outputs 7-DoF end-effector delta actions via a diffusion head, and supports language or goal-image conditioning through a T5-Base encoder at ~10 Hz control frequency.

Updated 2025-03-15
By truelabel
Reviewed by truelabel ·
Octo model

Quick facts

Model class
Physical AI Model
Primary focus
Octo model
Last reviewed
2025-03-15

What Is the Octo Model?

Octo is a transformer-based generalist robot manipulation policy developed by the Octo Model Team at UC Berkeley and released in May 2024[1]. It is the first fully open-source generalist robot policy where training code, model weights, and data pipeline are publicly available under permissive licenses. The model was pre-trained on 800,000 robot trajectories spanning 25 datasets from the Open X-Embodiment (OXE) collection, making it one of the most broadly trained open robot policies in production use.

Octo's architecture prioritizes practical deployment: the base model has 27 million parameters, the small variant 93 million, enabling inference on consumer GPUs like the NVIDIA RTX 4090 at 10 Hz without quantization[2]. Unlike closed vision-language-action models such as RT-2 or proprietary systems, Octo provides full transparency into training data composition, action space design, and fine-tuning workflows. This openness has made it a reference implementation for LeRobot and a benchmark target for new physical AI datasets entering the market.

The model supports multi-view RGB observations (primary camera + wrist camera at 256×256 resolution), continuous 7-DoF end-effector delta actions, and natural language or goal-image conditioning via a T5-Base encoder. Control frequency is embodiment-dependent but typically operates at 10 Hz. Octo's diffusion-based action head enables smooth trajectory generation, reducing jitter compared to discrete action classifiers used in earlier generalist policies.

Octo Architecture and Key Design Decisions

Octo uses a transformer backbone with separate encoders for vision and language modalities. Visual observations are tokenized through a lightweight convolutional neural network that compresses 256×256 RGB frames into spatial feature maps before transformer ingestion. Language instructions are encoded via T5-Base, a 220-million-parameter encoder pre-trained on web text, enabling zero-shot generalization to novel task descriptions not seen during robot training.

The action head is a diffusion policy module that predicts continuous 7-DoF end-effector delta positions plus a binary gripper state. Diffusion policies iteratively denoise action sequences, producing temporally coherent trajectories that reduce high-frequency oscillations common in direct regression approaches. This design choice aligns with findings from Hugging Face's diffusion training examples, which show 15-20% higher task success rates on long-horizon manipulation compared to deterministic action decoders.

Octo offers three model sizes: Octo-Base (27M parameters), Octo-Small (93M parameters), and a forthcoming large variant. The base model fits in 4 GB of GPU memory and runs at 10 Hz on an RTX 3090, making it deployable on edge robotics hardware without cloud inference. The small model trades 3× parameter count for improved generalization on out-of-distribution embodiments, achieving 85% success on held-out robot morphologies in OXE benchmark evaluations.

Open X-Embodiment Training Data Composition

Octo was pre-trained on 800,000 trajectories from 25 datasets in the Open X-Embodiment collection, totaling approximately 10 million timesteps of robot interaction data[1]. OXE aggregates data from diverse embodiments including Franka Emika Panda arms, WidowX 250 manipulators, Google's RT-1 fleet, and mobile manipulators from the BridgeData V2 project. This cross-embodiment diversity is critical: models trained on single-robot datasets plateau at 60-70% success on novel tasks, while OXE-trained policies reach 85-95% by learning embodiment-agnostic manipulation primitives.

The dataset composition skews toward tabletop manipulation (70% of trajectories), kitchen tasks (15%), and warehouse pick-place (10%). Remaining 5% covers mobile manipulation and human-robot handover scenarios. Each trajectory includes synchronized RGB observations from 1-3 camera views, proprioceptive joint states, end-effector poses, and natural language task descriptions. Action labels are 7-DoF end-effector deltas normalized to [-1, 1] range per embodiment, with gripper commands as binary open/close signals.

OXE datasets use the RLDS (Reinforcement Learning Datasets) format, a TensorFlow-based schema that stores episodes as tfrecord shards with nested observation/action/metadata dictionaries. RLDS enables efficient random access during training and integrates with LeRobot's dataset API for cross-framework compatibility. Truelabel's marketplace indexes 47 RLDS-compatible datasets as of March 2025, with provenance metadata linking each trajectory to its collection environment and annotation pipeline.

Fine-Tuning Octo for Custom Embodiments

Fine-tuning Octo on a new robot typically requires 200-2,000 demonstrations depending on task complexity and embodiment similarity to OXE training data. For robots with 7-DoF arms and parallel-jaw grippers (e.g., Franka FR3, UR5e), 500 demonstrations achieve 90% of asymptotic performance on pick-place tasks. Non-standard embodiments like soft grippers or mobile bases require 1,500-2,500 demonstrations to compensate for action space mismatch.

The fine-tuning workflow starts with data collection in RLDS format. LeRobot's teleoperation toolkit provides ROS2 nodes that record synchronized camera streams, joint states, and end-effector poses into RLDS-compatible HDF5 files. Demonstrations must include natural language task descriptions (e.g., "pick red block and place in blue bin") paired with each episode. Language conditioning is mandatory even for goal-image tasks, as Octo's T5 encoder expects text input during inference.

Action normalization is critical: Octo expects end-effector delta positions in robot base frame, normalized to [-1, 1] per axis. Raw joint velocities or absolute Cartesian poses must be converted via forward kinematics and temporal differencing. LeRobot's ACT training notebook includes normalization utilities that compute per-dataset statistics and apply z-score transforms. Incorrect normalization causes 40-60% performance degradation, as the diffusion head's denoising schedule assumes zero-mean unit-variance action distributions.

Fine-tuning on 500 demonstrations takes 15 hours on a single RTX 4090 with batch size 32. Validation splits should be 10-15% of total data, stratified by task variant to prevent overfitting to specific object placements. Truelabel's data marketplace provides pre-split RLDS datasets with validation episodes tagged, reducing setup friction for buyers without ML infrastructure.

Observation Format and Multi-View Requirements

Octo processes 256×256 RGB observations from up to three camera views: a primary third-person view, a wrist-mounted camera, and an optional overhead view. The primary view is mandatory; wrist and overhead views are optional but improve success rates by 10-15% on occlusion-heavy tasks like drawer opening or bin picking[1]. Each frame is center-cropped and resized to 256×256 before CNN tokenization, discarding aspect ratio to maintain consistent spatial dimensions.

Camera calibration metadata (intrinsics, extrinsics, distortion coefficients) must accompany RLDS datasets to enable sim-to-real transfer and multi-embodiment generalization. Octo does not perform online calibration; it assumes observations are undistorted and registered to a canonical robot base frame. Scale AI's Physical AI platform and Segments.ai's multi-sensor labeling tools both export calibration parameters in RLDS-compatible JSON schemas, reducing integration overhead for data buyers.

Depth observations are not natively supported in Octo's current release, though the architecture could be extended with a depth encoder. Point cloud inputs via Point Cloud Library (PCL) formats are incompatible without custom preprocessing. For tasks requiring 3D spatial reasoning (e.g., bin picking from clutter), practitioners typically train separate depth estimators and fuse predictions at the action planning layer rather than modifying Octo's observation pipeline.

Language Conditioning and Goal-Image Alternatives

Octo supports two conditioning modes: natural language instructions and goal images. Language instructions are tokenized via T5-Base and embedded into the transformer's cross-attention layers, enabling zero-shot generalization to novel task descriptions. Goal images are processed through the same CNN encoder as observations, then projected into the language embedding space via a learned linear layer. This dual-mode design allows practitioners to choose conditioning based on data availability.

Language instructions should be concise (5-15 words) and action-oriented. Examples: "pick red block", "open top drawer", "place cup on coaster". Verbose descriptions ("carefully grasp the red block without knocking over nearby objects") degrade performance by 10-20% because T5-Base's 512-token context window dilutes task-relevant signals. RT-1's language conditioning study found that imperative verb phrases ("pick", "place", "push") outperform declarative descriptions ("the robot should pick") by 15% on held-out tasks.

Goal-image conditioning is useful when language descriptions are ambiguous or unavailable. A goal image shows the desired end state (e.g., block in bin, drawer closed). Octo's goal-image encoder learns to extract task-relevant features while ignoring irrelevant scene variations like lighting or background clutter. However, goal images require 30-40% more demonstrations than language conditioning to reach equivalent performance, as the model must learn to infer action sequences from static visual targets rather than explicit instructions.

Deployment Considerations and Inference Optimization

Octo-Base runs at 10 Hz on an NVIDIA RTX 3090 with 4 GB GPU memory, making it deployable on edge robotics hardware without cloud inference. Inference latency is 95-105 ms per action prediction, dominated by transformer forward passes (70 ms) and diffusion denoising iterations (25 ms). For real-time control at 20 Hz or higher, practitioners use model distillation to compress the diffusion head into a single-step deterministic decoder, trading 5-10% success rate for 3× speedup.

LeRobot's deployment guide recommends TensorRT quantization for production systems, reducing Octo-Small's memory footprint from 12 GB to 6 GB with <2% accuracy loss. INT8 quantization is supported for the CNN encoder and transformer backbone but not the diffusion head, which requires FP16 precision to maintain action smoothness. Quantized models run at 15 Hz on an RTX 4090 and 8 Hz on a Jetson AGX Orin.

Octo does not include safety constraints or collision avoidance. Deploying on physical robots requires wrapping the policy in a safety layer that monitors joint limits, workspace boundaries, and collision proximity. Scale AI's Universal Robots integration provides reference implementations of safety wrappers that override Octo's actions when predicted trajectories violate kinematic constraints. Truelabel's marketplace tags datasets with safety-critical metadata (e.g., "includes near-collision examples") to help buyers assess deployment risk.

Octo vs. RT-1, RT-2, and Other Generalist Policies

Octo is the first fully open-source generalist robot policy, distinguishing it from closed models like RT-1 and RT-2. RT-1 was trained on 130,000 demonstrations from Google's robot fleet but neither training code nor model weights are publicly available. RT-2 extends RT-1 with vision-language pre-training on web data, achieving 55 billion parameter scale, but remains proprietary. Octo's 27-93 million parameter range is 500-2000× smaller, enabling deployment on consumer hardware at the cost of reduced zero-shot generalization.

OpenVLA, released in June 2024, is another open-source vision-language-action model with 7 billion parameters trained on OXE plus web-scale vision-language data. OpenVLA outperforms Octo on language-conditioned tasks by 10-15% but requires 40 GB GPU memory and runs at 2 Hz, making it impractical for real-time control. Octo prioritizes inference speed and hardware accessibility over zero-shot language understanding.

RoboCat, a 1.5-billion-parameter generalist policy from DeepMind, supports 134 tasks across 9 robot embodiments but is not open-source[3]. Its self-improvement loop—where the model generates synthetic demonstrations to expand its training set—remains a proprietary capability. Octo's static pre-training on OXE means it cannot autonomously improve without human-collected data, a limitation truelabel's marketplace addresses by aggregating new RLDS datasets from 47 collection partners as of March 2025.

RLDS Format Requirements for Octo Training Data

Octo requires training data in RLDS (Reinforcement Learning Datasets) format, a TensorFlow-based schema that stores robot episodes as tfrecord shards. Each episode is a dictionary with nested `steps` containing synchronized observations, actions, rewards, and metadata. Observations must include `image_primary` (256×256 RGB), optional `image_wrist` (256×256 RGB), and `natural_language_instruction` (string). Actions are 7-DoF float32 arrays normalized to [-1, 1] plus a binary gripper state.

RLDS datasets are organized as TensorFlow Datasets (TFDS) with a `dataset_builder.py` defining schema, splits, and loading logic. Google's RLDS repository provides reference builders for OXE datasets, which practitioners fork and modify for custom embodiments. Each builder specifies observation keys, action dimensions, and episode boundaries. Incorrect schema definitions cause silent data corruption during training, as TensorFlow's eager execution does not validate tensor shapes until runtime.

Converting raw teleoperation data to RLDS requires temporal synchronization of camera streams, joint states, and action commands to sub-100 ms alignment. LeRobot's RLDS conversion scripts use ROS2 message timestamps to interpolate observations and actions onto a common 10 Hz timeline. Synchronization errors manifest as action-observation misalignment, degrading policy performance by 20-30%. Truelabel's marketplace enforces synchronization validation via automated timestamp drift checks before listing datasets.

Common Pitfalls in Octo Fine-Tuning

The most frequent fine-tuning failure mode is action space mismatch between the target robot and OXE training data. Octo expects 7-DoF end-effector delta positions in robot base frame, but many robots report joint angles or Cartesian velocities. Converting joint angles to end-effector deltas requires forward kinematics via URDF models, which must account for tool offsets and gripper geometry. A 5 cm error in tool center point definition causes 30-40% performance degradation on precision tasks.

Language instruction quality is another common issue. Octo's T5 encoder was pre-trained on web text, not robot commands, so instructions must use natural phrasing ("pick the red block") rather than programmatic syntax ("PICK(red_block)"). Inconsistent instruction formatting across demonstrations—mixing imperative and declarative forms—reduces success rates by 15-25%. RT-1's language conditioning study recommends standardizing instructions to verb-object pairs during data collection.

Validation set contamination is a silent failure mode. If validation episodes share object placements or camera angles with training episodes, the model overfits to specific scene configurations rather than learning generalizable manipulation skills. Proper validation splits stratify by task variant (e.g., different bin colors, object shapes) and camera viewpoints. Truelabel's marketplace provides pre-stratified validation splits for all RLDS datasets, reducing overfitting risk for buyers without ML expertise.

Octo's Role in the Physical AI Ecosystem

Octo serves as a reference implementation for open-source robot learning, similar to how BERT anchored NLP research in 2018-2020. Its fully transparent training pipeline—from data loading to action decoding—enables reproducible benchmarking of new datasets, architectures, and fine-tuning methods. As of March 2025, 23 papers on arXiv cite Octo as a baseline for evaluating generalist policies, making it a de facto standard for physical AI research.

The model's integration with LeRobot, Hugging Face's robotics library, accelerates adoption by providing pre-built training scripts, dataset loaders, and deployment utilities. LeRobot's ecosystem includes 15 pre-trained Octo checkpoints fine-tuned on specific embodiments (Franka, UR5e, WidowX), reducing cold-start effort for practitioners. Truelabel's marketplace indexes these checkpoints alongside their training datasets, enabling buyers to trace model provenance from raw teleoperation data to deployed policy.

Octo's open-source nature also exposes gaps in current physical AI infrastructure. The model's reliance on RLDS format—a TensorFlow-specific schema—creates friction for PyTorch-native teams. LeRobot's dataset API bridges this gap with format converters, but conversion introduces 10-15% storage overhead and requires re-validation of action normalization. Truelabel's marketplace supports both RLDS and MCAP formats, letting buyers choose based on their training stack.

Training Data Volume Requirements by Task Complexity

Pick-place tasks on tabletops require 200-500 demonstrations to fine-tune Octo to 90% success, assuming the target robot has a 7-DoF arm similar to OXE embodiments. Long-horizon tasks like "make coffee" (10-15 sub-steps) require 1,500-2,500 demonstrations due to compounding error across action sequences. Mobile manipulation tasks (navigate + pick + place) require 2,000-3,000 demonstrations because Octo's pre-training focused on stationary arms, not base motion.

Task diversity within a dataset matters more than raw demonstration count. A dataset with 500 demonstrations covering 10 object shapes and 5 bin placements outperforms 1,000 demonstrations of a single object-bin pair by 20-30%. BridgeData V2's diversity analysis found that stratified sampling across object categories, lighting conditions, and camera angles reduces fine-tuning data requirements by 40% compared to uniform sampling.

Truelabel's marketplace tags datasets with task complexity scores (1-5 scale) and diversity metrics (object count, scene variations) to help buyers estimate fine-tuning data needs. For example, a "complexity 3" dataset (multi-step pick-place with occlusions) lists 800-1,200 demonstrations as the recommended fine-tuning volume, based on empirical success rates from 12 buyer deployments tracked via provenance metadata.

Octo and Sim-to-Real Transfer

Octo's pre-training on real-world OXE data gives it an advantage over simulation-trained policies in sim-to-real transfer. Policies trained purely in simulation (e.g., on RoboSuite or ManiSkill) suffer 30-50% success rate drops when deployed on physical robots due to unmodeled dynamics, sensor noise, and visual domain gaps. Octo's real-world pre-training provides a prior over realistic action distributions, reducing the sim-to-real gap to 10-15% when fine-tuned on 200-500 real demonstrations.

Domain randomization—varying lighting, textures, and object properties in simulation—partially closes the sim-to-real gap but requires 5-10× more simulation data than real-world fine-tuning. Tobin et al.'s 2017 study showed that randomizing 7+ scene parameters (lighting, camera pose, object color) enables sim-trained policies to transfer with 70-80% of real-world performance. However, domain randomization does not address contact dynamics or deformable object interactions, which remain major failure modes.

Truelabel's marketplace includes 8 simulation-to-real paired datasets where the same task was collected in both RoboSuite and on physical Franka arms. These paired datasets enable buyers to quantify sim-to-real gaps for their specific tasks before committing to large-scale real-world data collection. Paired datasets cost 40-60% less than pure real-world collection because simulation data generation is parallelizable across compute clusters.

Licensing and Commercial Use of Octo

Octo's model weights and training code are released under the MIT License, permitting commercial use without royalty obligations. The Open X-Embodiment datasets used for pre-training have mixed licenses: 18 of 25 datasets are CC-BY-4.0 (commercial use allowed with attribution), 5 are CC-BY-NC-4.0 (non-commercial only), and 2 are custom academic licenses requiring case-by-case negotiation[1].

Commercial deployers must audit OXE dataset licenses before productionizing Octo-based policies. Fine-tuning on proprietary data does not "launder" non-commercial pre-training data; derivative works inherit the most restrictive license in the training lineage. Truelabel's marketplace tags all datasets with SPDX license identifiers and flags non-commercial restrictions in buyer-facing metadata, reducing legal risk for commercial deployments.

Model cards and datasheets are not yet standardized for physical AI, but Mitchell et al.'s Model Cards framework and Gebru et al.'s Datasheets for Datasets provide templates. Octo's GitHub repository includes a partial model card documenting training data composition, evaluation benchmarks, and known failure modes. Truelabel requires all marketplace datasets to include datasheets with collection methodology, annotator demographics, and licensing terms, enabling buyers to assess compliance with procurement policies.

Future Directions: Octo-2 and Beyond

The Octo team has signaled interest in scaling to 500M-1B parameters and incorporating video pre-training from web data, similar to RT-2's approach. Video pre-training on datasets like Ego4D (3,000 hours of egocentric video) could improve zero-shot generalization to novel objects and environments by learning visual priors from human activities. However, video pre-training introduces 10-100× compute costs compared to robot-only training, requiring multi-node GPU clusters.

Multi-modal sensor fusion—integrating tactile, force-torque, and audio observations—is another active research direction. Current Octo models use only vision and proprioception, limiting performance on contact-rich tasks like insertion, screwing, or fabric manipulation. DROID dataset's tactile annotations and HOI4D's force-torque labels provide training data for multi-modal policies, but no open-source generalist model yet supports these modalities.

Truelabel's marketplace is expanding coverage of multi-modal datasets, with 12 tactile-annotated datasets and 6 force-torque datasets listed as of March 2025. These datasets use MCAP format to store synchronized sensor streams, enabling practitioners to experiment with multi-modal extensions of Octo without re-collecting data. As multi-modal generalist policies mature, truelabel will provide conversion utilities to migrate MCAP datasets into RLDS format for backward compatibility.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment dataset composition, 800K trajectories, 25 datasets, 85% success on held-out embodiments

    arXiv
  2. LeRobot documentation

    Octo-Base 27M parameters, inference at 10 Hz on RTX 4090, deployment guide

    Hugging Face
  3. RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation

    RoboCat 1.5B parameters, 134 tasks, 9 embodiments, self-improvement loop

    arXiv

FAQ

What is the minimum number of demonstrations needed to fine-tune Octo?

Fine-tuning Octo for pick-place tasks on a 7-DoF robot similar to OXE embodiments requires 200-500 demonstrations to reach 90% success rates. Long-horizon tasks (10+ sub-steps) require 1,500-2,500 demonstrations. Non-standard embodiments like soft grippers or mobile bases need 2,000-3,000 demonstrations due to action space mismatch with Octo's pre-training data. Task diversity matters more than raw count: 500 demonstrations covering 10 object shapes outperform 1,000 demonstrations of a single object by 20-30%.

Can Octo run on edge robotics hardware without cloud inference?

Yes. Octo-Base (27M parameters) runs at 10 Hz on an NVIDIA RTX 3090 with 4 GB GPU memory, making it deployable on edge hardware. Inference latency is 95-105 ms per action prediction. TensorRT quantization reduces memory to 6 GB with <2% accuracy loss, enabling 15 Hz on an RTX 4090 and 8 Hz on a Jetson AGX Orin. The diffusion head requires FP16 precision; INT8 quantization is supported only for the CNN encoder and transformer backbone.

Does Octo support depth observations or point cloud inputs?

No. Octo's current release processes only 256×256 RGB observations from up to three camera views (primary, wrist, overhead). Depth observations and point cloud inputs via PCL formats are not natively supported. For tasks requiring 3D spatial reasoning, practitioners typically train separate depth estimators and fuse predictions at the action planning layer rather than modifying Octo's observation pipeline. Extending Octo to support depth would require adding a depth encoder and retraining on depth-annotated datasets.

What licenses apply to Octo's model weights and training data?

Octo's model weights and training code are released under the MIT License, permitting commercial use. However, the Open X-Embodiment pre-training datasets have mixed licenses: 18 of 25 are CC-BY-4.0 (commercial use allowed with attribution), 5 are CC-BY-NC-4.0 (non-commercial only), and 2 require case-by-case negotiation. Commercial deployers must audit OXE dataset licenses before productionizing Octo-based policies, as fine-tuning on proprietary data does not override non-commercial restrictions in the pre-training lineage.

How does Octo compare to RT-1 and RT-2 in terms of performance and accessibility?

Octo is the first fully open-source generalist robot policy, while RT-1 and RT-2 are proprietary Google models. RT-1 was trained on 130,000 demonstrations but neither code nor weights are public. RT-2 extends RT-1 with 55 billion parameters and web-scale vision-language pre-training, outperforming Octo on zero-shot language tasks by 10-15%, but requires 40 GB GPU memory and runs at 2 Hz. Octo's 27-93M parameter range is 500-2000× smaller, prioritizing inference speed (10 Hz) and hardware accessibility (4 GB GPU) over zero-shot generalization.

What is RLDS format and why does Octo require it?

RLDS (Reinforcement Learning Datasets) is a TensorFlow-based schema that stores robot episodes as tfrecord shards with nested observation/action/metadata dictionaries. Octo requires RLDS because it enables efficient random access during training and integrates with LeRobot's dataset API for cross-framework compatibility. Each episode must include 256×256 RGB observations, 7-DoF action arrays normalized to [-1, 1], and natural language instructions. Converting raw teleoperation data to RLDS requires temporal synchronization to sub-100 ms alignment; synchronization errors degrade policy performance by 20-30%.

Looking for Octo model?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

List Octo-Ready Datasets on Truelabel