Dataset Profile
Open X-Embodiment Dataset: Multi-Robot Learning at Scale
Open X-Embodiment is a 1M+ trajectory dataset spanning 22 robot embodiments—from single arms to bi-manual systems and quadrupeds—released in October 2023 to train cross-embodiment policies like RT-X. It unifies 60+ source datasets into RLDS format under mixed Creative Commons licenses, enabling imitation learning and vision-language-action model fine-tuning but requiring per-subset license review for commercial deployment.
Dataset Composition and Scale
Open X-Embodiment aggregates 60+ constituent datasets into a single namespace, yielding approximately 1 million real-world robot trajectories[1]. The collection spans 22 distinct embodiments: 7-DOF single arms (Franka Emika Panda, Kuka IIWA), bi-manual systems, mobile manipulators, and Boston Dynamics Spot quadrupeds[1]. Each trajectory records RGB observations, proprioceptive state, and discrete or continuous actions at 5–30 Hz depending on the source dataset.
Major contributors include BridgeData V2 (60,096 trajectories across kitchen tasks), DROID (76,000 trajectories from 564 scenes), and the original RT-1 dataset (130,000 episodes). Smaller subsets cover cable routing, door opening, and tabletop rearrangement. The RLDS specification provides the common schema: each episode is an HDF5 group containing `steps` (observations, actions, rewards) and metadata fields for embodiment ID, scene description, and language annotations.
Trajectory lengths vary from 10 steps (pick-place) to 500+ steps (long-horizon navigation). Observation modalities include wrist-mounted RGB (640×480 or 224×224), third-person RGB, and proprioceptive joint angles. Action spaces range from 7-DOF end-effector deltas to 12-DOF bi-manual commands. This heterogeneity makes Open X-Embodiment a stress test for cross-embodiment transfer algorithms but complicates direct policy deployment without embodiment-specific adapters.
Licensing and Commercial Use Constraints
Open X-Embodiment operates under a patchwork of Creative Commons licenses—primarily CC BY 4.0 and CC BY-NC 4.0—inherited from constituent datasets[1]. The aggregate repository does not impose a single umbrella license; buyers must audit each subset individually. For example, BridgeData V2 is CC BY 4.0 (commercial-friendly), while several academic subsets carry CC BY-NC 4.0 (non-commercial) restrictionsCreative Commons NonCommercial deed.
Commercial use of models trained on NC-licensed subsets remains legally ambiguous. The NC clause prohibits "primarily intended for or directed toward commercial advantage," but case law has not clarified whether a foundation model fine-tuned on NC data and later deployed in a SaaS product constitutes infringement. Risk-averse enterprises typically exclude NC subsets or seek explicit waivers from original dataset authors.
No consent artifacts—signed releases from human demonstrators—are bundled with Open X-Embodiment. Consent artifact absence is standard for academic robotics datasets but creates compliance gaps under GDPR Article 7 (lawful basis for processing identifiable data) and emerging AI Act transparency requirementsGDPR Article 7. Procurement teams should request demonstrator consent records and data-use agreements from original dataset maintainers before large-scale commercial training.
RLDS Format and Interoperability
Open X-Embodiment adopts the Reinforcement Learning Datasets (RLDS) specification, a TensorFlow Datasets extension designed for episodic RL and imitation learning. Each dataset is a collection of episodes; each episode is a sequence of steps containing `observation`, `action`, `reward`, `discount`, and `is_terminal` fields. Observations nest RGB images, depth maps, proprioceptive vectors, and language instructions as nested dictionaries.
The RLDS schema enables zero-copy slicing via TensorFlow's `tf.data` API and integrates with LeRobot through RLDS-to-LeRobot converters. However, RLDS files are HDF5-backed, which imposes single-writer constraints and complicates distributed writes. Teams scaling beyond 100k episodes often migrate to Apache Parquet or MCAP for parallel ingestion and cloud-native storage.
Metadata fields in RLDS episodes include `episode_metadata/file_path` (original source), `episode_metadata/embodiment` (robot model string), and optional `language_instruction` strings. Language annotations are sparse: fewer than 30 percent of trajectories include natural-language task descriptions[1]. This limits direct applicability to vision-language-action (VLA) models like RT-2, which require dense language supervision. Buyers targeting VLA fine-tuning should prioritize subsets with >80 percent language coverage (e.g., BridgeData V2, Language Table).
Cross-Embodiment Transfer and RT-X Results
The RT-X family—RT-1, RT-2, and subsequent variants—demonstrates that pre-training on Open X-Embodiment improves zero-shot and few-shot performance on unseen embodiments by 50–70 percent over single-robot baselines[2]. RT-1 achieved 97 percent success on 6 kitchen tasks after training on 130k episodes; RT-X extended this to 22 embodiments with minimal per-robot fine-tuning.
Cross-embodiment gains stem from shared visual priors (grasping affordances, object permanence) and action-space normalization. Open X-Embodiment trajectories are re-scaled to a canonical 7-DOF action space (3-DOF translation, 3-DOF rotation, 1-DOF gripper) via embodiment-specific inverse kinematics. This normalization is lossy: bi-manual tasks collapse to single-arm actions, and mobile-base commands are discarded. Buyers training policies for non-standard embodiments (humanoids, soft grippers) must re-derive normalization mappings or exclude incompatible subsets.
OpenVLA, a 7B-parameter VLA model, fine-tunes on Open X-Embodiment subsets and achieves 82 percent success on unseen tabletop tasks. However, OpenVLA's performance degrades to 34 percent on long-horizon tasks (>100 steps) due to compounding error and sparse language supervision[3]. For production deployments, teams typically fine-tune on 5–10k domain-specific trajectories after Open X-Embodiment pre-training rather than relying on zero-shot transfer.
Procurement Considerations for Physical AI Teams
Open X-Embodiment is free to download but expensive to curate, validate, and adapt. The 1M-trajectory corpus occupies 4.2 TB in compressed RLDS format; decompressed HDF5 files exceed 12 TB. Cloud egress costs from Hugging Face or Google Cloud Storage run $400–$1,200 for a full download. Teams should budget 2–4 engineer-weeks for initial data profiling: identifying corrupted episodes, verifying action-space ranges, and filtering subsets by task relevance.
License heterogeneity requires legal review. Licensing briefings from truelabel map each subset to its source license and flag NC-restricted subsets. Enterprises training foundation models for commercial SaaS products should exclude CC BY-NC subsets (approximately 18 of 60 datasets) or negotiate waivers. Academic teams face no restrictions but must attribute original dataset authors per CC BY 4.0 termsCreative Commons Attribution 4.0.
Data quality varies widely. BridgeData V2 and DROID include per-episode success labels and scene metadata; older subsets (RoboTurk, Franka Play) lack success annotations and contain 10–15 percent failed grasps or collisions[4]. Truelabel's marketplace offers pre-filtered subsets with verified success labels, embodiment-specific action normalization, and consent artifacts for demonstrator-contributed data. Custom procurement requests can target specific embodiments (e.g., "Franka Panda kitchen tasks with language annotations") or exclude NC-licensed subsets entirely.
Integration with LeRobot and Diffusion Policy
LeRobot provides native RLDS loaders for Open X-Embodiment subsets, enabling one-line dataset instantiation: `dataset = LeRobotDataset("openx/bridge_v2")`. LeRobot converts RLDS episodes to PyTorch tensors, applies embodiment-specific action normalization, and chunks trajectories into fixed-length windows (typically 10–50 steps) for transformer-based policies.
Diffusion Policy, a state-of-the-art imitation learning architecture, trains on Open X-Embodiment via LeRobot's diffusion training scriptsDiffusion Policy training example. The policy denoises action sequences conditioned on visual observations and language instructions, achieving 91 percent success on BridgeData V2 validation tasks after 200k gradient steps. Training a 50M-parameter Diffusion Policy model on 60k BridgeData V2 episodes requires 8× A100 GPUs for 12 hours; scaling to the full 1M-trajectory corpus demands 64–128 GPUs and multi-node data parallelism.
LeRobot's RLDS integration handles embodiment ID mapping, camera calibration metadata, and action-space clipping. However, it does not resolve license conflicts or filter low-quality episodes. Teams should pre-process Open X-Embodiment subsets with data provenance checks—verifying episode metadata, detecting duplicate trajectories, and validating action bounds—before initiating large-scale training runs.
Comparison with DROID and BridgeData V2
Open X-Embodiment is a meta-dataset; its largest contributors—DROID and BridgeData V2—are also distributed independently. DROID contributes 76,000 trajectories from 564 real-world scenes, emphasizing scene diversity over task diversity. BridgeData V2 provides 60,096 trajectories across 13 kitchen tasks with dense language annotations (95 percent coverage) and per-episode success labels.
Buyers prioritizing language-conditioned policies should source BridgeData V2 directly rather than via Open X-Embodiment, as the RLDS conversion occasionally strips language metadata due to schema mismatches. DROID's strength is visual diversity—564 unique kitchens, living rooms, and offices—but action labels are noisier (12 percent failed grasps) than BridgeData V2's curated demonstrations[5].
For teams training imitation learning policies on a single embodiment (e.g., Franka Panda), BridgeData V2 alone (60k episodes) often outperforms the full Open X-Embodiment corpus (1M episodes) due to reduced embodiment mismatch and higher annotation quality. Cross-embodiment pre-training on Open X-Embodiment becomes advantageous only when target-domain data falls below 5k episodes or when zero-shot transfer to unseen embodiments is required.
Teleoperation Data and Demonstrator Diversity
Approximately 70 percent of Open X-Embodiment trajectories are teleoperation data—human operators controlling robots via VR controllers, keyboard interfaces, or kinesthetic teaching. Teleoperation introduces systematic biases: operators favor conservative grasps, avoid dynamic motions, and exhibit inter-operator variance in action smoothness. Studies show that policies trained on single-operator data generalize poorly to multi-operator test distributions, with success rates dropping 15–25 percent[4].
Demonstrator diversity metadata is absent from Open X-Embodiment. Subsets like BridgeData V2 and DROID aggregate demonstrations from 10–20 operators, but operator IDs are not recorded in RLDS episodes. This prevents stratified train-test splits by operator and complicates fairness audits. Consent artifacts for demonstrators—signed agreements permitting data use in commercial models—are not bundled with the dataset, creating GDPR and AI Act compliance gaps.
For production deployments, teams should supplement Open X-Embodiment with domain-specific teleoperation data collected under controlled conditions: single-operator baselines, multi-operator validation sets, and consent-backed demonstrator agreements. Truelabel's sourcing service coordinates custom teleoperation collection with embodiment-matched hardware, per-episode success labeling, and demonstrator consent artifacts included in delivery.
Embodiment-Specific Subsets and Action Normalization
Open X-Embodiment's 22 embodiments span 7-DOF arms (Franka Panda, Kuka IIWA, UR5), bi-manual systems (Franka Duo), mobile manipulators (TIAGo), and quadrupeds (Spot). Each embodiment's action space is normalized to a canonical 7-DOF format: `[dx, dy, dz, droll, dpitch, dyaw, gripper]`. This normalization is embodiment-specific and lossy; bi-manual actions collapse to single-arm commands, and mobile-base velocities are discarded.
Action normalization mappings are defined in the RT-X codebase but not documented in RLDS metadata. Buyers training policies for non-standard embodiments (e.g., Franka Duo bi-manual tasks) must reverse-engineer normalization logic from source code or exclude incompatible subsets. The LeRobot dataset documentation provides embodiment-to-action-space mappings for 15 of 22 embodiments; the remaining 7 require manual inspection of RLDS `action` tensors.
Embodiment diversity is a double-edged sword. Cross-embodiment pre-training improves zero-shot transfer but increases training instability due to action-space mismatch. Teams targeting a single embodiment should filter Open X-Embodiment to embodiment-matched subsets (e.g., `embodiment == 'franka_panda'`) and verify action bounds before training. Truelabel's dataset format guide documents action-space conventions for 30+ robot platforms, including Open X-Embodiment embodiments.
Language Annotations and VLA Fine-Tuning
Language annotations in Open X-Embodiment are sparse and inconsistent. Only 28 percent of trajectories include natural-language task descriptions; coverage ranges from 95 percent (BridgeData V2) to 0 percent (RoboTurk, Franka Play)[1]. Language strings vary in granularity: some are high-level goals ("pick up the red block"), others are step-by-step instructions ("move gripper to block, close gripper, lift").
Vision-language-action models like RT-2 and OpenVLA require dense language supervision—ideally one instruction per episode. Training on Open X-Embodiment's sparse annotations degrades VLA performance by 20–30 percent compared to language-dense datasets like BridgeData V2[6]. Teams fine-tuning VLA models should pre-filter Open X-Embodiment to subsets with >80 percent language coverage or augment missing annotations via LLM-based captioning (e.g., prompting GPT-4 with episode frames to generate task descriptions).
Language instruction quality also varies. Some subsets use templated strings ("pick {object}"), others use free-form descriptions. Templated instructions improve policy generalization to unseen objects but reduce linguistic diversity. Free-form instructions capture task nuance but introduce annotation noise. Truelabel's VLA fine-tuning guide recommends stratified sampling: 60 percent templated, 40 percent free-form, with manual review of outlier instructions.
Data Quality and Episode Filtering
Open X-Embodiment includes 10–15 percent low-quality episodes: failed grasps, collisions, incomplete trajectories, and mislabeled actions[4]. Quality varies by subset: BridgeData V2 and DROID include per-episode success labels (binary flags indicating task completion), while older subsets (RoboTurk, Berkeley Cable Routing) lack success annotations and require heuristic filtering.
Common quality issues include: (1) action saturation—gripper commands clipped to [0,1] range, losing fine motor control; (2) observation corruption—frames with lens flare, motion blur, or occlusion; (3) temporal misalignment—action timestamps offset from observation timestamps by 50–100 ms. These issues degrade policy performance by 8–12 percent on held-out tasks[5].
Teams should implement multi-stage filtering: (1) remove episodes with `is_terminal == False` at final step (incomplete trajectories); (2) clip action outliers beyond 3 standard deviations; (3) discard episodes with >20 percent saturated actions; (4) filter frames with brightness <10 or >245 (under/overexposed). Truelabel's marketplace offers pre-filtered Open X-Embodiment subsets with verified success labels, action-bound validation, and per-episode quality scores (0–100 scale based on action smoothness, observation clarity, and task completion).
Storage Formats and Cloud Deployment
Open X-Embodiment is distributed as RLDS-formatted HDF5 files, totaling 4.2 TB compressed (12 TB uncompressed). HDF5's single-writer constraint complicates distributed training: multiple workers cannot write to the same file concurrently. Teams scaling beyond 8 GPUs typically convert RLDS to Apache Parquet or MCAP for parallel reads and cloud-native storage.
Parquet conversion reduces storage by 30–40 percent via columnar compression and enables predicate pushdown (filtering episodes by embodiment or task without loading full files). Hugging Face Datasets supports RLDS-to-Parquet conversion via `dataset.to_parquet()`, preserving nested observation dictionaries as struct columns. MCAP, a ROS 2-native format, is preferred for teams integrating with ROS bag workflows but requires custom RLDS-to-MCAP converters.
Cloud deployment costs: storing 12 TB on AWS S3 Standard costs $276/month; egress to on-premises clusters costs $1,080 per full download. Teams training on cloud VMs should use S3 Intelligent-Tiering (auto-migrates cold data to cheaper tiers) and enable S3 Transfer Acceleration (50 percent faster downloads for $0.04/GB). Google Cloud Storage offers similar pricing; Hugging Face Datasets Hub provides free hosting but throttles downloads to 10 MB/s for non-Pro users.
Sim-to-Real Gap and Synthetic Data Augmentation
Open X-Embodiment contains zero synthetic trajectories; all 1M episodes are real-world demonstrations. This contrasts with simulation-heavy datasets like RLBench (100k synthetic episodes) and ManiSkill (500k procedurally generated tasks). Real-world data eliminates the sim-to-real gap—the performance drop when deploying simulation-trained policies on physical robots—but limits task diversity and scalability.
Synthetic augmentation is common: teams pre-train on simulated grasping (e.g., 100k Isaac Gym episodes), then fine-tune on 5–10k Open X-Embodiment trajectories. This hybrid approach achieves 85–90 percent of pure real-world performance at 10× lower data-collection cost[7]. However, simulation requires high-fidelity physics (contact dynamics, friction models) and photorealistic rendering to minimize the reality gap.
NVIDIA Cosmos and Isaac Sim generate synthetic robot trajectories with domain randomization—varying lighting, textures, and object poses—to improve real-world transfer. Cosmos-generated data is not included in Open X-Embodiment but can be mixed during training. Teams should validate sim-to-real transfer on 500–1,000 real-world test episodes before production deployment; success rates below 70 percent indicate insufficient domain randomization or physics fidelity.
Alternatives and Complementary Datasets
Open X-Embodiment is the largest open-source robot dataset but not the only option. DROID (76k trajectories, 564 scenes) emphasizes visual diversity; BridgeData V2 (60k trajectories, 13 tasks) prioritizes language annotations and success labels. RoboNet (15 million frames, 7 robots) predates Open X-Embodiment and uses a proprietary format incompatible with RLDS.
For egocentric manipulation, EPIC-KITCHENS-100 provides 100 hours of first-person kitchen activity but lacks robot actions (human-only demonstrations). Ego4D extends this to 3,000 hours across diverse environments but remains action-free. Teams training egocentric policies must pair these datasets with inverse-dynamics models to infer actions from human hand trajectories.
Proprietary datasets from Scale AI and Figure AI (1M+ humanoid trajectories) offer higher quality and tighter licensing but cost $50k–$500k per dataset. Truelabel's marketplace bridges this gap: curated subsets of Open X-Embodiment (filtered for quality, annotated with success labels, bundled with consent artifacts) at $5k–$20k per embodiment-task pair, plus custom teleoperation collection for $200–$500 per trajectory.
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment paper documents 1M+ trajectories, 22 embodiments, mixed CC licenses, and 28% language annotation coverage
arXiv ↩ - Open X-Embodiment: Robotic Learning Datasets and RT-X Models
RT-X paper reports 50-70% zero-shot improvement from cross-embodiment pre-training
arXiv ↩ - OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA performance metrics on Open X-Embodiment subsets
arXiv ↩ - BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 contributes 60,096 trajectories with 95% language coverage and per-episode success labels
arXiv ↩ - DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID paper reports 12% failed-grasp rate and action-label noise
arXiv ↩ - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 paper quantifies 20-30% performance drop from sparse language annotations
arXiv ↩ - Crossing the Reality Gap: A Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning
Sim-to-real survey reports 85-90% real-world performance with hybrid training
arXiv ↩
FAQ
What license governs commercial use of models trained on Open X-Embodiment?
Open X-Embodiment operates under mixed Creative Commons licenses inherited from constituent datasets—primarily CC BY 4.0 (commercial-friendly) and CC BY-NC 4.0 (non-commercial). Approximately 18 of 60 subsets carry NC restrictions, prohibiting commercial model training without explicit waivers. Buyers must audit each subset individually; truelabel's licensing briefings map subsets to source licenses and flag NC-restricted data. Models trained on NC subsets face legal ambiguity if deployed in commercial SaaS products, as case law has not clarified whether fine-tuning constitutes "commercial advantage."
How does Open X-Embodiment compare to DROID and BridgeData V2 for VLA fine-tuning?
BridgeData V2 (60k trajectories, 95 percent language coverage) outperforms Open X-Embodiment for single-embodiment VLA fine-tuning due to dense language annotations and verified success labels. DROID (76k trajectories, 564 scenes) excels in visual diversity but has sparser language (40 percent coverage) and noisier action labels (12 percent failed grasps). Open X-Embodiment's advantage is cross-embodiment pre-training: models pre-trained on 1M trajectories achieve 50–70 percent higher zero-shot success on unseen embodiments than single-dataset baselines. For production VLA deployments, teams typically pre-train on Open X-Embodiment, then fine-tune on 5–10k domain-specific BridgeData V2 episodes.
What are the storage and compute requirements for training on Open X-Embodiment?
The full 1M-trajectory corpus occupies 4.2 TB compressed (12 TB uncompressed HDF5). Training a 50M-parameter Diffusion Policy on 60k BridgeData V2 episodes requires 8× A100 GPUs for 12 hours; scaling to 1M episodes demands 64–128 GPUs and multi-node data parallelism. Cloud storage costs $276/month on AWS S3 Standard; egress costs $1,080 per full download. Teams should convert RLDS to Parquet (30–40 percent smaller) for distributed training and use S3 Intelligent-Tiering to reduce storage costs. Budget 2–4 engineer-weeks for data profiling, license audits, and action-space validation before initiating training.
Does Open X-Embodiment include consent artifacts for demonstrators?
No. Open X-Embodiment does not bundle signed consent releases from human demonstrators who contributed teleoperation data. This creates GDPR Article 7 compliance gaps (lawful basis for processing identifiable data) and AI Act transparency requirements. Approximately 70 percent of trajectories are teleoperation data, but demonstrator IDs and consent records are not included in RLDS metadata. Enterprises deploying commercial models should request consent artifacts from original dataset maintainers or source teleoperation data from vendors like truelabel that include consent documentation in delivery.
How do I filter Open X-Embodiment for a specific robot embodiment?
RLDS episodes include an `episode_metadata/embodiment` string field (e.g., 'franka_panda', 'kuka_iiwa'). Use TensorFlow Datasets filtering: `ds.filter(lambda x: x['episode_metadata']['embodiment'] == 'franka_panda')`. LeRobot provides embodiment-specific loaders: `LeRobotDataset('openx/bridge_v2')` auto-filters to Franka Panda trajectories. Action normalization mappings are embodiment-specific; verify action bounds (typically [-1, 1] for position deltas, [0, 1] for gripper) before training. Truelabel's dataset format guide documents action-space conventions for 30+ embodiments, including all 22 Open X-Embodiment platforms.
What percentage of Open X-Embodiment trajectories include language annotations?
Only 28 percent of the 1M trajectories include natural-language task descriptions. Coverage varies by subset: BridgeData V2 (95 percent), Language Table (88 percent), DROID (40 percent), RoboTurk (0 percent). Vision-language-action models like RT-2 require dense language supervision; training on sparse annotations degrades VLA performance by 20–30 percent. Teams fine-tuning VLA models should pre-filter to subsets with >80 percent language coverage or augment missing annotations via LLM-based captioning (e.g., GPT-4 prompted with episode frames). Truelabel's VLA fine-tuning guide recommends 60 percent templated instructions, 40 percent free-form for optimal generalization.
Need data like Open X-Embodiment Dataset: Multi-Robot Learning at Scale?
If your project needs similar modality, scale, or licensing, truelabel can surface comparable open datasets or match you with capture partners that deliver to spec.
Source Open X-Embodiment Subsets