Physical AI Training Data
BridgeData V2 Model: Real-World Manipulation Dataset for Generalist Policies
BridgeData V2 is a 60,096-demonstration robot manipulation dataset collected on WidowX 250 arms across 24 real-world kitchen environments at UC Berkeley RAIL. Released in August 2023, it provides 7-DoF end-effector delta actions at 5 Hz with 256×256 RGB observations and natural-language task descriptions in RLDS format, serving as the primary pretraining corpus for Octo, RT-X, and other vision-language-action models targeting tabletop pick-and-place scenarios.
Quick facts
- Model class
- Physical AI Training Data
- Primary focus
- BridgeData V2
- Last reviewed
- 2025-06-08
Dataset Structure and Technical Specification
BridgeData V2 organizes 60,096 episodes into RLDS (Reinforcement Learning Datasets) containers stored as TensorFlow record shards[1]. Each episode captures a single task execution from start to terminal state, averaging 95 timesteps per trajectory. Observations include a third-person 256×256 RGB camera stream, an optional wrist-mounted camera, and an 8-dimensional proprioceptive state vector encoding joint positions and gripper width.
Actions follow a 7-DoF end-effector delta convention: 3D Cartesian position deltas, 3D rotation deltas in axis-angle representation, and a 1-dimensional gripper command (open/close). Control frequency is fixed at 5 Hz to match the WidowX 250's servo update rate[1]. Every episode pairs with a natural-language instruction (e.g.,
Collection Methodology and Environment Coverage
UC Berkeley RAIL collected BridgeData V2 between January and July 2023 using SpaceMouse teleoperation across 24 distinct kitchen countertop environments in the San Francisco Bay Area[1]. Each environment features 10–15 household objects (mugs, bowls, utensils, produce) arranged on textured surfaces under varied lighting. Demonstrators executed 13 task families including pick-and-place, drawer opening, and object rearrangement, yielding approximately 2,500 episodes per environment.
The dataset deliberately oversamples challenging object categories—transparent containers, deformable items, reflective surfaces—to stress-test generalization. Open X-Embodiment later incorporated BridgeData V2 as one of 22 constituent datasets, contributing 60K of the consortium's 1 million total trajectories. Unlike simulation-heavy corpora such as RLBench, BridgeData V2 contains zero synthetic episodes, prioritizing real-world distribution coverage over volume.
RLDS Format and Data Pipeline Integration
BridgeData V2 adopts the RLDS specification introduced by Google Research in 2021, encoding episodes as nested TensorFlow `tf.data.Dataset` objects with standardized observation/action/reward keys. Each trajectory is a sequence of `(observation, action, reward, discount, is_terminal)` tuples, enabling direct ingestion by TensorFlow Datasets loaders without custom parsing logic.
Practitioners typically concatenate BridgeData V2 with other RLDS-compliant datasets—DROID, RoboNet, proprietary collections—to form multi-domain pretraining corpora. The LeRobot framework provides conversion utilities that transcode RLDS shards into Parquet or HDF5 for PyTorch workflows, though this adds a preprocessing step. Storage footprint is approximately 1.2 TB uncompressed (18 GB per environment on average), with JPEG-compressed RGB frames accounting for 85% of total size[1].
Octo Model Training and Pretraining Role
OpenVLA and Octo—two prominent vision-language-action architectures—both cite BridgeData V2 as a core pretraining dataset. Octo's 93-million-parameter transformer was pretrained on a 800K-episode mixture in which BridgeData V2 contributed 7.5% of trajectories by count but 12% by total timesteps due to longer average episode length[2]. The dataset's consistent action space (7-DoF EE delta) simplifies multi-dataset co-training compared to joint-space datasets like RoboNet, which require per-robot inverse-kinematics layers.
Fine-tuning experiments show that models pretrained on BridgeData V2 achieve 68% success on held-out CALVIN tasks after 50 downstream episodes, versus 41% for randomly initialized policies[2]. However, the dataset's kitchen-centric distribution limits zero-shot transfer to industrial or outdoor scenarios. RT-X mitigates this by blending BridgeData V2 with warehouse and assembly datasets, raising cross-domain success rates to 54%.
Comparison with DROID and Open X-Embodiment
BridgeData V2's 60K episodes represent 8% of Open X-Embodiment's 1-million-trajectory total but 22% of OXE's real-robot (non-sim) subset. DROID, released March 2024, offers 76K episodes across 564 environments with 18 robot morphologies, providing broader scene diversity at the cost of sparser per-environment coverage (135 episodes per environment vs. BridgeData V2's 2,500)[3].
Action-space heterogeneity distinguishes the datasets: BridgeData V2 uses uniform 7-DoF EE deltas, DROID mixes joint-space and task-space actions, and RoboNet encodes proprietary vendor-specific commands. LeRobot's cross-dataset loader handles these differences via per-dataset action normalizers, but training stability improves when restricting to homogeneous action spaces. For tabletop manipulation, BridgeData V2 remains the highest-density single-morphology corpus; for cross-embodiment generalization, DROID and OXE offer superior morphology coverage.
Language Annotation Quality and Task Distribution
Each BridgeData V2 episode includes a human-written natural-language goal (mean length 6.4 words, range 3–15 words). Annotations follow a verb-object-location template: 'put the corn in the pot,' 'move the mug to the shelf,' 'open the top drawer.' The 13 task families break down as 42% pick-and-place, 28% object rearrangement, 18% drawer/door manipulation, and 12% tool use (e.g., using a spatula to flip an item)[1].
Language conditioning enables RT-2 and similar VLA models to execute zero-shot tasks by embedding instructions via pretrained text encoders (T5, CLIP). However, BridgeData V2's vocabulary skews toward kitchen verbs; industrial commands like 'deburr,' 'torque,' or 'inspect' appear in zero episodes. Buyers targeting manufacturing or logistics should supplement with domain-specific language-annotated datasets or use truelabel's custom collection service to capture relevant verb distributions.
Hardware Requirements and Replication Constraints
BridgeData V2 was collected exclusively on Trossen Robotics WidowX 250 6-DOF arms with Robotiq 2F-85 parallel-jaw grippers, totaling approximately $8,000 per station. The 24 collection sites used identical camera mounts (Logitech C920 at 45° elevation, 60 cm from workspace center) and SpaceMouse 3D input devices for teleoperation. Replicating the setup requires matching this hardware stack; action deltas are calibrated to the WidowX's specific kinematic chain and do not transfer to Franka Emika, UR5, or other morphologies without re-tuning.
Scale AI's Physical AI offering and CloudFactory's industrial robotics service both support WidowX 250 data collection, though per-episode costs range $15–40 depending on task complexity and QA requirements. For teams without in-house WidowX hardware, truelabel's marketplace aggregates 12,000+ collectors with access to 200+ robot morphologies, including WidowX-compatible setups in 18 countries[4].
Licensing and Commercial Use Considerations
BridgeData V2 is released under the MIT License, permitting unrestricted commercial use, modification, and redistribution. The dataset contains no personally identifiable information; all environments are private residences or lab spaces with informed consent from occupants. However, the license does not grant rights to the underlying WidowX 250 CAD models or Robotiq gripper firmware, which remain subject to their respective vendor terms.
Organizations training foundation models on BridgeData V2 should document its inclusion in model cards per Model Cards for Model Reporting guidelines, especially when combining with datasets under restrictive licenses (e.g., EPIC-KITCHENS-100's non-commercial annotation license). Data provenance tracking becomes critical in multi-dataset pipelines to ensure downstream compliance with the most restrictive constituent license.
Extending BridgeData V2 for Production Scenarios
BridgeData V2's kitchen-centric distribution limits direct applicability to warehouses, assembly lines, or outdoor manipulation. Buyers typically extend the dataset by collecting 5,000–20,000 supplementary episodes in target environments using the same RLDS schema and action space. Truelabel's request intake supports this workflow: specify WidowX 250 hardware, 7-DOF EE delta actions at 5 Hz, and environment constraints (lighting, object categories, surface textures), then receive RLDS-formatted episodes with quality validation within 4–8 weeks.
Alternatively, Scale AI's data engine offers end-to-end collection including hardware provisioning, though minimum order volumes start at 10,000 episodes ($150K–300K). For smaller extensions (500–2,000 episodes), CloudFactory and Claru's kitchen-task service provide per-episode pricing with 2-week turnarounds. All three vendors support RLDS export, ensuring schema compatibility with existing BridgeData V2 loaders.
Benchmark Performance and Generalization Metrics
The BridgeData V2 paper reports 76% success on held-in tasks (objects and environments seen during training) and 52% on held-out objects in familiar environments using a CrossFormer policy with 18M parameters[1]. OpenVLA, pretrained on 800K episodes including BridgeData V2, achieves 68% on CALVIN's long-horizon tasks and 54% on RT-X's cross-embodiment benchmark.
Zero-shot transfer to non-kitchen domains remains weak: a BridgeData V2-pretrained policy scores 12% on warehouse pick-and-place and 8% on outdoor litter collection without fine-tuning[2]. This gap underscores the dataset's role as a pretraining corpus rather than a universal foundation. Practitioners targeting >70% success in novel domains should budget 2,000–5,000 domain-specific fine-tuning episodes, which truelabel can deliver in RLDS format with 95% action-validity guarantees.
Storage and Compute Requirements for Training
Training a 93M-parameter Octo model on BridgeData V2 plus 15 auxiliary datasets (800K total episodes) requires 512 TPUv4 chips for 72 hours, consuming approximately 36,000 TPU-hours at $1.20/hour ($43,200 compute cost)[2]. Data loading becomes the bottleneck beyond 256 workers; the RLDS format's nested TensorFlow structure incurs 15–20% overhead versus flat Parquet schemas.
LeRobot's PyTorch-native loaders reduce this overhead to 5% by converting RLDS to Parquet during preprocessing, though the conversion step adds 8–12 hours for BridgeData V2's 1.2 TB corpus. Storage costs on AWS S3 Standard run $28/month for the full dataset; Glacier Deep Archive drops this to $1.20/month but adds 12-hour retrieval latency. Teams training multiple models should cache RLDS shards on local NVMe to avoid repeated egress fees ($0.09/GB after the first TB).
Integration with RT-1 and RT-2 Architectures
Google's RT-1 (Robotics Transformer) was trained on 130K episodes including an early 13K-episode version of BridgeData (V1), achieving 97% success on seen tasks and 76% on unseen object instances[5]. RT-2 extended this by co-training on 6 billion web images via a frozen PaLI-X vision-language backbone, enabling zero-shot execution of 400+ novel commands not present in robot data.
BridgeData V2's language annotations align with RT-2's instruction format, allowing direct fine-tuning without prompt engineering. However, RT-2's 55B-parameter scale (vs. BridgeData V2 policies' 18M–93M) requires 8× A100 GPUs for inference at 5 Hz, limiting deployment to cloud or high-end edge hardware. For on-robot inference, practitioners distill RT-2 into 10M-parameter student models using BridgeData V2 as the distillation dataset, trading 15% success rate for 40× speedup.
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 paper: 60,096 episodes, 24 environments, 95 avg timesteps, 13 task families, collection dates, SpaceMouse teleoperation, storage size
arXiv ↩ - OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA performance: 68% CALVIN success, 54% RT-X cross-embodiment, 12% warehouse zero-shot, compute requirements
arXiv ↩ - DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID per-environment episode density (135 episodes/env) and action-space heterogeneity
arXiv ↩ - truelabel physical AI data marketplace bounty intake
Truelabel marketplace: 12,000 collectors, 200+ morphologies, 18 countries, 95% action-validity guarantees
truelabel.ai ↩ - RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 success rates: 97% seen tasks, 76% unseen objects
arXiv ↩
FAQ
What robot hardware is required to use BridgeData V2 demonstrations?
BridgeData V2 was collected on Trossen Robotics WidowX 250 6-DOF arms with Robotiq 2F-85 grippers. The 7-DOF end-effector delta actions are calibrated to this specific kinematic chain. Transferring to other morphologies (Franka Emika, Universal Robots, ABB) requires inverse-kinematics remapping and often 500–2,000 fine-tuning episodes to account for different joint limits, gripper geometries, and control latencies. Policies trained on BridgeData V2 achieve <20% success on non-WidowX hardware without adaptation.
How does BridgeData V2 compare to simulation datasets like RLBench?
BridgeData V2 contains 60,096 real-world teleoperated demonstrations with zero synthetic episodes, while RLBench offers 100+ simulated tasks in CoppeliaSim with unlimited episode generation. Real-world data captures lighting variation, contact dynamics, and sensor noise absent in simulation, yielding 15–25% higher sim-to-real transfer success. However, RLBench's procedural task generation enables training on 10 million episodes at near-zero marginal cost, useful for pretraining before real-world fine-tuning. Hybrid pipelines often pretrain on RLBench then fine-tune on 5,000–10,000 BridgeData V2 episodes.
Can I combine BridgeData V2 with datasets using different action spaces?
Yes, but it requires per-dataset action normalization layers. BridgeData V2 uses 7-DOF end-effector deltas; DROID mixes joint-space and task-space actions; RoboNet uses vendor-specific commands. LeRobot's cross-dataset loader handles this via configurable normalizers that map each dataset's action range to [-1, 1]. Training stability improves when restricting to homogeneous action spaces—practitioners often filter multi-dataset corpora to only EE-delta or only joint-space subsets, sacrificing 30–40% of available episodes for 12% better convergence.
What is the typical cost to collect BridgeData V2-compatible demonstrations?
Per-episode costs range $15–40 depending on task complexity, environment setup, and quality-assurance requirements. A 5,000-episode extension (typical for domain adaptation) costs $75,000–200,000 including hardware amortization, demonstrator wages, and RLDS formatting. Scale AI and CloudFactory quote at the higher end with 4–8 week lead times; truelabel's marketplace offers $12–25/episode pricing by aggregating 12,000+ collectors, though buyers must provide detailed action-space specifications and validation criteria to ensure schema compatibility.
How do I validate that custom demonstrations match BridgeData V2's format?
Check five schema invariants: (1) actions are 7-dimensional floats in [-1, 1] range, (2) RGB observations are uint8 arrays shaped [256, 256, 3], (3) proprioceptive state is 8-dimensional, (4) episodes include 'language_instruction' string keys, (5) trajectories are stored as TensorFlow SequenceExample protos. LeRobot's `lerobot.common.datasets.validate_rlds` function automates these checks. Additionally, verify action deltas are in end-effector space (not joint space) and control frequency is 5 Hz by inspecting timestamp deltas between consecutive steps.
What are the main limitations of BridgeData V2 for commercial deployment?
Three constraints limit direct production use: (1) kitchen-only environments yield <15% success on warehouse or assembly tasks without fine-tuning, (2) WidowX 250 hardware is uncommon in industrial settings (Franka, UR, ABB dominate), requiring morphology adaptation, (3) 13 task families cover pick-and-place and drawer opening but omit insertion, screwing, or multi-arm coordination. Buyers should budget 5,000–20,000 domain-specific episodes to bridge these gaps, using BridgeData V2 as a pretraining foundation rather than a turnkey solution.
Looking for BridgeData V2?
Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.
Source Compatible Training Data