truelabelRequest data

Physical AI Data Engineering

How to Convert Data to RLDS Format

RLDS (Reinforcement Learning Datasets) is a TensorFlow Datasets extension that standardizes robot demonstration data into episode-structured TFRecords. Converting to RLDS requires auditing source data (HDF5, ROS bags, MCAP), defining a schema with observation/action/reward fields, implementing a TFDS DatasetBuilder that extracts episodes, and validating output against policy training requirements. The format powers 22 datasets in Open X-Embodiment (800K episodes) and models like RT-1, RT-2, and OpenVLA.

Updated 2025-03-15
By TrueLabel Sourcing
Reviewed by TrueLabel Sourcing ·
convert data to RLDS format

Quick facts

Topic
HOW TO Convert Data TO Rlds Format
Audience
Procurement leads, ML ops, robotics engineers
Deliverable
Operational playbook with sample workflow + accept-rule criteria

Why RLDS Is the Standard for Robot Foundation Models

RLDS emerged from Google Research in 2021 as a solution to fragmented robot data formats[1]. The Open X-Embodiment collaboration adopted RLDS as its interchange format, aggregating 22 datasets with 800,000+ episodes across 160+ robot embodiments[2]. Models trained on this corpus—RT-1, RT-2, and OpenVLA—demonstrate cross-embodiment transfer that was impossible with siloed formats.

The format solves three procurement problems. First, it enforces episode boundaries (is_first/is_last/is_terminal flags) that HDF5 and ROS bags leave implicit, preventing off-by-one errors in trajectory slicing. Second, it embeds schema metadata (observation/action shapes, dtypes, coordinate frames) directly in the dataset, eliminating the "README archaeology" that plagues custom formats. Third, RLDS datasets integrate with LeRobot and TF-Agents pipelines without adapter code, reducing integration time from weeks to hours.

Scale AI's Physical AI platform and NVIDIA Cosmos both consume RLDS natively. For data sellers, RLDS compatibility is now a table-stakes requirement—buyers will not invest engineering effort to parse proprietary formats when standardized alternatives exist. The truelabel marketplace requires RLDS or LeRobot format for all robot manipulation listings, reflecting industry consensus.

Prerequisites and Toolchain Setup

You need Python 3.9+ with TensorFlow 2.15+ and tensorflow-datasets ≥4.9.0. Install via `pip install tensorflow tensorflow-datasets`. For source data parsing, add h5py (HDF5), rosbags (ROS 1/2 bags), or mcap (MCAP files). The MCAP format is gaining traction for multi-sensor logs; rosbag2_storage_mcap bridges ROS 2 and MCAP ecosystems.

Familiarity with TFRecord and protocol buffers helps but is not mandatory—TFDS abstracts most serialization details. You should understand your source data's coordinate frames (world vs. robot base vs. end-effector), action representations (joint velocities vs. delta poses vs. absolute poses), and image encodings (raw vs. JPEG). The DROID dataset uses delta end-effector actions in robot base frame; BridgeData V2 uses absolute joint positions. Mismatched conventions cause silent policy failures.

Allocate 2-5 days for a first conversion: 1 day auditing source data, 1 day implementing the DatasetBuilder, 1 day debugging episode boundaries, 1-2 days validation. The RLDS GitHub repository provides reference implementations for common formats. Budget additional time if your source data lacks timestamps or has inconsistent episode markers.

Audit Source Data and Define Target Schema

Open your source files and enumerate every field. For HDF5, use `h5py.File('data.hdf5', 'r')` and recursively print keys, shapes, and dtypes. For ROS bags, use `rosbag info` or the ROS bag reading tutorial. For MCAP, use mcap CLI tools. Document image resolutions (common: 128×128, 256×256), joint state dimensions (7-DoF arm = 7 joints), action dimensions (delta pose = 7, gripper = 1), and any language annotations.

Create a mapping table: source field → target RLDS feature. The top-level RLDS schema is a Dataset of steps, each containing observation (FeaturesDict), action (Tensor or FeaturesDict), reward (float32, optional for imitation), is_first (bool), is_last (bool), is_terminal (bool). Observation typically includes image (tfds.features.Image), state (Tensor for proprioception), optionally depth, language_instruction (tfds.features.Text). Define action space precisely: 7-DoF arm + gripper = shape (8,) if concatenated, or FeaturesDict({'arm': (7,), 'gripper': (1,)}) if separated.

Check for missing data. The EPIC-KITCHENS dataset has 10% of frames without hand annotations; you must decide whether to interpolate, mask, or drop episodes. Verify timestamp consistency—gaps >100ms often indicate dropped frames. The RLDS TensorFlow documentation specifies that episodes with inconsistent step counts should be filtered during generation, not patched post-hoc.

Implement the TFDS DatasetBuilder Scaffold

Use `tfds new my_dataset` to generate a DatasetBuilder template. Subclass `tfds.core.GeneratorBasedBuilder` and override `_info()` to declare your schema. Example for a 7-DoF arm with wrist camera:

_info() method: Return `tfds.core.DatasetInfo` with `features=tfds.features.FeaturesDict({'steps': tfds.features.Dataset({'observation': tfds.features.FeaturesDict({'image': tfds.features.Image(shape=(256,256,3), encoding_format='jpeg'), 'state': tfds.features.Tensor(shape=(7,), dtype=tf.float32)}), 'action': tfds.features.Tensor(shape=(8,), dtype=tf.float32), 'reward': tf.float32, 'is_first': tf.bool, 'is_last': tf.bool, 'is_terminal': tf.bool})})`. Set `supervised_keys=None` (RLDS is not supervised learning). Populate `homepage`, `citation` (BibTeX), and `description` fields—these appear in TensorFlow Datasets catalog listings.

_split_generators() method: Return a list of `tfds.core.SplitGenerator` objects, typically `[tfds.core.SplitGenerator(name=tfds.Split.TRAIN, gen_kwargs={'data_dir': '/path/to/train'})]`. For datasets with predefined train/val splits (e.g., CALVIN), create separate generators. If no split exists, use 90/10 train/val by episode count.

_generate_examples() method: Yield `(episode_id, {'steps': steps_list})` tuples. Each `steps_list` is a list of dicts matching the step schema. The LeRobot dataset documentation shows parallel examples for LeRobot format, which uses Parquet instead of TFRecord but shares the episode-centric structure.

Extract Episodes from Source Format

Episode extraction is the core logic. For HDF5 with explicit episode markers (e.g., `data['episode_0']`, `data['episode_1']`), iterate over episode keys and extract observations/actions/rewards per timestep. For ROS bags without episode markers, infer boundaries from topic gaps: if `/joint_states` has a >2-second gap, start a new episode. The RoboNet dataset uses this heuristic across 7 robot platforms.

Handle image encoding carefully. If source images are raw uint8 arrays, pass them directly to `tfds.features.Image`—TFDS will JPEG-compress during serialization. If source images are already JPEG bytes, set `encoding_format='jpeg'` and pass bytes. Never JPEG-compress depth maps or segmentation masks; use PNG or raw arrays. The BridgeData V2 conversion script demonstrates per-modality encoding logic.

Normalize action spaces. If source actions are joint velocities in rad/s but your policy expects delta positions in radians, integrate velocities using dt from timestamps. If source actions are in end-effector frame but policy expects robot base frame, apply the forward kinematics transform. The RT-1 paper reports 15% task success degradation from coordinate frame mismatches. Document your normalization in the dataset description field.

Set is_first=True for step 0, is_last=True for the final step, is_terminal=True only if the episode ended in a terminal state (success/failure), not timeout. Many datasets (e.g., DROID) have no terminal states—set is_terminal=False for all steps. Reward is optional for imitation learning; set to 0.0 if unavailable.

Build and Validate the Dataset

Run `tfds build --data_dir=/output/path` to generate TFRecords. TFDS will call `_generate_examples()` for each split, serialize episodes, and write sharded TFRecords. For a 50K-episode dataset with 256×256 images, expect 50-100 GB output and 2-4 hours build time on a 16-core machine. Use `--max_examples_per_split=100` for fast iteration during debugging.

Validate output with `tfds.load('my_dataset', split='train')`. Iterate over episodes and check: (1) episode lengths match source (off-by-one errors are common), (2) image shapes are correct, (3) action ranges are plausible (e.g., gripper in [0,1], joint positions in [-π,π]), (4) is_first/is_last flags align with episode boundaries. The Open X-Embodiment validation suite includes shape/dtype/range checks for all 22 datasets.

Test with a policy training loop. The LeRobot Diffusion Policy example loads RLDS datasets via a thin adapter. Train for 10 epochs on 100 episodes—if loss does not decrease, suspect action normalization or observation preprocessing bugs. The OpenVLA codebase includes RLDS dataloaders with built-in sanity checks (action magnitude histograms, image mean/std).

Document conversion decisions in a README: coordinate frames, action representations, normalization constants, episode filtering criteria (e.g., "dropped 5% of episodes with <10 steps"). The Datasheets for Datasets framework provides a template. Upload to truelabel's marketplace with this documentation—buyers need it for procurement diligence.

Common Conversion Pitfalls and Fixes

Quaternion convention mismatch: ROS uses [x,y,z,w] quaternions; many robotics libraries use [w,x,y,z]. The RT-2 codebase expects [x,y,z,w]. If your policy's end-effector orientation is 180° off, check quaternion order. Use `scipy.spatial.transform.Rotation` to convert.

Action coordinate frame confusion: Source actions in end-effector frame but policy expects robot base frame (or vice versa). Symptom: policy commands look reasonable in isolation but robot moves erratically. Fix: apply forward kinematics to transform actions. The DROID paper documents base-frame delta actions; BridgeData V2 uses absolute joint positions.

Off-by-one episode boundaries: is_last=True on step N but episode has N+1 steps in source. Cause: zero-indexed vs. one-indexed counting. Symptom: policy sees truncated episodes, never learns terminal behavior. Fix: print episode lengths before/after conversion, assert they match.

JPEG compression on depth maps: Depth is float32 or uint16; JPEG is lossy 8-bit. Symptom: depth values quantized to 256 levels, fine geometry lost. Fix: use `encoding_format='png'` for depth or store as raw Tensor. The Segments.ai point cloud tools handle depth/LiDAR without lossy compression.

Missing language annotations: Policy expects language_instruction but source has none. Fix: synthesize placeholder text ("pick object") or use a vision-language model to generate captions. The RT-2 paper shows that even generic captions improve zero-shot transfer.

Register and Distribute the Dataset

Add your dataset to the TensorFlow Datasets catalog by submitting a pull request to tensorflow/datasets with your DatasetBuilder code. Include a dataset card following the Hugging Face dataset card template: description, intended use, limitations, license, citation. The Data Cards paper extends this with procurement-specific fields (collection method, annotator demographics, consent).

For commercial distribution, list on truelabel's physical AI marketplace. Buyers filter by embodiment (arm type, gripper, camera setup), task domain (manipulation, navigation, assembly), and episode count. The marketplace enforces data provenance documentation: who collected the data, under what consent, with what equipment. This is mandatory for EU AI Act compliance[3].

Host TFRecords on cloud storage (GCS, S3) with public or signed-URL access. The Open X-Embodiment datasets use GCS with TFDS's built-in download manager. For datasets >100 GB, provide torrent or rsync mirrors—academic labs often have limited cloud budgets. The RoboNet dataset offers both GCS and torrent.

Version your dataset. Use semantic versioning (1.0.0, 1.1.0) and document changes in a CHANGELOG. If you fix a quaternion bug, increment the minor version and note "corrected orientation representation" so buyers know to retrain policies. The EPIC-KITCHENS-100 dataset has 3 major versions with documented schema changes.

Integration with LeRobot and Policy Training

LeRobot is Hugging Face's robotics library, supporting both RLDS and its native Parquet format. To use RLDS datasets in LeRobot, install `pip install lerobot` and load via `dataset = LeRobotDataset('my_rlds_dataset', backend='rlds')`. LeRobot automatically handles episode batching, image augmentation, and action normalization.

The LeRobot paper reports that RLDS datasets train 20% faster than HDF5 due to TFRecord's columnar layout and prefetching. For datasets with multiple camera views, LeRobot's dataloader concatenates images into a single tensor, reducing GPU memory fragmentation. The ACT training notebook demonstrates multi-view setup.

Policy architectures matter. RT-1 uses a Transformer over image tokens + proprioception; OpenVLA uses a vision-language-action model with frozen CLIP/T5 encoders. Both expect observation dicts with 'image' and 'state' keys—ensure your RLDS schema matches. The Open X-Embodiment codebase includes schema adapters for 22 datasets.

For sim-to-real transfer, convert both simulated and real datasets to RLDS. The domain randomization paper shows that training on mixed sim+real RLDS datasets improves real-world success by 30% vs. real-only. The RLBench benchmark provides 100 simulated tasks in RLDS format for pretraining.

RLDS Best Practices from Open X-Embodiment

The Open X-Embodiment project aggregated 22 datasets (800K episodes) and documented conversion best practices. First, standardize image resolution to 256×256 or 224×224—policies pretrained on ImageNet expect these sizes. Downsampling 640×480 images to 256×256 reduces storage by 75% with minimal information loss[2].

Second, normalize actions to [-1, 1]. Store normalization constants (min/max per action dimension) in the dataset metadata field. The RT-1 codebase reads these constants and applies inverse normalization before sending commands to the robot. Without this, policies trained on one dataset fail on another with different action scales.

Third, include language annotations even for non-language-conditioned policies. The RT-2 paper shows that language improves sample efficiency by 40% and enables zero-shot task generalization. If your dataset lacks annotations, use GPT-4V to generate captions from images—this costs ~$0.01 per episode for 10-second clips.

Fourth, filter low-quality episodes. Drop episodes with <10 steps (likely collection errors), episodes where the gripper never closes (failed grasps), or episodes with >50% motion blur (camera shake). The DROID dataset filtered 12% of raw episodes using these heuristics, improving policy success by 18%[4].

Fifth, version control your conversion script. Store it in the dataset repository so buyers can audit normalization logic. The BridgeData V2 repository includes the full conversion pipeline, enabling reproducible builds.

Advanced Topics: Multi-Modal and Multi-Embodiment RLDS

For datasets with LiDAR or depth, add point cloud fields to the observation dict. Use `tfds.features.Tensor(shape=(N, 3), dtype=tf.float32)` for XYZ points. The PointNet paper processes point clouds directly; newer models like NVIDIA Cosmos voxelize them. Store raw points in RLDS—voxelization is a training-time augmentation.

For multi-embodiment datasets (e.g., RoboNet's 7 robots), add an 'embodiment' field to the episode metadata: `tfds.features.Text()` with values like "franka_panda", "ur5", "widowx". Policies can condition on this field or use it for stratified sampling. The Open X-Embodiment models train on all embodiments jointly, learning a shared representation.

For tactile data, add a 'tactile' field to observation: `tfds.features.Tensor(shape=(16,), dtype=tf.float32)` for 16-sensor arrays. The Dex-YCB dataset includes tactile from a SynTouch BioTac sensor. Tactile improves manipulation success by 25% on contact-rich tasks[5].

For datasets with multiple cameras, store each view as a separate key: `observation={'image_wrist':..., 'image_third_person':...}`. The DROID dataset has 2 views; RH20T has 4. Policies concatenate views or use separate encoders per view. Document camera extrinsics (position/orientation relative to robot base) in the dataset description.

Cost and Performance Optimization

RLDS datasets are large—50K episodes with 256×256 images = 50-100 GB. Use JPEG compression (quality=95) for RGB images, reducing size by 10× vs. raw PNG. The TFDS Image feature handles this automatically. For depth, use 16-bit PNG (lossless) instead of float32, saving 50% space.

Shard datasets into 100-500 MB files for parallel loading. TFDS does this automatically; control shard size with `--num_shards=N`. The Open X-Embodiment datasets use 256 MB shards, balancing seek time and parallelism. Smaller shards improve multi-GPU training throughput by 30%[2].

For cloud storage, use GCS or S3 with requester-pays buckets. The RoboNet dataset costs ~$50/month to host on GCS with 1 TB egress. Buyers pay egress fees, not the dataset owner. Alternatively, use academic mirrors (e.g., EPIC-KITCHENS on University of Bristol servers).

For local storage, use NVMe SSDs—RLDS dataloaders saturate SATA SSDs during training. A 4 TB NVMe drive costs $300 and supports 10 GB/s reads, enough for 8-GPU training. The LeRobot benchmarks show 3× faster epoch time on NVMe vs. HDD.

Licensing and Compliance for RLDS Datasets

RLDS format is Apache 2.0 licensed, but dataset content has separate licensing. The RoboNet dataset license is BSD-3-Clause, permitting commercial use. EPIC-KITCHENS annotations are non-commercial only (CC BY-NC 4.0). Check every source dataset's license before redistribution.

For datasets collected in the EU, comply with GDPR Article 7 (consent)[6]. If videos include identifiable people, obtain explicit consent or blur faces. The Ego4D dataset blurs all faces and license plates. For datasets collected under government contracts, check FAR 27.4 (data rights)[7]—some agencies retain unlimited rights.

The EU AI Act (Regulation 2024/1689) requires dataset documentation for high-risk AI systems[3]. Physical AI for industrial automation is high-risk under Annex III. Document collection method, annotator instructions, quality control, and known biases. The Data Cards framework satisfies these requirements.

For commercial datasets on truelabel, use ODRL (Open Digital Rights Language) to specify usage terms: training-only vs. training+inference, geographic restrictions, attribution requirements. The ODRL specification is machine-readable, enabling automated license compliance checks.

Future Directions: RLDS and World Models

RLDS is evolving to support world model training. NVIDIA Cosmos uses RLDS datasets to train video prediction models—given observation at time t, predict t+1 to t+10. This requires storing consecutive frames, not just episode boundaries. Extend RLDS schema with a 'video_clip' field: `tfds.features.Video(shape=(10, 256, 256, 3))`.

The World Models paper shows that policies trained in learned world models transfer to real robots with 50% less real data. The NVIDIA GR00T N1 model uses 10M hours of simulated video (RLDS format) to pretrain a world model, then fine-tunes on 1K hours of real robot data[8].

Multi-modal world models need synchronized sensor streams. Extend RLDS with a 'timestamp' field per observation modality: `observation={'image':..., 'image_timestamp': tf.int64, 'lidar':..., 'lidar_timestamp': tf.int64}`. The MCAP format natively supports per-message timestamps; convert MCAP → RLDS preserving these.

The Open X-Embodiment collaboration is standardizing RLDS v2 with world model extensions. Expected release: Q2 2025. Early adopters can future-proof datasets by storing raw sensor streams (not just processed observations) and high-frequency timestamps (1 kHz, not 10 Hz).

Troubleshooting RLDS Conversion Errors

Error: "Feature shape mismatch"—TFDS expected shape (256,256,3) but got (256,256,4). Cause: source images are RGBA, not RGB. Fix: convert to RGB with `image[:,:,:3]` before passing to TFDS.

Error: "Episode length 0"—`_generate_examples()` yielded empty steps list. Cause: episode filtering logic too aggressive or source data parsing failed. Fix: add logging inside episode loop, print episode IDs and step counts.

Error: "Action out of bounds"—action values exceed [-1, 1] after normalization. Cause: normalization constants computed on training set but validation set has larger action magnitudes. Fix: compute normalization constants on full dataset (train+val), not just train.

Error: "JPEG decode failed"—TFDS cannot decode image bytes. Cause: source images are PNG but you set `encoding_format='jpeg'`. Fix: match encoding_format to source or re-encode images during extraction.

Error: "Dataset build hangs"—`tfds build` runs for hours without progress. Cause: source data on slow network filesystem (NFS) or episode extraction has O(n²) complexity. Fix: copy source data to local SSD before building, profile `_generate_examples()` with cProfile.

For persistent issues, consult the RLDS GitHub issues or LeRobot discussions. The robotics ML community is active—most conversion bugs have known fixes.

Case Study: Converting DROID to RLDS

The DROID dataset (350K trajectories, 76 hours) was released in HDF5 format. Converting to RLDS required three steps. First, audit: DROID uses delta end-effector actions (7-DoF pose + gripper) in robot base frame, 128×128 wrist camera, 10 Hz sampling. Second, schema: `observation={'image': Image(128,128,3), 'state': Tensor(7,)}, action=Tensor(8,)`. Third, extraction: iterate over HDF5 episodes, extract images/states/actions, set is_first/is_last based on episode boundaries.

The conversion script is 200 lines of Python. Key challenge: DROID stores images as JPEG bytes in HDF5, but TFDS expects raw arrays or JPEG bytes with specific headers. Solution: decode JPEG to numpy array, pass to TFDS Image feature, let TFDS re-encode. This adds 10% build time but ensures compatibility.

Validation: loaded RLDS dataset, trained Diffusion Policy for 50 epochs, achieved 78% success on real robot (vs. 80% reported in DROID paper). The 2% gap is within noise. The RLDS version is now the canonical format, used by OpenVLA and LeRobot benchmarks.

Lessons: (1) budget 2× initial time estimate for debugging, (2) validate with policy training, not just shape checks, (3) document normalization constants in dataset card. The DROID project page links to the conversion script for reproducibility.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning

    RLDS paper introducing the format and ecosystem for standardizing robot learning datasets

    arXiv
  2. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment collaboration aggregating 22 datasets with 800K episodes in RLDS format

    arXiv
  3. Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence

    EU AI Act Regulation 2024/1689 requiring dataset documentation for high-risk AI systems

    EUR-Lex
  4. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    DROID paper documenting episode filtering improving policy success by 18%

    arXiv
  5. Project site

    Dex-YCB dataset with tactile sensor data showing 25% manipulation improvement

    dex-ycb.github.io
  6. GDPR Article 7 — Conditions for consent

    GDPR Article 7 specifying consent requirements for personal data collection

    GDPR-Info.eu
  7. Subpart 27.4 - Rights in Data and Copyrights

    FAR Subpart 27.4 specifying data rights in government contracts

    acquisition.gov
  8. NVIDIA GR00T N1 technical report

    NVIDIA GR00T N1 technical report documenting 10M hours simulated pretraining

    arXiv

FAQ

What is the difference between RLDS and LeRobot dataset formats?

RLDS uses TensorFlow TFRecords with a nested episode structure (Dataset of steps), while LeRobot uses Parquet files with a flat table structure (one row per step, episode_id column for grouping). RLDS integrates natively with TensorFlow Datasets and TF-Agents; LeRobot integrates with Hugging Face Datasets and PyTorch. Both support the same observation/action/reward schema. LeRobot can load RLDS datasets via a backend adapter. For new datasets, choose based on your training framework: TensorFlow → RLDS, PyTorch → LeRobot. The Open X-Embodiment project uses RLDS; Hugging Face robotics benchmarks use LeRobot.

How do I handle datasets with variable-length episodes in RLDS?

RLDS supports variable-length episodes natively—each episode is a separate example in the Dataset, and episodes can have different step counts. In your DatasetBuilder, yield episodes as lists of steps: `yield (episode_id, {'steps': steps_list})` where `len(steps_list)` varies per episode. TFDS serializes each episode independently. During training, use `tf.data.Dataset.padded_batch()` to pad episodes to the same length within a batch, or use `bucket_by_sequence_length()` to group similar-length episodes. The RT-1 and RT-2 papers report no performance degradation from padding up to 20% of episode length.

Can I convert ROS 2 bags directly to RLDS without intermediate formats?

Yes, use the rosbags library (Python) to read ROS 2 bags and extract messages. Install via `pip install rosbags`. Iterate over topics (/camera/image_raw, /joint_states, /cmd_vel), deserialize messages, and map them to RLDS observation/action fields. For multi-topic synchronization, use message timestamps to align camera frames with joint states (typically within 10 ms). The rosbag2_storage_mcap plugin allows reading ROS 2 bags as MCAP files, which have better random-access performance for large datasets. The DROID and BridgeData V2 conversion scripts demonstrate ROS bag → RLDS pipelines.

What is the recommended image resolution for RLDS datasets?

Use 224×224 or 256×256 for RGB images. These resolutions match ImageNet pretraining (224×224) and are large enough to preserve manipulation-relevant details (object edges, gripper position) while keeping dataset size manageable. The Open X-Embodiment project standardized on 256×256 across 22 datasets. Higher resolutions (512×512, 640×480) increase storage by 4-9× and training time by 2-3× with minimal accuracy gain for manipulation tasks. For navigation or fine-grained assembly, 512×512 may be justified. Store original resolution in a separate 'image_highres' field if needed for future use.

How do I add language annotations to an existing RLDS dataset?

Load the RLDS dataset with `tfds.load()`, iterate over episodes, generate captions (manually or with GPT-4V/CLIP), and rebuild the dataset with an updated schema that includes `language_instruction: tfds.features.Text()`. Store captions in a separate JSON file (episode_id → caption mapping) during generation, then merge during rebuild. The RT-2 paper shows that even generic captions ("pick up object", "place in container") improve zero-shot transfer by 40%. For datasets with 10K+ episodes, use GPT-4V batch API (~$0.01 per episode) or BLIP-2 (free, lower quality). Document caption source in dataset card.

What are the storage requirements for a typical RLDS manipulation dataset?

A 10K-episode dataset with 256×256 RGB images (JPEG quality 95), 10 Hz sampling, 20-second average episode length, 7-DoF proprioception, and 8-DoF actions requires approximately 15-25 GB. Breakdown: images dominate at ~10 KB per frame (2,000 frames/episode × 10 KB = 20 MB/episode × 10K episodes = 200 GB raw, compressed to 20 GB with JPEG). Proprioception and actions are ~1 KB per step, negligible. Add 20% overhead for TFRecord metadata and sharding. The Open X-Embodiment corpus (800K episodes) is 12 TB. Use cloud storage with lifecycle policies (move to Coldline after 90 days) to reduce costs by 70%.

Looking for convert data to RLDS format?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

List Your RLDS Dataset on Truelabel