Format reference
Robot data format guides
Eight formats cover almost every robotics dataset shipped today. The matrix below compares them on streaming, schema preservation, language SDK coverage, compression, and license. Each row links to a deeper format-specific page with verified scale facts and when-to-use guidance.
Feature matrix — sortable
Click any column header to sort. Sort by Since for the maturity timeline, by Reader SDKs alphabetically for language coverage, or by Streaming to find formats that support partial reads.
| Dataset | Primary use | Streaming | Schema preservation | Reader SDKs | Compression | License | Since ▼ |
|---|---|---|---|---|---|---|---|
| LeRobot | Robot-learning datasets + policy training | Partial (HF Datasets streaming) | Strong — LeRobotDataset v2.x pinned | Python (lerobot, datasets) | MP4 (AV1) + Parquet | Apache-2.0 | 2024 |
| MCAP | Time-synchronized robotics logs | Yes — chunk-indexed | Self-describing (protobuf, ROS, JSON, FlatBuffer) | Rust, Python, C++, Go, TypeScript | lz4, zstd | Apache-2.0 | 2022 |
| RLDS | RL / robotics episodes (obs, action, reward, metadata) | Yes (TFDS-based) | Strong — episode + step shape pinned | Python (TFDS, NumPy) | Inherits TFDS | Apache-2.0 | 2021 |
| Parquet | Tabular state/action streams + frame tables | Columnar partial reads | Strong — Arrow schema | C/C++, Java, Python, R, Rust, Go, JS | snappy, gzip, zstd, brotli | Apache-2.0 | 2013 |
| ROS bag | ROS-native robot fleet logs (legacy) | ROS 1 yes; ROS 2 / SQLite yes | ROS message types only | C++, Python (via ROS) | lz4, bz2 | BSD-style (ROS) | 2007 |
| HDF5 | Trajectories, pose, sensor streams | Partial reads supported | Schema-rich (groups + attributes) | C/C++, Python, R, Java, MATLAB | Native (gzip, szip, zstd) | BSD-style | 1998 |
| Pickle | Python-first benchmark releases | No | Schema-free (any Python object) | Python only | External (gzip wrapper) | PSF (Python) | 1994 |
| Point cloud | 3D scene geometry, LiDAR, depth | Format-dependent | PCD / PLY / LAS / USDZ headers | C++, Python, USD/Pixar tools | LAZ, draco, ZSTD | BSD / open | 1994 |
8 format guides — search and filter
8 of 8 datasets
HDF5 robot data format for robot training data
Delivery format
HDF5 robot data is useful for large structured arrays, trajectories, pose, sensor streams, and compact robot episodes. Define schema version, groups for observations/actions, timestamps, task labels, and metadata attributes before reviewing samples so you can verify that delivery matches the training pipeline.
LeRobot format format for robot training data
Delivery format
LeRobot format is useful for developer-friendly robot learning datasets and policy training pipelines. Define episode metadata, observation tensors, action tensors, timestamps, and repo-compatible manifest before reviewing samples so you can verify that delivery matches the training pipeline.
MCAP format for robot training data
Delivery format
MCAP is useful for time-synchronized robotics logs, compressed video topics, IMU, state, and action messages. Define topic schema, timestamp domain, compression settings, camera topic, state topic, and manifest before reviewing samples so you can verify that delivery matches the training pipeline.
Parquet robot data format for robot training data
Delivery format
Parquet robot data is useful for large robotics dataset hubs, frame tables, episode metadata, sharded time-series records, and LeRobot-compatible distribution. Define episode index, frame offsets, task descriptions, feature schema, video references, statistics, and split metadata before reviewing samples so you can verify that delivery matches the training pipeline.
Pickle robot data format for robot training data
Delivery format
Pickle robot data is useful for Python-first benchmark releases that package demonstrations, robot state dictionaries, observations, and task metadata. Define schema documentation, Python version notes, object keys, observation/action fields, conversion script, and checksum manifest before reviewing samples so you can verify that delivery matches the training pipeline.
Point cloud format for robot training data
Delivery format
Point cloud is useful for 3D scene geometry, object reconstruction, LiDAR/depth capture, navigation perception, and manipulation planning. Define coordinate frame, units, sensor intrinsics and extrinsics, timestamps, segmentation or object labels where available, and source RGB/depth references before reviewing samples so you can verify that delivery matches the training pipeline.
RLDS format for robot training data
Delivery format
RLDS is useful for reinforcement learning and robotics episodes with observations, actions, rewards, and metadata. Define episode ID, observation stream, action stream, timestamps, task label, and success flag before reviewing samples so you can verify that delivery matches the training pipeline.
ROS bag format for robot training data
Delivery format
ROS bag is useful for robot-native data collection where buyers need ROS topics preserved for replay or conversion. Define topic list, message types, timestamps, sensor calibration, and conversion notes before reviewing samples so you can verify that delivery matches the training pipeline.
Picking a format — decision rule
- Training a learning policy on the buyer’s pipeline → LeRobot or RLDS.
- Multi-robot fleet logs with synchronized topics → MCAP (replaces ROS bag).
- Compact structured arrays for trajectories, pose, sensor streams → HDF5.
- Tabular state/action + frame tables, columnar reads → Parquet.
- 3D scene geometry, LiDAR, depth → Point cloud.
- Inheriting a Python research release → convert pickle to one of the above before production ingest.