Physical AI Data Engineering
How to Implement Data Versioning for Robotics
Data versioning for robotics requires tracking both raw sensor streams (camera frames, joint states, force-torque readings) and derived artifacts (annotations, model checkpoints, evaluation metrics) across collection cycles. Use Git for metadata and code, DVC or LFS for large binary files, and structured formats like HDF5, MCAP, or RLDS for episode storage. Embed provenance metadata (collector ID, robot serial, calibration version) in every episode file. Maintain a dataset registry mapping version tags to training runs, enabling reproducible experiments and rollback when model performance degrades. The Open X-Embodiment dataset aggregates 1M+ trajectories from 22 robot embodiments using this approach.
Quick facts
- Difficulty
- Intermediate
- Audience
- Physical AI data engineers
- Last reviewed
- 2025-01-15
Why Data Versioning Matters for Robot Learning
Robot learning datasets differ from static computer vision benchmarks in three critical ways: they grow incrementally as you collect new episodes, they contain tightly coupled multimodal streams (RGB-D video, proprioception, actions, language annotations), and they require expensive physical infrastructure to reproduce. A single annotation error or calibration drift can invalidate weeks of training. Open X-Embodiment aggregates 1M+ trajectories from 22 robot embodiments[1], demonstrating that cross-embodiment generalization depends on rigorous versioning to isolate embodiment-specific biases from task semantics.
Without versioning, teams face three failure modes. First, training irreproducibility: a model trained on "the dataset" in January cannot be rebuilt in March because intermediate processing scripts changed. Second, silent data corruption: a misconfigured camera driver writes malformed timestamps for 200 episodes before anyone notices, contaminating downstream training. Third, collaboration friction: distributed teams overwrite each other's annotations or retrain on stale data because no single source of truth exists. DROID collected 76K trajectories across 564 scenes and 86 tasks[2], requiring Git-based metadata versioning and cloud storage with immutable object versions to coordinate 12 collection sites.
Versioning also enables dataset ablation studies. When RT-1 achieved 97% success on 700+ tasks, the team attributed gains to specific data augmentation strategies by retraining on versioned subsets[3]. Similarly, BridgeData V2 expanded from 60K to 100K demonstrations, and versioning let researchers quantify the marginal value of each collection wave[4]. Physical AI procurement increasingly demands this audit trail: buyers want to know which episodes came from which robot, which annotator labeled them, and which calibration parameters were active.
Choose Storage Formats That Embed Provenance
Robot episodes are not flat image-label pairs. A single episode contains synchronized RGB-D video (30-60 FPS), proprioceptive state (joint angles, velocities, torques at 100-500 Hz), end-effector poses (position, orientation, gripper width), and action commands. Storing these as loose files (one PNG per frame, one JSON per timestep) creates filesystem bottlenecks and makes atomic versioning impossible. Use hierarchical formats that bundle all modalities into a single file with embedded metadata.
HDF5 is the most common choice for offline datasets. Each episode becomes an HDF5 file with groups for observations, actions, and metadata. RLDS (Reinforcement Learning Datasets) standardizes this structure: every episode has a `steps` array where each step contains `observation`, `action`, `reward`, `is_terminal`, and `is_first` fields[5]. LeRobot extends RLDS with video compression (H.264 in HDF5) and Parquet sidecar files for fast metadata queries[6]. For online logging during teleoperation, MCAP offers millisecond-precision timestamping and schema evolution, making it the preferred format for ROS 2 bag files[7].
Embed provenance metadata in every file header: `collector_id`, `robot_serial`, `calibration_version`, `camera_intrinsics_hash`, `annotation_tool_version`. This turns each episode into a self-describing artifact. When EPIC-KITCHENS-100 released 100 hours of egocentric video across 45 kitchens, every video file included participant consent timestamps and camera calibration matrices[8]. For procurement, this metadata answers "Can I retrain on a subset collected after calibration fix X?" without re-downloading terabytes.
Version Control for Code, Metadata, and Large Files
Use Git for everything except large binary blobs. Store dataset schemas (Protobuf definitions, JSON schemas), processing scripts (calibration, synchronization, augmentation), annotation guidelines, and the dataset registry (a YAML or JSON file mapping version tags to episode lists) in a Git repository. This makes every transformation auditable. When a preprocessing bug surfaces, `git blame` reveals which commit introduced it, and `git revert` rolls back the change across the entire team.
For large files (HDF5 episodes, video, point clouds), Git alone fails: a 10 GB episode file creates a 10 GB commit object, bloating the repository. Use DVC (Data Version Control) or Git LFS. DVC stores a lightweight pointer file (`.dvc`) in Git and pushes the actual data to S3, GCS, or Azure Blob. Running `dvc pull` fetches the data corresponding to the current Git commit, ensuring code and data stay synchronized. LeRobot uses DVC to version 50+ datasets totaling 2 TB, with each dataset tagged by collection date and robot platform[9].
Maintain a dataset registry as a versioned YAML file. Each entry maps a semantic version (e.g., `bridge-v2.1`) to a list of episode IDs, a data URL, and a Git commit hash. Example structure: `{version: bridge-v2.1, episodes: [ep_001, ep_002,...], storage_url: s3://bucket/bridge-v2.1/, code_commit: a3f9c8d, collection_date: 2024-03-15}`. Training scripts read this registry to fetch the exact dataset version, eliminating "works on my machine" issues. Open X-Embodiment publishes a registry mapping 22 datasets to Hugging Face dataset IDs, enabling reproducible multi-dataset training[1].
Track Dataset Lineage Across Collection and Annotation Cycles
Robot datasets evolve through multiple stages: raw collection, quality filtering, annotation, augmentation, and train/val splits. Each stage produces a derived dataset, and losing the transformation chain breaks reproducibility. Use lineage tracking to record parent-child relationships between dataset versions. OpenLineage provides a standard schema for capturing dataset transformations as directed acyclic graphs (DAGs)[10].
Implement lineage tracking with a simple JSON log. When you filter raw episodes to remove failures, write: `{derived_version: bridge-v2.1-filtered, parent_version: bridge-v2.1-raw, transform: remove_episodes_with_collision, removed_count: 342, timestamp: 2024-03-20T10:15:00Z}`. When you add language annotations, write: `{derived_version: bridge-v2.1-annotated, parent_version: bridge-v2.1-filtered, transform: add_language_labels, annotator_ids: [ann_01, ann_02], tool_version: labelbox-v3.2}`. Store these logs in Git alongside the dataset registry. This creates an audit trail: given a trained model, you can trace back through every transformation to the raw sensor data.
DROID collected 76K trajectories but published only 60K after filtering for task success and camera occlusions[2]. The lineage log documents which 16K episodes were removed and why, enabling researchers to re-include them for failure-mode analysis. Truelabel's data provenance glossary defines the metadata fields buyers expect: collection timestamp, robot embodiment, task category, success label, annotator ID, and license terms[11]. Embedding this in lineage logs turns datasets into procurement-ready assets.
Ensure Training Reproducibility with Locked Dependencies
Reproducibility requires locking not just the dataset version but also the training code, model architecture, hyperparameters, and software dependencies. Use containerization (Docker) and dependency pinning (pip freeze, conda env export) to freeze the entire training environment. A reproducible training run should be a single command: `docker run --gpus all training-image:v1.2.3 --dataset bridge-v2.1 --config configs/rt1.yaml`.
Store training configs in Git with semantic versioning. Each config file specifies the dataset version, model checkpoint (if fine-tuning), optimizer settings, and random seeds. LeRobot's training examples use Hydra configs that reference dataset versions by name, ensuring every experiment is reproducible from a Git commit hash[12]. When RT-2 fine-tuned a vision-language model on robot data, the team published Docker images with pinned PyTorch, TensorFlow, and JAX versions[13].
Log training provenance to a metadata file written at the start of each run: `{run_id: rt1-bridge-v2.1-run-042, dataset_version: bridge-v2.1, code_commit: a3f9c8d, config_file: configs/rt1.yaml, start_time: 2024-04-01T08:00:00Z, gpu_type: A100-80GB, framework_versions: {torch: 2.0.1, transformers: 4.28.0}`. Store this alongside model checkpoints. When a model underperforms in production, this log reveals whether the issue stems from data drift, code changes, or hyperparameter tuning. Safetensors embeds metadata directly in model checkpoint files, enabling self-describing artifacts[14].
Implement Automated Quality Checks at Ingestion
Catch data corruption early by running automated quality checks when episodes are ingested into the versioned dataset. Define checks as code (Python scripts, Great Expectations suites) and version them in Git. Checks should validate: (1) structural integrity (all required fields present, correct dtypes, no NaNs in critical fields), (2) physical plausibility (joint angles within URDF limits, gripper commands in [0,1], camera timestamps monotonically increasing), and (3) statistical consistency (action magnitudes within 3 standard deviations of historical mean, frame rates within 5% of nominal).
Run checks in a CI/CD pipeline triggered by new data uploads. If an episode fails validation, quarantine it in a `failed_qc/` directory and log the failure reason. BridgeData V2 filtered 100K demonstrations down to 60K by rejecting episodes with camera occlusions, robot collisions, or incomplete task execution[4]. Automating these checks prevents bad data from contaminating training sets. For teleoperation datasets, run checks in real-time during collection: if the camera framerate drops below 25 FPS, alert the operator and discard the episode.
Use RLDS's episode validation utilities to check that every step has matching observation and action dimensions[5]. For point cloud data, verify that PointNet-compatible formats have consistent point counts and valid XYZ coordinates[15]. Store validation results as metadata: `{episode_id: ep_12345, qc_passed: true, qc_checks: [structure_ok, physics_ok, stats_ok], qc_timestamp: 2024-04-15T14:30:00Z}`. This metadata becomes part of the dataset provenance, enabling buyers to filter for high-confidence episodes.
Manage Multi-Robot and Multi-Site Collection
Physical AI datasets increasingly aggregate data from multiple robot platforms and collection sites. Open X-Embodiment combines 22 datasets spanning 7-DoF arms, mobile manipulators, and quadrupeds[1]. DROID collected across 12 sites with different robot configurations[2]. Versioning must handle embodiment heterogeneity and distributed coordination.
Use a hierarchical versioning scheme: `{dataset_name}-{embodiment}-{site}-{collection_wave}`. Example: `bridge-franka-berkeley-2024q1`. Each site maintains a local dataset registry and pushes to a central registry on merge. The central registry enforces schema compatibility: all episodes must conform to the same RLDS schema, even if observation spaces differ (7-DoF vs. 6-DoF arms). LeRobot defines a common schema with optional fields, allowing datasets to omit modalities (e.g., no depth camera) without breaking compatibility[6].
For cross-embodiment training, store embodiment-specific metadata in every episode: `robot_type`, `action_space_dim`, `control_frequency`, `end_effector_type`. This enables filtering and normalization. RT-1 trained on 130K episodes from 13 robots by normalizing actions to a canonical 7-DoF space and embedding robot type as a conditioning variable[3]. Scale AI's Physical AI platform provides multi-embodiment data pipelines with automatic action space alignment[16]. Versioning these normalization transforms (stored as code in Git) ensures that adding a new robot embodiment does not break existing training pipelines.
Enable Collaboration with Access Control and Audit Logs
Multi-team collaboration requires role-based access control and audit logging. Not every team member needs write access to production datasets. Use cloud storage IAM policies (AWS S3 bucket policies, GCS IAM) to enforce: (1) read-only access for training engineers, (2) write access for data collection operators, (3) admin access for dataset maintainers. Truelabel's marketplace implements fine-grained access control, allowing buyers to preview datasets before purchasing full access[17].
Log every dataset modification: who uploaded which episodes, who annotated which frames, who changed the dataset registry. Store logs in an append-only format (e.g., AWS CloudTrail, GCS audit logs). When a model trained on `bridge-v2.1` underperforms, audit logs reveal that 500 episodes were re-annotated between training runs, explaining the performance delta. EPIC-KITCHENS-100 publishes annotation provenance, listing which annotators labeled which video segments[8].
For annotation workflows, integrate versioning with annotation tools. Labelbox and Encord support dataset versioning natively: each annotation task references a specific dataset version, and completed annotations create a new derived version[18]. This prevents annotators from working on stale data. Segments.ai versions point cloud annotations, enabling rollback when annotation guidelines change mid-project[19].
Version Models and Evaluation Metrics Alongside Data
Data versioning is incomplete without model versioning and evaluation versioning. A trained model is an artifact derived from a specific dataset version, and its performance metrics are only meaningful relative to a specific test set version. Store models, datasets, and evaluation results in a unified registry. Example entry: `{model_id: rt1-v3.2, dataset_version: bridge-v2.1, test_set_version: bridge-v2.1-test, accuracy: 0.94, checkpoint_url: s3://models/rt1-v3.2.safetensors, training_commit: a3f9c8d}`.
Use Safetensors to store model checkpoints with embedded metadata: dataset version, training hyperparameters, evaluation metrics[14]. This makes checkpoints self-describing. OpenVLA publishes model checkpoints with dataset provenance, enabling researchers to reproduce training from scratch[20]. For evaluation, version test sets separately from training sets. DROID splits 76K trajectories into 60K train and 16K test, with the split frozen and versioned[2].
Track evaluation lineage: which model version was evaluated on which test set version, using which evaluation script version. Store results in a structured format (CSV, Parquet) with columns: `model_version`, `test_set_version`, `metric_name`, `metric_value`, `eval_timestamp`. This enables longitudinal analysis: how does model performance change as the dataset grows? RT-2 reported that scaling from 130K to 1M episodes improved success rates by 15 percentage points[13], a claim verifiable only with versioned datasets and evaluation scripts.
Handle Dataset Splits and Stratification
Train/val/test splits must be deterministic and versioned. A common mistake: generating splits on-the-fly with a random seed, then changing the seed between experiments. This makes results incomparable. Instead, generate splits once, store the episode IDs in a JSON file, and version that file in Git. Example: `{train: [ep_001, ep_002,...], val: [ep_500, ep_501,...], test: [ep_800, ep_801,...], split_strategy: stratified_by_task, random_seed: 42}`.
For robot datasets, stratify by task, embodiment, and collection site to ensure test sets are representative. Open X-Embodiment stratifies by robot type and task category, preventing models from overfitting to a single embodiment[1]. BridgeData V2 stratifies by object category (rigid vs. deformable) and manipulation primitive (pick, place, push)[4]. Store stratification metadata in the split file: `{episode_id: ep_001, task: pick_apple, embodiment: franka, site: berkeley}`.
When adding new episodes, append to existing splits rather than regenerating. If `bridge-v2.1` has 60K episodes and you collect 10K more, create `bridge-v2.2` by appending the new episodes to the train split (or a separate `train_new` split for ablation studies). This preserves test set integrity: models trained on v2.1 and v2.2 are evaluated on the same test set, making performance comparisons valid. LeRobot uses this append-only strategy, with each dataset version adding episodes without modifying existing splits[6].
Implement Rollback and Disaster Recovery
Versioning enables rollback: if a new dataset version degrades model performance, revert to the previous version and retrain. This requires immutable storage: once a dataset version is published, its contents never change. Use cloud storage versioning (S3 object versioning, GCS object versioning) to make deletions and overwrites recoverable. If someone accidentally deletes `bridge-v2.1/`, restore it from the version history.
Maintain backup copies in geographically distributed locations. Store the primary dataset in one cloud region (e.g., us-west-2) and replicate to another region (e.g., eu-west-1). DROID stores 76K trajectories (≈5 TB) in Google Cloud Storage with cross-region replication[2]. For high-value datasets, use cold storage (AWS Glacier, GCS Nearline) for long-term archival of raw data, keeping only processed versions in hot storage.
Test disaster recovery by simulating data loss. Delete a dataset version from staging storage and verify that you can restore it from backups within your recovery time objective (RTO). Document the recovery procedure in a runbook stored in Git. EPIC-KITCHENS-100 publishes checksums (MD5, SHA256) for every video file, enabling integrity verification after download[8]. Include checksums in your dataset registry and verify them during `dvc pull` or `aws s3 sync`.
Optimize Storage Costs with Compression and Deduplication
Robot datasets are storage-intensive: a single hour of RGB-D video at 30 FPS with 1920x1080 resolution generates ≈200 GB uncompressed. Use lossy compression for images and video (H.264, H.265) and lossless compression for proprioceptive data (gzip, zstd). LeRobot stores video as H.264 in HDF5, reducing storage by 20x with negligible quality loss[6]. For point clouds, use PointNet-compatible formats with octree compression[15].
Deduplicate episodes with identical content. If a robot repeats the same pick-and-place task 100 times with minimal variation, store the unique episodes and reference them with metadata tags. BridgeData V2 collected 100K demonstrations but found that 15% were near-duplicates, reducing effective diversity[4]. Use perceptual hashing (pHash, dHash) to detect duplicate frames and episode-level similarity metrics (action sequence edit distance) to detect duplicate trajectories.
Use tiered storage: keep recent datasets (last 3 months) in hot storage (S3 Standard, GCS Standard) for fast access, and move older versions to cold storage (S3 Glacier, GCS Nearline) with 1-12 hour retrieval latency. Open X-Embodiment stores 1M+ trajectories with tiered storage, reducing costs by 70% while maintaining access to all historical versions[1]. Automate tiering with lifecycle policies: after 90 days, transition objects to cold storage; after 1 year, archive to deep cold storage.
Integrate with Continuous Training Pipelines
Modern robot learning uses continuous training: as new episodes arrive, retrain models incrementally and deploy updated policies. Versioning must integrate with CI/CD pipelines. Use event-driven triggers: when a new dataset version is published (detected via S3 event notifications or Git webhooks), automatically launch a training job. LeRobot's training scripts read dataset versions from a config file, enabling automated retraining when the config is updated[12].
Implement dataset drift detection: compare the distribution of actions, observations, and task success rates between consecutive dataset versions. If the mean action magnitude shifts by >10% or the success rate drops by >5%, flag the new version for manual review before training. RT-1 used statistical process control to detect when new data collection introduced distribution shifts[3]. Store drift metrics in the dataset registry: `{version: bridge-v2.2, parent_version: bridge-v2.1, action_mean_delta: 0.03, success_rate_delta: -0.02, drift_flag: false}`.
For online learning, version datasets at episode granularity. Each episode gets a unique ID and timestamp, and training scripts query for episodes collected after a cutoff timestamp. Scale AI's Physical AI platform supports streaming data ingestion, where new episodes are versioned and indexed in real-time[16]. This enables training on the freshest data without waiting for batch uploads. Store episode-level metadata in a fast queryable format (Parquet, DuckDB) to enable efficient filtering by timestamp, task, or embodiment.
Document Versioning Policies and Train Your Team
Versioning is a sociotechnical system: tools alone do not ensure compliance. Document your versioning policies in a data governance handbook stored in Git. Policies should cover: (1) naming conventions (semantic versioning for datasets, ISO 8601 timestamps for episodes), (2) access control (who can publish new versions, who can modify the registry), (3) quality gates (all episodes must pass automated checks before versioning), (4) retention policies (keep all versions for 2 years, archive older versions to cold storage), and (5) incident response (how to handle accidental deletions or data corruption).
Train your team on versioning workflows. Run onboarding sessions for new data collectors, annotators, and ML engineers. Provide runbooks for common tasks: "How to upload a new dataset version," "How to roll back to a previous version," "How to query the dataset registry." Truelabel's marketplace provides buyer-facing documentation for every dataset, including versioning history and provenance metadata[17].
Conduct quarterly audits of the dataset registry. Verify that all versioned datasets have corresponding Git commits, that all episodes pass quality checks, and that backup copies exist in cold storage. Use automated scripts to detect orphaned files (episodes not referenced in any version) and stale versions (not accessed in 6+ months). EPIC-KITCHENS-100 publishes annual dataset reports documenting collection statistics, annotation quality, and usage metrics[8]. Adopt this practice internally to maintain institutional knowledge as team members rotate.
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment aggregates 1M+ trajectories from 22 robot embodiments, demonstrating cross-embodiment generalization and versioning requirements
arXiv ↩ - DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID collected 76K trajectories across 564 scenes and 86 tasks, requiring Git-based metadata versioning and cloud storage coordination
arXiv ↩ - RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 achieved 97% success on 700+ tasks, attributing gains to specific data augmentation strategies via versioned dataset ablation studies
arXiv ↩ - BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 expanded from 60K to 100K demonstrations, with versioning enabling quantification of marginal value per collection wave
arXiv ↩ - RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning
RLDS standardizes episode structure with steps array containing observation, action, reward, is_terminal, and is_first fields
arXiv ↩ - LeRobot dataset documentation
LeRobot extends RLDS with video compression and Parquet sidecar files for fast metadata queries
Hugging Face ↩ - MCAP specification
MCAP offers millisecond-precision timestamping and schema evolution for online teleoperation logging
MCAP ↩ - Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
EPIC-KITCHENS-100 released 100 hours of egocentric video with embedded participant consent timestamps and camera calibration matrices
arXiv ↩ - LeRobot GitHub repository
LeRobot uses DVC to version 50+ datasets totaling 2 TB with each dataset tagged by collection date and robot platform
GitHub ↩ - OpenLineage Object Model
OpenLineage provides standard schema for capturing dataset transformations as directed acyclic graphs
OpenLineage ↩ - truelabel data provenance glossary
Truelabel data provenance glossary defines metadata fields buyers expect: collection timestamp, robot embodiment, task category, success label, annotator ID, license terms
truelabel.ai ↩ - Diffusion Policy training example
LeRobot training examples use Hydra configs that reference dataset versions by name for reproducible experiments
GitHub ↩ - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 fine-tuned vision-language model on robot data with published Docker images containing pinned framework versions
arXiv ↩ - Safetensors documentation
Safetensors embeds metadata directly in model checkpoint files, enabling self-describing artifacts with dataset provenance
Hugging Face ↩ - PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
PointNet-compatible formats require consistent point counts and valid XYZ coordinates for 3D classification and segmentation
arXiv ↩ - scale.com physical ai
Scale AI Physical AI platform provides multi-embodiment data pipelines with automatic action space alignment and streaming data ingestion
scale.com ↩ - truelabel physical AI data marketplace bounty intake
Truelabel marketplace implements fine-grained access control and buyer-facing documentation with versioning history
truelabel.ai ↩ - docs.labelbox.com overview
Labelbox supports dataset versioning natively with annotation tasks referencing specific dataset versions
docs.labelbox.com ↩ - segments
Segments.ai versions point cloud annotations enabling rollback when annotation guidelines change
segments.ai ↩ - OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA publishes model checkpoints with dataset provenance enabling reproducible training from scratch
arXiv ↩
FAQ
What is the difference between data versioning and model versioning in robotics?
Data versioning tracks changes to training datasets (episodes, annotations, splits), while model versioning tracks trained model checkpoints. Both are necessary for reproducibility. A model checkpoint is an artifact derived from a specific dataset version, and its performance metrics are only meaningful relative to a specific test set version. Store models, datasets, and evaluation results in a unified registry that links model IDs to dataset versions, training commits, and evaluation metrics. Use Safetensors to embed dataset provenance directly in model checkpoint files.
How do I version datasets that are too large for Git?
Use Git for metadata (schemas, processing scripts, dataset registry) and DVC or Git LFS for large binary files (HDF5 episodes, video, point clouds). DVC stores a lightweight pointer file in Git and pushes the actual data to S3, GCS, or Azure Blob. Running `dvc pull` fetches the data corresponding to the current Git commit, ensuring code and data stay synchronized. LeRobot uses DVC to version 50+ datasets totaling 2 TB, with each dataset tagged by collection date and robot platform. Alternatively, use cloud storage versioning (S3 object versioning) and store version manifests (lists of object keys and checksums) in Git.
What metadata should I embed in every robot episode file?
Embed provenance metadata in every episode file header: collector ID, robot serial number, calibration version, camera intrinsics hash, annotation tool version, collection timestamp, task category, and success label. This turns each episode into a self-describing artifact. For multi-robot datasets, include robot type, action space dimensionality, control frequency, and end-effector type. Store metadata as JSON or YAML in the HDF5 file attributes or as a sidecar file. EPIC-KITCHENS-100 includes participant consent timestamps and camera calibration matrices in every video file. Truelabel's data provenance glossary defines the metadata fields buyers expect for procurement-ready datasets.
How do I handle dataset versioning when collecting data from multiple robots and sites?
Use a hierarchical versioning scheme: `{dataset_name}-{embodiment}-{site}-{collection_wave}`. Example: `bridge-franka-berkeley-2024q1`. Each site maintains a local dataset registry and pushes to a central registry on merge. The central registry enforces schema compatibility: all episodes must conform to the same RLDS schema, even if observation spaces differ. Store embodiment-specific metadata in every episode (robot type, action space dimension, control frequency) to enable filtering and normalization. Open X-Embodiment combines 22 datasets spanning 7-DoF arms, mobile manipulators, and quadrupeds using this approach. DROID collected across 12 sites with different robot configurations, requiring Git-based metadata versioning and cloud storage with immutable object versions.
What is the best format for storing robot episodes: HDF5, MCAP, or Parquet?
HDF5 is best for offline datasets with synchronized multimodal streams (RGB-D video, proprioception, actions). RLDS standardizes HDF5 structure for reinforcement learning. MCAP is best for online logging during teleoperation, offering millisecond-precision timestamping and schema evolution; it is the preferred format for ROS 2 bag files. Parquet is best for tabular metadata (episode IDs, task labels, success rates) that needs fast querying. LeRobot uses HDF5 for episode data with Parquet sidecar files for metadata queries. For point clouds, use HDF5 with PointNet-compatible formats or PCD files. Choose based on your access patterns: HDF5 for sequential reads, MCAP for streaming writes, Parquet for random access queries.
How do I ensure that train/val/test splits are reproducible across dataset versions?
Generate splits once, store the episode IDs in a JSON file, and version that file in Git. Example: `{train: [ep_001, ep_002,...], val: [ep_500, ep_501,...], test: [ep_800, ep_801,...], split_strategy: stratified_by_task, random_seed: 42}`. When adding new episodes, append to existing splits rather than regenerating. If `bridge-v2.1` has 60K episodes and you collect 10K more, create `bridge-v2.2` by appending the new episodes to the train split. This preserves test set integrity: models trained on v2.1 and v2.2 are evaluated on the same test set, making performance comparisons valid. Stratify by task, embodiment, and collection site to ensure test sets are representative. LeRobot uses this append-only strategy, with each dataset version adding episodes without modifying existing splits.
Looking for data versioning for robotics?
Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.
Browse Physical AI Datasets