truelabelRequest data

Dataset profile

cadene/droid_1.0.1 Franka robot dataset for VLA training

cadene/droid_1.0.1 is a 95,600-episode Franka robot dataset built on the LeRobot v2.1 framework [ref:ref-lerobot], distributing 27.6M frames at 15 fps under Apache-2.0 for VLA training data, manipulation, and teleoperation research [ref:ref-droid]. The DROID team describes the corpus as [quote:ref-droid]"a large-scale in-the-wild robot manipulation dataset"[/quote] covering scenes captured by Franka arms across diverse tabletop tasks, stored as parquet episodes [ref:ref-parquet] with synchronized video. With over 429,000 downloads on Hugging Face, it gives procurement teams a vetted Apache-2.0 LeRobot dataset for behavior cloning, world-model training, and Franka arm transfer learning.

Updated 2026-05-26
By TrueLabel Sourcing
Reviewed by TrueLabel Sourcing ·
Franka robot dataset

Quick facts

Scale
95,600 episodes, 27.6M frames
License
Apache-2.0
Robot platform
Franka
Format
Parquet + video (15 fps)
Commercial use
Permitted under Apache-2.0
Framework
LeRobot v2.1

Dataset composition and structure

cadene/droid_1.0.1 organizes 95,600 robot episodes into 95 chunks of 1,000 episodes each, with every episode stored as a parquet file [1] accompanied by synchronized video recordings at 15 frames per second. The dataset captures Franka robot arm behavior across diverse manipulation scenarios, yielding 27.6 million individual frames and 286,800 video files that document state trajectories, action sequences, and visual observations. Each episode follows a consistent schema defined by the LeRobot codebase version 2.1 [2], ensuring that downstream training pipelines can ingest observation tensors, action vectors, and metadata without custom preprocessing. The chunk-based storage architecture enables teams to load subsets of episodes into memory for distributed training runs, while the parquet format guarantees efficient columnar access to time-series features like joint positions, gripper states, and camera frames. Training splits are pre-defined with all 95,600 episodes assigned to the training partition, giving procurement teams a turnkey foundation for supervised learning experiments that require millions of state-action pairs from a single robot morphology.

  • 95 chunks of 1,000 episodes for scalable data loading
  • Parquet episodes with synchronized 15 fps video streams
  • 27.6M frames capturing Franka arm state and visual observations
  • LeRobot v2.1 schema for plug-and-play pipeline integration

Apache-2.0 licensing and commercial deployment

The dataset ships under the Apache-2.0 license, which grants procurement teams unrestricted rights to use, modify, and distribute the data in commercial robotic products without royalty obligations or copyleft restrictions [3]. This permissive licensing model removes legal friction for startups building manipulation products, research labs publishing derivative datasets, and enterprises training proprietary VLA models that will be deployed in warehouses, healthcare facilities, or manufacturing lines. Unlike datasets encumbered by non-commercial clauses or attribution requirements that complicate supply-chain audits, Apache-2.0 ensures that models trained on cadene/droid_1.0.1 can be embedded in closed-source systems, fine-tuned with private data, and sold as part of integrated robotics solutions. Legal teams appreciate the mature license text and the absence of ambiguous terms around dataset derivatives, while engineering teams benefit from the freedom to preprocess, augment, or merge this data with proprietary teleoperation logs. The high download count of 429,086 signals broad industry adoption, reducing procurement risk by demonstrating that hundreds of teams have already vetted the dataset for production pipelines and found the licensing terms compatible with commercial objectives.

Procurement considerations for robotics teams

Robotics teams evaluating cadene/droid_1.0.1 should prioritize its single-robot-type focus on the Franka platform, which simplifies transfer learning for labs already operating Franka arms but may limit generalization to other morphologies like UR5, xArm, or proprietary manipulators. The 15 fps frame rate strikes a balance between temporal resolution and storage efficiency, capturing manipulation dynamics for tasks like pick-and-place, assembly, and tabletop rearrangement without overwhelming compute budgets during training. Procurement leads should verify that their data infrastructure can handle the 95-chunk directory structure and that their training framework supports parquet ingestion, as the LeRobot schema is optimized for PyTorch-based pipelines and may require adapter code for TensorFlow or JAX workflows. The absence of explicit task labels in the metadata means that teams pursuing task-conditioned policies will need to annotate episodes post-download or rely on self-supervised learning objectives that do not require semantic task categories. Because the dataset includes only training splits, validation and test partitions must be carved out manually by withholding episode chunks, a step that requires coordination between data engineering and ML research teams to avoid leakage. The dataset's integration with the LeRobot ecosystem offers long-term maintainability advantages, as updates to the codebase often include improved data loaders, augmentation utilities, and benchmarking scripts that reduce the engineering burden for teams adopting the dataset in 2025 and beyond.

Known limitations and alternative sourcing

cadene/droid_1.0.1 does not disclose the task distribution or scene diversity within its 95,600 episodes, making it difficult for procurement teams to assess whether the data covers edge cases relevant to their deployment environments. Teams targeting multi-modal fusion architectures may find the dataset's modality information unspecified in the Hugging Face metadata, requiring manual inspection of parquet schemas to confirm the presence of depth maps, tactile readings, or force-torque signals beyond RGB video streams. The dataset's origin as a LeRobot export means that teams unfamiliar with that framework may encounter friction when adapting data loaders or interpreting proprietary metadata fields, a consideration for organizations standardizing on alternative robotics stacks like ROS2-native pipelines or custom teleoperation formats. While the Apache-2.0 license permits redistribution, the dataset does not include pre-trained model checkpoints or baseline performance metrics, so teams must budget compute resources for initial training runs to establish whether the data quality meets their accuracy and generalization requirements. Organizations requiring sensor modalities beyond visual observations or needing demonstrations from heterogeneous robot fleets should evaluate complementary datasets like Open-X Embodiment or Bridge V2, which offer broader morphological coverage at the cost of more complex licensing and larger storage footprints.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. Apache Parquet file format

    Parquet columnar file format used for the episode storage layout

    Apache Parquet
  2. LeRobot GitHub repository

    LeRobot v2.1 framework — codebase, schema, and PyTorch data loaders the dataset is built on

    GitHub
  3. Project site

    Original DROID project site documenting the dataset's scale, scope, and Franka manipulation focus

    droid-dataset.github.io

FAQ

What is cadene/droid_1.0.1 and what type of robotics data does it contain?

cadene/droid_1.0.1 is a large-scale robotics demonstration dataset containing 95,600 episodes of Franka robot arm behavior, totaling 27.6 million frames captured at 15 frames per second. The dataset was built using the LeRobot framework version 2.1 and stores each episode as a parquet file alongside synchronized video recordings, providing state trajectories, action sequences, and visual observations suitable for training manipulation policies, vision-language-action models, and world models. The data is organized into 95 chunks of 1,000 episodes each, enabling scalable data loading for distributed training workflows, and the entire collection is hosted on Hugging Face with over 429,000 downloads to date. The dataset focuses exclusively on the Franka robot platform, making it particularly valuable for teams already deploying Franka arms in research labs or production environments where transfer learning from this morphology will accelerate policy development.

What are the exact license terms and can I use this dataset in commercial robotics products?

cadene/droid_1.0.1 is released under the Apache License 2.0, a permissive open-source license that explicitly permits commercial use, modification, and distribution without royalty payments or copyleft obligations. This means robotics companies can train proprietary models on the dataset, deploy those models in commercial products, and sell integrated systems without needing to open-source their derivatives or seek additional permissions from the dataset authors. The Apache-2.0 license requires only that you include a copy of the license text and preserve any copyright notices when redistributing the dataset itself, but these requirements do not extend to trained model weights or robotic systems that merely use the data during development. Legal teams appreciate this license for its clarity and industry-standard language, which simplifies IP audits and reduces procurement friction when building supply chains around open datasets.

Which robotics teams should prioritize cadene/droid_1.0.1 for their training pipelines?

Teams operating Franka robot arms and building manipulation policies for tabletop tasks should prioritize this dataset, especially those seeking large-scale demonstration data under a permissive license that allows commercial deployment. The dataset is particularly well-suited for research labs training vision-language-action models that require millions of state-action pairs from a consistent robot morphology, as well as startups developing behavior cloning systems where data volume and licensing clarity are critical procurement criteria. Organizations already integrated with the LeRobot ecosystem will benefit from plug-and-play compatibility with existing data loaders and training scripts, reducing engineering overhead during the data ingestion phase. The 15 fps frame rate makes the dataset appropriate for manipulation tasks that do not require high-frequency control, such as pick-and-place operations, assembly sequences, and object rearrangement scenarios where human-scale motion dynamics are sufficient for policy learning.

When is cadene/droid_1.0.1 NOT the right dataset choice for a robotics project?

cadene/droid_1.0.1 is not suitable for teams requiring multi-robot generalization, as the dataset contains only Franka arm demonstrations and will not provide the morphological diversity needed to train policies that transfer across UR5, xArm, or custom manipulator designs. Organizations building high-frequency control systems for tasks like dynamic manipulation, throwing, or contact-rich assembly may find the 15 fps capture rate insufficient to model the temporal dynamics of rapid motions or impact events. Teams that need explicit task labels or semantic annotations for task-conditioned policies will face additional annotation costs, since the dataset metadata does not include task categories or goal descriptions for the 95,600 episodes. If your deployment environment involves sensor modalities beyond RGB video—such as depth cameras, tactile arrays, or force-torque sensors—you should verify that the parquet schema includes those signals before committing to this dataset, as the Hugging Face metadata does not specify the full modality breakdown. Finally, organizations standardizing on non-PyTorch training frameworks may encounter integration friction with the LeRobot data format and should budget engineering time for building custom data loaders.

How do I access and load cadene/droid_1.0.1 into my training pipeline?

You can download cadene/droid_1.0.1 directly from Hugging Face using the datasets library or the Hugging Face CLI, with the data organized into 95 chunk directories containing 1,000 episodes each in parquet format plus synchronized video files. The LeRobot codebase provides native data loaders that parse the parquet schema and yield batches of observations, actions, and metadata for PyTorch training loops, minimizing the need for custom preprocessing code if you are already using that framework. Each episode is stored at the path pattern data/chunk-000/episode-000000.parquet, allowing you to load subsets of episodes for experimentation or distribute chunk loading across multiple GPU nodes for large-scale training runs. If you are using TensorFlow, JAX, or a custom training stack, you will need to write an adapter that reads the parquet files using libraries like PyArrow or Pandas, extracts the observation tensors and action vectors according to the LeRobot schema, and feeds them into your model's input pipeline.

What quality assurance and validation steps should I perform after downloading this dataset?

After downloading cadene/droid_1.0.1, your data engineering team should first verify the integrity of all 95 chunk directories and confirm that episode counts match the documented 95,600 total, checking for corrupted parquet files or missing video streams that could cause training crashes. Inspect a sample of episodes to understand the observation space dimensions, action vector structure, and any metadata fields specific to the LeRobot schema, ensuring compatibility with your model architecture's input requirements. Because the dataset does not include pre-defined validation or test splits, you must manually partition episodes into training, validation, and test sets by withholding entire chunks to prevent data leakage, a critical step that requires coordination with your ML research team to establish statistically valid evaluation protocols. Run exploratory data analysis on action distributions, episode lengths, and visual scene characteristics to identify potential dataset biases or coverage gaps relative to your deployment environment, and consider visualizing trajectory rollouts to detect labeling errors or low-quality demonstrations that may degrade policy performance. Finally, execute a small-scale training run on a subset of episodes to benchmark data loading throughput and confirm that your infrastructure can sustain the I/O demands of streaming 27.6 million frames during full-scale experiments.

Need data like cadene/droid_1.0.1 Franka robot dataset for VLA training?

If your project needs similar modality, scale, or licensing, truelabel can surface comparable open datasets or match you with capture partners that deliver to spec.

Access cadene/droid_1.0.1 on Hugging Face