Dataset profile
IPEC-COMMUNITY/language_table_lerobot Dataset Profile
IPEC-COMMUNITY/language_table_lerobot is an Apache-2.0 licensed robotics dataset containing 442,226 episodes and 7,045,476 frames of xArm manipulation data recorded at 10 FPS. Created using the LeRobot framework, this dataset structures robot trajectories in Parquet format with synchronized video recordings, designed specifically for language-conditioned manipulation tasks. Robotics teams building vision-language-action models or training teleoperation policies can leverage this large-scale embodiment data for xArm platforms, with the permissive Apache-2.0 license enabling both research and commercial deployment without restrictive encumbrances.
Quick facts
- Scale
- 442,226 episodes / 7M+ frames
- License
- Apache-2.0
- Format
- Parquet + Video (LeRobot v2.0)
- Robot Type
- xArm manipulator
- Frame Rate
- 10 FPS
- Commercial Use
- Permitted under Apache-2.0
Dataset Composition and Structure
The dataset organizes 442,226 manipulation episodes captured from xArm robotic systems, totaling more than 7 million individual frames sampled at 10 frames per second. Each episode is stored as a Parquet file within a chunked directory structure, with 443 chunks of approximately 1,000 episodes each, enabling efficient streaming and parallel loading during training. Video recordings accompany every episode, synchronized frame-by-frame with robot state and action data. The dataset derives from the Language Table benchmark environment and has been converted to the LeRobot v2.0 format, which standardizes trajectory representation for cross-embodiment learning. Metadata confirms the training split encompasses all 442,226 episodes, and the dataset references 127,605 distinct task specifications, indicating substantial linguistic diversity in the instruction set. This scale and structure make it suitable for large-batch pretraining of transformer-based policies that map natural language commands to continuous control actions on six-degree-of-freedom arms.
Licensing and Commercial Deployment Rights
The dataset is released under the Apache License 2.0, a permissive open-source license that grants users the right to use, modify, and distribute the data for any purpose, including commercial applications, without royalty obligations. Unlike copyleft licenses, Apache-2.0 does not require derivative works to be open-sourced, so teams incorporating this dataset into proprietary VLA training pipelines retain full ownership of their models and weights. The license does require preservation of copyright notices and disclaimers, and it includes an express patent grant from contributors, reducing IP risk for organizations deploying models trained on this data in production robotics systems. Procurement teams should note that while the dataset license is permissive, any third-party assets referenced in task specifications or environment textures may carry separate terms; however, the core trajectory and video data are unencumbered. This clarity accelerates legal review cycles for enterprise buyers who need auditable provenance before committing GPU budgets to training runs.
Integration with LeRobot and OpenX Ecosystems
Built using the LeRobot codebase version 2.0, this dataset adheres to a standardized schema shared across multiple embodiments in the Open X-Embodiment initiative, simplifying the process of co-training on heterogeneous robot data. The Parquet-based storage format supports columnar access patterns that reduce I/O overhead when loading high-dimensional observation and action tensors, and the chunked episode structure allows dataset workers to parallelize reads across distributed training nodes. Each episode file includes metadata fields for episode index, chunk assignment, and FPS, enabling precise replay and temporal alignment during imitation learning. The dataset tags indicate compatibility with RLDS (Reinforcement Learning Datasets) tooling, so teams already using TensorFlow Datasets or PyTorch data loaders can adapt existing pipelines with minimal refactoring. The 10 FPS capture rate balances temporal resolution with storage efficiency, providing sufficient granularity for smooth trajectory interpolation while keeping video file sizes manageable across 442,000 episodes. Teams sourcing data for RT-1, RT-2, or Octo-style policies will find the format and scale directly applicable to their training harnesses.
Limitations and Procurement Considerations
While the dataset offers substantial scale, buyers should recognize that it represents a single embodiment—the xArm manipulator—limiting direct transfer to robots with different kinematic chains, gripper configurations, or sensorimotor latencies. The metadata does not specify camera viewpoints, resolution, or whether depth channels are included, so teams requiring stereo or RGB-D inputs should verify modality details before committing to integration work. The 127,605 task count suggests linguistic diversity, but without access to the instruction text or task distribution statistics, it is difficult to assess coverage of edge-case commands or compositional generalization scenarios. Additionally, the dataset description truncates before detailing train-validation splits or episode success rates, meaning buyers cannot pre-filter for high-quality demonstrations without inspecting the data directly. The 193,268 download count signals community validation, yet no benchmark results or baseline policy performances are documented, so teams must budget for their own evaluation runs. Organizations requiring multi-robot or sim-to-real transfer should plan to combine this dataset with complementary embodiments rather than relying on it as a sole data source for cross-platform policies.
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
FAQ
What is the IPEC-COMMUNITY/language_table_lerobot dataset?
IPEC-COMMUNITY/language_table_lerobot is a large-scale robotics dataset containing 442,226 episodes of xArm manipulator trajectories paired with language task specifications, totaling over 7 million frames captured at 10 frames per second. The dataset is formatted using the LeRobot v2.0 schema, storing each episode as a Parquet file with synchronized video recordings, and is designed for training language-conditioned manipulation policies such as vision-language-action models. It originates from the Language Table benchmark environment and has been structured for compatibility with the Open X-Embodiment ecosystem, enabling researchers and engineers to pretrain or fine-tune transformer-based robot policies on a substantial corpus of single-arm manipulation data.
What license governs this dataset and can I use it commercially?
The dataset is distributed under the Apache License 2.0, a permissive open-source license that explicitly permits commercial use, modification, and redistribution without requiring derivative works to be open-sourced. You may train proprietary models on this data, deploy those models in commercial robotic products, and retain full ownership of your trained weights and inference code. The license includes an express patent grant, reducing intellectual property risk, and requires only that you preserve copyright notices and include a copy of the license text with any distributions. Because Apache-2.0 does not impose copyleft obligations, it is particularly well-suited for enterprise procurement where legal teams need clear rights to incorporate open data into closed-source training pipelines and production systems.
Who should use this dataset and for what applications?
This dataset is ideal for robotics teams building language-conditioned manipulation policies on xArm platforms, including researchers developing vision-language-action models, engineers fine-tuning teleoperation assistants, and machine learning teams pretraining large transformer policies for embodied AI. The scale—over 400,000 episodes—supports data-hungry architectures like RT-2 or Octo, and the LeRobot format ensures compatibility with modern imitation learning codebases. Teams working on instruction-following tasks, zero-shot generalization from linguistic commands, or cross-task transfer within tabletop manipulation will find the 127,605 distinct task specifications valuable for training models that generalize across diverse natural language prompts. The permissive license and high download count make it a low-risk choice for both academic benchmarking and commercial product development on xArm embodiments.
When is this dataset NOT the right choice for my project?
This dataset may not suit your needs if you are training policies for robots with significantly different morphologies, such as humanoid hands, mobile manipulators, or dual-arm systems, since it captures only xArm single-arm trajectories and transfer to other kinematic structures typically requires additional domain adaptation. If your application demands high-frequency control, the 10 FPS capture rate may lack the temporal resolution needed for contact-rich or high-speed tasks. The metadata does not confirm the presence of depth channels, force-torque sensors, or tactile feedback, so projects requiring multimodal sensory input beyond RGB video should verify modality coverage before integration. Additionally, without documented train-validation splits or episode success labels, teams needing curated subsets of high-quality demonstrations will need to implement their own filtering and quality-assurance steps, adding overhead to the data preparation pipeline.
How is the dataset structured and what format does it use?
The dataset employs the LeRobot v2.0 format, organizing episodes into 443 chunks of approximately 1,000 episodes each, with each episode stored as a Parquet file alongside a corresponding video recording. The Parquet schema captures robot state observations, action sequences, and metadata in columnar format, enabling efficient parallel reads and low-latency streaming during distributed training. Video files are synchronized frame-by-frame with trajectory data at 10 FPS, and the directory structure follows a predictable pattern that simplifies dataset sharding and episode indexing. Metadata in meta/info.json specifies the robot type, total frame count, chunk size, and split definitions, allowing loaders to preallocate buffers and schedule data workers appropriately. This structure is compatible with RLDS tooling and standard PyTorch or TensorFlow data pipelines, minimizing integration friction for teams already using LeRobot or Open X-Embodiment datasets.
What should I know before procuring or integrating this dataset into my training pipeline?
Before procurement, verify that your compute infrastructure can handle the storage footprint of 7 million frames plus accompanying video, and confirm that your data loaders support Parquet and the LeRobot schema to avoid costly conversion work. Check whether your policy architecture requires modalities not explicitly listed, such as depth maps or proprioceptive force readings, since the metadata does not enumerate all sensor channels. Budget time to inspect a sample of episodes for task distribution, success rates, and annotation quality, as the dataset description truncates before providing these details. If you plan to combine this dataset with other embodiments for cross-robot pretraining, ensure your training harness can handle heterogeneous action spaces and observation dimensions. Finally, coordinate with your legal team to document the Apache-2.0 license in your data provenance records, confirm that no additional third-party assets introduce conflicting terms, and establish a process for preserving required copyright notices in your deployment artifacts.
Need data like IPEC-COMMUNITY/language_table_lerobot Dataset Profile?
If your project needs similar modality, scale, or licensing, truelabel can surface comparable open datasets or match you with capture partners that deliver to spec.
Access dataset on Hugging Face