Physical AI Data Engineering

How to Create Action-Chunked Datasets for Robot Policy Training

Action chunking transforms sequential robot demonstrations into fixed-length temporal windows that policy models consume during training. You audit source trajectories for temporal consistency, select chunk size and horizon parameters based on your target architecture (ACT uses 100-step chunks, Diffusion Policy uses 16-step), implement sliding-window extraction with proper padding, compute per-dimension action normalization statistics, serialize to RLDS or LeRobot format, and validate end-to-end with a training smoke test.

Updated 2026-01-15

By Truelabel Team

Reviewed by Truelabel Team · Jan 15, 2026

action chunked dataset

List Your Action-Chunked Dataset How sourcing works

Quick facts

Topic: HOW TO Create Action Chunked Dataset
Audience: Procurement leads, ML ops, robotics engineers
Deliverable: Operational playbook with sample workflow + accept-rule criteria

Why Action Chunking Transforms Robot Policy Learning

Action chunking emerged as the dominant temporal representation for imitation learning after RT-1 and ACT demonstrated that predicting multi-step action sequences outperforms single-step policies on long-horizon manipulation tasks. Instead of outputting one action per observation, chunked policies predict 10-100 future actions in a single forward pass, which amortizes perception cost and enables temporal consistency across control cycles.

The Open X-Embodiment dataset contains 1 million trajectories from 22 robot embodiments, all structured with action chunks^[1]. DROID provides 76,000 trajectories across 564 skills with 16-step action chunks^[2]. BridgeData V2 ships 60,000 demonstrations with configurable chunk sizes for both ACT and Diffusion Policy training^[3]. These datasets prove that action chunking is not an architectural detail but a data-structuring requirement for modern robot learning.

Chunking also solves the temporal credit assignment problem. Single-step policies struggle to attribute success or failure across multi-second maneuvers because each action sees only local feedback. Chunked policies learn temporal dependencies explicitly — the model observes that gripper closure at t+5 depends on approach trajectory from t to t+4. This temporal binding is why RT-2 achieves 62% success on unseen tasks compared to 32% for single-step baselines^[4].

Prerequisites: Source Data Requirements

You need robot demonstration trajectories recorded at consistent control frequency with synchronized observations and actions. Minimum viable input: 50 episodes of 10-30 seconds each, recorded at 5-30 Hz, with joint positions or end-effector poses as actions and RGB images or proprioceptive state as observations. LeRobot supports datasets as small as 10 episodes for prototyping, but production models require 500-5,000 episodes depending on task complexity^[5].

Temporal consistency is non-negotiable. Every timestep in a chunk must represent the same control interval — if your robot runs at 10 Hz, every frame must be exactly 0.1 seconds apart. USB bandwidth contention, ROS message queue drops, and compute load spikes during recording introduce timing jitter that breaks chunking assumptions. The RLDS paper reports that 15-30% of raw teleoperation data contains temporal irregularities requiring interpolation or rejection^[6].

You also need action and observation schemas locked before chunking. Changing action dimensions mid-dataset (e.g., adding a wrist rotation DOF) invalidates normalization statistics and forces re-chunking. EPIC-KITCHENS-100 maintains strict schema versioning across 100 hours of egocentric video precisely to avoid this problem^[7]. Lock your action space, observation modalities, and control frequency before collecting the first trajectory.

Audit Source Trajectories for Temporal Consistency

Load timestamps from every episode and compute inter-frame intervals with numpy.diff(timestamps). For a 10 Hz dataset, expected interval is 0.1 seconds. Compute mean, standard deviation, minimum, and maximum interval across all episodes. Flag episodes where standard deviation exceeds 10% of mean interval, where any single gap exceeds 2× expected interval, or where total frame count deviates by more than 5% from expected count.

RoboNet contains 15 million frames from 7 robot platforms, and the authors report rejecting 12% of raw episodes due to timing inconsistencies^[8]. Temporal drift accumulates — a 1 ms jitter per frame becomes a 100 ms offset after 100 frames, which misaligns actions with observations and corrupts the policy's temporal model. The DROID dataset enforces sub-millisecond timestamp precision by synchronizing all sensors to a hardware clock^[2].

For episodes with 1-2 dropped frames, interpolate missing timesteps using cubic spline interpolation on action and proprioceptive channels, and repeat the nearest image frame for vision channels. For episodes with systematic timing issues (recording drifted from 10 Hz to 8 Hz), reject the entire episode. CALVIN provides a temporal validation script that flags 90% of problematic episodes automatically^[9]. Run this audit before investing time in chunking — fixing temporal issues post-chunking requires re-extracting every chunk.

Select Chunk Size and Temporal Horizon Parameters

Chunk size is the number of future actions your policy predicts per forward pass. ACT uses 100-step chunks for long-horizon bimanual tasks, predicting 10 seconds of actions at 10 Hz^[10]. Diffusion Policy uses 16-step chunks for reactive manipulation, predicting 0.8 seconds at 20 Hz^[11]. RT-1 uses 8-step chunks for mobile manipulation at 3 Hz, predicting 2.7 seconds per inference^[12].

Longer chunks capture extended temporal dependencies but increase memory cost and training time. A 100-step chunk with 7-DOF actions and 224×224 RGB observations consumes 1.2 MB per sample. A batch size of 32 requires 38 GB of GPU memory before model parameters. BridgeData V2 provides both 16-step and 100-step variants to support different architectures^[3].

Temporal horizon is the observation history window fed to the policy. RT-2 uses a 6-frame history (0.5 seconds at 12 Hz) to capture object motion^[4]. OpenVLA uses a 1-frame history because the vision-language backbone encodes sufficient temporal context from static images^[13]. Start with chunk_size = 16 and history_length = 1 for reactive tasks, chunk_size = 100 and history_length = 3 for long-horizon tasks. The LeRobot documentation provides architecture-specific defaults.

Implement Sliding-Window Chunk Extraction

Sliding-window extraction generates overlapping chunks from continuous trajectories. For an episode with T timesteps and chunk size C, you extract T - C + 1 chunks. Chunk i contains observations from timestep i to i + history_length - 1 and actions from timestep i to i + C - 1. This overlap is intentional — it increases sample count and teaches the policy to handle partial action sequences.

RLDS provides a reference implementation in TensorFlow that handles edge cases: episodes shorter than chunk size get zero-padded, and the final chunk in each episode gets truncated if necessary^[6]. LeRobot implements the same logic in PyTorch with configurable padding strategies. Both libraries store a chunk_mask tensor that marks valid timesteps, allowing the loss function to ignore padded regions.

Handle episode boundaries carefully. Some implementations extract chunks that span episode boundaries (cross-episode chunking), which teaches the policy to reset between tasks. Open X-Embodiment uses cross-episode chunking to train RT-X models that generalize across 22 robot embodiments^[1]. Other implementations reject boundary-spanning chunks to maintain temporal causality. BridgeData V2 provides both variants as separate dataset splits. Document your choice in the dataset card — this affects how users must structure their training loops.

Compute and Apply Action Normalization Statistics

Action normalization rescales each action dimension to zero mean and unit variance, which stabilizes gradient flow during policy training. Compute per-dimension mean and standard deviation across all actions in all episodes before chunking. Store these statistics in a JSON file alongside the dataset — users must apply identical normalization at inference time or the policy outputs garbage.

LeRobot datasets store normalization statistics in a stats.json file with keys for each action dimension^[5]. RLDS embeds statistics in the TFRecord metadata. DROID provides per-robot normalization statistics because the 7 robot embodiments have different joint ranges^[2]. If your dataset spans multiple robots, compute per-robot statistics and store them in separate files.

Some architectures require observation normalization as well. Diffusion Policy normalizes proprioceptive observations (joint positions, velocities) but not images^[11]. RT-2 normalizes images to [-1, 1] using ImageNet statistics because the vision backbone was pretrained on ImageNet^[4]. Document your normalization scheme in the dataset README — this is the most common source of train-test mismatch bugs. The OpenVLA codebase includes a normalization validator that checks statistics at load time.

Serialize to RLDS or LeRobot Format

RLDS is the TensorFlow-native format used by Google Research for RT-1, RT-2, and RT-X models. It stores episodes as TFRecord files with a standardized schema: each episode contains a steps array, and each step contains observation and action dictionaries^[6]. RLDS supports arbitrary observation modalities (RGB, depth, point clouds, proprioception) and action spaces (joint positions, end-effector poses, gripper commands). The TensorFlow RLDS documentation provides conversion scripts for common formats.

LeRobot is the PyTorch-native format used by Hugging Face for ACT, Diffusion Policy, and OpenVLA models. It stores episodes as Parquet files with a columnar schema optimized for fast random access during training^[5]. LeRobot datasets integrate with Hugging Face Hub for one-line loading and automatic caching. The LeRobot dataset documentation provides a conversion API that accepts NumPy arrays, HDF5 files, or ROS bags as input.

Both formats support multi-modal observations and variable-length episodes. RLDS has better TensorFlow ecosystem integration (TensorBoard, TFX pipelines). LeRobot has better PyTorch ecosystem integration (Weights & Biases, Lightning). Open X-Embodiment ships in RLDS format with 1 million trajectories^[1]. DROID ships in LeRobot format with 76,000 trajectories^[2]. Choose based on your training framework — converting between formats is possible but adds friction.

Validate End-to-End with a Training Smoke Test

Run a 10-episode training loop with your chunked dataset to catch serialization bugs, normalization errors, and shape mismatches before scaling to full training. Load 10 episodes, extract chunks, normalize actions, and run 100 training steps with a small policy model. Check that loss decreases monotonically and that predicted actions fall within the normalized range [-3, 3] standard deviations.

LeRobot provides a smoke-test script that validates dataset format, normalization statistics, and dataloader configuration in under 5 minutes^[5]. RLDS provides a similar validator in the rlds/examples directory. Both scripts check for common errors: missing chunk masks, incorrect observation shapes, NaN values in actions, and mismatched episode lengths.

Visualize a random sample of chunks to verify temporal alignment. Plot the first 3 chunks from episode 0: overlay the observation images and plot the action trajectories as line graphs. Verify that actions are smooth (no discontinuous jumps), that observations change gradually across the history window, and that chunk boundaries do not introduce artifacts. The BridgeData project page provides visualization notebooks for both RLDS and LeRobot formats. Catch alignment bugs now — they are invisible in aggregate training metrics but destroy policy performance on deployment.

Advanced Chunking Strategies for Multi-Task Datasets

Multi-task datasets require task-conditional chunking where chunk size varies by task complexity. Open X-Embodiment uses 8-step chunks for pick-and-place tasks and 32-step chunks for bimanual assembly tasks within the same dataset^[1]. The dataloader samples chunks according to a task distribution, and the policy receives a task embedding alongside observations.

RT-2 implements language-conditioned chunking where the chunk size is determined by the language instruction length^[4]. Short instructions like 'pick apple' use 8-step chunks. Long instructions like 'open the drawer, pick the red block, and place it in the bin' use 64-step chunks. This adaptive chunking reduces memory cost for simple tasks while preserving temporal context for complex tasks.

Hierarchical chunking splits long trajectories into coarse chunks (10-second segments) and fine chunks (1-second segments). The policy first predicts a coarse action sequence, then refines each coarse action into a fine action sequence. RoboCat uses two-level hierarchical chunking to scale to 100-second manipulation tasks^[14]. The CALVIN dataset provides hierarchical chunk annotations for 34 long-horizon tasks^[9]. Hierarchical chunking is complex to implement but necessary for tasks longer than 30 seconds.

Common Chunking Pitfalls and How to Avoid Them

Inconsistent control frequency across episodes breaks temporal assumptions. If episode 1 runs at 10 Hz and episode 2 runs at 12 Hz, a 16-step chunk represents 1.6 seconds in episode 1 but 1.3 seconds in episode 2. The policy learns inconsistent temporal dynamics and fails at test time. Solution: reject episodes with frequency drift exceeding 5%, or resample all episodes to a common frequency using interpolation. RoboNet resamples all 7 robot platforms to 4 Hz to ensure consistency^[8].

Wrong observation-action alignment occurs when observations at timestep t are paired with actions from timestep t+1. This introduces a one-step lag that the policy cannot recover from. Solution: verify alignment by plotting observation timestamps against action timestamps for 10 random episodes. The DROID dataset includes alignment validation code that checks for sub-millisecond synchronization^[2].

Per-chunk normalization instead of per-dataset normalization computes mean and standard deviation within each chunk rather than across the entire dataset. This makes normalization statistics non-stationary and prevents the policy from learning absolute action magnitudes. Solution: compute normalization statistics once across all episodes before chunking, then apply those fixed statistics to every chunk. LeRobot enforces this by storing global statistics in stats.json^[5].

Ignoring chunk masks during loss computation treats padded timesteps as valid data, which injects noise into the gradient. Solution: multiply the loss by the chunk mask before averaging. RLDS provides a masked_loss utility function that handles this automatically^[6].

Tooling and Infrastructure for Production Chunking

LeRobot provides a command-line tool that converts raw trajectories to chunked datasets in one command: lerobot-convert --input-dir./raw --output-dir./chunked --chunk-size 16 --format lerobot^[5]. It handles temporal validation, normalization, and serialization automatically. The tool supports HDF5, ROS bags, and NumPy arrays as input formats.

RLDS provides a similar tool for TensorFlow users: rlds_convert --input-format rosbag --output-format tfrecord --chunk-size 100. It integrates with TensorFlow Datasets for automatic caching and distributed loading. The RLDS documentation includes examples for converting EPIC-KITCHENS, RoboNet, and custom datasets.

Scale AI's Physical AI platform offers managed chunking as part of their data pipeline service^[15]. You upload raw teleoperation data, specify chunk parameters, and receive a production-ready dataset in RLDS or LeRobot format within 48 hours. This is cost-effective for teams without ML infrastructure — Scale handles temporal validation, normalization, and format conversion at $0.50 per episode.

For teams building custom pipelines, truelabel's physical AI data marketplace connects buyers with collectors who provide pre-chunked datasets in standardized formats. Every dataset includes temporal validation reports, normalization statistics, and smoke-test results. Buyers can filter by chunk size, control frequency, and action space to find datasets compatible with their target architecture.

Dataset Documentation and Metadata Standards

Every action-chunked dataset must include a README with chunk parameters (chunk_size, history_length, control_frequency), normalization statistics (per-dimension mean and standard deviation), temporal validation results (percentage of episodes rejected, mean inter-frame interval), and example loading code. Datasheets for Datasets provides a 50-question template covering motivation, composition, collection process, and recommended uses^[16].

LeRobot datasets include a dataset_info.json file with machine-readable metadata: action space schema, observation modalities, episode count, total timesteps, and chunk extraction parameters^[5]. This metadata enables automatic validation and compatibility checking. The Hugging Face Datasets library parses this metadata to generate dataset cards automatically.

Open X-Embodiment publishes a 40-page dataset paper describing collection protocols, annotation procedures, and quality control measures for all 22 robot embodiments^[1]. This level of documentation is necessary for datasets used in production systems — buyers need to understand data provenance, licensing terms, and known failure modes. The truelabel data provenance glossary defines the minimum metadata required for physical AI datasets in regulated industries.

Licensing and Commercial Use Considerations

Action-chunked datasets derived from open-source trajectories inherit the source license. BridgeData V2 is released under MIT license, permitting commercial use without restrictions^[3]. DROID is released under CC BY 4.0, requiring attribution but permitting commercial use^[2]. EPIC-KITCHENS-100 is released under a custom non-commercial license that prohibits training models for commercial deployment^[17].

If you collect proprietary teleoperation data and chunk it for internal use, you own the chunked dataset and can license it however you choose. If you hire annotators to collect data, ensure your service agreement assigns IP rights to you — Appen and Scale AI both provide work-for-hire agreements that transfer dataset ownership to the buyer.

If you plan to sell chunked datasets on truelabel's marketplace, you must provide a license file (MIT, Apache 2.0, CC BY 4.0, or custom commercial license) and a provenance statement documenting data sources, collection dates, and annotator consent. Buyers in regulated industries (automotive, medical robotics) require full provenance chains to satisfy audit requirements. The truelabel provenance glossary defines the metadata fields required for marketplace listings.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

How to Build an Egocentric Data Pipeline for Physical AIRelated page LeRobot format format for robot training dataDelivery format detail RLDS format for robot training dataDelivery format detail Multi-Task Learning RoboticsDefinition and terminology Vision-Language-Action ModelDefinition and terminology HDF5 robot data format for robot training dataDelivery format detail MCAP format for robot training dataDelivery format detail Parquet robot data format for robot training dataDelivery format detail

External references and source context

Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment contains 1 million trajectories from 22 robot embodiments with action chunking
arXiv ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID provides 76,000 trajectories with 16-step action chunks and sub-millisecond timestamp precision
arXiv ↩
BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 provides both 16-step and 100-step variants to support different architectures
arXiv ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 achieves 62% success on unseen tasks compared to 32% for single-step baselines; uses 6-frame history and language-conditioned chunking
arXiv ↩
LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch
LeRobot supports datasets as small as 10 episodes for prototyping; stores normalization statistics in stats.json; provides smoke-test validation
arXiv ↩
RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning
RLDS paper reports that 15-30% of raw teleoperation data contains temporal irregularities; provides reference implementation for chunk extraction
arXiv ↩
Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
EPIC-KITCHENS-100 maintains strict schema versioning to avoid mid-dataset changes
arXiv ↩
RoboNet: Large-Scale Multi-Robot Learning
RoboNet contains 15 million frames from 7 robot platforms; authors rejected 12% of raw episodes due to timing inconsistencies; resamples all platforms to 4 Hz
arXiv ↩
CALVIN paper
CALVIN provides temporal validation script that flags 90% of problematic episodes automatically; provides hierarchical chunk annotations for 34 long-horizon tasks
arXiv ↩
CALVIN paper
ACT uses 100-step chunks for long-horizon bimanual tasks, predicting 10 seconds of actions at 10 Hz
arXiv ↩
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Diffusion Policy uses 16-step chunks predicting 0.8 seconds at 20 Hz
arXiv ↩
RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 demonstrated that predicting multi-step action sequences outperforms single-step policies on long-horizon manipulation tasks
arXiv ↩
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA uses 1-frame history because vision-language backbone encodes sufficient temporal context; codebase includes normalization validator
arXiv ↩
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
RoboCat uses two-level hierarchical chunking to scale to 100-second manipulation tasks
arXiv ↩
Scale AI: Expanding Our Data Engine for Physical AI
Scale AI blog post on expanding data engine for physical AI
scale.com ↩
Datasheets for Datasets
Datasheets for Datasets provides 50-question template for dataset documentation
arXiv ↩
EPIC-KITCHENS-100 annotations license
EPIC-KITCHENS-100 annotations license prohibits commercial use
GitHub ↩
kognic.com platform
Kognic platform provides annotation tools for autonomous systems
kognic.com
docs.labelbox.com overview
Labelbox documentation overview describes annotation platform capabilities
docs.labelbox.com
encord.com annotate
Encord annotation platform supports multi-modal robot data
encord.com
dataloop.ai annotation
Dataloop annotation platform provides tools for robot learning datasets
dataloop.ai
v7darwin.com data annotation
V7 Darwin data annotation platform supports physical AI workflows
v7darwin.com
roboflow.com annotate
Roboflow annotation tools support computer vision for robotics
roboflow.com
Segments.ai multi-sensor data labeling
Segments.ai multi-sensor data labeling supports point cloud and image annotation
segments.ai
sama.com computer vision
Sama computer vision solutions provide annotation services for robot learning
sama.com

FAQ

What chunk size should I use for my robot learning task?

Use 16-step chunks for reactive manipulation tasks (pick-and-place, pushing, grasping) where the policy must respond to dynamic environments within 1 second. Use 100-step chunks for long-horizon tasks (bimanual assembly, multi-stage cooking, tool use) where the policy must plan 5-10 seconds ahead. ACT uses 100-step chunks at 10 Hz for bimanual tasks. Diffusion Policy uses 16-step chunks at 20 Hz for reactive tasks. RT-1 uses 8-step chunks at 3 Hz for mobile manipulation. Start with your target architecture's default and adjust based on task duration.

How do I handle episodes shorter than my chunk size?

Zero-pad short episodes to reach the minimum chunk size and store a chunk_mask tensor that marks valid timesteps. During training, multiply the loss by the chunk mask before averaging to ignore padded regions. RLDS and LeRobot both implement this automatically. Alternatively, reject episodes shorter than chunk_size during preprocessing if they represent incomplete demonstrations. BridgeData V2 rejects episodes shorter than 10 timesteps (0.5 seconds at 20 Hz) because they lack sufficient context for policy learning.

Can I use the same chunked dataset for ACT and Diffusion Policy?

No, ACT and Diffusion Policy require different chunk sizes and temporal horizons. ACT uses 100-step action chunks with 1-frame observation history. Diffusion Policy uses 16-step action chunks with 2-frame observation history. You must create separate chunked datasets for each architecture. BridgeData V2 provides both variants as separate dataset splits. Alternatively, store raw trajectories and chunk on-the-fly during training using architecture-specific dataloaders — this increases training time but avoids storing duplicate data.

How do I verify that my action normalization is correct?

Load 100 random chunks from your dataset and compute per-dimension mean and standard deviation of the normalized actions. The mean should be within 0.01 of zero and the standard deviation should be within 0.05 of one for every action dimension. If any dimension deviates, recompute normalization statistics and re-normalize the dataset. Also verify that 99.7% of normalized actions fall within [-3, 3] standard deviations — values outside this range indicate outliers that should be clipped or rejected. LeRobot provides a normalization validator that runs these checks automatically.

What temporal validation checks should I run before chunking?

Compute inter-frame intervals for every episode using numpy.diff(timestamps). Flag episodes where standard deviation exceeds 10% of mean interval, where any single gap exceeds 2× expected interval, or where total frame count deviates by more than 5% from expected count. Reject episodes with systematic timing drift (e.g., recording frequency drifted from 10 Hz to 8 Hz). Interpolate episodes with 1-2 dropped frames using cubic spline interpolation. RLDS reports that 15-30% of raw teleoperation data contains temporal irregularities requiring interpolation or rejection. Run these checks before investing time in chunking — fixing temporal issues post-chunking requires re-extracting every chunk.

Where can I find pre-chunked datasets for robot learning?

Open X-Embodiment provides 1 million trajectories with 8-32 step chunks in RLDS format. DROID provides 76,000 trajectories with 16-step chunks in LeRobot format. BridgeData V2 provides 60,000 demonstrations with both 16-step and 100-step variants. All three datasets are available on Hugging Face Hub with one-line loading. For custom tasks, truelabel's physical AI data marketplace connects buyers with collectors who provide pre-chunked datasets in standardized formats with temporal validation reports and normalization statistics.

Looking for action chunked dataset?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners and helps scope consent artifacts and commercial licensing requirements before delivery.

List Your Action-Chunked Dataset