Physical AI Training Guide

How to Train a Diffusion Policy for Robot Manipulation

Training a diffusion policy requires a demonstration dataset of 100+ episodes with synchronized observations and actions, a vision encoder (ResNet-18 or ViT), and a conditional denoising network (U-Net or Transformer). Normalize actions to [-1,1], configure a DDPM or DDIM noise schedule with 10-100 diffusion steps, set observation horizon to 2-4 frames and action horizon to 8-16 steps, then train with AdamW optimizer for 50,000-200,000 gradient steps while monitoring MSE loss and success rate on held-out validation episodes.

Updated 2025-06-15

By truelabel

Reviewed by truelabel · Jun 15, 2025

train diffusion policy

Browse Physical AI Datasets How sourcing works

Quick facts

Difficulty: Intermediate
Audience: Physical AI data engineers
Last reviewed: 2025-06-15

Why Diffusion Policies Outperform Standard Behavioral Cloning

Diffusion policies model action distributions as iterative denoising processes rather than direct regression, enabling multimodal action prediction and stable training on high-dimensional continuous control tasks. Standard behavioral cloning suffers from compounding errors when the policy encounters states outside the training distribution — a single incorrect action cascades into novel states where the policy has no supervision. Diffusion models address this by learning the full conditional distribution p(action|observation) instead of a point estimate, allowing the policy to recover from perturbations by sampling diverse plausible actions.

LeRobot's diffusion policy implementation demonstrates 15-25% higher success rates than MSE-based behavioral cloning on manipulation benchmarks like CALVIN and LIBERO. The RT-1 Robotics Transformer trained on 130,000 demonstrations across 700 tasks uses a diffusion-based action decoder to handle the multimodal action distributions inherent in diverse manipulation skills^[1]. Google's RT-2 extends this architecture by grounding vision-language models in robotic affordances, achieving 50% higher generalization to novel objects compared to deterministic policies^[2].

Diffusion policies excel when demonstration data contains multiple valid solutions for the same observation — grasping an object from different angles, navigating around obstacles via alternate paths, or executing bimanual coordination with variable timing. The iterative refinement process also provides implicit regularization: early diffusion steps generate coarse action sketches while later steps add fine-grained corrections, reducing overfitting on small datasets. DROID's 76,000-trajectory dataset shows that diffusion policies trained on 500 demonstrations match the performance of deterministic policies trained on 2,000 demonstrations for pick-and-place tasks^[3].

Dataset Preparation and Normalization Requirements

Demonstration datasets must contain synchronized observations (RGB images, depth maps, proprioceptive state) and ground-truth actions recorded at consistent frequency, typically 10-30 Hz. Each episode should include start-to-success trajectories with clear task completion signals — partial demonstrations or failure trajectories introduce distribution shift that degrades policy performance. The Open X-Embodiment dataset aggregates 1 million trajectories across 22 robot embodiments with standardized observation and action schemas, providing a reference format for dataset structure^[4].

Action normalization is critical for diffusion training stability. Compute per-dimension min and max across the entire dataset and normalize to [-1,1] — the Gaussian noise schedule assumes this range. Store normalization statistics separately as JSON or NumPy arrays; you will need them to denormalize predicted actions during deployment. Verify no action values are clipped after normalization, which indicates outlier demonstrations that should be inspected or removed. For proprioceptive state (joint positions, velocities, end-effector pose), compute per-dimension mean and standard deviation and normalize to zero mean, unit variance.

Image observations require careful preprocessing. Resize to 84×84 or 128×128 for efficiency, or 224×224 if using a pretrained vision encoder like ResNet-18 or SigLIP. Normalize pixel values to [0,1] or [-1,1] depending on encoder pretraining — ImageNet models expect [0,1] with per-channel mean subtraction, while CLIP-family models expect [-1,1]. The BridgeData V2 dataset provides 60,000 demonstrations with 224×224 RGB images normalized to [0,1], serving as a reference for vision-based manipulation tasks^[5]. Always apply the same normalization at training and inference to avoid distribution shift.

Architecture Selection: U-Net vs Transformer Backbones

Diffusion policies require a conditional denoising network that takes noisy actions and observations as input and predicts the noise to subtract. Two architectures dominate: convolutional U-Nets and Transformer decoders. U-Nets process actions as 1D sequences with temporal convolutions, using skip connections to preserve fine-grained action details across diffusion steps. Transformers treat actions as token sequences with self-attention, enabling longer action horizons and better handling of temporal dependencies.

U-Net architectures work well for action horizons up to 16 steps and observation horizons of 2-4 frames. The original diffusion policy paper uses a 1D U-Net with 4 downsampling blocks, each containing two residual convolution layers with group normalization and SiLU activation. Observation embeddings from the vision encoder are injected via FiLM conditioning at each resolution level. This architecture trains in 24-48 hours on a single A100 GPU for datasets with 10,000-50,000 transitions^[6].

Transformer decoders scale better to longer horizons and multi-task settings. OpenVLA uses a 12-layer Transformer decoder with cross-attention to vision tokens and causal self-attention over action tokens, supporting action horizons up to 32 steps^[7]. The LeRobot framework provides reference implementations of both architectures with hyperparameter presets for common manipulation tasks. Start with U-Net for single-task policies with short horizons; switch to Transformer if you need longer action sequences or plan to scale to multi-task training. Transformer models require 2-3× more GPU memory and training time but generalize better across diverse tasks.

Configuring Noise Schedules and Diffusion Steps

The noise schedule controls how Gaussian noise is added during forward diffusion and removed during reverse denoising. Two schedules dominate robotics applications: DDPM (Denoising Diffusion Probabilistic Models) with linear or cosine variance schedules, and DDIM (Denoising Diffusion Implicit Models) with deterministic sampling. DDPM uses 100-1000 diffusion steps during training but can be reduced to 10-50 steps at inference via DDIM sampling without retraining.

Linear schedules add noise uniformly across timesteps, while cosine schedules concentrate more noise in early steps and preserve signal longer. Cosine schedules improve sample quality for high-dimensional action spaces (7+ DOF manipulators) by allowing the model to refine coarse action plans before adding fine-grained corrections. Set beta_start to 0.0001 and beta_end to 0.02 for linear schedules; cosine schedules compute betas automatically from a maximum signal-to-noise ratio parameter.

DDIM sampling enables fast inference by skipping timesteps — a model trained with 100 DDPM steps can generate actions in 10 DDIM steps with minimal quality loss. This reduces inference latency from 200ms to 20ms on an RTX 4090, critical for real-time control at 10-30 Hz. The LeRobot diffusion training example demonstrates DDIM inference with 10 steps achieving 95% of the success rate of 100-step DDPM sampling on push tasks^[8]. Start with 100 training steps and 10 inference steps; increase training steps to 200-500 only if validation loss plateaus early. Monitor the trade-off between inference speed and action quality by plotting success rate vs number of inference steps on held-out episodes.

Action Chunking and Observation Horizon Tuning

Action chunking predicts multiple future actions per observation to amortize inference cost and improve temporal consistency. The action horizon (number of predicted steps) and observation horizon (number of past frames) are critical hyperparameters. Longer action horizons reduce compounding errors by committing to multi-step plans, but increase the risk of executing stale actions if the environment changes unexpectedly.

Typical configurations use observation horizons of 2-4 frames and action horizons of 8-16 steps. The policy executes only the first 1-4 actions from each prediction, then queries the model again with updated observations — this receding-horizon control balances planning stability with reactivity. The DROID dataset uses 2-frame observation history and 10-step action chunks for 6-DOF manipulation, achieving 78% success on novel object grasping^[3].

Observation horizon affects the policy's ability to infer velocity and acceleration from frame differences. Single-frame policies cannot distinguish static objects from slow-moving ones; 2-frame policies capture instantaneous velocity; 4-frame policies enable acceleration estimation. However, longer observation horizons increase memory usage and slow training — a 4-frame policy with 224×224 RGB images consumes 4× the GPU memory of a single-frame policy. Start with 2 frames for tasks with clear visual state (pick-and-place, pushing); increase to 4 frames only if the task requires velocity inference (catching, dynamic manipulation).

Action horizon length depends on task dynamics and control frequency. Fast tasks (throwing, hitting) benefit from shorter horizons (4-8 steps) to maintain reactivity. Slow tasks (assembly, bimanual coordination) benefit from longer horizons (16-32 steps) to maintain plan consistency. The RT-1 paper uses 15-step action chunks at 3 Hz control frequency, providing 5 seconds of lookahead for mobile manipulation tasks^[1]. Tune action horizon by plotting success rate vs horizon length on validation episodes — optimal horizon maximizes success while minimizing execution time.

Vision Encoder Selection and Pretraining Strategies

Vision encoders map raw image observations to compact feature representations that the diffusion model conditions on. Three strategies dominate: training from scratch, freezing pretrained encoders, and fine-tuning pretrained encoders. Training from scratch requires 50,000+ demonstrations and 3-7 days on multi-GPU setups. Freezing pretrained encoders (ResNet-18, ResNet-50, ViT-B/16) enables training on 1,000-10,000 demonstrations in 1-2 days but may underperform on domain-specific visual features. Fine-tuning pretrained encoders balances sample efficiency and task-specific adaptation.

ResNet-18 pretrained on ImageNet provides strong baselines for tabletop manipulation with 224×224 RGB input. Extract features from the penultimate layer (512-dim for ResNet-18, 2048-dim for ResNet-50) and project to 256-512 dimensions via a learned linear layer. SigLIP and CLIP encoders pretrained on image-text pairs generalize better to novel objects and scenes, critical for open-vocabulary manipulation. The OpenVLA model uses a frozen SigLIP encoder with 768-dim embeddings, achieving 60% zero-shot success on unseen objects without fine-tuning^[7].

Fine-tuning pretrained encoders improves performance by 10-20% on domain-specific tasks but risks overfitting on small datasets. Use a learning rate 10× lower than the diffusion model (e.g., 1e-5 for encoder, 1e-4 for diffusion head) and apply gradient clipping to prevent catastrophic forgetting of pretrained features. The BridgeData V2 paper fine-tunes ResNet-50 encoders on 60,000 demonstrations, improving success rates from 68% (frozen) to 82% (fine-tuned) on novel object arrangements^[5]. Monitor validation loss closely — if it diverges from training loss after 10,000 steps, freeze the encoder and train only the diffusion head.

Training Loop Implementation and Optimization

Implement the training loop with AdamW optimizer, learning rate warmup, and exponential moving average (EMA) of model weights. Set the base learning rate to 1e-4 for U-Net architectures and 3e-5 for Transformer architectures. Use a linear warmup over the first 1,000-5,000 steps to stabilize early training, then decay with cosine annealing or constant schedule. Batch size depends on GPU memory: 64-128 for U-Nets on A100 GPUs, 16-32 for Transformers.

EMA weights smooth out training noise and improve generalization. Maintain a shadow copy of model parameters updated as `ema_param = 0.995 ema_param + 0.005 training_param` after each gradient step. Use EMA weights for validation and deployment — this typically improves success rates by 5-10% compared to final training weights. The LeRobot paper reports that EMA with decay 0.995 is critical for stable diffusion policy training, reducing validation loss variance by 30%^[6].

Monitor mean squared error (MSE) between predicted noise and ground-truth noise during training. MSE should decrease steadily for the first 20,000-50,000 steps, then plateau. If loss plateaus early (before 10,000 steps), increase model capacity (more layers, wider hidden dimensions) or reduce regularization (lower weight decay, disable dropout). If loss continues decreasing but validation success rate plateaus, you are overfitting — reduce model capacity, increase dropout, or collect more demonstrations.

Log validation metrics every 1,000-5,000 steps: success rate on held-out episodes, average episode length, and action prediction error. Success rate is the primary metric — a policy with low MSE but poor success rate has learned spurious correlations. The Open X-Embodiment benchmark defines success as task completion within 200 steps for pick-and-place, 300 steps for assembly tasks^[4]. Train for 50,000-200,000 gradient steps depending on dataset size and task complexity; most policies converge in 1-3 days on a single A100 GPU.

Simulation Evaluation and Hyperparameter Validation

Evaluate trained policies in simulation before deploying to real hardware. Simulation enables rapid iteration on hyperparameters (action horizon, observation horizon, inference steps) and identifies failure modes without risking hardware damage. Use physics simulators like RoboSuite, ManiSkill, or RLBench that match your real robot's kinematics and action space.

Run 50-100 evaluation episodes per hyperparameter configuration and compute success rate, average episode length, and action smoothness (sum of squared action differences between consecutive steps). Success rate measures task completion; episode length measures efficiency; action smoothness measures stability. Policies with high success but jerky actions may damage hardware or fail on real robots due to unmodeled dynamics. The RLBench benchmark provides 100 manipulation tasks with standardized evaluation protocols, enabling direct comparison across policies^[9].

Test robustness by adding observation noise (Gaussian noise on proprioceptive state, JPEG compression on images) and action delays (execute actions 1-3 steps late). Real-world deployment always introduces noise and latency; policies that degrade gracefully under perturbations transfer better. The CALVIN benchmark evaluates policies on 5-task chains with randomized object poses, requiring 34-step average success for human-level performance^[10]. If your policy achieves 80%+ success in simulation but fails on real hardware, the sim-to-real gap is likely due to unmodeled contact dynamics, sensor noise, or calibration errors — collect more real-world demonstrations rather than tuning simulation parameters.

Real-Time Deployment Optimization and Latency Reduction

Real-time control at 10-30 Hz requires inference latency under 33-100ms. Diffusion policies with 100 DDPM steps take 200-500ms on CPU, exceeding real-time budgets. Three optimizations enable real-time deployment: DDIM sampling with 10-20 steps, model quantization to FP16 or INT8, and GPU inference with batched action prediction.

DDIM sampling reduces inference steps from 100 to 10 with minimal quality loss, cutting latency by 10×. The LeRobot diffusion example demonstrates 10-step DDIM inference achieving 95% of 100-step DDPM success on push tasks while running at 20 Hz on an RTX 4090^[8]. Start with 10 DDIM steps; increase to 20 only if success rate drops below 90% of training performance.

Model quantization converts FP32 weights to FP16 or INT8, reducing memory usage and increasing throughput. FP16 quantization is lossless for most diffusion models, providing 2× speedup with no accuracy loss. INT8 quantization requires calibration on 100-1,000 validation samples and may degrade success rates by 2-5%. Use PyTorch's `torch.quantization` or ONNX Runtime for deployment — both support dynamic quantization without retraining.

GPU inference is critical for vision-heavy policies. A ResNet-18 encoder processes 224×224 RGB images in 5ms on an RTX 4090 vs 50ms on a 16-core CPU. Batch multiple action predictions when using action chunking — predicting 16 actions in a single forward pass is faster than 16 sequential predictions. The RT-2 deployment uses TensorRT optimization on NVIDIA Jetson AGX Orin, achieving 15 Hz control with a 12-layer Transformer and 10 DDIM steps^[2]. Profile inference latency with `torch.profiler` to identify bottlenecks — vision encoding typically consumes 40-60% of total latency, diffusion denoising 30-50%, and action post-processing 5-10%.

Common Training Pitfalls and Debugging Strategies

Four failure modes dominate diffusion policy training: incorrect action normalization, missing EMA weights, overfitting on small datasets, and using DDPM at inference time. Incorrect normalization causes the diffusion model to learn noise patterns that do not denormalize to valid actions — always verify that denormalized actions lie within the robot's joint limits and match the distribution of demonstration actions. Plot histograms of predicted vs ground-truth actions on validation episodes; mismatched distributions indicate normalization errors.

Missing EMA weights cause high-variance predictions and poor generalization. The diffusion model's stochastic sampling amplifies weight noise, making non-EMA checkpoints unstable. Always use EMA weights for validation and deployment — if you forgot to enable EMA during training, you cannot recover it post-hoc and must retrain. The LeRobot paper reports 15% higher success rates with EMA decay 0.995 compared to final training weights^[6].

Overfitting on small datasets (under 1,000 demonstrations) manifests as low training loss but poor validation success. Increase data augmentation (random crops, color jitter, Gaussian noise on proprioceptive state) and reduce model capacity (fewer layers, smaller hidden dimensions). The DROID dataset shows that diffusion policies require 500+ demonstrations per task to match deterministic policy performance^[3]. If you cannot collect more data, consider pretraining on a related task or using a frozen pretrained vision encoder.

Using DDPM at inference time instead of DDIM causes 10× slower control loops and missed real-time deadlines. DDPM requires 100-1000 denoising steps; DDIM achieves equivalent quality in 10-20 steps. Always configure DDIM sampling for deployment — the LeRobot training example provides reference code for DDIM inference with configurable step counts^[8]. If DDIM degrades success rates, increase inference steps to 20-50 rather than reverting to DDPM.

Dataset Sourcing and Procurement for Diffusion Training

High-quality demonstration datasets are the bottleneck for diffusion policy training. Public datasets like Open X-Embodiment (1 million trajectories, 22 embodiments) and DROID (76,000 trajectories, 6-DOF manipulation) provide starting points but may not match your robot's embodiment or task distribution^[4]. Custom data collection via teleoperation or kinesthetic teaching requires 10-50 hours of operator time per 1,000 demonstrations, costing $5,000-$25,000 at $50-$100/hour labor rates.

Truelabel's physical AI data marketplace aggregates 12,000+ collectors across 47 countries, enabling procurement of task-specific demonstration datasets with defined quality criteria (success rate, action smoothness, observation diversity). Buyers specify robot embodiment, task description, success criteria, and target dataset size; collectors submit demonstrations that pass automated quality checks (action limits, observation resolution, trajectory length). This marketplace model reduces procurement time from 3-6 months (hiring and training operators) to 2-4 weeks (posting request and reviewing submissions).

Data provenance and licensing are critical for commercial deployment. Public datasets often carry restrictive licenses (CC BY-NC, research-only) that prohibit commercial use. The RoboNet dataset uses a custom research license that forbids redistribution and commercial deployment^[11]. Truelabel's provenance tracking records collector identity, collection timestamp, hardware configuration, and licensing terms for every trajectory, enabling buyers to verify commercial-use rights and comply with AI Act transparency requirements. Always audit dataset licenses before training production models — retraining on properly licensed data costs less than legal disputes over unlicensed deployment.

Multi-Task Training and Transfer Learning Strategies

Multi-task diffusion policies trained on diverse demonstrations generalize better to novel tasks than single-task specialists. The RT-1 model trained on 130,000 demonstrations across 700 tasks achieves 97% success on seen tasks and 76% on unseen tasks, compared to 89% and 34% for single-task policies^[1]. Multi-task training requires task conditioning — concatenate a task embedding (learned or from a language model) to observation embeddings before feeding to the diffusion model.

Transfer learning from large pretrained models accelerates training on small datasets. The OpenVLA model pretrained on 970,000 trajectories from Open X-Embodiment achieves 60% zero-shot success on novel objects and 85% success after fine-tuning on 100 task-specific demonstrations^[7]. Fine-tuning requires 10-100× fewer demonstrations than training from scratch, reducing data collection costs from $25,000 to $500-$2,500 per task.

Domain randomization during training improves sim-to-real transfer by exposing the policy to diverse visual and dynamic conditions. Randomize object textures, lighting, camera poses, and physics parameters (friction, mass, damping) during simulation training. The domain randomization paper shows that policies trained with randomized textures transfer to real robots with 80% success vs 40% for non-randomized policies^[12]. However, excessive randomization degrades in-distribution performance — start with narrow randomization ranges and widen only if real-world success rates are low.

The RT-2 model demonstrates that vision-language pretraining on web data improves generalization to novel objects and instructions. RT-2 fine-tunes a pretrained vision-language model (PaLI-X) on robot demonstrations, achieving 50% higher success on unseen objects compared to vision-only policies^[2]. This approach requires 10-100× more compute than training from scratch but enables zero-shot generalization to objects described in natural language. Consider vision-language pretraining if your application requires open-vocabulary manipulation or frequent task changes.

Benchmarking and Performance Metrics

Standardized benchmarks enable objective comparison across diffusion policy implementations. The CALVIN benchmark evaluates policies on 5-task chains (pick, place, rotate, slide, open drawer) with randomized object poses, requiring 34-step average success for human-level performance^[10]. The LIBERO benchmark provides 130 long-horizon tasks (10-20 steps) across 4 environments (kitchen, living room, study, tabletop), measuring success rate and execution time.

Real-world benchmarks like DROID and Open X-Embodiment test generalization across robot embodiments and task distributions. DROID evaluates 6-DOF manipulation on 50 household objects with 10 grasp poses per object, achieving 78% success with diffusion policies vs 62% with deterministic policies^[3]. Open X-Embodiment evaluates cross-embodiment transfer by training on 21 robots and testing on the 22nd, achieving 45% zero-shot success vs 12% for single-embodiment policies^[4].

Report success rate, average episode length, action smoothness, and inference latency for every evaluation. Success rate measures task completion; episode length measures efficiency; action smoothness measures stability; inference latency measures real-time feasibility. The RT-1 paper reports 97% success, 45-step average length, 0.12 action smoothness (sum of squared differences), and 15 Hz control frequency^[1]. Always evaluate on held-out test episodes (20-30% of dataset) to measure generalization — training set success rates overestimate real-world performance by 10-20%.

Hardware Requirements and Cost Optimization

Training diffusion policies requires GPU compute: 1-3 days on a single A100 (40GB) for 10,000-50,000 demonstrations, 3-7 days for 100,000+ demonstrations. Cloud GPU costs range from $1.50/hour (A100 on Lambda Labs) to $4.00/hour (A100 on AWS), totaling $36-$672 per training run. Multi-GPU training scales linearly up to 4-8 GPUs but requires distributed data loading and gradient synchronization.

Consumer GPUs (RTX 4090, RTX 3090) provide cost-effective alternatives for small datasets (under 10,000 demonstrations). An RTX 4090 with 24GB VRAM trains U-Net diffusion policies in 2-4 days, costing $1,600 upfront vs $200-$400 in cloud GPU rental. However, consumer GPUs lack ECC memory and may produce non-deterministic results due to floating-point rounding errors — always validate on multiple training runs.

Inference hardware depends on control frequency and latency requirements. CPU inference (Intel i9, AMD Ryzen 9) achieves 5-10 Hz control with 10 DDIM steps, sufficient for slow manipulation tasks (assembly, bimanual coordination). GPU inference (RTX 4090, Jetson AGX Orin) achieves 15-30 Hz control, required for dynamic tasks (throwing, catching, contact-rich manipulation). The RT-2 deployment uses Jetson AGX Orin ($2,000) for mobile manipulation, achieving 15 Hz control with TensorRT optimization^[2].

Cost optimization strategies include mixed-precision training (FP16 instead of FP32, reducing memory by 50%), gradient checkpointing (trading compute for memory, enabling 2× larger batch sizes), and early stopping (monitoring validation loss and stopping when it plateaus, saving 20-40% of training time). The LeRobot paper reports that FP16 training with gradient checkpointing reduces A100 hours from 72 to 48 for 100,000-demonstration datasets with no accuracy loss^[6].

Future Directions: World Models and Foundation Models

Diffusion policies are evolving toward world-model-based planning and foundation model pretraining. World models learn forward dynamics p(next_observation | observation, action) and enable the policy to simulate future trajectories before execution, improving sample efficiency and safety. The World Models paper demonstrates that policies trained with world-model rollouts require 10× fewer real-world demonstrations than model-free policies^[13].

NVIDIA's Cosmos World Foundation Models provide pretrained video prediction models for physical AI, enabling zero-shot world modeling for novel tasks. Cosmos models trained on 20 million video clips predict 16-frame futures at 30 FPS, supporting 0.5-second lookahead for manipulation planning^[14]. Integrating world models with diffusion policies requires training a joint model that predicts both future observations and optimal actions, increasing training complexity but improving generalization.

Foundation models like OpenVLA and RT-2 pretrained on millions of demonstrations enable few-shot adaptation to novel tasks. OpenVLA achieves 60% zero-shot success on unseen objects and 85% success after 100 fine-tuning demonstrations^[7]. RT-2 leverages vision-language pretraining to ground natural language instructions in robotic affordances, achieving 50% higher generalization than vision-only policies^[2]. Future diffusion policies will likely combine world-model planning, foundation model pretraining, and online fine-tuning to achieve human-level generalization across diverse manipulation tasks.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

VLA training dataBuyer conversion page Physical AI data providers: criteria and optionsRelated page Best VLA training data providers 2026Related page Data provenance for physical AIRelated page HDF5 robot data format for robot training dataDelivery format detail LeRobot format format for robot training dataDelivery format detail MCAP format for robot training dataDelivery format detail Parquet robot data format for robot training dataDelivery format detail

External references and source context

RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 trained on 130,000 demonstrations uses diffusion-based action decoder achieving 97% success on seen tasks
arXiv ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 achieves 50% higher generalization to novel objects via vision-language pretraining and diffusion actions
arXiv ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID dataset with 76,000 trajectories shows diffusion policies require 500+ demonstrations per task
arXiv ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment aggregates 1 million trajectories across 22 robot embodiments with standardized schemas
arXiv ↩
BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 provides 60,000 demonstrations with 224×224 RGB images for vision-based manipulation
arXiv ↩
LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch
LeRobot paper reports EMA with decay 0.995 reduces validation loss variance by 30%
arXiv ↩
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA uses 12-layer Transformer decoder achieving 60% zero-shot success on unseen objects
arXiv ↩
Diffusion Policy training example
LeRobot diffusion training example demonstrates 10-step DDIM achieving 95% of DDPM success
GitHub ↩
RLBench: The Robot Learning Benchmark & Learning Environment
RLBench benchmark provides 100 manipulation tasks with standardized evaluation protocols
arXiv ↩
CALVIN paper
CALVIN benchmark evaluates policies on 5-task chains requiring 34-step average success for human-level performance
arXiv ↩
RoboNet: Large-Scale Multi-Robot Learning
RoboNet dataset uses custom research license forbidding redistribution and commercial deployment
arXiv ↩
Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization with texture variation achieves 80% sim-to-real transfer vs 40% without randomization
arXiv ↩
World Models
World models enable policies to simulate future trajectories requiring 10× fewer real-world demonstrations
worldmodels.github.io ↩
NVIDIA Cosmos World Foundation Models
NVIDIA Cosmos models trained on 20 million video clips predict 16-frame futures supporting 0.5-second lookahead
NVIDIA Developer ↩

FAQ

How many demonstrations do I need to train a diffusion policy?

Minimum 100 demonstrations for simple pick-and-place tasks, 500-1,000 for multi-step manipulation, and 5,000-10,000 for multi-task policies. The DROID dataset shows diffusion policies require 500+ demonstrations per task to match deterministic policy performance. Data quality matters more than quantity — 500 high-quality demonstrations (diverse object poses, smooth actions, consistent success) outperform 2,000 low-quality demonstrations with jerky actions or partial trajectories. If you have under 500 demonstrations, use a frozen pretrained vision encoder and aggressive data augmentation to prevent overfitting.

What is the difference between DDPM and DDIM sampling?

DDPM (Denoising Diffusion Probabilistic Models) uses stochastic sampling with 100-1000 steps, while DDIM (Denoising Diffusion Implicit Models) uses deterministic sampling with 10-50 steps. Both produce equivalent action quality, but DDIM is 10-20× faster, enabling real-time control at 10-30 Hz. Train with DDPM (100 steps) for stability, then switch to DDIM (10 steps) at inference. The LeRobot training example demonstrates 10-step DDIM achieving 95% of 100-step DDPM success on push tasks while running at 20 Hz on an RTX 4090.

Can I use diffusion policies for real-time control?

Yes, with DDIM sampling (10-20 steps) and GPU inference. A U-Net diffusion policy with ResNet-18 encoder runs at 15-30 Hz on an RTX 4090, sufficient for most manipulation tasks. CPU inference achieves 5-10 Hz, acceptable for slow tasks like assembly but too slow for dynamic manipulation. Optimize inference latency by using FP16 quantization (2× speedup), batching action predictions when using action chunking, and profiling with torch.profiler to identify bottlenecks. The RT-2 deployment uses TensorRT optimization on Jetson AGX Orin to achieve 15 Hz control with a 12-layer Transformer.

Should I train from scratch or fine-tune a pretrained model?

Fine-tune a pretrained model if you have under 10,000 demonstrations; train from scratch if you have 50,000+. Pretrained vision encoders (ResNet-18, SigLIP) reduce data requirements by 10×, enabling training on 1,000 demonstrations in 1-2 days. OpenVLA pretrained on 970,000 trajectories achieves 60% zero-shot success on novel objects and 85% success after fine-tuning on 100 demonstrations. Training from scratch requires 50,000+ demonstrations and 3-7 days on multi-GPU setups but may outperform pretrained models on domain-specific visual features. Start with a frozen pretrained encoder; fine-tune only if validation success plateaus.

How do I debug a diffusion policy that trains but fails at deployment?

Check four failure modes: incorrect action normalization (verify denormalized actions lie within joint limits), missing EMA weights (always use EMA for deployment), sim-to-real gap (add observation noise and action delays during training), and DDPM inference (switch to DDIM for real-time control). Plot histograms of predicted vs ground-truth actions on validation episodes — mismatched distributions indicate normalization errors. If the policy succeeds in simulation but fails on real hardware, collect 50-100 real-world demonstrations and fine-tune for 5,000-10,000 steps. The CALVIN benchmark shows policies require 80%+ simulation success to achieve 60%+ real-world success.

What action and observation horizons should I use?

Start with 2-frame observation horizon and 8-16 step action horizon. Observation horizon affects velocity inference: 1 frame cannot distinguish static from moving objects, 2 frames capture velocity, 4 frames enable acceleration estimation. Action horizon balances planning stability and reactivity: 4-8 steps for fast tasks (throwing), 16-32 steps for slow tasks (assembly). The RT-1 paper uses 15-step action chunks at 3 Hz control frequency, providing 5 seconds of lookahead for mobile manipulation. Tune by plotting success rate vs horizon length on validation episodes — optimal horizon maximizes success while minimizing execution time.

Looking for train diffusion policy?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

Browse Physical AI Datasets