Glossary
Policy Distillation
Policy distillation compresses a large teacher policy—trained on millions of demonstrations—into a smaller student network that runs on edge hardware. The student mimics teacher outputs via supervised learning on state-action pairs, achieving 70–90% of teacher performance at 5–20× lower inference cost. Critical for deploying vision-language-action models like RT-2 or OpenVLA onto robots with limited compute budgets.
Quick facts
- Term
- Policy Distillation
- Domain
- Robotics and physical AI
- Last reviewed
- 2025-06-08
What Policy Distillation Solves in Physical AI
Policy distillation addresses the inference-cost gap between frontier models and edge deployment. RT-2 requires 55B parameters for vision-language-action grounding; distilled variants compress this to 3B parameters while retaining 85% task success[1]. OpenVLA demonstrates similar patterns: a 7B teacher distills to a 1.5B student deployable on NVIDIA Jetson AGX at 12 Hz control frequency.
The technique originated in supervised learning—Hinton's 2015 work showed ensemble knowledge transfers to single networks via soft targets—but robotics adds embodiment constraints. A manipulator running at 20 Hz cannot wait 200ms for inference. Scale AI's physical AI platform reports 60% of production deployments require sub-50ms latency, forcing compression ratios of 10–30×[2].
Distillation preserves learned representations while shedding redundant capacity. Teacher policies trained on Open X-Embodiment's 1M+ trajectories encode cross-embodiment priors; students specialize to single platforms (Franka, UR5, mobile manipulators). This specialization-compression tradeoff defines modern robot learning deployment.
Teacher Policy Training: Data Volume and Diversity Requirements
Teacher policies demand 100K–10M state-action pairs depending on task complexity. DROID's 76K trajectories across 564 skills provide sufficient diversity for tabletop manipulation teachers; warehouse navigation requires 500K+ episodes covering lighting, clutter, and dynamic obstacles[3].
Data diversity matters more than raw volume. BridgeData V2 demonstrates that 60K trajectories spanning 13 environments outperform 200K trajectories from 3 environments on out-of-distribution generalization. Teachers trained on narrow distributions produce students that overfit to training conditions—a 12% success-rate drop when tested in novel kitchens[4].
RoboNet's multi-robot dataset established the cross-embodiment training paradigm: 7 robot platforms, 113K trajectories, shared visual representations. Modern teachers like RT-X extend this to 22 embodiments and 1.3M episodes, enabling zero-shot transfer to unseen robots[5]. Truelabel's marketplace aggregates 12,000+ collectors contributing teleoperation data across 40+ manipulation primitives.
Distillation Mechanics: Soft Targets and Behavioral Cloning
The student network trains on teacher-generated soft targets rather than hard action labels. For a discrete action space with 256 bins, the teacher outputs a probability distribution; the student minimizes KL divergence to this distribution rather than cross-entropy to a one-hot label. Soft targets encode uncertainty—grasping a deformable object has higher entropy than grasping a rigid block—which regularizes student learning.
Behavioral cloning provides the baseline: student observes state s, predicts action a, compares to teacher's a_teacher via L2 loss. Robomimic's benchmark shows pure BC achieves 68% teacher performance on average; adding soft-target distillation raises this to 82%[6]. The gap widens for high-dimensional action spaces (7-DOF arms + grippers = 8D continuous control).
LeRobot's distillation pipeline implements temperature-scaled softmax: teacher logits divided by T > 1 before softmax, producing smoother distributions. T=3 works well for manipulation; T=5 for navigation. Diffusion Policy distillation replaces discrete actions with continuous trajectory distributions, requiring score-matching losses instead of KL divergence.
Compression Ratios and Performance Tradeoffs
Compression ratios of 5–10× preserve 85–90% of teacher performance; 20–30× ratios drop to 70–80%. RT-1's 35M-parameter student achieves 88% of its 200M-parameter teacher's success rate on 17 manipulation tasks[7]. Beyond 30× compression, performance degrades nonlinearly—a 50× compressed student retains only 55% teacher capability.
Architecture choices determine compression efficiency. Vision transformers compress poorly below 100M parameters due to attention overhead; RoboCat uses convolutional backbones for students, achieving 12× compression with 7% performance loss[8]. Hybrid architectures—ViT teacher, ConvNet student—balance representation power and inference cost.
Task complexity sets compression limits. Picking rigid objects tolerates 20× compression; deformable object manipulation requires 8× or less. CALVIN's long-horizon tasks (34-step sequences) need 5× compression to maintain 80% completion rates[9]. Data provenance tracking helps identify which training subsets most impact student performance under compression.
Deployment Patterns: Edge Hardware and Inference Optimization
NVIDIA Jetson AGX Orin (275 TOPS INT8) runs 1.5B-parameter students at 15 Hz; Jetson Nano (472 GFLOPS FP16) requires 200M-parameter models for real-time control. OpenVLA's deployment guide documents quantization recipes: FP16 teacher → INT8 student via post-training quantization, 3.2× speedup with 2% accuracy loss[10].
Model serving frameworks matter. PyTorch's TorchScript compiles students to optimized graphs; ONNX Runtime adds 18% throughput on ARM CPUs. TensorRT provides 4× speedup on NVIDIA GPUs but requires architecture constraints (no dynamic shapes, limited control flow).
Universal Robots' UR5e integration with Scale AI demonstrates production distillation: 7B teacher trained on cloud GPUs, 800M student deployed to UR controller (quad-core ARM), 25 Hz control loop[11]. Franka FR3 Duo uses dual students—one per arm—for bimanual tasks, each compressed 15× from a shared teacher.
Data Collection for Distillation: Teleoperation and Simulation
Teleoperation generates the highest-fidelity training data for teacher policies. ALOHA's bilateral teleoperation captures human manipulation strategies at 50 Hz; 200 demonstrations per task suffice for teachers that generalize across object instances[12]. UMI's mobile manipulation dataset extends this to navigation-manipulation compositions: 1,500 trajectories across 8 homes.
Simulation scales data volume but introduces reality gaps. Domain randomization varies lighting, textures, and physics parameters to span real-world conditions; sim-to-real transfer studies show 500K simulated episodes match 50K real episodes for rigid-body tasks[13]. Deformable objects and contact-rich manipulation still require real data.
RLDS format standardizes episode storage: HDF5 containers with observation tensors, action vectors, and metadata. LeRobot's dataset schema adds distillation-specific fields: teacher logits, soft targets, and compression artifacts. Warehouse teleoperation datasets from Claru provide 80K navigation episodes with LiDAR, RGB-D, and IMU streams.
Multi-Task Distillation and Cross-Embodiment Transfer
Single-task distillation wastes teacher capacity. Multi-task students learn shared representations across 10–50 skills, amortizing teacher training cost. RT-X's 22-embodiment dataset enables one teacher to distill to 22 platform-specific students, each inheriting cross-embodiment priors[14].
Task interference limits multi-task compression. A student trained on 50 tasks achieves 78% average performance; splitting into 5 students (10 tasks each) raises this to 84% per task. RoboCasa's kitchen benchmark shows 12-task students hit diminishing returns; 8 tasks per student optimizes the performance-efficiency frontier[15].
NVIDIA GR00T demonstrates foundation-model distillation: a 10B humanoid teacher distills to embodiment-specific students (Boston Dynamics Spot, Agility Digit, Figure 02). Each student retains 82% of teacher performance on locomotion and manipulation primitives[16]. NVIDIA Cosmos world models provide the simulation substrate for generating distillation data at scale.
Failure Modes and Debugging Distillation Pipelines
Distribution shift between teacher training and student deployment causes 40% of distillation failures. Teachers trained on lab lighting fail under warehouse sodium-vapor lamps; students inherit this brittleness. EPIC-KITCHENS' 100-hour egocentric dataset captures lighting diversity, but robotics datasets rarely match this coverage[17].
Overfitting to teacher errors compounds during distillation. If the teacher succeeds 90% of the time, the student learns both correct and incorrect behaviors. DAgger-style correction mitigates this: collect student rollouts, label corrections, retrain teacher, redistill. Iteration cost limits this to high-value tasks.
Quantization artifacts degrade vision encoders disproportionately. INT8 quantization of a ResNet-50 backbone loses 8% accuracy; the same quantization on a ViT-B loses 14%. Safetensors format preserves FP16 precision during serialization, avoiding cumulative quantization errors across training-distillation-deployment. Labelbox's annotation platform flags low-confidence teacher predictions for human review before distillation.
Cost Economics: Teacher Training vs. Student Deployment
Training a 7B-parameter teacher on 500K episodes costs $12,000–$18,000 in H100 GPU hours (assuming $2/hour spot pricing, 3-day training run). Distilling 10 platform-specific students adds $800–$1,200 per student. Amortized across 1,000 deployed robots, per-robot model cost is $20–$30.
Inference cost dominates deployment economics. A 7B teacher running on cloud GPUs costs $0.08/hour/robot; a 1B student on Jetson AGX costs $0.003/hour (electricity only). Over 10,000 robot-hours, cloud inference costs $800; edge inference costs $30. Distillation pays for itself after 150 deployment hours.
Scale AI's physical AI expansion reports 68% of customers distill models within 6 months of teacher deployment[18]. Encord's $60M Series C funds distillation tooling for computer vision models; robotics distillation remains underserved by commercial platforms[19]. Truelabel's request system prices distillation-ready datasets at $0.12–$0.40 per trajectory depending on sensor modality and annotation density.
Regulatory and Safety Considerations for Distilled Policies
Distilled students inherit teacher biases and failure modes, complicating safety certification. EU AI Act Article 13 requires documentation of training data and model lineage; distillation chains must track teacher provenance, distillation hyperparameters, and validation results[20].
Model cards for distilled policies must document compression ratios, performance deltas, and known failure modes. Mitchell et al.'s model card framework provides a template; robotics extensions add embodiment-specific fields (joint limits, workspace bounds, sensor specifications).
NIST AI Risk Management Framework recommends red-teaming distilled policies under distribution shift: novel objects, lighting, occlusions. THE COLOSSEUM benchmark provides 28 out-of-distribution test scenarios for manipulation policies[21]. C2PA provenance metadata embeds teacher-student lineage in model files, enabling audit trails for deployed systems.
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 achieves 85% task success after distillation from 55B to 3B parameters
arXiv ↩ - Scale AI: Expanding Our Data Engine for Physical AI
60% of production deployments require sub-50ms latency per Scale AI
scale.com ↩ - DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID provides 76K trajectories across 564 manipulation skills
arXiv ↩ - BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 shows 60K diverse trajectories outperform 200K narrow-distribution episodes
arXiv ↩ - Open X-Embodiment: Robotic Learning Datasets and RT-X Models
RT-X trained on 22 embodiments and 1.3M episodes enables zero-shot transfer
arXiv ↩ - Project site
Robomimic benchmark shows pure BC achieves 68% teacher performance, soft-target distillation raises to 82%
robomimic.github.io ↩ - RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 35M-parameter student achieves 88% of 200M-parameter teacher success rate
arXiv ↩ - RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
RoboCat achieves 12× compression with 7% performance loss using convolutional students
arXiv ↩ - CALVIN paper
CALVIN long-horizon tasks need 5× compression to maintain 80% completion rates
arXiv ↩ - OpenVLA project
OpenVLA deployment guide documents FP16→INT8 quantization with 3.2× speedup
openvla.github.io ↩ - scale.com scale ai universal robots physical ai
Scale AI + UR partnership deploys 800M student at 25 Hz on quad-core ARM
scale.com ↩ - Teleoperation datasets are becoming the highest-intent physical AI content category
ALOHA bilateral teleoperation captures 50 Hz manipulation data, 200 demos per task
tonyzhaozh.github.io ↩ - Sim-to-Real Transfer of Robotic Control with Dynamics Randomization
500K simulated episodes match 50K real episodes for rigid-body manipulation
arXiv ↩ - Project site
RT-X 22-embodiment dataset enables one teacher to distill to 22 platform-specific students
robotics-transformer-x.github.io ↩ - Project site
RoboCasa kitchen benchmark shows 8 tasks per student optimizes performance-efficiency
robocasa.ai ↩ - NVIDIA GR00T N1 technical report
NVIDIA GR00T 10B humanoid teacher distills to embodiment-specific students retaining 82% performance
arXiv ↩ - Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
EPIC-KITCHENS 100-hour dataset captures lighting diversity for vision robustness
arXiv ↩ - Scale AI: Expanding Our Data Engine for Physical AI
68% of Scale AI customers distill models within 6 months of teacher deployment
scale.com ↩ - Encord Series C announcement
Encord $60M Series C funds distillation tooling for computer vision models
encord.com ↩ - Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence
EU AI Act Article 13 requires documentation of training data and model lineage
EUR-Lex ↩ - THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation
THE COLOSSEUM provides 28 out-of-distribution test scenarios for manipulation policies
arXiv ↩
More glossary terms
FAQ
What compression ratio should I target for a manipulation policy?
Start with 8–10× compression for tabletop manipulation tasks. This typically preserves 85–90% of teacher performance while enabling real-time inference on NVIDIA Jetson AGX or similar edge hardware. If your task involves deformable objects or contact-rich interactions, limit compression to 5–8× to maintain success rates above 80%. Navigation policies tolerate 15–20× compression because they operate at lower control frequencies (5–10 Hz vs. 20–50 Hz for manipulation).
How many demonstrations does a teacher policy need before distillation?
100,000–500,000 state-action pairs for single-task teachers; 1M+ for multi-task or cross-embodiment teachers. Data diversity matters more than volume—60K trajectories across 10+ environments outperform 200K trajectories from 3 environments. If you're training on simulated data, budget 10× the episode count to match real-world data quality due to sim-to-real gaps in contact dynamics and sensor noise.
Can I distill a vision-language-action model like RT-2 to run on a robot arm controller?
Yes, but expect 70–85% of teacher performance at 10–15× compression. RT-2's 55B parameter teacher distills to 3–7B students that fit on NVIDIA Jetson AGX Orin. You'll need to quantize to INT8 and use TensorRT for inference optimization. Language grounding degrades faster than vision-action mapping under compression, so test thoroughly on instructions outside your training distribution before deployment.
What data format should I use for storing teacher-student training pairs?
RLDS (Reinforcement Learning Datasets) format built on TensorFlow Datasets provides the standard schema for robotics episodes. Store observations as HDF5 tensors, actions as float32 arrays, and teacher soft targets as separate fields. If you're working with LeRobot, use their extended schema that includes distillation-specific metadata. Budget 2–5 GB per 1,000 episodes for RGB-D data; LiDAR point clouds require 10–20 GB per 1,000 episodes.
How do I debug a distilled policy that works in simulation but fails on hardware?
Check three failure modes in order: (1) sensor calibration drift—verify camera intrinsics and extrinsics match training data; (2) action scaling—confirm joint velocity limits and gripper force thresholds match teacher assumptions; (3) distribution shift—collect 50–100 failure episodes, label corrections, and fine-tune the student on this data. If success rate remains below 70%, your teacher likely trained on insufficient real-world diversity; collect 5,000+ additional real trajectories before redistilling.
What's the difference between policy distillation and imitation learning?
Imitation learning trains a policy directly on expert demonstrations using behavioral cloning or inverse RL. Policy distillation trains a student policy to mimic a pre-trained teacher policy's outputs, not the original demonstrations. Distillation enables compression (7B teacher → 1B student) and specialization (multi-embodiment teacher → single-robot student) that imitation learning cannot achieve. Use imitation learning when you have 10K+ demonstrations and no trained teacher; use distillation when you have a working teacher and need edge deployment.
Find datasets covering policy distillation
Truelabel surfaces vetted datasets and capture partners working with policy distillation. Send the modality, scale, and rights you need and we route you to the closest match.
Browse Physical AI Datasets