Physical AI Implementation Guide
How to Fine-Tune a Vision-Language-Action Model on Custom Robot Data
Fine-tuning a vision-language-action model requires converting your robot demonstrations into RLDS format with 256×256 RGB observations and normalized 7-DoF actions, configuring LoRA adapters with rank 32–64 to reduce VRAM from 80GB to under 24GB, training for 5,000–15,000 steps on 4–8 A100 GPUs over 12–48 hours, and validating success rates above 70% on held-out tasks before deploying the policy to your physical robot with 10–30Hz control loops.
Quick facts
- Difficulty
- Intermediate
- Audience
- Physical AI data engineers
- Last reviewed
- 2025-06-15
Why Fine-Tuning VLA Models Outperforms Training From Scratch
Vision-language-action models like OpenVLA and RT-2 arrive pretrained on millions of robot trajectories from datasets including Open X-Embodiment, which aggregates 1 million episodes across 22 robot embodiments[1]. Training a comparable foundation model from scratch demands 500+ A100-days and access to multi-embodiment data pipelines that few organizations possess. Fine-tuning transfers this learned manipulation prior to your specific tasks in 12–48 hours on 4–8 GPUs.
Task-specific adaptation preserves the model's general understanding of object affordances, spatial reasoning, and language grounding while specializing action distributions to your robot's kinematics and workspace constraints. RT-1 demonstrated 97% success on seen tasks and 76% on novel instructions after fine-tuning on 130,000 demonstrations[2], compared to 34% for policies trained only on target-domain data. The pretrained visual encoder already recognizes 10,000+ object categories from web-scale vision-language pretraining, eliminating the need to relearn basic perception.
Compute efficiency becomes critical when iteration speed determines product velocity. Full fine-tuning of OpenVLA's 7B parameters requires 8×A100 80GB GPUs and consumes 640 GPU-hours per training run. Parameter-efficient methods like LoRA reduce memory to 24GB per GPU and cut training time to 50 GPU-hours while retaining 95% of full fine-tuning performance, enabling rapid experimentation on physical AI data collected from your production environment.
Prerequisites: Hardware, Software, and Data Requirements
Compute infrastructure determines feasibility before you begin. OpenVLA fine-tuning with LoRA rank-64 adapters requires 4×A100 40GB GPUs minimum; full fine-tuning demands 8×A100 80GB. LeRobot reports training Diffusion Policy on 200 episodes completes in 6 hours on 4×A100s[3]. Cloud options include AWS p4d.24xlarge instances at $32.77/hour or Lambda Labs A100 clusters at $1.10/GPU-hour. Budget 12–48 hours of wall-clock time for hyperparameter sweeps across learning rates, batch sizes, and LoRA ranks.
Software dependencies span the robotics and ML stacks. Install Python 3.10+, PyTorch 2.1+ with CUDA 12.1, and LeRobot or the OpenVLA training repository. For RLDS dataset manipulation you need TensorFlow Datasets and the RLDS library. Octo models require JAX 0.4+ and Flax. Verify your environment supports mixed-precision training (bfloat16) and distributed data parallelism across multiple GPUs to avoid bottlenecks during the 10,000+ training steps typical for convergence.
Dataset requirements set the lower bound on fine-tuning success. Collect 50–200 teleoperation episodes per task, each containing 20–100 timesteps of RGB observations, language instructions, and 7-DoF end-effector actions. DROID provides 76,000 trajectories across 564 skills as a reference scale[4]. Store demonstrations in RLDS format with episode boundaries, or convert from HDF5/ROS bags using the LeRobot data pipeline. Split 80% train, 10% validation, 10% test; never evaluate on training tasks to avoid overfitting to specific object placements or lighting conditions.
Step 1: Convert Your Robot Data to RLDS Format
RLDS (Reinforcement Learning Datasets) structures episodes as TensorFlow Datasets with nested dictionaries containing observations, actions, rewards, and metadata. Each episode is a sequence of steps; each step holds an `observation` dict (with `image` and `state` tensors), an `action` vector, a `language_instruction` string, and boolean flags like `is_first` and `is_last`. The RLDS repository provides builder templates for custom datasets.
Image preprocessing standardizes visual inputs to the model's expected resolution. OpenVLA uses 256×256 RGB images; resize your camera feeds with bilinear interpolation and center-crop to preserve aspect ratio. Normalize pixel values to [0, 1] by dividing by 255. If your robot has multiple cameras, concatenate views along the channel dimension or encode them as separate observation keys. BridgeData V2 demonstrates multi-view encoding with wrist and third-person cameras[5].
Action normalization prevents gradient explosion during training. Compute per-dimension mean and standard deviation across your entire training set, then apply z-score normalization: `action_normalized = (action - mean) / std`. Clip normalized actions to [-1, 1] to match the model's output tanh activation. For 7-DoF delta poses, typical std values are 0.02–0.05m for position deltas and 0.1–0.3 radians for rotation deltas. Store normalization statistics in a JSON file; you will need them at inference time to denormalize the model's predictions back to robot-executable commands.
Language instruction annotation grounds each episode in natural language. Write task descriptions like "pick the blue block and place it in the bin" rather than generic labels like "task_17". RT-2 shows that instruction diversity improves generalization: paraphrase the same task multiple ways ("grab the blue cube", "grasp the azure block") to teach the model semantic equivalence[6]. If you lack annotations, use an LLM to generate instructions from video frames, but manually verify 10% for quality. Store instructions as UTF-8 strings in the `language_instruction` field of each episode's first step.
Step 2: Configure LoRA Adapters to Reduce Memory Footprint
Low-Rank Adaptation (LoRA) injects trainable rank-decomposition matrices into transformer attention layers, freezing the pretrained weights and updating only 0.1–1% of total parameters. For OpenVLA's 7B-parameter model, LoRA rank-32 adds 67M trainable parameters and reduces VRAM from 80GB to 22GB per GPU, enabling fine-tuning on 4×A100 40GB cards. Rank-64 doubles adapter capacity to 134M parameters at 28GB VRAM, improving task performance by 3–8% on complex manipulation benchmarks.
Hyperparameter selection balances expressivity and overfitting risk. Start with rank r=32, alpha=64 (scaling factor), and dropout=0.05. Apply LoRA to query and value projection matrices in all transformer blocks; some practitioners also adapt the feedforward layers for an additional 2–4% success rate gain. LeRobot's training scripts default to rank-32 for Diffusion Policy and rank-64 for ACT models. If validation loss plateaus above training loss after 5,000 steps, increase rank to 64 or 128; if training loss diverges, reduce learning rate or increase dropout to 0.1.
Mixed-precision training accelerates throughput by 2–3× without accuracy loss. Enable bfloat16 automatic mixed precision in PyTorch via `torch.cuda.amp.autocast()`. This reduces activation memory by 50% and allows larger batch sizes: 32 sequences per GPU with LoRA rank-32, compared to 8 sequences for full fine-tuning. Monitor gradient norms; if they exceed 10.0, apply gradient clipping at 1.0 to stabilize training. OpenVLA's paper reports bfloat16 training matches float32 final performance while cutting wall-clock time from 48 to 18 hours on 8×A100s[7].
Step 3: Set Learning Rate, Batch Size, and Training Duration
Learning rate scheduling determines convergence speed and final performance. Use a linear warmup from 0 to peak learning rate over the first 500–1,000 steps, then cosine decay to 10% of peak by the end of training. For LoRA fine-tuning, start with peak LR=3e-4; for full fine-tuning, use 1e-5. RT-1 training used 1e-4 with 10,000-step warmup across 200,000 total steps[8]. If validation loss oscillates, halve the learning rate; if it plateaus early, double it and restart from the best checkpoint.
Batch size scaling trades memory for gradient stability. Effective batch size (global batch across all GPUs) should be 128–512 sequences for VLA fine-tuning. With 4 GPUs and per-device batch size of 8, your effective batch is 32; use gradient accumulation over 4 steps to reach effective batch 128. Larger batches (256–512) smooth gradients and improve success rates by 2–5% but require proportionally more VRAM. LeRobot's Diffusion Policy example uses batch size 64 on 4 GPUs with 2-step accumulation for effective batch 128.
Training duration depends on dataset size and task complexity. For 100–200 episodes, train for 10,000–15,000 steps (approximately 50–75 epochs with batch size 128). Monitor validation success rate every 500 steps; stop when it plateaus for 2,000 consecutive steps or begins decreasing (overfitting signal). Simple pick-and-place tasks converge in 5,000 steps; multi-stage assembly tasks may require 20,000+ steps. Budget 12 hours on 4×A100s for 10,000 steps with LoRA rank-32, or 36 hours for full fine-tuning on 8×A100s.
Step 4: Train the Model and Monitor for Common Failure Modes
Launch distributed training with PyTorch DDP or DeepSpeed for multi-GPU parallelism. Use `torchrun --nproc_per_node=4` to spawn 4 processes, one per GPU. Set `CUDA_VISIBLE_DEVICES` to isolate GPUs if sharing a node. Enable TensorBoard logging for loss curves, gradient norms, and learning rate schedules. LeRobot's ACT training notebook demonstrates end-to-end setup with Weights & Biases integration for experiment tracking.
Gradient explosion manifests as NaN losses after 100–500 steps, typically caused by excessive learning rates or unnormalized actions. If training diverges, reduce peak learning rate by 50%, verify action normalization statistics are correct (mean near 0, std near 1), and enable gradient clipping at norm 1.0. Check that your RLDS dataset does not contain outlier actions (e.g., a single timestep with 10× larger delta than the rest); filter or clip these during preprocessing.
Overfitting indicators include training loss below 0.01 while validation loss remains above 0.1, or 95%+ training success with 40% validation success. Increase LoRA dropout from 0.05 to 0.1, apply weight decay (1e-4 to 1e-3), or collect more diverse demonstrations. CALVIN shows that task diversity matters more than raw episode count: 50 episodes across 10 object configurations outperform 200 episodes with identical setups[9]. Augment images with random crops, color jitter (±10% brightness/contrast), and horizontal flips to artificially expand the training distribution.
Step 5: Evaluate Fine-Tuned Policy on Held-Out Tasks
Simulation rollouts provide fast iteration before risking hardware. Load your fine-tuned checkpoint into RoboSuite or ManiSkill and execute 50–100 episodes on test tasks. Measure success rate (task completion within episode horizon), average episode length, and action smoothness (sum of squared action deltas). A well-tuned policy achieves 70–90% success on seen task variations and 40–60% on novel object shapes or positions. If success drops below 50%, retrain with more data or higher LoRA rank.
Real-world deployment requires action post-processing for safety and smoothness. Denormalize model outputs using the stored mean/std statistics, then apply a moving-average filter (window size 3–5) to reduce high-frequency jitter. Clip position deltas to ±5cm and rotation deltas to ±15° per timestep to prevent collisions. Run the policy at 10–30Hz depending on your robot's control loop; RT-2 operates at 3Hz for high-level planning with a lower-level impedance controller at 100Hz[10].
Failure analysis identifies systematic gaps in the training data. Record all failed episodes with timestamps and failure modes (e.g., "gripper missed object by 3cm", "knocked over adjacent item"). If 60% of failures occur during grasp approach, collect 20 additional teleoperation episodes focusing on diverse grasp angles and object orientations. If the policy succeeds on blue objects but fails on red, your dataset likely undersamples red instances; commission targeted data collection to balance the distribution. Retrain with the augmented dataset and re-evaluate until success rate exceeds your deployment threshold.
Step 6: Optimize Inference Latency for Real-Time Control
Model quantization reduces checkpoint size and speeds up forward passes. Convert the fine-tuned model to int8 precision using PyTorch's `torch.quantization` API or bitsandbytes library. This cuts VRAM from 14GB to 4GB and reduces inference latency from 80ms to 30ms per action on a single A100, enabling deployment on edge GPUs like NVIDIA Jetson Orin. Quantization typically costs 1–3% success rate; validate on 50 test episodes before deploying to production robots.
Batch inference amortizes encoder overhead when controlling multiple robots. If you operate 4 identical arms, stack their observations into a batch-4 tensor and run a single forward pass, reducing per-robot latency from 80ms to 25ms. This requires synchronized observation capture across robots; use hardware-triggered cameras or NTP time synchronization to align frames within 5ms. Scale AI's Universal Robots partnership demonstrates fleet-scale inference optimization for physical AI deployments[11].
Fallback policies handle out-of-distribution states where the VLA's confidence drops below a threshold. Compute action entropy at each timestep; if it exceeds 2.5 nats, trigger a scripted recovery behavior (e.g., return to home position, request human teleoperation). RoboCat uses a separate "meta-controller" that switches between the learned policy and hard-coded primitives based on predicted success probability[12]. Log all fallback triggers to identify dataset gaps; if 15% of episodes require fallback during grasping, collect more grasp demonstrations with varied object poses and lighting conditions.
Advanced Techniques: Multi-Task Fine-Tuning and Continual Learning
Multi-task batching trains a single policy across 5–20 related tasks by sampling episodes uniformly from each task's dataset. This improves generalization: a policy trained on "pick red block" and "pick blue block" transfers better to "pick green block" than a policy trained only on red. Use task-conditioned language instructions to disambiguate behaviors; the model learns to attend to color words in the instruction embedding. Open X-Embodiment shows that multi-task pretraining on 22 robots improves single-task fine-tuning success by 12–18% compared to single-embodiment pretraining[13].
Continual learning updates the policy as you collect new demonstrations without catastrophic forgetting of old tasks. Allocate 20% of each training batch to replay buffer samples from previous tasks, maintaining performance on "pick red block" while learning "stack two blocks". Alternatively, use elastic weight consolidation (EWC) to penalize changes to parameters important for old tasks. RT-1's data engine continuously ingests new teleoperation data and retrains policies weekly, accumulating 130,000 episodes over 17 months[14].
Sim-to-real transfer bridges the reality gap when physical data is expensive. Pretrain on 10,000 simulated episodes in RoboSuite with domain randomization (varying object textures, lighting, camera poses), then fine-tune on 50–100 real episodes. This hybrid approach achieves 85% real-world success compared to 65% for simulation-only policies. Domain randomization forces the model to learn robust features invariant to visual nuances, improving transfer across the sim-to-real distribution shift[15].
Troubleshooting: Common Fine-Tuning Pitfalls and Solutions
Action distribution mismatch occurs when your robot's action space differs from the pretraining data. OpenVLA expects 7-DoF delta end-effector poses; if your robot uses 6-DoF joint positions, you must either retarget actions to end-effector space using inverse kinematics or train an action-space adapter layer. The adapter is a 2-layer MLP that maps the VLA's 7D output to your 6D joint space, trained jointly during fine-tuning. Freeze the VLA backbone for the first 1,000 steps while the adapter learns the mapping, then unfreeze LoRA adapters for end-to-end fine-tuning.
Language instruction ambiguity degrades performance when multiple tasks share similar descriptions. "Pick the block" is ambiguous if your workspace contains red, blue, and green blocks. Augment instructions with spatial or attribute details: "pick the red block on the left", "grasp the small blue cube near the bin". If your dataset lacks detailed annotations, use an LLM to generate richer descriptions from video frames, then manually verify 20% for correctness. RT-2's vision-language pretraining enables zero-shot understanding of 500+ object categories and spatial relations, reducing annotation burden[16].
Insufficient data diversity limits generalization to novel object poses, lighting, or backgrounds. If your 100 episodes all use the same table texture and overhead lighting, the policy will fail under side lighting or on a metal surface. Collect demonstrations across 3–5 workspace configurations, 2–3 lighting conditions, and 10+ object instances per category. Alternatively, apply aggressive image augmentation during training: random crops (224×224 from 256×256), color jitter (±20% brightness/saturation), and Gaussian blur (σ=0.5–2.0 pixels). This artificially expands the visual distribution and improves robustness by 8–15% on out-of-distribution test scenes.
Deployment Checklist: From Checkpoint to Production Robot
Model export converts the PyTorch checkpoint to an inference-optimized format. Use `torch.jit.trace()` to compile the model into TorchScript, eliminating Python overhead and enabling C++ deployment. For edge devices, export to ONNX and run with TensorRT for 2–4× speedup. Verify that the exported model produces identical outputs (within 1e-5 absolute error) to the original checkpoint on 100 test observations before deploying to hardware.
Safety interlocks prevent collisions and workspace violations. Implement joint limit checks (halt if any joint exceeds ±170° for a 180° range), workspace boundary enforcement (stop if end-effector exits a predefined 3D bounding box), and force-torque thresholds (e-stop if wrist sensor detects >20N unexpected force). Run the policy in a "shadow mode" for 50 episodes where it predicts actions but a human teleoperator executes them, logging discrepancies to identify unsafe predictions before autonomous operation.
Monitoring and logging enable post-deployment debugging. Record every episode's observations, actions, language instruction, and success label to a database. Compute per-episode metrics: task success, episode length, action smoothness (sum of squared deltas), and maximum joint velocity. Set alerts for anomalies: success rate drops below 60% over 20 episodes, average episode length exceeds 150% of training mean, or action smoothness degrades by 50%. Data provenance tracking links each deployed policy version to its training dataset, enabling rapid root-cause analysis when performance degrades[17].
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment aggregates 1 million episodes across 22 robot embodiments for VLA pretraining
arXiv ↩ - RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 achieved 97% success on seen tasks after training on 130,000 demonstrations
arXiv ↩ - LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch
LeRobot reports Diffusion Policy training on 200 episodes completes in 6 hours on 4×A100s
arXiv ↩ - DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID contains 76,000 trajectories across 564 manipulation skills collected in diverse environments
arXiv ↩ - BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 demonstrates multi-view camera encoding with wrist and third-person observations
arXiv ↩ - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 shows that instruction diversity improves generalization to novel language commands
arXiv ↩ - OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA paper reports bfloat16 training matches float32 performance while cutting time from 48 to 18 hours
arXiv ↩ - RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 used learning rate 1e-4 with 10,000-step warmup across 200,000 total training steps
arXiv ↩ - CALVIN paper
CALVIN shows task diversity matters more than episode count for generalization
arXiv ↩ - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 operates at 3Hz for high-level planning with 100Hz impedance control
robotics-transformer2.github.io ↩ - scale.com scale ai universal robots physical ai
Scale AI and Universal Robots partnership demonstrates fleet-scale inference optimization
scale.com ↩ - RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
RoboCat uses a meta-controller to switch between learned policies and scripted primitives
arXiv ↩ - Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment multi-task pretraining improves single-task fine-tuning by 12–18%
arXiv ↩ - Google Research blog
RT-1 data engine continuously ingests teleoperation data and retrains policies weekly
robotics-transformer1.github.io ↩ - Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization forces models to learn robust features invariant to visual nuances
arXiv ↩ - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 vision-language pretraining enables zero-shot understanding of 500+ object categories
arXiv ↩ - truelabel data provenance glossary
Data provenance tracking links deployed policies to training datasets for root-cause analysis
truelabel.ai ↩
FAQ
How many demonstration episodes do I need to fine-tune a VLA model effectively?
For simple pick-and-place tasks, 50–100 teleoperation episodes suffice to achieve 70–80% success rates after fine-tuning OpenVLA or RT-2 with LoRA adapters. Complex multi-stage tasks like assembly or bimanual manipulation require 200–500 episodes. BridgeData V2 collected 60,000 episodes across 24 tasks to train a generalist policy with 85% average success[ref:ref-bridgedata-v2]. Prioritize diversity over raw count: 100 episodes spanning 10 object configurations and 3 lighting conditions outperform 300 episodes in a single setup. If your budget is constrained, start with 50 episodes, fine-tune, evaluate on held-out tasks, then collect 20–50 additional episodes targeting the failure modes you observe.
Can I fine-tune a VLA model on a single GPU, or do I need a multi-GPU cluster?
LoRA fine-tuning of OpenVLA with rank-32 adapters fits on a single A100 40GB GPU with batch size 4 and gradient accumulation over 8 steps, yielding effective batch size 32. Training 10,000 steps takes approximately 48 hours on one GPU versus 12 hours on 4 GPUs. Full fine-tuning requires 8×A100 80GB GPUs due to the 7B parameter count and 256×256 image inputs. If you lack multi-GPU access, use cloud providers like Lambda Labs (A100 at $1.10/hour) or AWS p4d instances. Alternatively, fine-tune smaller VLA variants: Octo-Base (93M parameters) trains on 2×A100 40GB in 8 hours for 10,000 steps with full fine-tuning, achieving 90% of Octo-Large performance on single-task benchmarks.
What is the difference between fine-tuning OpenVLA versus RT-2, and which should I choose?
OpenVLA is a 7B-parameter open-source VLA built on Llama-2 and pretrained on 970,000 robot trajectories from Open X-Embodiment, released in June 2024[ref:ref-openvla-paper]. RT-2 is Google's 55B-parameter proprietary model that combines PaLI-X vision-language pretraining with 130,000 RT-1 robot demonstrations, published in July 2023[ref:ref-rt2-paper]. OpenVLA offers full code access, LoRA fine-tuning scripts, and LeRobot integration, making it the default choice for academic and startup use. RT-2 achieves 5–10% higher success on novel instructions due to its larger scale and web-knowledge transfer, but requires licensing from Google and lacks public fine-tuning tools. Choose OpenVLA for rapid iteration and full control; pursue RT-2 if you have a Google partnership and need state-of-the-art generalization.
How do I handle multi-camera observations when fine-tuning a VLA model?
VLA models typically expect a single 256×256 RGB image per timestep. For multi-camera setups (e.g., wrist camera + third-person camera), concatenate views along the channel dimension to create a 256×256×6 tensor (2 cameras × 3 RGB channels), then add a learned projection layer that maps 6 channels to 3 before feeding into the pretrained image encoder. Alternatively, encode each camera separately through the frozen vision encoder, concatenate the resulting feature vectors, and train a fusion MLP during fine-tuning. BridgeData V2 uses the concatenation approach for wrist + external cameras and reports 8% higher success on manipulation tasks compared to single-camera policies[ref:ref-bridgedata-v2]. If your cameras have different resolutions, resize both to 256×256 and apply per-camera normalization (compute separate mean/std for each view) to prevent one camera from dominating the gradient signal.
What are the signs that my fine-tuned VLA model is overfitting, and how do I fix it?
Overfitting manifests as training success rate above 90% while validation success remains below 50%, or training loss below 0.01 with validation loss above 0.15. Check if validation loss stops decreasing after 3,000 steps while training loss continues dropping—this indicates the model is memorizing training episodes rather than learning generalizable manipulation skills. To mitigate overfitting: (1) increase LoRA dropout from 0.05 to 0.1–0.15, (2) apply weight decay of 1e-4 to 1e-3 on adapter parameters, (3) collect 20–50 additional demonstrations with varied object poses and lighting, (4) enable aggressive image augmentation (random crops, color jitter ±20%, Gaussian blur), and (5) reduce LoRA rank from 64 to 32 to limit adapter capacity. If overfitting persists, your dataset likely lacks diversity; CALVIN demonstrates that 50 episodes across 10 object configurations outperform 200 episodes in a single setup by 25% on novel test scenes[ref:ref-calvin].
How do I deploy a fine-tuned VLA model to run at 10–30Hz on a physical robot?
Real-time deployment requires inference latency under 33ms for 30Hz control. Start by quantizing the fine-tuned checkpoint to int8 using PyTorch's `torch.quantization.quantize_dynamic()` or bitsandbytes, reducing VRAM from 14GB to 4GB and latency from 80ms to 25–35ms on an A100. Export the quantized model to TorchScript via `torch.jit.trace()` for C++ integration with your robot's control stack. If deploying on an edge GPU like NVIDIA Jetson Orin, convert to ONNX and optimize with TensorRT, achieving 15–20ms latency. Implement a double-buffered observation queue: while the policy computes action_t, your robot captures observation_t+1 in parallel, hiding 10–15ms of camera latency. Apply a 3-point moving-average filter to model outputs to smooth high-frequency jitter, and clip position deltas to ±5cm per timestep to prevent collisions. RT-2 operates at 3Hz for high-level action selection with a 100Hz impedance controller executing the trajectory, demonstrating that VLAs need not run at full control frequency if paired with a lower-level stabilizer[ref:ref-rt2-site].
Looking for fine-tune VLA model?
Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.
Browse VLA Training Datasets