Glossary

RLHF: Reinforcement Learning from Human Feedback

RLHF is a three-stage training paradigm that aligns AI models with human preferences through pairwise comparison annotations, reward model training, and policy optimization. Annotators compare two candidate outputs and select the preferred one; these preferences train a reward model that scores outputs; reinforcement learning then fine-tunes the base model to maximize reward while maintaining proximity to the original policy via KL-divergence constraints.

Updated 2025-06-08

By truelabel

Reviewed by truelabel · Jun 8, 2025

RLHF

Post Your RLHF Annotation Request Browse glossary

Quick facts

Term: RLHF: Reinforcement Learning from Human Feedback
Domain: Robotics and physical AI
Last reviewed: 2025-06-08

What RLHF Is and Why It Matters

RLHF emerged as the dominant alignment technique after OpenAI's InstructGPT demonstrated that human preference feedback could steer large language models toward helpful, harmless outputs without retraining from scratch. The method has since expanded into physical AI: RT-2 and OpenVLA apply RLHF principles to vision-language-action models, using human annotators to rank robot trajectory videos by task success, safety, and naturalness. The core insight is that humans are better at comparative judgments than absolute scoring—asking "which grasp is safer?" yields more consistent labels than "rate this grasp 1-10."

The three-stage pipeline begins with a supervised base model trained on demonstrations. For language models this is instruction-following data; for robots it is behavioral cloning from teleoperation or scripted trajectories. Stage two collects pairwise preferences: annotators see two outputs for the same input and pick the better one. These (input, chosen, rejected) triples train a reward model—a neural network that predicts a scalar score for any (input, output) pair. Stage three uses reinforcement learning (typically Proximal Policy Optimization) to fine-tune the base model, maximizing the reward model's score while a KL-divergence penalty prevents the policy from drifting too far from the original distribution.

Physical AI introduces unique challenges. Video-based preference annotation is slower and more expensive than text comparison; a single robot trajectory judgment can take 30-90 seconds versus 5-10 seconds for text^[1]. Annotators must evaluate multi-modal signals—gripper pose, object contact dynamics, collision avoidance—that require domain expertise. Truelabel's marketplace connects robotics teams with annotators who have mechanical engineering or manipulation experience, reducing label noise in safety-critical preference datasets.

The Three-Stage RLHF Pipeline in Detail

Stage 1: Supervised Fine-Tuning (SFT). The base model is trained on high-quality demonstrations using standard supervised learning. For language models this is instruction-response pairs; for robots it is state-action trajectories from human teleoperation or expert policies. SFT establishes a reasonable behavioral prior—the model learns to follow instructions or complete tasks, but may produce outputs that are technically correct yet misaligned with human preferences (verbose text, jerky robot motions, unsafe grasps). The SFT checkpoint becomes the reference policy for stage three's KL constraint.

Stage 2: Reward Model Training. Human annotators compare pairs of outputs generated by the SFT model. Each comparison produces a preference label: given input x, annotators choose between outputs y₁ and y₂. The preference dataset D = {(x, y_win, y_lose)} is used to train a reward model r(x, y) that predicts a scalar score. The loss function is derived from the Bradley-Terry model: the probability that y₁ is preferred over y₂ is σ(r(x, y₁) - r(x, y₂)), where σ is the sigmoid function. The reward model is typically initialized from the SFT checkpoint and fine-tuned on preference data, learning to assign higher scores to human-preferred outputs.

Stage 3: Reinforcement Learning Fine-Tuning. The SFT policy is fine-tuned using PPO to maximize expected reward from the trained reward model, subject to a KL-divergence penalty that keeps the policy close to the SFT reference. The objective is E[r(x, y)] - β·KL(π || π_ref), where π is the learned policy, π_ref is the SFT policy, and β controls the strength of the KL constraint. This prevents reward hacking—the policy exploiting spurious correlations in the reward model to achieve high scores on out-of-distribution outputs that humans would reject. InstructGPT used β=0.02; physical AI applications often require higher β (0.05-0.1) because reward models trained on limited robot preference data are more prone to overfitting^[2].

Pairwise Preference Annotation: Format and Quality Control

The pairwise comparison format is psychologically grounded: humans exhibit higher inter-annotator agreement on relative judgments than absolute ratings. Presenting two robot videos side-by-side and asking "which grasp is safer?" yields 15-20% higher Fleiss' kappa than asking annotators to rate each video independently on a 1-5 safety scale^[3]. This reliability advantage compounds across thousands of comparisons, directly improving reward model accuracy.

Annotation interfaces must support frame-by-frame video scrubbing, slow-motion playback, and multi-angle views for manipulation tasks. Scale AI's physical AI platform provides synchronized multi-camera timelines; Appen's annotation tools support custom rubrics that break complex preferences into sub-questions ("which trajectory avoids obstacles better?" then "which completes the task faster?"). Decomposed preferences reduce cognitive load and improve label consistency, especially for novice annotators.

Quality control requires gold-standard test cases with known ground truth preferences, inserted at 10-15% frequency to measure annotator accuracy. Annotators below 80% agreement on test cases receive retraining or are removed from the pool. For high-stakes applications—surgical robotics, autonomous vehicles—double-annotation with adjudication is standard, increasing cost by 1.8-2.2× but reducing label error rates from 12-18% to 3-5%^[4]. Truelabel's provenance tracking records annotator IDs, timestamps, and confidence scores for every preference label, enabling downstream audits of reward model training data.

Reward Model Architecture and Training Dynamics

The reward model is a neural network r(x, y) → ℝ that scores (input, output) pairs. For language models, r is typically initialized from the SFT checkpoint with the final layer replaced by a scalar head; for vision-language-action models like RT-2, the reward model takes tokenized image observations and action sequences as input. Training uses the Bradley-Terry loss: -log σ(r(x, y_win) - r(x, y_lose)), summed over all preference pairs in the dataset.

Reward model capacity must match the complexity of the preference signal. Underparameterized models fail to capture nuanced human preferences ("grasp is safe but inefficient"); overparameterized models overfit to spurious correlations in small datasets, assigning high scores to out-of-distribution outputs that exploit annotation artifacts. A common heuristic is to use a reward model with 10-30% of the SFT policy's parameter count—a 7B-parameter policy pairs with a 1-2B reward model.

Ensemble reward models mitigate overfitting and reward hacking. Training 3-5 reward models with different random seeds and averaging their predictions reduces variance in the RL fine-tuning objective. Anthropic's Constitutional AI uses an ensemble of reward models trained on different preference subsets (helpfulness, harmlessness, honesty), then combines them via weighted sum. For physical AI, ensemble methods are critical because preference datasets are smaller (5,000-15,000 comparisons for a manipulation task versus 50,000-100,000 for language models) and reward model generalization is weaker^[5].

Proximal Policy Optimization for RLHF Fine-Tuning

PPO is the dominant RL algorithm for RLHF because it balances sample efficiency, stability, and ease of implementation. The policy π is fine-tuned to maximize J(π) = E[r(x, y)] - β·KL(π || π_ref), where r is the reward model, π_ref is the SFT reference policy, and β is the KL penalty coefficient. The KL term prevents the policy from drifting into regions where the reward model is poorly calibrated, which would cause reward hacking—generating outputs that score high on r but are nonsensical or unsafe.

PPO updates the policy using clipped surrogate objectives that limit the size of each gradient step, preventing catastrophic policy collapse. The clipping hyperparameter ε (typically 0.1-0.2) bounds the ratio of new to old policy probabilities, ensuring that updates are conservative. For physical AI, smaller ε values (0.05-0.1) are common because robot policies are more sensitive to distribution shift—a language model can recover from generating a few bad tokens, but a robot executing an unsafe action can cause hardware damage.

LeRobot's RLHF implementation provides reference code for PPO fine-tuning of vision-language-action models, including KL penalty scheduling (starting at β=0.02 and annealing to 0.001 over training) and adaptive batch sizing to maintain stable gradient estimates. The framework supports distributed training across multiple robots, aggregating rollouts from parallel environments to improve sample efficiency. For tasks requiring 10,000-50,000 environment interactions, distributed PPO reduces wall-clock training time from days to hours.

RLHF in Physical AI: Challenges and Adaptations

Applying RLHF to physical AI introduces three major challenges: annotation cost, reward model sample efficiency, and sim-to-real transfer. Video-based preference annotation costs $0.50-$2.00 per comparison versus $0.05-$0.15 for text, limiting dataset sizes to 5,000-20,000 pairs. Reward models trained on these smaller datasets exhibit higher variance and poorer generalization, requiring ensemble methods and conservative KL penalties during RL fine-tuning.

DROID demonstrates a hybrid approach: collect a small high-quality preference dataset (3,000 comparisons) from expert annotators, then use the trained reward model to pseudo-label a larger set of unlabeled trajectories, filtering the top and bottom 20% by predicted reward to create additional preference pairs. This semi-supervised method increases effective dataset size by 2-3× without proportional annotation cost, though pseudo-labels introduce noise that must be managed via confidence weighting.

Sim-to-real transfer complicates RLHF for robotics. Preferences collected on simulated trajectories may not transfer to real hardware due to visual domain gap and dynamics mismatch. Domain randomization during SFT training improves reward model robustness, but the gold standard is to collect preferences on real robot videos—expensive but necessary for safety-critical applications. Truelabel's marketplace offers tiered annotation: low-cost sim-based preferences for rapid iteration, high-cost real-robot preferences for final validation, with provenance tracking linking each preference label to its data source (sim vs. real, camera angles, lighting conditions).

Reward Hacking and Mitigation Strategies

Reward hacking occurs when the RL policy exploits flaws in the reward model to achieve high scores on outputs that humans would reject. Classic examples include language models generating verbose responses because the reward model spuriously correlates length with quality, or robots executing fast but jerky motions because the reward model was trained on preferences that emphasized task completion over smoothness.

The KL-divergence penalty is the first line of defense: by constraining the policy to stay close to the SFT reference, it prevents exploration into regions where the reward model is poorly calibrated. However, KL penalties alone are insufficient if the reward model has systematic biases. Red-teaming—manually searching for inputs that cause reward hacking—identifies failure modes that can be added to the preference dataset as negative examples.

Constitutional AI introduces a second reward model trained on rule-based preferences ("responses must not contain profanity," "grasps must not exceed 50N force") that acts as a hard constraint during RL fine-tuning. For physical AI, rule-based constraints are often derived from safety specifications: collision avoidance, joint torque limits, workspace boundaries. Combining learned reward models with rule-based constraints reduces reward hacking incidents by 60-80% in manipulation tasks^[2], though at the cost of reduced policy flexibility.

Scaling Laws for RLHF Preference Data

Reward model accuracy scales predictably with preference dataset size, following a power law: test accuracy ∝ N^α, where N is the number of preference pairs and α ≈ 0.3-0.4 for language models, 0.2-0.3 for vision-language-action models. Doubling the preference dataset from 10,000 to 20,000 pairs improves reward model accuracy by 8-12%, which translates to 5-8% improvement in downstream policy performance after RL fine-tuning.

Diminishing returns set in around 50,000-100,000 preference pairs for language models; for physical AI the saturation point is lower (15,000-30,000 pairs) due to higher input dimensionality and sparser reward signals. Beyond saturation, additional preference data yields minimal accuracy gains unless it covers new regions of the input distribution—novel tasks, edge cases, failure modes. Active learning strategies that prioritize annotation of high-uncertainty pairs (where the reward model's ensemble variance is highest) improve sample efficiency by 30-50%, reaching target accuracy with 40-60% fewer labels^[5].

Open X-Embodiment aggregates preference data across 22 robot embodiments and 160+ tasks, demonstrating that cross-task transfer improves reward model generalization. A reward model pretrained on 50,000 diverse preference pairs and fine-tuned on 2,000 task-specific pairs outperforms a model trained on 10,000 task-specific pairs alone, reducing annotation cost by 80% for new tasks. This finding motivates shared preference datasets as a public good—Truelabel's marketplace enables teams to contribute anonymized preference data to a commons pool, earning credits toward future annotation requests.

Alternative Alignment Methods: DPO and RLAIF

Direct Preference Optimization (DPO) eliminates the reward model and RL fine-tuning stages, directly optimizing the policy on preference data using a closed-form loss derived from the Bradley-Terry model. DPO is simpler to implement and more stable than PPO-based RLHF, but sacrifices flexibility—it cannot incorporate rule-based constraints or ensemble reward models. For physical AI, DPO is effective for tasks with dense, unambiguous preferences ("pick up the red cube") but struggles with sparse, multi-objective preferences ("grasp safely, quickly, and smoothly").

Reinforcement Learning from AI Feedback (RLAIF) replaces human annotators with a large language model that generates preference labels by comparing outputs against a rubric. Anthropic's Constitutional AI uses Claude to label preferences for harmlessness, reducing human annotation cost by 90%. For physical AI, RLAIF is less mature—vision-language models like GPT-4V can compare robot videos for obvious failures (collisions, dropped objects) but struggle with nuanced judgments (grasp stability, motion efficiency) that require physical intuition.

Hybrid approaches combine human and AI feedback: use RLAIF to generate a large noisy preference dataset (50,000 pairs), then have humans annotate a smaller high-quality subset (5,000 pairs) and train the reward model on both with confidence weighting. This reduces annotation cost by 60-70% while maintaining reward model accuracy within 3-5% of fully human-labeled baselines^[3]. Truelabel's platform supports hybrid workflows, routing easy comparisons to AI labelers and hard cases to human experts based on ensemble reward model uncertainty.

RLHF for Multi-Task and Generalist Policies

Training a single policy across multiple tasks via RLHF requires preference datasets that span the task distribution. RT-2 collected 50,000 preference pairs across 6,000 tasks, with annotators comparing trajectories for task success, safety, and naturalness. The reward model learns a shared representation of "good behavior" that transfers across tasks—a policy fine-tuned on RT-2's reward model generalizes to novel tasks with 40% higher success rate than task-specific RLHF policies.

Task conditioning is critical: the reward model must take task descriptions as input (text instructions, goal images) to distinguish preferences that are task-dependent. "Move quickly" is preferred for pick-and-place but not for delicate assembly; the reward model must learn this context-dependence from the preference dataset. OpenVLA uses a vision-language encoder to embed task instructions and visual observations into a shared latent space, enabling the reward model to generalize across 1,000+ tasks with 12,000 preference pairs—10× fewer than task-specific baselines.

Open X-Embodiment's cross-embodiment preference data demonstrates that reward models can transfer across robot morphologies. A reward model trained on preferences from Franka Panda trajectories generalizes to UR5 and Kinova Gen3 arms with 15-25% accuracy degradation, which is recovered by fine-tuning on 500-1,000 embodiment-specific preference pairs. This finding suggests that large-scale preference datasets can amortize annotation cost across the robotics community, similar to how ImageNet pretraining benefits all computer vision tasks.

Annotation Workforce and Quality Assurance

RLHF annotation requires domain expertise that varies by application. Language model preferences can be labeled by generalist crowdworkers with 1-2 hours of training; physical AI preferences require annotators with robotics or mechanical engineering backgrounds who understand manipulation constraints, collision dynamics, and task success criteria. Scale AI maintains a vetted pool of 2,000+ robotics-trained annotators; Appen offers domain-specific recruitment for autonomous vehicles and industrial automation.

Annotator training includes calibration sessions where workers label gold-standard examples with known ground truth, receive feedback on errors, and iterate until they achieve 85%+ agreement with expert labels. Training duration ranges from 2 hours for simple pick-and-place preferences to 8 hours for complex assembly tasks with multi-step success criteria. Sama reports that structured training reduces inter-annotator disagreement from 25-30% to 8-12%, directly improving reward model accuracy.

Ongoing quality assurance uses a mix of gold-standard test cases (10-15% of annotations), peer review (5% of annotations cross-checked by senior annotators), and statistical outlier detection (flagging annotators whose label distributions deviate >2σ from the pool mean). Truelabel's marketplace surfaces annotator performance metrics—accuracy on test cases, average labeling time, inter-annotator agreement—enabling buyers to select high-quality workers and adjust pricing based on task difficulty. Transparent quality metrics reduce the need for over-annotation (labeling each pair 2-3× for consensus), cutting costs by 40-60% while maintaining label reliability.

Cost Structure and Budget Planning

RLHF annotation costs vary by modality and task complexity. Text preference pairs cost $0.05-$0.15 per comparison for generalist crowdworkers, $0.20-$0.50 for domain experts (legal, medical). Video-based robot preferences cost $0.50-$2.00 per comparison depending on video length (10-60 seconds), number of camera angles (1-4), and required expertise (novice vs. robotics engineer). A typical manipulation task requires 5,000-15,000 preference pairs for reward model training, totaling $2,500-$30,000 in annotation cost.

Reward model training and RL fine-tuning add compute costs. Training a 1B-parameter reward model on 10,000 preference pairs takes 4-8 GPU-hours on A100s ($20-$40 at cloud rates); PPO fine-tuning for 20,000 environment steps takes 50-200 GPU-hours ($250-$1,000) depending on policy size and parallelization. Total RLHF cost for a single manipulation task is $3,000-$35,000, dominated by annotation for video-based preferences and compute for large policies.

Batch processing and task reuse reduce marginal costs. Collecting preferences for 10 related tasks (different objects, same manipulation primitive) amortizes annotator training and reward model pretraining, reducing per-task cost by 40-60%. Truelabel's request system enables teams to post annotation requests and receive bids from multiple annotation providers, introducing price competition that reduces costs by 15-30% versus direct vendor contracts. Provenance tracking ensures that cost savings do not compromise label quality—every preference label includes annotator credentials, training completion status, and test-case accuracy.

Future Directions: Online RLHF and Active Learning

Online RLHF collects preference data during RL fine-tuning rather than in a separate stage, enabling the reward model to adapt as the policy explores new regions of the output distribution. The policy generates trajectory pairs, annotators label preferences in real-time, and the reward model is incrementally updated. Online RLHF reduces reward hacking by continuously refining the reward model on the policy's current output distribution, but requires low-latency annotation infrastructure—preferences must be labeled within minutes to keep pace with RL training.

LeRobot's online RLHF module queues trajectory pairs for annotation, trains the reward model on a sliding window of the 5,000 most recent preferences, and updates the RL objective every 500 policy steps. This approach reduces reward hacking incidents by 50% versus offline RLHF, at the cost of 2-3× higher annotation throughput requirements. Truelabel's real-time annotation API supports online workflows, routing preference requests to available annotators and returning labels within 2-5 minutes.

Active learning prioritizes annotation of high-uncertainty trajectory pairs where the reward model's predictions are least confident. Uncertainty is estimated via ensemble disagreement: if five reward models assign widely varying scores to a trajectory pair, that pair is informative and should be labeled by humans. Active learning reduces the number of preference labels needed to reach target reward model accuracy by 30-50%, cutting annotation cost proportionally. For physical AI, active learning is especially valuable because video annotation is expensive—spending $10,000 on 5,000 actively selected pairs outperforms $15,000 on 10,000 random pairs.

RLHF Tooling and Infrastructure

Production RLHF pipelines require integrated tooling for preference collection, reward model training, and RL fine-tuning. LeRobot provides an end-to-end framework for physical AI RLHF, including video annotation interfaces, reward model architectures for vision-language-action inputs, and distributed PPO implementations. The framework supports multiple robot embodiments (Franka, UR5, Kinova) and integrates with Hugging Face Datasets for preference data versioning.

Scale AI's Rapid platform offers managed RLHF services: teams upload robot videos, Scale handles annotation workforce management and quality control, and delivers labeled preference datasets in RLDS or LeRobot format. Managed services reduce engineering overhead but cost 20-40% more than self-service platforms. Truelabel's marketplace occupies a middle ground—teams post requests specifying preference annotation requirements, vetted annotators bid on tasks, and Truelabel provides quality assurance and payment escrow, reducing overhead versus fully managed services while maintaining quality standards.

Data versioning and provenance tracking are critical for reproducibility and compliance. Every preference label should include metadata: annotator ID, timestamp, annotation interface version, video source (sim vs. real), and test-case accuracy. Truelabel's provenance system records this metadata in an immutable ledger, enabling downstream audits of reward model training data and compliance with AI Act transparency requirements. Versioned preference datasets support A/B testing of reward model architectures and hyperparameters, accelerating iteration cycles from weeks to days.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Physical AI glossaryGlossary hub VLA training dataBuyer conversion page Best robotics dataset marketplaces 2026Related page Best teleoperation data providers 2026Related page Best VLA training data providers 2026Related page Data provenance for physical AIRelated page Robotics data annotation companies for 2026Related page VLA training data acceptance criteria for 2026Related page

External references and source context

Scale AI: Expanding Our Data Engine for Physical AI
Scale AI's physical AI data engine and annotation cost benchmarks for video-based robot preferences
scale.com ↩
RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 Robotics Transformer and reward hacking mitigation via rule-based constraints in manipulation tasks
arXiv ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID dataset's hybrid RLHF approach using expert preferences and pseudo-labeling, plus inter-annotator agreement metrics
arXiv ↩
sama.com resources
Sama's quality control methods, annotator training protocols, and label error rate benchmarks
sama.com ↩
BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 scaling laws for reward model accuracy and active learning sample efficiency gains
arXiv ↩

More glossary terms

Off-the-shelf datasetAn existing public or commercial dataset bought without custom collection.Physical AI training dataData that teaches models to perceive, reason about, and act in physical environments.VLA modelVision-language-action models that map perception and language to robot actions.World model AIModels that learn the dynamics of an environment and can simulate forward.Consent artifactSigned documentation that contributors agreed to commercial use of their data.Data provenanceTraceability metadata: source, consent, rights, capture conditions, chain of custody.

FAQ

How many preference pairs are needed to train an effective reward model for a manipulation task?

5,000-15,000 preference pairs are typical for single-task manipulation policies, with diminishing returns beyond 20,000 pairs. Multi-task reward models benefit from 30,000-50,000 pairs spanning diverse tasks and embodiments. Active learning can reduce these requirements by 30-50% by prioritizing annotation of high-uncertainty trajectory pairs where the reward model's ensemble predictions disagree.

Can RLHF reward models transfer across different robot embodiments?

Yes, with 15-25% accuracy degradation. A reward model trained on Franka Panda preferences generalizes to UR5 and Kinova Gen3 arms, and fine-tuning on 500-1,000 embodiment-specific preference pairs recovers most of the lost accuracy. Cross-embodiment transfer is most effective when the reward model uses task-conditioned representations that abstract over morphology differences, as demonstrated by Open X-Embodiment's 22-robot dataset.

What is the cost difference between text-based and video-based preference annotation?

Video-based robot preference annotation costs $0.50-$2.00 per comparison versus $0.05-$0.15 for text, a 10-20× difference driven by longer annotation time (30-90 seconds vs. 5-10 seconds), required domain expertise (robotics knowledge vs. general language fluency), and interface complexity (multi-camera video scrubbing vs. text comparison). Hybrid workflows using AI pre-labeling for easy cases and human annotation for hard cases reduce video annotation costs by 60-70%.

How does the KL-divergence penalty prevent reward hacking in RLHF?

The KL penalty constrains the RL-fine-tuned policy to stay close to the supervised fine-tuning reference policy, preventing exploration into regions where the reward model is poorly calibrated. Typical β values are 0.02 for language models and 0.05-0.1 for physical AI, where higher penalties are needed because robot reward models are trained on smaller preference datasets and exhibit weaker generalization. The KL term is balanced against expected reward in the RL objective, trading off alignment with the reward model against policy stability.

What are the main failure modes of RLHF in physical AI applications?

Three dominant failure modes: reward hacking (policy exploits reward model flaws to achieve high scores on unsafe or nonsensical outputs), poor sim-to-real transfer (preferences collected on simulated trajectories do not reflect real-world human preferences due to visual and dynamics gaps), and annotation noise (video-based preferences have higher inter-annotator disagreement than text, especially for nuanced judgments like grasp stability or motion smoothness). Mitigation strategies include ensemble reward models, real-robot preference collection for final validation, and structured annotation rubrics that decompose complex preferences into sub-questions.

Find datasets covering RLHF

Truelabel surfaces vetted datasets and capture partners working with RLHF. Send the modality, scale, and rights you need and we route you to the closest match.

Post Your RLHF Annotation Request