Physical AI Glossary

Reward Model

A reward model is a neural network trained on human preference annotations to predict scalar quality scores for robot trajectories or AI outputs. In physical AI, reward models convert pairwise human judgments—'trajectory A handles the object more carefully than B'—into continuous reward signals that guide reinforcement learning policies toward safer, smoother, and more task-aligned behavior without hand-crafted reward functions.

Updated 2025-06-15

By truelabel

Reviewed by truelabel · Jun 15, 2025

reward model

List Your Preference Dataset Browse glossary

Quick facts

Term: Reward Model
Domain: Robotics and physical AI
Last reviewed: 2025-06-15

What Is a Reward Model in Physical AI?

A reward model is a learned function r(s, a, τ) that maps robot states, actions, or full trajectories to scalar quality scores reflecting human preferences. Unlike classical reinforcement learning, where engineers manually specify reward functions (e.g., +1 for grasping success, -0.1 per timestep), reward models infer preferences from annotated comparisons. An annotator reviews two robot execution videos and selects the better one; the model learns to assign higher scores to preferred behaviors across dimensions like motion smoothness, collision avoidance, task completion speed, and object handling care.

The Deep RL from Human Preferences paper demonstrated that 700 binary preference labels could train Atari and MuJoCo policies competitive with hand-tuned reward functions. In robotics, RT-1 and RT-2 leverage large-scale trajectory datasets annotated with success labels, but reward models extend this by capturing nuanced quality gradients—distinguishing 'acceptable' from 'excellent' execution within successful attempts.

Reward models are central to InstructGPT and subsequent LLM alignment pipelines (GPT-4, Claude, Gemini), where they score text completions. Physical AI applies the same principle to continuous control: human annotators compare 10–30 second robot clips, and the trained reward model generalizes to score millions of unseen trajectories during policy optimization. Truelabel's marketplace connects teams building reward models with collectors who capture diverse manipulation, navigation, and teleoperation preference data across 12,000+ embodiments^[1].

How Reward Models Enable RLHF for Robotics

Reinforcement Learning from Human Feedback (RLHF) replaces hand-engineered reward functions with learned reward models. The three-stage pipeline: (1) collect a seed dataset of robot trajectories via teleoperation or scripted policies; (2) annotators compare trajectory pairs, selecting which better satisfies task goals; (3) train a reward model on these preferences using the Bradley-Terry model, where P(τ_A ≻ τ_B) = σ(r(τ_A) - r(τ_B)). The trained reward model then scores new trajectories during policy gradient optimization, guiding the robot toward human-preferred behaviors.

Open X-Embodiment aggregated 1 million+ robot trajectories across 22 embodiments, but raw trajectory success labels (binary 0/1) provide coarse training signals^[2]. Reward models add resolution: annotators might prefer trajectory A because it avoids a near-collision (safety), uses smoother joint velocities (efficiency), or repositions the gripper mid-grasp (adaptability)—all qualities invisible to a binary success flag. The reward model learns these implicit criteria from comparative data, producing dense per-timestep or per-trajectory scores.

DROID collected 76,000 real-world manipulation trajectories but lacks preference annotations^[3]. Teams training reward models on DROID must retrospectively annotate subsets or use proxy signals (task completion time, human intervention rate). Truelabel's request system enables targeted preference collection: specify embodiment, task family (pick-place, bimanual assembly), and quality dimensions (speed vs. caution), and collectors submit annotated comparison sets within 2–4 weeks.

Preference Annotation Mechanics and Quality Control

Preference annotation asks: 'Which trajectory better accomplishes the task?' Annotators watch two 10–30 second robot execution videos side-by-side and select A, B, or 'tie.' High-quality annotations require: (1) clear task definitions ('grasp the mug by the handle, not the body'); (2) consistent annotator training (calibration sessions with gold-standard examples); (3) inter-annotator agreement checks (Cohen's kappa ≥0.65 indicates acceptable consistency); (4) diverse comparison sampling (avoid showing only near-identical pairs or trivially different failures).

BridgeData V2 contains 60,000 trajectories but no preference labels^[4]. Retrofitting preferences requires sampling trajectory pairs stratified by success rate—comparing two failures teaches the model what 'less bad' looks like, while comparing two successes teaches quality gradients within competent execution. A 10,000-trajectory dataset might yield 15,000–25,000 pairwise comparisons if each trajectory appears in 1.5–2.5 pairs on average.

Annotator disagreement is signal, not noise. If 40% prefer trajectory A and 60% prefer B, the reward model should assign r(A) ≈ 0.4 and r(B) ≈ 0.6 under a probabilistic interpretation, capturing the inherent ambiguity in 'careful handling' or 'natural motion.' EPIC-KITCHENS-100 demonstrated that egocentric video annotations benefit from multiple annotators per clip to model human judgment variance^[5]. Physical AI reward models inherit this principle: 2–3 annotators per comparison pair improve model calibration and reduce overfitting to individual annotator biases.

Reward Model Architectures for Trajectory Scoring

Reward models for robotics typically process trajectory data as sequences of (observation, action, next_observation) tuples. Common architectures: (1) Transformer encoders that attend over full trajectories (10–100 timesteps), producing a single scalar score via a learned aggregation head; (2) Recurrent networks (LSTM, GRU) that process trajectories sequentially, outputting per-timestep rewards summed or averaged into a trajectory score; (3) Vision-language models (e.g., CLIP-based) that encode video observations and task descriptions, then score alignment via a learned reward head.

RT-2 uses a vision-language-action (VLA) backbone pretrained on web data, fine-tuned on robot trajectories^[6]. Reward models can leverage the same pretrained representations: freeze the VLA encoder, add a 2-layer MLP reward head, and train only the head on preference data. This transfer learning reduces annotation requirements—1,000–3,000 preference pairs suffice when the encoder already captures object semantics and motion priors from Open X-Embodiment pretraining.

For point-cloud or multi-sensor inputs (LiDAR + RGB + proprioception), reward models must fuse modalities before scoring. Segments.ai and Kognic provide multi-sensor annotation tools, but reward model architectures need custom fusion layers—e.g., PointNet++ for LiDAR, ResNet for RGB, concatenated with joint-position embeddings, then passed through a shared Transformer. Training such models requires 5,000–15,000 annotated trajectory pairs to learn robust cross-modal quality signals.

Scaling Reward Models with Synthetic and Bootstrapped Data

Collecting 10,000+ human preference annotations is expensive ($0.50–$2.00 per comparison pair, $5,000–$20,000 total). Teams bootstrap reward models by: (1) training an initial model on 1,000–2,000 high-quality human labels; (2) using the model to score a larger unlabeled trajectory pool; (3) sampling high-disagreement pairs (where the model is uncertain) for human review; (4) retraining on the expanded dataset. This active learning loop reduces annotation costs by 40–60% while maintaining model accuracy^[7].

Synthetic preferences from simulation accelerate early-stage development. Domain randomization generates diverse simulated trajectories; rule-based heuristics (collision count, task completion time, energy expenditure) produce noisy preference labels. A reward model pretrained on 50,000 synthetic pairs, then fine-tuned on 2,000 real human labels, often outperforms a model trained on 5,000 real labels alone—the synthetic data teaches coarse quality distinctions, and real data calibrates fine-grained human preferences.

NVIDIA Cosmos world foundation models generate photorealistic robot simulation videos^[8]. Future reward model pipelines may train on Cosmos-rendered trajectories annotated by human reviewers who cannot distinguish synthetic from real footage, then transfer to physical robots. Truelabel's marketplace supports hybrid collection: collectors submit real teleoperation clips, and buyers augment with sim-rendered variants, creating 3:1 synthetic-to-real preference datasets that cost 50–70% less than pure real-world collection.

Reward Model Generalization and Distribution Shift

Reward models trained on narrow task distributions (e.g., pick-place with rigid objects on flat surfaces) often fail when deployed on novel objects, backgrounds, or manipulation strategies. Generalization requires: (1) diverse training data spanning object categories, lighting conditions, and gripper approaches; (2) regularization techniques (dropout, weight decay) to prevent overfitting to annotator-specific quirks; (3) evaluation on held-out embodiments or tasks to detect distribution shift before deployment.

Open X-Embodiment showed that policies trained on multi-embodiment data transfer better to new robots^[2]. Reward models inherit this principle: a model trained on preference data from 5+ robot arms (Franka, UR5, Kinova) generalizes better to a 6th unseen arm than a model trained on a single embodiment. DROID's 76,000 trajectories span 564 scenes and 86 tasks, providing the diversity needed for robust reward model training—but only if preference annotations cover that diversity.

Out-of-distribution detection is critical. If a reward model assigns high scores to a trajectory that violates safety constraints the training data never exhibited (e.g., high-speed motion near a human), the policy will exploit this blind spot. Ensemble reward models (train 3–5 models on bootstrap samples of the preference data, flag trajectories where ensemble variance is high) reduce this risk. Truelabel's data provenance tracking logs embodiment, scene, and task metadata for every trajectory, enabling stratified evaluation: measure reward model accuracy separately on indoor vs. outdoor scenes, rigid vs. deformable objects, and single-arm vs. bimanual tasks.

Reward Models vs. Inverse Reinforcement Learning

Inverse Reinforcement Learning (IRL) infers a reward function from expert demonstrations by assuming the expert is (approximately) optimal under some unknown reward. Reward models, by contrast, learn from explicit human preferences without assuming optimality. IRL requires solving the forward RL problem repeatedly during training (computationally expensive), while reward models train via supervised learning on preference pairs (orders of magnitude faster).

Russell's 1998 IRL framework assumed access to an MDP and expert trajectories that maximize an unknown reward function. Physical AI rarely satisfies these assumptions: human teleoperation is noisy, suboptimal, and inconsistent across operators. Christiano et al. (2017) demonstrated that preference-based reward learning outperforms IRL when expert demonstrations are imperfect—a common scenario in robotics, where even skilled operators exhibit 15–25% task failure rates on complex manipulation.

Reward models also handle multi-objective trade-offs more naturally. An annotator comparing two trajectories implicitly balances speed, safety, and smoothness; the learned reward model captures this composite preference. IRL, by contrast, typically assumes a single scalar reward function, requiring manual weight tuning (e.g., r = 0.6·speed + 0.3·safety + 0.1·smoothness) that reward models learn automatically from data. For physical AI teams, reward models offer a pragmatic path to alignment: collect 2,000–5,000 preference pairs, train a model in 4–12 GPU-hours, and deploy—no MDP solver required.

Reward Hacking and Adversarial Robustness

Reward hacking occurs when a policy exploits reward model errors to achieve high scores without satisfying true task goals. Example: a reward model trained only on successful grasps might assign high scores to a robot that knocks objects off the table (removing them from view) because the training data never included 'object disappeared' scenarios. The policy learns to maximize the flawed reward signal rather than accomplish the intended task.

Adversarial robustness techniques mitigate reward hacking: (1) red-teaming the reward model by generating edge-case trajectories (via RL or human adversaries) and collecting preference labels on failures; (2) uncertainty-aware scoring where the model outputs (mean, variance) and the policy is penalized for high-variance predictions; (3) ensemble disagreement as a proxy for model confidence—if 5 reward models disagree on a trajectory's score, flag it for human review before policy training.

InstructGPT mitigated reward hacking in LLMs by iteratively collecting preference data on model-generated outputs, retraining the reward model, and repeating. Physical AI teams apply the same loop: deploy a policy trained on reward model v1, collect human feedback on its failures, retrain reward model v2 on the expanded dataset, and retrain the policy. Truelabel's marketplace supports iterative collection: buyers post 'failure mode requests' specifying undesired behaviors (e.g., 'robot arm moves too fast near obstacles'), and collectors submit trajectory pairs demonstrating safe vs. unsafe execution.

Reward Models in Multi-Task and Long-Horizon Settings

Multi-task reward models score trajectories across diverse tasks (pick-place, drawer opening, bimanual assembly) using a shared representation. Task-conditioning via language embeddings (e.g., CLIP text encoder) allows a single model to handle 50–200 tasks: the input is (trajectory, task_description), and the output is a task-specific quality score. This amortizes annotation costs—1,000 preference pairs per task × 100 tasks = 100,000 pairs, but a multi-task model trained on 20,000 pairs (200 per task) often achieves 80–90% of the per-task model's accuracy.

RT-1 trained on 130,000 trajectories across 700 tasks, but lacked preference annotations^[9]. A multi-task reward model retrofitted onto RT-1 data would require sampling 15,000–30,000 trajectory pairs stratified by task family (manipulation, navigation, mobile manipulation) and difficulty (easy/medium/hard based on success rate). OpenVLA provides a 7B-parameter VLA backbone suitable for multi-task reward model fine-tuning—freeze the vision-language encoder, add a task-conditioned reward head, and train on preference data.

Long-horizon tasks (e.g., 'prepare a meal': 12 subtasks, 3–5 minutes) require reward models that score both subtask execution and task sequencing. Hierarchical reward models decompose long trajectories into subtask segments, score each segment, and aggregate via learned weights. CALVIN demonstrated long-horizon language-conditioned manipulation across 34 tasks^[10]; a hierarchical reward model for CALVIN would need 5,000–10,000 preference pairs covering subtask transitions (e.g., 'grasp object' → 'place in drawer' vs. 'grasp object' → 'place on shelf').

Reward Models for Sim-to-Real Transfer

Reward models trained on simulated trajectories often misalign with real-world human preferences due to visual and dynamics gaps. Sim-to-real transfer strategies: (1) train on sim data, fine-tune on 500–1,500 real preference pairs; (2) use domain-invariant features (proprioception, object poses) rather than raw pixels; (3) apply domain randomization during sim data generation to reduce visual overfitting^[11].

RLBench provides 100+ simulated manipulation tasks with procedurally generated scenes^[12]. A reward model trained on RLBench preference data (annotators compare sim videos) transfers to real robots if: (1) the sim renderer matches real camera characteristics (resolution, field-of-view, lens distortion); (2) object textures and lighting are randomized during training; (3) the model is fine-tuned on 1,000–2,000 real-world preference pairs post-deployment. Without fine-tuning, sim-trained reward models exhibit 20–40% accuracy drops on real data.

NVIDIA Cosmos generates photorealistic sim data that narrows the visual gap^[8]. Future pipelines may train reward models entirely on Cosmos-rendered trajectories annotated by humans who treat them as real footage, then deploy to physical robots with minimal fine-tuning. Truelabel's marketplace supports sim-to-real workflows: buyers specify 'sim-rendered trajectories acceptable' in request requirements, and collectors submit preference annotations on mixed sim/real datasets, enabling cost-effective reward model pretraining before real-world deployment.

Integrating Reward Models into Policy Training Pipelines

Reward models integrate into policy training via three methods: (1) offline RL where the reward model scores a fixed dataset of trajectories, and the policy is trained to maximize expected reward without environment interaction; (2) online RL where the policy generates new trajectories, the reward model scores them, and policy gradients are computed from reward model outputs; (3) iterative RLHF where the policy is trained, deployed to collect new data, human preferences are collected on the new data, the reward model is retrained, and the cycle repeats.

LeRobot provides a unified interface for training policies on Open X-Embodiment datasets^[13]. Integrating a reward model requires: (1) adding a reward_model.score(trajectory) call during data loading; (2) modifying the loss function to weight trajectory samples by predicted reward; (3) logging reward model scores alongside policy metrics (success rate, episode length) to detect reward hacking. A 10,000-trajectory dataset scored by a reward model in 2–5 minutes on a single GPU enables rapid policy iteration.

Online RL with reward models is sample-efficient but risks reward hacking. Christiano et al. used online preference collection: the policy generates trajectory pairs, humans label preferences, the reward model is retrained, and the policy is updated—all within a single training run. Physical AI teams rarely afford continuous human-in-the-loop annotation; instead, they collect 5,000–10,000 preference pairs upfront, train a reward model, run online RL for 50,000–200,000 environment steps, then collect another 2,000–5,000 preference pairs on policy-generated trajectories to retrain the reward model and continue training.

Commercial Reward Model Training Services and Platforms

Scale AI offers end-to-end reward model training: customers upload robot trajectories, Scale's annotation workforce collects preference labels, and Scale delivers a trained reward model as a PyTorch checkpoint^[14]. Pricing: $15,000–$50,000 for 5,000–15,000 preference pairs plus model training. Turnaround: 3–6 weeks for data collection, 1–2 weeks for model training. Scale's annotators are trained on robotics-specific quality criteria (collision avoidance, smooth motion, task completion), but customers cannot audit individual annotations or retrain models on custom architectures.

Labelbox and Encord provide annotation platforms where customers manage their own annotator workforce (internal teams or third-party vendors). Customers upload trajectory videos, configure pairwise comparison interfaces, and export preference labels in JSON format for custom reward model training. Cost: $0.30–$1.20 per comparison pair (annotator labor only; platform fees separate). Flexibility: full control over annotation guidelines, quality checks, and model architecture, but requires in-house ML expertise to train and deploy reward models.

Truelabel's marketplace offers a hybrid model: buyers post preference annotation requests specifying task, embodiment, and quality dimensions; collectors (robotics labs, teleoperation providers) submit annotated trajectory pairs; Truelabel validates data quality via inter-annotator agreement checks and provenance tracking; buyers receive preference datasets in RLDS or LeRobot format within 2–4 weeks. Pricing: $0.40–$1.00 per comparison pair, 30–50% below Scale AI for equivalent quality. Buyers train reward models using open-source tools (LeRobot, Stable Baselines3) or commercial platforms (Weights & Biases, Comet).

Future Directions: Foundation Reward Models and Zero-Shot Transfer

Foundation reward models pretrained on millions of preference pairs across diverse tasks, embodiments, and domains could enable zero-shot transfer: a model trained on manipulation data scores navigation trajectories without task-specific fine-tuning. Analogous to RT-2's vision-language pretraining, foundation reward models would learn general notions of 'safe motion,' 'efficient execution,' and 'task completion' transferable across physical AI applications^[6].

Pretraining requires 500,000–2,000,000 preference pairs—10–100× larger than current datasets. Open X-Embodiment contains 1 million trajectories but lacks preference annotations^[2]. Retrofitting preferences via crowdsourced annotation ($0.50 per pair × 1 million = $500,000) or semi-automated labeling (rule-based heuristics + human validation on 10% of pairs) could create the first foundation reward model training set. NVIDIA Cosmos may accelerate this by generating synthetic preference data at scale^[8].

Zero-shot reward models would transform physical AI procurement: instead of collecting task-specific preference data, teams fine-tune a foundation model on 200–500 examples and deploy. Early experiments show 40–60% accuracy on novel tasks without fine-tuning, rising to 75–85% with 500 task-specific pairs—a 5–10× reduction in annotation requirements compared to training from scratch. Truelabel's marketplace positions to aggregate the diverse preference data needed for foundation model pretraining, offering buyers access to pretrained checkpoints fine-tunable on proprietary task data within days rather than months.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Physical AI glossaryGlossary hub Physical AI data providers: criteria and optionsRelated page Data provenance for physical AIRelated page What is physical AI training data?Related page Physical AI training dataDefinition and terminology Sourcing mocap human demonstrationsRelated page Assembly training dataTask-specific requirements Bimanual manipulation training dataTask-specific requirements

External references and source context

truelabel physical AI data marketplace bounty intake
Truelabel's marketplace connects 12,000+ collectors with buyers needing preference annotation datasets for reward model training, offering 2–4 week turnaround.
truelabel.ai ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment aggregated 1 million+ robot trajectories across 22 embodiments, demonstrating that multi-embodiment training improves policy generalization.
arXiv ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID collected 76,000 real-world manipulation trajectories across 564 scenes and 86 tasks, providing diversity for reward model training but lacking preference annotations.
arXiv ↩
BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 contains 60,000 trajectories but no preference labels, requiring retrospective annotation for reward model training.
arXiv ↩
Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
EPIC-KITCHENS-100 demonstrated that multiple annotators per video clip improve annotation quality by modeling human judgment variance.
arXiv ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 uses vision-language-action pretraining on web data, providing a backbone architecture suitable for reward model fine-tuning on robot preference data.
arXiv ↩
Large image datasets: A pyrrhic win for computer vision?
Active learning loops that sample high-disagreement examples for human review reduce annotation costs by 40–60% while maintaining model accuracy.
arXiv ↩
NVIDIA Cosmos World Foundation Models
NVIDIA Cosmos world foundation models generate photorealistic robot simulation videos, enabling large-scale synthetic preference data collection.
NVIDIA Developer ↩
RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 trained on 130,000 robot trajectories across 700 tasks, demonstrating large-scale multi-task policy learning but lacking preference annotations for reward modeling.
arXiv ↩
CALVIN paper
CALVIN demonstrated long-horizon language-conditioned manipulation across 34 tasks, requiring hierarchical reward models to score subtask execution and sequencing.
arXiv ↩
Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization generates diverse simulated trajectories, enabling synthetic preference data generation for reward model pretraining.
arXiv ↩
RLBench: The Robot Learning Benchmark & Learning Environment
RLBench provides 100+ simulated manipulation tasks with procedurally generated scenes, suitable for sim-to-real reward model pretraining.
arXiv ↩
LeRobot documentation
LeRobot provides a unified interface for training policies on Open X-Embodiment datasets, simplifying reward model integration into policy training pipelines.
Hugging Face ↩
scale.com physical ai
Scale AI offers end-to-end reward model training services, collecting 5,000–15,000 preference pairs and delivering trained models for $15,000–$50,000.
scale.com ↩

More glossary terms

Physical AI training dataData that teaches models to perceive, reason about, and act in physical environments.Data provenanceTraceability metadata: source, consent, rights, capture conditions, chain of custody.Egocentric dataFirst-person camera footage capturing how a worker or operator sees a task.Off-the-shelf datasetAn existing public or commercial dataset bought without custom collection.Robot demonstrationsRecorded successful task executions used as demonstrations for imitation learning.Teleoperation dataHuman-controlled robot trajectories used to bootstrap policies for new skills.

FAQ

How many preference annotations are needed to train a production-grade reward model for a single manipulation task?

2,000–5,000 pairwise preference comparisons typically suffice for a single task (e.g., pick-place with rigid objects). This assumes diverse trajectory sampling: 30–40% comparisons between two successful executions (teaching quality gradients), 30–40% between success and failure (teaching task completion), and 20–30% between two failures (teaching 'less bad' distinctions). If leveraging a pretrained vision-language backbone (e.g., CLIP, RT-2 encoder), 1,000–2,000 pairs may suffice. Multi-task models require 200–500 pairs per task across 10–50 tasks (2,000–25,000 total) to learn shared quality representations.

Can reward models trained on human teleoperation data evaluate autonomous policy outputs?

Yes, but with caveats. Reward models learn quality criteria (smoothness, safety, efficiency) that transfer from teleoperation to autonomous execution. However, autonomous policies may exhibit failure modes absent in teleoperation data (e.g., repetitive oscillations, reward hacking behaviors). Best practice: train the reward model on 70–80% teleoperation data, 20–30% early autonomous policy rollouts, and iteratively collect preference labels on policy-generated trajectories to retrain the model. Ensemble reward models (3–5 models trained on bootstrap samples) help detect out-of-distribution autonomous behaviors via high ensemble disagreement.

What is the cost difference between collecting preference annotations via crowdsourcing vs. expert roboticists?

Crowdsourced preference annotation (e.g., Amazon Mechanical Turk, Scale AI workforce) costs \$0.30–\$0.80 per comparison pair, with 2–5 day turnaround for 5,000 pairs. Expert roboticist annotation (PhD students, research engineers) costs \$1.50–\$3.00 per pair but provides higher inter-annotator agreement (Cohen's kappa 0.75–0.85 vs. 0.55–0.70 for crowdsourced). For safety-critical applications (surgical robotics, human-robot collaboration), expert annotation is justified. For general manipulation tasks, crowdsourced annotation with 2–3 labels per pair and majority-vote aggregation achieves acceptable quality at 40–60% lower cost. Truelabel's marketplace offers a middle ground: vetted collectors (robotics labs, teleoperation providers) at \$0.40–\$1.00 per pair with domain expertise exceeding crowdsourced but below PhD-level costs.

How do reward models handle multi-modal preferences where annotators disagree on which trajectory is better?

Annotator disagreement is informative signal. If 60% prefer trajectory A and 40% prefer B, the reward model should assign r(A) ≈ 0.6 and r(B) ≈ 0.4 under a probabilistic interpretation, capturing inherent preference ambiguity. Training uses soft labels: instead of binary 'A wins,' the model predicts P(A ≻ B) = 0.6 and is trained via cross-entropy loss against the empirical preference distribution. This prevents overconfident predictions on ambiguous comparisons. Collecting 2–3 annotations per trajectory pair (rather than 1) improves calibration: the model learns that 'careful handling' is subjective, and different annotators weight speed vs. caution differently. Ensemble reward models further capture this uncertainty by training separate models on different annotator subsets.

Can reward models be fine-tuned for new tasks without collecting additional preference data?

Limited zero-shot transfer is possible if the new task shares quality dimensions with training tasks. A reward model trained on pick-place (valuing smooth motion, collision avoidance) will partially transfer to drawer-opening (same motion qualities apply). However, task-specific criteria (e.g., 'drawer must close fully' for drawer-opening) require 200–500 new preference pairs for fine-tuning. Foundation reward models pretrained on 500,000+ diverse preference pairs across 100+ tasks show 40–60% accuracy on novel tasks zero-shot, rising to 75–85% with 500 task-specific pairs—a 5–10× reduction vs. training from scratch. Current best practice: collect 500–1,000 preference pairs for each new task family (manipulation, navigation, mobile manipulation) and fine-tune a multi-task base model rather than training per-task models from scratch.

Find datasets covering reward model

Truelabel surfaces vetted datasets and capture partners working with reward model. Send the modality, scale, and rights you need and we route you to the closest match.

List Your Preference Dataset