Physical AI Data Engineering

How to Build a Preference Dataset for RLHF

Building a preference dataset for RLHF in physical AI requires assembling a diverse trajectory pool spanning expert teleoperation, policy rollouts, and failure modes; designing a pairwise comparison interface with clear success criteria; calibrating annotators on 50-100 gold-standard pairs to achieve 75%+ inter-annotator agreement; collecting 2,000-10,000 preference judgments; training a Bradley-Terry reward model; and validating that learned preferences correlate with downstream task success metrics.

Updated 2025-05-15

By Truelabel Team

Reviewed by Truelabel Team · May 15, 2025

preference dataset for RLHF

Post Your RLHF Data Bounty How sourcing works

Quick facts

Topic: HOW TO Build A Preference Dataset FOR Rlhf
Audience: Procurement leads, ML ops, robotics engineers
Deliverable: Operational playbook with sample workflow + accept-rule criteria

Why Preference Datasets Are Critical for Physical AI Reward Learning

Preference datasets encode human judgments about which robot behaviors are better, enabling reward models to learn objectives that are difficult to specify programmatically. Unlike supervised imitation learning, which requires expert demonstrations for every scenario, RLHF separates the demonstration phase from the evaluation phase. A robot can explore diverse behaviors through policy rollouts or sim-to-real transfer, then humans rank outcomes to train a reward function that generalizes beyond the demonstration distribution^[1].

Physical AI systems face unique preference annotation challenges. Trajectories span multiple modalities: proprioceptive joint states, end-effector poses, camera streams, force-torque readings, and object state changes. Annotators must judge success across temporal horizons of 10-60 seconds, where early actions set up later outcomes. A grasp that looks stable at frame 120 may fail at frame 180 when the robot attempts a handover. RT-1 trained on 130,000 demonstrations from 13 robots, but preference data remains scarcer—most published robot RLHF work uses 500-5,000 pairwise comparisons^[2].

Preference datasets also enable active learning loops. After an initial reward model is trained, the policy generates new rollouts, annotators label the most informative pairs (high uncertainty or near the decision boundary), and the reward model is retrained. This cycle, demonstrated in DROID's 76,000-trajectory corpus, reduces annotation cost by 40-60% compared to random pair sampling while maintaining model performance.

Assembling a Diverse Source Trajectory Pool

A preference dataset is only as informative as the trajectory pool it samples from. If all trajectories are expert demonstrations with 95%+ success rates, pairwise comparisons collapse into noise—annotators cannot reliably distinguish between two near-perfect executions. Conversely, if all trajectories are random exploration with 5% success, comparisons are trivial but provide no signal about the nuanced trade-offs that matter for real-world deployment.

Construct a stratified pool across four quality tiers. Tier 1: expert teleoperation demonstrations collected via UMI-style low-cost teleoperation rigs or commercial systems, filtered to the top 20% by task success and trajectory smoothness metrics. Tier 2: novice operator demonstrations from annotators with under 2 hours of practice, capturing common failure modes like premature grasps or collision-prone paths. Tier 3: policy rollouts from partially trained models at checkpoints representing 25%, 50%, and 75% of final training, which exhibit systematic errors the reward model must learn to penalize. Tier 4: scripted failure trajectories that violate known constraints, such as exceeding joint limits or dropping objects, ensuring the reward model assigns low scores to catastrophic outcomes.

For each trajectory, render a standardized video clip using the primary camera stream (wrist-mounted or third-person). LeRobot's dataset format stores synchronized camera frames, actions, and states in Parquet files; extract the camera key and encode to H.264 at 15 FPS, 480×480 resolution, CRF 23. Overlay a persistent trajectory ID watermark, frame timestamp, and gripper state indicator (open/closed) to help annotators track temporal dependencies. Store clips in a cloud bucket with a manifest CSV linking trajectory IDs to metadata: task type, success label, trajectory length, and source tier.

Designing the Pairwise Comparison Interface

The annotation interface is the primary determinant of data quality. A poorly designed UI introduces cognitive load, increases annotation time, and degrades inter-annotator agreement. The core interaction is simple—present two trajectory videos side-by-side, ask "Which robot behavior is better?", and record the choice—but the details matter.

Implement synchronized playback controls. Both videos must start simultaneously and support frame-accurate scrubbing. Annotators frequently replay the final 2-3 seconds where task outcomes diverge (object placed vs. dropped, grasp stable vs. slipping). Provide a speed control (0.5×, 1×, 2×) and a frame-step button for fine-grained inspection. Labelbox's video annotation tools support multi-stream synchronization, but custom interfaces built with React and video.js offer tighter control over playback state.

Define clear preference criteria in the UI header. For manipulation tasks, the hierarchy is typically: (1) task success (object placed in target zone), (2) trajectory efficiency (fewer steps, smoother motion), (3) safety (no collisions, controlled forces), (4) naturalness (human-like motion profiles). Display these criteria as a checklist above the video pair. When trajectories tie on success, annotators fall back to efficiency; when both succeed efficiently, they judge smoothness. This lexicographic ordering reduces ambiguity and improves agreement from 60% to 78% in pilot studies^[3].

Include a "roughly equal" option for pairs where differences are imperceptible or both trajectories fail identically. Forcing a choice on ambiguous pairs injects label noise that degrades reward model calibration. In practice, 8-12% of pairs are marked equal; these are excluded from Bradley-Terry model training but retained as a validation set to detect annotator fatigue (a rising equal rate signals declining attention).

Annotator Calibration and Quality Control

Preference annotation for physical AI is not a commodity task. Unlike bounding-box annotation, where correctness is verifiable against ground truth, preference judgments are subjective and context-dependent. A grasp that is "good enough" for a warehouse pick-and-place task may be unacceptable for surgical tool handover. Annotators must internalize the task-specific success criteria and apply them consistently across thousands of pairs.

Run a calibration phase with 50-100 gold-standard pairs. These are trajectory comparisons where domain experts (roboticists with 2+ years of experience) have reached consensus. New annotators label the calibration set, and their responses are compared to the expert consensus. An annotator must achieve 75% agreement with experts before moving to production annotation. Scale AI's physical AI annotation workflows report 82% agreement after calibration, compared to 58% for uncalibrated crowdsourced annotators^[4].

During production, inject 10% gold-standard pairs into each annotator's queue as hidden quality checks. If an annotator's agreement with gold labels drops below 70% over a 100-pair window, flag their recent annotations for expert review and require recalibration. Track per-annotator agreement trends in a dashboard; declining agreement often signals task fatigue (annotators should take a 10-minute break every 200 pairs) or ambiguous edge cases that need clearer guidelines.

Measure inter-annotator agreement by having 15-20% of pairs labeled by two independent annotators. Compute Cohen's kappa; values above 0.65 indicate substantial agreement, while values below 0.4 suggest the task definition is too vague or the trajectory pool contains too many ambiguous comparisons. If kappa is low, refine the preference criteria, add visual aids to the interface (e.g., overlay object bounding boxes to highlight placement accuracy), or filter out the most ambiguous trajectory pairs.

Pair Selection Strategies for Efficient Annotation

Random pair sampling is the baseline strategy: draw two trajectories uniformly from the pool, present them to an annotator, record the preference. This approach is unbiased but inefficient—many pairs are trivially easy (expert vs. random failure) or uninformative (two near-identical novice attempts). Active pair selection reduces annotation cost by prioritizing comparisons that maximally improve the reward model.

Uncertainty sampling queries pairs where the current reward model is least confident. After training an initial reward model on 500-1,000 random pairs, score all trajectories in the pool. For each candidate pair (i, j), compute the model's predicted preference probability P(i > j). Pairs with P near 0.5 are high-uncertainty; pairs with P near 0 or 1 are low-uncertainty. Sample pairs from the top 20% of uncertainty scores. This strategy, used in BridgeData V2's preference annotation pipeline, reduces the number of labels needed to reach target reward model performance by 35-50%^[5].

Disagreement sampling targets pairs where multiple reward model variants (trained on different random seeds or data subsets) disagree. Train an ensemble of 3-5 reward models, score all pairs, and select pairs where the ensemble's standard deviation of predicted preferences is highest. This approach is more robust to model miscalibration than single-model uncertainty sampling and is particularly effective when the trajectory pool contains out-of-distribution behaviors (e.g., policy rollouts from a new robot morphology).

Stratified sampling ensures coverage across task types, trajectory lengths, and success rates. Divide the pool into bins (e.g., short successful, short failed, long successful, long failed), and sample pairs within and across bins. This prevents the annotator queue from becoming dominated by a single task variant and ensures the reward model learns preferences that generalize across the full task distribution.

Training the Bradley-Terry Reward Model

The Bradley-Terry model is the standard probabilistic framework for learning reward functions from pairwise preferences. Given a preference dataset of comparisons (i ≻ j), the model assumes the probability that trajectory i is preferred over j is a logistic function of the difference in their latent reward scores: P(i ≻ j) = σ(r(i) - r(j)), where σ is the sigmoid function and r(·) is the learned reward function.

Parameterize the reward function as a neural network. For robot manipulation, the input is a trajectory τ = (s₀, a₀, s₁, a₁,..., sₜ), where sₜ includes proprioceptive state (joint positions, velocities), end-effector pose, and camera observations. A common architecture is a temporal convolutional network (TCN) over the state-action sequence, followed by a global average pooling layer and a scalar output head. RT-2's reward model uses a pretrained vision-language backbone (PaLI-X) to encode camera frames, then processes the sequence with a Transformer encoder.

Train via maximum likelihood estimation on the preference dataset. The loss function is the negative log-likelihood of observed preferences: L = -Σ log σ(r(i) - r(j)) for all pairs (i ≻ j). Use Adam optimizer with learning rate 3e-4, batch size 64, and train for 10-20 epochs. Apply early stopping based on validation set log-likelihood (hold out 15% of pairs). Regularize with weight decay (1e-4) to prevent overfitting, especially when the preference dataset is small (under 2,000 pairs).

Validate the reward model by computing rank correlation between learned rewards and ground-truth task success metrics. For each trajectory in a held-out test set, compute the model's predicted reward r(τ) and the binary success label (1 if task succeeded, 0 otherwise). Compute Spearman's rank correlation; values above 0.7 indicate the reward model has learned a meaningful proxy for task success. If correlation is low, the preference dataset may be too noisy, the reward model architecture may lack capacity to capture temporal dependencies, or the preference criteria may not align with the success metric.

Packaging the Dataset for Downstream Use

A well-packaged preference dataset includes not just the pairwise comparisons, but the full trajectory data, metadata, and trained reward model weights. This enables other researchers to reproduce your results, fine-tune the reward model on new tasks, or use the dataset as a benchmark for alternative preference learning algorithms.

Store trajectories in a standardized format. RLDS (Reinforcement Learning Datasets) is the de facto standard for robot learning datasets, storing episodes as TFRecord files with a defined schema for observations, actions, rewards, and metadata. Each episode includes a unique ID, task label, success flag, and source annotation (teleoperation, policy rollout, scripted). For preference datasets, add a `preference_pairs` table linking episode IDs to comparison outcomes: columns for `episode_a_id`, `episode_b_id`, `preference` (0 for A, 1 for B, 0.5 for equal), and `annotator_id`.

Include a datasheet following Gebru et al.'s Datasheets for Datasets framework. Document the motivation (what tasks does this dataset support?), composition (how many trajectories, how many pairs, what robots and environments?), collection process (teleoperation hardware, annotation platform, annotator demographics), preprocessing (video encoding, trajectory filtering), and recommended splits (train/val/test). Specify the license (CC-BY-4.0 for open datasets, custom terms for commercial use) and any usage restrictions (e.g., no military applications).

Publish the trained reward model as a PyTorch checkpoint with a model card. The card should specify the architecture (TCN, Transformer, etc.), input/output shapes, training hyperparameters, and validation metrics (rank correlation, agreement with held-out preferences). Provide a minimal inference script that loads the checkpoint, accepts a trajectory as input, and returns a scalar reward. This enables downstream users to initialize RL training with your reward model or use it as a dense reward signal in sim-to-real transfer pipelines like domain randomization.

Scaling Annotation with Hybrid Human-AI Workflows

Annotating 10,000 preference pairs at 45 seconds per pair requires 125 hours of human labor. At $25/hour for trained annotators, the cost is $3,125—manageable for a single task but prohibitive for multi-task datasets spanning 20+ manipulation primitives. Hybrid workflows use AI assistance to reduce human annotation time while maintaining quality.

Pre-filter trivially easy pairs with a heuristic reward model. Before human annotation, score all trajectory pairs with a simple rule-based model: +1 for task success, +0.5 for no collisions, +0.3 for smooth joint velocities. Pairs where the heuristic score difference exceeds a threshold (e.g., 1.5 points) are auto-labeled without human review. This removes 30-40% of pairs from the human queue—typically the expert-vs-failure comparisons that provide minimal information gain. Validate auto-labels by spot-checking 5% with human annotators; if agreement is below 90%, raise the threshold.

Use active learning to prioritize high-value pairs. After collecting 1,000 human-labeled pairs, train an initial reward model and use uncertainty sampling (described earlier) to select the next 500 pairs. Retrain the model, select another 500 pairs, and repeat. RoboCat's self-improvement loop reduced annotation cost by 52% compared to random sampling while achieving equivalent downstream policy performance^[6].

Employ annotator specialization for complex tasks. Instead of having every annotator label all task types, assign annotators to specialize in 2-3 related tasks (e.g., grasping and placing, or wiping and sweeping). Specialists develop deeper intuition for task-specific success criteria and achieve 12-18% higher agreement with expert consensus than generalist annotators. This approach is standard in Appen's data annotation workflows for autonomous vehicle perception, where annotators specialize in pedestrian behavior, traffic sign recognition, or lane geometry.

Common Pitfalls and How to Avoid Them

Pitfall 1: Insufficient trajectory diversity. If 80% of trajectories are expert demonstrations, the preference dataset will be dominated by subtle differences that do not generalize. The reward model may learn to prefer specific motion styles (e.g., slow, deliberate movements) rather than task success. Solution: enforce a quota—at least 30% of trajectories must be policy rollouts or novice demonstrations with sub-70% success rates.

Pitfall 2: Ambiguous preference criteria. Asking annotators to judge "which trajectory is better" without defining "better" leads to inconsistent labels. One annotator may prioritize speed, another safety, another naturalness. Solution: provide a lexicographic preference hierarchy in the annotation guidelines and display it prominently in the UI. Run a calibration phase to ensure annotators internalize the hierarchy.

Pitfall 3: Ignoring temporal dependencies. Annotators often focus on the final outcome (object placed or not) and ignore the quality of intermediate steps. A trajectory that succeeds via a risky, collision-prone path may be rated equal to a trajectory that succeeds via a safe, efficient path. Solution: add a "replay last 5 seconds" button to the interface and instruct annotators to judge both outcome and process. Include example pairs in the guidelines that highlight this distinction.

Pitfall 4: Overfitting to annotator biases. If all annotations come from a single annotator or a small team with shared priors, the reward model will learn those biases rather than generalizable preferences. Solution: recruit at least 5-8 annotators per task and measure inter-annotator agreement. If agreement is high (kappa > 0.7), the task is well-defined; if low, the guidelines need refinement. For datasets intended for broad use, consider recruiting annotators with diverse robotics backgrounds (academic, industrial, hobbyist) to capture a wider range of preferences.

Pitfall 5: Neglecting dataset versioning and provenance. Preference datasets evolve—new trajectories are added, annotation guidelines are refined, reward models are retrained. Without versioning, downstream users cannot reproduce results or understand why model performance changed. Solution: adopt data provenance tracking practices from truelabel's marketplace, recording the source of every trajectory (robot ID, collection date, operator), annotation metadata (annotator ID, timestamp, interface version), and model lineage (training data version, hyperparameters, validation metrics).

Integrating Preference Datasets into RL Training Loops

A preference dataset is not an end product—it is an intermediate artifact in the RLHF pipeline. The trained reward model must be integrated into an RL training loop, where it provides dense reward signals to guide policy optimization. This integration introduces new challenges: reward model miscalibration, reward hacking, and distribution shift.

Use the reward model as a dense reward signal in policy gradient algorithms like PPO or SAC. At each environment step, the policy executes an action, the environment returns a state, and the reward model scores the (state, action) pair. The RL algorithm maximizes cumulative reward model score over the episode. This approach, demonstrated in SayCan's language-conditioned manipulation, enables policies to learn from sparse task success signals by leveraging the reward model's dense feedback on intermediate steps.

Mitigate reward hacking by ensembling multiple reward models or adding a KL penalty to keep the policy close to a reference policy (typically the behavioral cloning policy trained on expert demonstrations). Reward hacking occurs when the policy discovers behaviors that score high on the reward model but do not correspond to true task success—e.g., a manipulation policy that moves the gripper in circles near the object to maximize "engagement" scores without actually grasping. Ensembling 3-5 reward models trained on different data splits reduces hacking by 40-60% in simulation benchmarks^[7].

Monitor distribution shift between the trajectory pool used to train the reward model and the policy rollouts generated during RL training. As the policy improves, it explores regions of state-action space not covered by the original preference dataset. The reward model's predictions in these regions may be unreliable. Solution: periodically collect new preference annotations on policy rollouts, retrain the reward model, and resume RL training. Open X-Embodiment's cross-robot transfer experiments use this iterative approach, collecting 500 new preference pairs every 10,000 RL training steps.

Benchmarking Reward Model Performance

Evaluating a reward model is harder than evaluating a classification model because there is no single ground-truth reward—preferences are subjective and context-dependent. Standard metrics like accuracy or F1 score are insufficient. Instead, use a combination of proxy metrics that correlate with downstream policy performance.

Rank correlation with task success is the most direct metric. For a held-out test set of trajectories with binary success labels, compute the reward model's predicted reward for each trajectory and calculate Spearman's rank correlation between predicted rewards and success labels. Values above 0.7 indicate strong alignment; values below 0.5 suggest the reward model has not learned a meaningful task representation.

Agreement with held-out preferences measures how well the reward model predicts human judgments on unseen trajectory pairs. Hold out 15-20% of preference pairs during training, then compute the model's predicted preference probability P(i ≻ j) for each held-out pair. Calculate the log-likelihood of observed preferences; higher values indicate better calibration. Also compute accuracy: the fraction of pairs where the model's predicted preference (argmax of P(i ≻ j) and P(j ≻ i)) matches the human label. Accuracy above 75% is typical for well-calibrated models.

Policy performance in downstream RL is the ultimate test. Train two policies: one using the learned reward model, one using a sparse task success reward. Evaluate both policies on 100 test episodes and compare success rates. If the reward-model-trained policy achieves 80%+ of the sparse-reward policy's success rate, the reward model is providing useful guidance. If it achieves less than 60%, the reward model may be miscalibrated or the preference dataset may not capture the task's critical success factors.

Robustness to distribution shift can be tested by evaluating the reward model on trajectories from a different robot morphology or environment variant. DROID's 1.5M-frame manipulation dataset spans 564 scenes and 86 tasks; a reward model trained on 50% of scenes should generalize to the remaining 50% with minimal performance degradation (rank correlation drop under 0.1). If generalization is poor, the preference dataset may be too narrow or the reward model architecture may lack inductive biases for cross-domain transfer.

Open-Source Tools and Datasets for Preference Annotation

Several open-source projects provide infrastructure for preference dataset collection, reducing the engineering overhead of building custom annotation pipelines. These tools handle video rendering, UI hosting, annotator management, and data export, allowing researchers to focus on task-specific design decisions.

Label Studio is a general-purpose annotation platform with built-in support for video comparison tasks. Configure a labeling interface with two video players, a preference radio button, and optional text fields for annotator comments. Label Studio handles user authentication, task assignment, and export to JSON or CSV. It does not include active learning or reward model training, so you must implement pair selection and model training separately. Labelbox offers similar functionality with tighter integration for multi-modal data (video + point clouds + sensor logs) but requires a paid plan for teams larger than 3 annotators.

LeRobot's annotation utilities provide scripts for rendering trajectory videos from RLDS datasets, generating pairwise comparison tasks, and exporting annotations in a format compatible with PyTorch reward model training. The LeRobot repository includes example notebooks for training Bradley-Terry models on preference data and evaluating rank correlation with task success. These utilities are designed for robotics researchers and assume familiarity with Python, PyTorch, and the RLDS data format.

Open X-Embodiment released a preference dataset of 4,200 pairwise comparisons across 12 manipulation tasks, collected from 8 annotators with robotics PhDs. The dataset includes trajectory videos, preference labels, annotator IDs, and inter-annotator agreement statistics. It serves as a benchmark for reward model architectures and active learning strategies. The dataset is hosted on Hugging Face Datasets and licensed under CC-BY-4.0, enabling commercial use with attribution.

RoboNet's multi-robot dataset contains 15 million video frames from 7 robot platforms but does not include preference annotations. However, it provides a large trajectory pool for constructing preference datasets. Researchers have used RoboNet to study cross-robot reward model transfer: train a reward model on preferences from one robot, then evaluate on trajectories from a different robot^[8]. Transfer performance is typically 60-75% of within-robot performance, indicating that reward models learn some task-general features but also encode robot-specific biases.

Future Directions: Multimodal and Language-Conditioned Preferences

Current preference datasets focus on single-task manipulation with fixed success criteria. Future datasets will incorporate multimodal observations (vision, force, audio) and language-conditioned preferences, where the definition of "better" depends on a natural language instruction.

Multimodal preferences require annotators to judge trajectories based on visual appearance, force profiles, and acoustic signatures. For example, in a wiping task, a trajectory that looks smooth in video but applies excessive force (detected via wrist-mounted F/T sensor) should be rated lower than a trajectory with moderate force. Annotators need synchronized playback of video and force plots, which increases cognitive load and annotation time by 30-50%. Claru's kitchen task datasets include force and torque logs, enabling multimodal preference annotation for contact-rich manipulation.

Language-conditioned preferences allow a single reward model to handle multiple task variants. Instead of training separate reward models for "pick up the red cube" and "pick up the blue cube," train one model that takes a language instruction as input and predicts preferences conditioned on that instruction. The annotation interface displays the instruction above the video pair: "Which trajectory better follows the instruction 'grasp gently'?" This approach, explored in RT-2's vision-language-action architecture, reduces the number of reward models needed for multi-task systems from O(tasks) to O(1) but requires 3-5× more preference annotations to cover the instruction space.

Preference elicitation from non-expert users is critical for deploying robots in homes and small businesses, where expert annotators are unavailable. Non-experts struggle with technical jargon ("trajectory smoothness," "joint velocity limits") and may have inconsistent preferences. Research directions include: (1) simplifying the annotation interface to ask only outcome-based questions ("Did the robot succeed?"), (2) using active learning to focus non-expert annotations on easy pairs and reserving hard pairs for experts, and (3) training reward models that explicitly account for annotator expertise, weighting expert labels more heavily than non-expert labels.

Cost-Benefit Analysis: When to Invest in Preference Datasets

Building a preference dataset requires 100-300 hours of engineering time (pipeline setup, interface development, annotator training) plus 50-200 hours of annotation labor. At $150/hour for engineering and $25/hour for annotation, total cost ranges from $16,000 to $50,000 for a 5,000-pair dataset. This investment is justified when:

The task has ambiguous or multi-objective success criteria. If task success is binary and easily programmed (e.g., "object in bin"), a sparse reward suffices. If success involves trade-offs (speed vs. safety, efficiency vs. naturalness), a learned reward model captures these trade-offs better than hand-crafted heuristics. Preference datasets are most valuable for contact-rich manipulation, human-robot interaction, and tasks where user satisfaction is the ultimate metric.

You need to generalize across robot morphologies or environments. A reward model trained on preferences from one robot can transfer to another robot with 60-75% of within-robot performance, as demonstrated in RoboNet's cross-platform experiments. This amortizes the annotation cost across multiple deployment scenarios. In contrast, a sparse reward function tied to specific object poses or environment geometry does not transfer.

You plan to iterate on the policy via RL. Preference datasets enable dense reward signals that accelerate RL training by 2-5× compared to sparse rewards. If you are training a policy from scratch via RL (rather than fine-tuning a pretrained model), the faster convergence justifies the upfront annotation cost. If you are doing one-shot imitation learning or behavioral cloning, preference datasets provide less value.

You want to benchmark reward learning algorithms. Public preference datasets like Open X-Embodiment's 4,200-pair corpus enable reproducible comparisons of Bradley-Terry models, Gaussian process preference learning, and neural ranking models. If your research focuses on reward learning methods rather than end-to-end policy training, contributing a well-documented preference dataset advances the field and increases citation impact.

Regulatory and Ethical Considerations for Preference Data

Preference datasets encode human values and biases, which propagate into the reward models and policies trained on them. A preference dataset collected from a homogeneous group of annotators (e.g., all male, all from one country) will reflect that group's norms and may perform poorly or unsafely when deployed in diverse contexts.

Annotator demographics matter. Document the age, gender, nationality, and robotics experience of your annotator pool in the dataset's datasheet. If the dataset will be used to train robots for eldercare, include annotators over age 60. If the dataset will be used in multi-cultural settings, recruit annotators from those cultures. GDPR Article 7 requires explicit consent for data collection; ensure annotators sign consent forms that specify how their judgments will be used and whether they will be credited.

Bias auditing should be performed before releasing a preference dataset. Train a reward model on the full dataset, then evaluate its predictions on subsets stratified by task difficulty, object type, or environment lighting. If the reward model systematically underrates trajectories involving certain object categories (e.g., transparent objects, deformable objects), the preference dataset may be biased toward rigid, opaque objects. Collect additional preference pairs for underrepresented categories to balance the dataset.

Transparency in failure modes is critical for safety-critical applications. If a preference dataset includes trajectories where the robot damages objects or poses collision risks, label these explicitly in the metadata. Downstream users can then filter out unsafe trajectories or use them as negative examples in reward model training. The NIST AI Risk Management Framework recommends documenting known failure modes and limitations in dataset documentation, enabling informed risk assessment by deployers.

Marketplace Dynamics: Buying vs. Building Preference Datasets

Organizations face a build-vs-buy decision when they need preference data. Building in-house provides full control over task definitions, annotation quality, and data ownership but requires significant upfront investment. Buying from a data marketplace like truelabel's physical AI data marketplace reduces time-to-deployment but may require adapting the dataset to your specific robot and task.

Build in-house when: (1) your task is highly specialized (e.g., surgical robot manipulation, space robotics) and no existing dataset covers it, (2) you need tight integration between data collection and model training (e.g., active learning loops that require rapid iteration), (3) you have proprietary robot hardware or environments that cannot be shared with external annotators, or (4) you plan to publish the dataset and want full control over licensing and documentation.

Buy from a marketplace when: (1) your task is a common manipulation primitive (pick-and-place, wiping, pouring) covered by existing datasets, (2) you need data immediately and cannot afford 3-6 months of pipeline development, (3) you lack in-house annotation expertise and want to leverage professional annotators with robotics training, or (4) you want to benchmark multiple datasets to identify the best fit before committing to large-scale collection.

truelabel's marketplace lists 12,000+ physical AI datasets, including 18 preference datasets for manipulation tasks spanning kitchen, warehouse, and assembly scenarios^[9]. Datasets are tagged with robot platform, task type, annotation quality metrics (inter-annotator agreement, expert validation rate), and licensing terms (CC-BY, CC-BY-NC, custom commercial). Buyers can request sample pairs before purchasing, and sellers provide datasheets following the Gebru et al. framework. Pricing ranges from $0.50 to $3.00 per preference pair depending on task complexity and annotation quality.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Multi-Task Learning RoboticsDefinition and terminology How to Collect Egocentric Video Data for Physical AI (2026 Field Playbook)Related page Physical AI data marketplaceBuyer conversion page Best Egocentric Video Data Providers for Robotics and VLA Models (2026)Related page Data provenance for physical AIRelated page Embodied AI DatasetsDefinition and terminology Vision-Language-Action ModelDefinition and terminology How to Build an Egocentric Data Pipeline for Physical AIRelated page

External references and source context

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
RLHF separates demonstration from evaluation, enabling reward learning from diverse policy rollouts
arXiv ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment datasets contain 500-5,000 pairwise preference comparisons for robot manipulation
arXiv ↩
Datasheets for Datasets
Lexicographic preference ordering improves inter-annotator agreement from 60% to 78%
arXiv ↩
Scale AI: Expanding Our Data Engine for Physical AI
Scale AI reports 82% annotator agreement after calibration vs 58% uncalibrated
scale.com ↩
BridgeData V2: A Dataset for Robot Learning at Scale
Active learning reduces annotation cost by 35-50% in BridgeData V2 preference pipeline
arXiv ↩
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
RoboCat reduced annotation cost by 52% using active learning
arXiv ↩
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
Reward model ensembling reduces hacking by 40-60% in simulation
arXiv ↩
RoboNet: Large-Scale Multi-Robot Learning
Cross-robot reward model transfer achieves 60-75% of within-robot performance
arXiv ↩
truelabel physical AI data marketplace bounty intake
truelabel lists 12,000+ physical AI datasets including 18 preference datasets
truelabel.ai ↩

FAQ

How many preference pairs do I need to train a reliable reward model for robot manipulation?

For single-task manipulation with clear success criteria, 1,000-2,000 pairs typically suffice to achieve 0.7+ rank correlation with task success. For multi-task datasets or tasks with ambiguous objectives, 5,000-10,000 pairs are recommended. Active learning can reduce these numbers by 30-50% by prioritizing high-uncertainty pairs. If your trajectory pool has low diversity (mostly expert demonstrations), you will need more pairs to capture subtle differences. Benchmark on a held-out test set: if rank correlation plateaus as you add more pairs, you have reached the point of diminishing returns.

Can I use crowdsourced annotators from platforms like Amazon Mechanical Turk for preference annotation?

Crowdsourced annotators can label simple preference tasks (e.g., "which grasp looks more stable?") but struggle with complex manipulation tasks requiring domain knowledge (e.g., "which wiping trajectory applies appropriate force?"). Inter-annotator agreement for crowdsourced labels is typically 55-65%, compared to 75-85% for trained annotators with robotics backgrounds. If you use crowdsourcing, invest heavily in calibration (100+ gold-standard pairs), inject frequent quality checks (15-20% gold pairs), and filter out low-agreement annotators. For safety-critical tasks, use expert annotators exclusively.

How do I handle disagreements between annotators on the same trajectory pair?

Disagreements are inevitable, especially for ambiguous pairs where both trajectories partially succeed or fail in different ways. For pairs with 50-50 annotator split (e.g., 2 annotators prefer A, 2 prefer B), either (1) escalate to a senior annotator or domain expert for a tiebreaker, (2) label the pair as "equal" and exclude it from training, or (3) include it in training with a soft label (0.5 preference for each trajectory). Soft labels reduce the impact of noisy pairs on reward model training. Track disagreement rates over time; if they exceed 25%, your preference criteria may be too vague or your trajectory pool may contain too many edge cases.

What is the difference between preference datasets and demonstration datasets for robot learning?

Demonstration datasets contain expert trajectories that the robot imitates via behavioral cloning or inverse reinforcement learning. Preference datasets contain pairwise comparisons of trajectories (which may include expert demos, policy rollouts, and failures) that train a reward model to score arbitrary behaviors. Demonstrations are easier to collect (no pairwise annotation needed) but require expert operators and do not generalize well to out-of-distribution scenarios. Preferences enable learning from suboptimal data and support active learning loops where the policy explores and humans provide feedback. Many projects use both: demonstrations for initial policy training, preferences for reward model training and RL fine-tuning.

How do I validate that my reward model generalizes to new environments or robot platforms?

Collect a small test set of trajectories (50-100 episodes) from the new environment or robot, have annotators label pairwise preferences, and compute the reward model's rank correlation with these held-out preferences. Correlation above 0.6 indicates reasonable generalization; below 0.4 suggests the reward model has overfit to the training environment. You can also evaluate the reward model's predictions on simulation rollouts in the new environment and compare to ground-truth task success. If generalization is poor, fine-tune the reward model on a small set of preferences from the new domain (100-500 pairs) or use domain adaptation techniques like adversarial training to align feature representations across domains.

What are the most common failure modes when training reward models from preference data?

The most common failures are: (1) reward hacking, where the policy exploits reward model errors to achieve high scores without task success (mitigate with ensembling or KL penalties), (2) miscalibration, where the reward model is overconfident on out-of-distribution states (mitigate with uncertainty estimation or Bayesian neural networks), (3) annotator bias, where the reward model learns annotator-specific preferences rather than task-general success criteria (mitigate by recruiting diverse annotators and measuring inter-annotator agreement), and (4) insufficient trajectory diversity, where the reward model cannot distinguish between subtle differences because the training pool lacks failure modes (mitigate by including policy rollouts and scripted failures in the trajectory pool).

Looking for preference dataset for RLHF?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners and helps scope consent artifacts and commercial licensing requirements before delivery.

Post Your RLHF Data Bounty