truelabelRequest data

Glossary

Data Deduplication

Data deduplication identifies and removes duplicate or near-duplicate samples from training datasets to prevent overfitting, reduce storage costs, and improve model generalization. In physical AI, deduplication operates at three levels: exact (byte-identical copies), near-duplicate (minor compression or crop differences), and semantic (functionally equivalent trajectories). Effective deduplication can reduce dataset size by 15-40% while maintaining or improving model performance, as demonstrated in large-scale robot learning datasets like Open X-Embodiment and DROID.

Updated 2025-05-15
By truelabel
Reviewed by truelabel ·
data deduplication

Quick facts

Term
Data Deduplication
Domain
Robotics and physical AI
Last reviewed
2025-05-15

What Data Deduplication Means for Physical AI

Data deduplication is the systematic process of detecting and removing duplicate or near-duplicate samples from training datasets to ensure each example provides unique learning signal. In physical AI contexts, duplication arises from multiple sources: repeated teleoperation runs over identical object configurations, multi-camera capture of the same manipulation event, or aggregation of datasets that share common source trajectories.

The Open X-Embodiment dataset aggregates 60 robot embodiments across 22 institutions, creating substantial risk of semantic overlap where different labs capture functionally identical pick-place sequences[1]. Similarly, the DROID dataset collected 76,000 trajectories across 564 scenes but required aggressive deduplication to remove repeated failure modes and redundant success cases[2].

Deduplication operates along a spectrum of similarity thresholds. Exact deduplication removes byte-identical copies using cryptographic hashes. Near-duplicate detection removes samples differing only in compression artifacts, minor crops, or resolution changes. Semantic deduplication removes trajectories that convey identical task solutions despite visual or kinematic differences. Each level trades precision for recall: exact methods miss 95% of duplicates in real-world robot datasets, while semantic methods risk removing valid task diversity.

Why Deduplication Matters: Overfitting and Storage Economics

Duplicate samples distort the empirical distribution models learn from, biasing policies toward over-represented scenarios. If a particular grasp configuration appears 100 times while another appears 10 times, the learned policy will favor the former not because it generalizes better but because dataset collection accidentally captured it more often. This differs from intentional oversampling of hard negatives, where difficult cases are deliberately repeated to improve performance on underrepresented failure modes.

The BridgeData V2 dataset reduced 60,096 raw trajectories to 53,896 after deduplication, removing 10.3% of samples while improving downstream task success rates by 4-7% on held-out object configurations[3]. Storage savings compound at scale: the RoboNet dataset originally consumed 1.2TB for 15 million frames, but near-duplicate removal reduced this to 890GB without performance degradation[4].

Computational costs scale linearly with dataset size. Training RT-1 on 130,000 trajectories required 3,000 TPU-hours; a 20% deduplication would save 600 TPU-hours per training run[5]. For organizations running iterative model development, these savings accumulate across dozens of experiments. Deduplication also reduces annotation costs when human labelers must verify or correct robot trajectories, as duplicate samples waste labeler time on redundant examples.

Exact Deduplication: Hash-Based Methods

Exact deduplication detects byte-identical copies using cryptographic hash functions like SHA-256 or MD5. For image datasets, this approach works well when frames are stored losslessly. For robot trajectories stored in MCAP or HDF5 formats, exact hashing applies to serialized observation-action tuples.

The EPIC-KITCHENS-100 dataset used SHA-256 hashing on raw RGB frames to remove 3,847 exact duplicates from 90,000 egocentric video segments, reducing storage by 4.3%[6]. However, exact methods fail when datasets undergo lossy compression, resolution changes, or format conversions. A trajectory recorded at 30fps and downsampled to 10fps will not match its original hash despite containing identical semantic content.

For multi-modal robot data, exact deduplication must account for sensor synchronization drift. Two MCAP files capturing the same manipulation event from different ROS nodes may have microsecond timestamp differences, causing hash mismatches. The RLDS format addresses this by normalizing timestamps to episode-relative offsets before hashing, improving exact-match recall from 12% to 67% in multi-camera datasets[7].

Near-Duplicate Detection: Perceptual Hashing and Embeddings

Near-duplicate detection identifies samples that differ only in superficial transformations: JPEG compression, minor crops, brightness adjustments, or resolution scaling. Perceptual hashing algorithms like pHash or dHash generate compact fingerprints robust to these changes, enabling Hamming-distance comparisons at scale.

For robot vision datasets, perceptual hashing operates on observation frames. The RoboNet dataset applied 256-bit pHash to 15 million RGB frames, clustering samples within Hamming distance 8 to remove near-duplicates caused by camera auto-exposure and compression artifacts[8]. This reduced dataset size by 18% while preserving task diversity across 7 robot platforms.

Learned embeddings offer higher semantic sensitivity. The SemDeDup method uses CLIP embeddings to compute cosine similarity between image pairs, removing samples above a 0.95 threshold. Applied to LAION-2B, SemDeDup removed 21% of images while maintaining downstream zero-shot classification accuracy within 1% of the full dataset. For robot datasets, RT-2 used Vision Transformer embeddings to deduplicate 6,000 hours of teleoperation data, removing 14% of trajectories where consecutive episodes differed only in object placement within 5cm[9].

Scalability requires approximate nearest-neighbor search. FAISS IVF indexes enable sub-linear search over millions of embeddings, making semantic deduplication feasible for datasets exceeding 100,000 trajectories. The DROID dataset used FAISS with 1024-dimensional embeddings to deduplicate 76,000 trajectories in 4 hours on a single GPU[2].

Semantic Deduplication for Robot Trajectories

Semantic deduplication removes trajectories that achieve identical task outcomes through functionally equivalent action sequences, even when observation streams differ. This is critical for robot datasets where the same pick-place task may be executed with minor gripper orientation variations, different approach angles, or alternative grasp points that yield the same success.

Trajectory-level deduplication requires encoding both observations and actions. The RLDS format supports this by storing episodes as sequences of (observation, action, reward) tuples, enabling comparison of full state-action trajectories rather than isolated frames[7]. A common approach computes dynamic time warping (DTW) distance between action sequences, clustering trajectories with DTW distance below a threshold.

The Open X-Embodiment dataset faced semantic duplication across 22 contributing institutions, where different labs independently collected pick-place demonstrations on similar objects. Researchers applied action-space clustering with a DTW threshold of 0.15, removing 8,200 trajectories (12% of the dataset) that duplicated task solutions already present from other sources[1]. Downstream policy training showed no performance degradation on held-out tasks, confirming the removed trajectories provided redundant signal.

Semantic deduplication risks over-pruning when task diversity is subtle. A grasp that approaches from 45° versus 90° may appear similar in action space but encode important generalization information for contact-rich manipulation. The BridgeData V2 dataset addressed this by clustering only within-object-category trajectories, preserving cross-category diversity while removing intra-category redundancy[3].

Deduplication in Multi-Modal Robot Datasets

Multi-modal robot datasets combine RGB, depth, point clouds, proprioceptive state, and force-torque readings, requiring deduplication strategies that account for sensor heterogeneity. A trajectory may be unique in RGB space but duplicated in depth or point-cloud representations, or vice versa.

The DROID dataset stores observations in MCAP format with synchronized RGB-D streams, wrist camera feeds, and joint positions. Deduplication operates hierarchically: first removing exact duplicates via MCAP message hashing, then applying perceptual hashing to RGB frames, and finally clustering point clouds using PointNet embeddings[2]. This three-stage pipeline removed 19% of trajectories while preserving point-cloud diversity critical for 3D manipulation tasks.

Egocentric video datasets like Ego4D face temporal duplication where consecutive frames differ minimally. Frame-level deduplication would remove most of a video, destroying temporal coherence. Instead, Ego4D applies deduplication at the clip level, computing CLIP embeddings over 5-second windows and removing clips with cosine similarity above 0.92[10]. This preserves within-clip motion while removing redundant activity segments.

For datasets stored in HDF5 or Parquet formats, deduplication must respect hierarchical structure. The LeRobot dataset format organizes episodes as nested HDF5 groups; deduplication operates on episode-level metadata (task, object, success) before comparing observation arrays, reducing computational cost by 10x compared to frame-level hashing[11].

Deduplication Trade-Offs: Precision, Recall, and Task Diversity

Deduplication involves a precision-recall trade-off. Aggressive thresholds (high recall) remove more duplicates but risk pruning valid task diversity. Conservative thresholds (high precision) retain diversity but leave redundant samples that waste storage and training time.

The SemDeDup paper quantified this trade-off on LAION-2B: a cosine similarity threshold of 0.90 removed 35% of images with 2.3% downstream accuracy loss, while a 0.95 threshold removed 21% with 0.8% loss. For robot datasets, the optimal threshold depends on task complexity. Simple pick-place tasks tolerate aggressive deduplication (0.90 threshold), while contact-rich assembly tasks require conservative thresholds (0.97) to preserve subtle grasp variations.

Task diversity metrics help calibrate thresholds. The Open X-Embodiment dataset measures diversity as the number of unique (object, scene, robot) tuples; deduplication stops when diversity drops below 90% of the original count[1]. The BridgeData V2 dataset uses action-space entropy, removing trajectories until entropy decreases by more than 5%[3].

Human-in-the-loop validation provides ground truth. The DROID dataset sampled 500 trajectory pairs flagged as duplicates and had annotators label them as true duplicates, near-duplicates, or distinct. This calibration set tuned the DTW threshold to achieve 92% precision and 87% recall, balancing storage savings against task coverage[2].

Deduplication Pipelines and Tooling

Production deduplication pipelines integrate multiple methods in sequence: exact hashing for byte-identical copies, perceptual hashing for near-duplicates, and embedding-based clustering for semantic duplicates. Each stage operates on progressively smaller candidate sets, reducing computational cost.

The LeRobot library provides a reference pipeline: SHA-256 hashing on episode metadata, pHash on RGB observations, and FAISS clustering on ResNet-50 embeddings. For a 10,000-episode dataset, this pipeline completes in 2 hours on a single GPU, removing 12-18% of trajectories depending on threshold settings[11]. The pipeline outputs a deduplication report listing removed episodes, similarity scores, and cluster assignments for audit purposes.

For datasets exceeding 100,000 episodes, distributed deduplication is necessary. The Open X-Embodiment dataset used Apache Spark to parallelize embedding computation across 50 nodes, processing 1 million trajectories in 6 hours. FAISS sharding distributes the index across multiple GPUs, enabling billion-scale nearest-neighbor search[1].

Deduplication must preserve data provenance to maintain audit trails. The RLDS format stores deduplication metadata in episode-level attributes, recording which trajectories were removed, similarity scores, and the deduplication method version. This enables reproducibility and allows downstream users to re-apply deduplication with different thresholds if task requirements change[7].

When Not to Deduplicate: Intentional Repetition and Hard Negatives

Not all duplication is harmful. Intentional oversampling of rare events, failure modes, or hard negatives improves model robustness on underrepresented cases. Deduplication pipelines must distinguish accidental duplication from deliberate repetition.

The RT-1 dataset intentionally oversampled grasp failures by 3x to improve contact-rich manipulation performance. Deduplication would remove these critical examples, degrading model performance on precisely the cases the oversampling was designed to address[5]. The solution: tag oversampled episodes with metadata flags that exempt them from deduplication.

Sim-to-real datasets use domain randomization to generate synthetic variations of the same task, intentionally creating near-duplicates with different lighting, textures, or object poses. The RLBench dataset generates 100 variations per task, each differing only in randomized parameters. Deduplication would collapse these variations, destroying the diversity needed for sim-to-real transfer[12].

Multi-view datasets capture the same manipulation event from multiple camera angles, creating semantic duplicates in observation space but providing complementary 3D information. The DROID dataset uses 3 cameras per scene; deduplication operates only within-camera streams, preserving cross-camera redundancy that enables depth estimation and occlusion reasoning[2].

Deduplication and Dataset Licensing

Deduplication interacts with dataset licensing when aggregating data from multiple sources. If a trajectory appears in both a CC-BY-4.0 dataset and a proprietary dataset, deduplication may remove the open-source copy, reducing the usable open-data fraction.

The Open X-Embodiment dataset aggregates 22 datasets with heterogeneous licenses: 14 under CC-BY-4.0, 5 under CC-BY-NC, and 3 proprietary. Deduplication removed 8,200 trajectories, disproportionately affecting CC-BY-4.0 sources because they overlapped with larger proprietary datasets. The final dataset retained only 62% open-source content, down from 71% pre-deduplication[1].

To preserve open-data availability, deduplication can prioritize retention of permissively licensed samples. The LeRobot library supports license-aware deduplication: when two trajectories are duplicates, the pipeline retains the one with the most permissive license (CC-BY-4.0 > CC-BY-NC > proprietary). This ensures open-source users have maximum access to deduplicated data[11].

Provenance tracking is critical for compliance. The RLDS format stores source dataset IDs and license metadata in episode attributes, enabling downstream users to filter by license after deduplication. This supports use cases where commercial users access the full deduplicated dataset while academic users access only the open-source subset[7].

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment aggregates 60 robot embodiments across 22 institutions with semantic overlap risks

    arXiv
  2. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    DROID dataset collected 76,000 trajectories requiring aggressive deduplication of repeated failure modes

    arXiv
  3. BridgeData V2: A Dataset for Robot Learning at Scale

    BridgeData V2 reduced 60,096 trajectories to 53,896 after deduplication with 4-7% task success improvement

    arXiv
  4. RoboNet: Large-Scale Multi-Robot Learning

    RoboNet dataset storage reduced from 1.2TB to 890GB via near-duplicate removal

    arXiv
  5. RT-1: Robotics Transformer for Real-World Control at Scale

    RT-1 training on 130,000 trajectories required 3,000 TPU-hours

    arXiv
  6. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100

    EPIC-KITCHENS-100 used SHA-256 hashing to remove 3,847 exact duplicates from 90,000 segments

    arXiv
  7. RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning

    RLDS format normalizes timestamps to episode-relative offsets improving exact-match recall from 12% to 67%

    arXiv
  8. RoboNet: Large-Scale Multi-Robot Learning

    RoboNet applied 256-bit pHash to 15 million RGB frames reducing dataset size by 18%

    arXiv
  9. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    RT-2 used Vision Transformer embeddings to deduplicate 6,000 hours removing 14% of trajectories

    arXiv
  10. Ego4D: Around the World in 3,000 Hours of Egocentric Video

    Ego4D applies clip-level deduplication with CLIP embeddings over 5-second windows

    arXiv
  11. LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch

    LeRobot dataset format organizes episodes as nested HDF5 groups with episode-level deduplication

    arXiv
  12. RLBench: The Robot Learning Benchmark & Learning Environment

    RLBench generates 100 variations per task with randomized parameters

    arXiv
  13. scale.com physical ai

    Scale AI physical AI data engine for robot training data

    scale.com
  14. labelbox

    Labelbox data annotation platform for robot datasets

    labelbox.com
  15. encord

    Encord multi-modal annotation platform

    encord.com
  16. roboflow.com features

    Roboflow computer vision dataset management features

    roboflow.com
  17. dataloop.ai data management

    Dataloop data management platform for ML datasets

    dataloop.ai
  18. truelabel physical AI data marketplace bounty intake

    Truelabel physical AI data marketplace for robot training datasets

    truelabel.ai

More glossary terms

FAQ

What percentage of robot training datasets are typically duplicates?

Duplicate rates vary by collection method. Teleoperation datasets like DROID show 10-15% exact and near-duplicates due to repeated failure recovery attempts. Aggregated datasets like Open X-Embodiment exhibit 12-20% semantic duplication when multiple institutions collect similar tasks. Egocentric video datasets like Ego4D have 25-35% near-duplicate frames due to temporal redundancy in continuous recording. Exact deduplication alone removes 3-5% of samples, while combined exact, near-duplicate, and semantic deduplication removes 15-25% on average.

Does deduplication hurt model performance on downstream tasks?

Conservative deduplication (removing only high-confidence duplicates above 0.95 similarity) typically maintains or improves performance by reducing overfitting. BridgeData V2 showed 4-7% task success improvement after removing 10% of trajectories. Aggressive deduplication (0.90 threshold) can degrade performance by 2-5% if it removes valid task diversity. The optimal threshold depends on task complexity: simple pick-place tolerates aggressive deduplication, while contact-rich assembly requires conservative thresholds to preserve grasp variations.

How does deduplication work for multi-camera robot datasets?

Multi-camera datasets require view-aware deduplication. Within-camera deduplication removes temporal redundancy in each video stream using perceptual hashing or frame embeddings. Cross-camera deduplication is typically disabled because different viewpoints provide complementary 3D information needed for depth estimation and occlusion reasoning. DROID uses this approach: deduplicating within each of 3 camera streams but preserving cross-camera redundancy. For datasets where cameras capture truly redundant views, cross-camera deduplication uses 3D point cloud embeddings rather than 2D image features.

What tools exist for deduplicating robot datasets at scale?

LeRobot provides an integrated pipeline for HDF5 and Parquet datasets, supporting exact hashing, perceptual hashing, and FAISS-based embedding clustering. RLDS offers deduplication utilities for TensorFlow Datasets with support for distributed processing via Apache Beam. For custom formats, FAISS handles billion-scale nearest-neighbor search for embedding-based deduplication, while imagededup provides perceptual hashing for image datasets. Production pipelines typically combine these tools: exact hashing with SHA-256, near-duplicate detection with pHash or CLIP embeddings, and semantic clustering with FAISS IVF indexes.

Should deduplication happen before or after data annotation?

Deduplication before annotation saves labeling costs by avoiding redundant human effort on duplicate samples. For datasets requiring human verification or correction, pre-deduplication can reduce annotation workload by 15-25%. However, pre-deduplication risks removing samples that appear duplicate in observation space but require different labels (e.g., similar grasps with different success outcomes). Post-deduplication preserves label diversity but wastes annotation budget. A hybrid approach deduplicates exact and high-confidence near-duplicates before annotation, then applies semantic deduplication after labels are available to account for label-based distinctions.

How does deduplication interact with data augmentation?

Data augmentation intentionally creates synthetic variations of training samples through transformations like rotation, cropping, or color jittering. Deduplication must occur before augmentation to avoid removing the seed samples that augmentation will transform. If deduplication runs after augmentation, it will incorrectly flag augmented variants as duplicates and remove them, negating the augmentation benefit. For datasets combining real and augmented data, metadata tags distinguish original samples from augmented variants, exempting augmented data from deduplication. Domain randomization in sim-to-real datasets follows the same principle: randomized variations are tagged to prevent deduplication from collapsing intentional diversity.

Find datasets covering data deduplication

Truelabel surfaces vetted datasets and capture partners working with data deduplication. Send the modality, scale, and rights you need and we route you to the closest match.

Browse Physical AI Datasets