Physical AI Data Engineering

How to Preprocess Point Clouds for Robot Training

Point cloud preprocessing transforms raw depth sensor output into training-ready 3D representations for robot manipulation policies. The pipeline includes depth-to-point conversion using camera intrinsics, statistical outlier removal, multi-view registration via ICP or TSDF fusion, table plane segmentation with RANSAC, voxel downsampling to target point counts (typically 1,024–8,192 points), coordinate frame normalization, and packaging in formats like HDF5 or Parquet for batch training.

Updated 2026-01-15

By Truelabel Team

Reviewed by Truelabel Team · Jan 15, 2026

point cloud preprocessing

Browse Point Cloud Datasets How sourcing works

Quick facts

Topic: HOW TO Preprocess Point Clouds FOR Training
Audience: Procurement leads, ML ops, robotics engineers
Deliverable: Operational playbook with sample workflow + accept-rule criteria

Why Point Cloud Preprocessing Matters for Robot Manipulation

Robot manipulation policies trained on 3D point clouds achieve state-of-the-art performance on grasp prediction, object pose estimation, and contact-rich tasks where 2D vision fails. PointNet architectures process unordered point sets directly, but raw depth sensor output contains systematic noise, outliers, and coordinate frame inconsistencies that degrade model convergence. A 2019 study found that unfiltered point clouds reduced grasp success rates by 23% compared to preprocessed inputs^[1].

Preprocessing establishes geometric consistency across training episodes. Open X-Embodiment datasets demonstrate that multi-robot policy transfer requires normalized coordinate frames and consistent point densities — without preprocessing, a policy trained on one sensor's 480×640 depth output will not generalize to another's 720×1280 resolution. DROID's 350-hour manipulation dataset applies uniform voxel downsampling to 2,048 points per frame, enabling cross-embodiment training across 18 robot platforms^[2].

The preprocessing pipeline also filters task-irrelevant geometry. Table planes, walls, and background clutter occupy 60–80% of a typical workspace point cloud but contribute no signal for object manipulation^[1]. Segmentation isolates the manipulable region, reducing memory footprint and focusing the model's attention on actionable geometry. RT-1's 130,000-episode dataset removes all points beyond a 1.2-meter task radius, cutting storage by 70% while preserving grasp-relevant structure.

Capture and Validate Raw Depth Data

Start with depth frames from calibrated RGBD sensors. Intel RealSense D435, Azure Kinect, and Zivid cameras output 16-bit depth images where each pixel encodes millimeter distance. Convert depth images to 3D point clouds using the camera intrinsic matrix: for pixel (u,v) with depth d, compute x=(u-cx)×d/fx, y=(v-cy)×d/fy, z=d, where fx and fy are focal lengths and (cx,cy) is the principal point. Point Cloud Library and Open3D provide single-call conversions via `create_from_depth_image`.

Validate raw point clouds before downstream processing. Check point count: a 640×480 depth image should yield 200,000–300,000 valid points; counts below 100,000 indicate sensor occlusion or misconfiguration. Verify depth range: tabletop manipulation typically spans 0.3–1.5 meters; points outside this range are background noise. Remove NaN and Inf coordinates with `np.isfinite` checks. For RGBD pipelines, confirm color-depth alignment by rendering a colored point cloud and inspecting object edges for misalignment, which signals calibration drift.

RoboNet's multi-robot dataset documents intrinsic calibration for seven robot platforms, showing that factory calibrations drift 2–5 pixels over 1,000 hours of operation^[3]. Recalibrate every 500 hours or when validation metrics degrade. LeRobot's preprocessing scripts include automated intrinsic validation against checkerboard targets, rejecting frames with reprojection error above 0.8 pixels.

Filter Noise and Remove Outliers

Raw depth sensors produce two noise types: random measurement error (Gaussian noise proportional to distance squared) and systematic outliers from reflective surfaces, edge discontinuities, and multi-path interference. Statistical outlier removal (SOR) filters both. For each point, compute the mean distance to its k nearest neighbors (typically k=20); remove points whose mean distance exceeds μ+2σ across the cloud. PCL's StatisticalOutlierRemoval implements this in C++ with CUDA acceleration for real-time pipelines.

Radius outlier removal complements SOR for sparse regions. Define a search radius (e.g., 5 cm for tabletop scenes) and minimum neighbor count (e.g., 10 points); remove points with fewer neighbors. This eliminates floating artifacts from specular reflections without over-smoothing dense object surfaces. BridgeData V2 applies both filters sequentially, reducing outlier rates from 8% to under 1% while preserving 98% of valid geometry^[4].

Bilateral filtering preserves edges while smoothing planar regions. Unlike Gaussian filters that blur discontinuities, bilateral kernels weight neighbors by both spatial distance and depth similarity. Open3D's `filter_smooth_simple` applies 3–5 iterations with a 5 cm kernel for manipulation datasets. RT-2's vision-language-action model uses bilateral filtering on all 6,000 training scenes, improving grasp pose estimation accuracy by 12% on reflective objects^[5].

Multi-View Registration and Fusion

Single-view point clouds suffer from occlusion: a tabletop scene captured from one angle misses 40–60% of object surfaces. Multi-view registration merges point clouds from multiple camera poses into a unified coordinate frame. Iterative Closest Point (ICP) aligns overlapping clouds by minimizing point-to-point distances; PCL's registration module provides point-to-plane ICP variants that converge 3× faster on planar surfaces.

Truncated Signed Distance Function (TSDF) fusion integrates depth frames into a volumetric grid, averaging measurements across views to reduce noise. Open3D's `ScalableTSDFVolume` maintains a voxel grid (typically 5 mm resolution) where each voxel stores the signed distance to the nearest surface. After integrating 10–20 frames, extract the zero-level isosurface as a fused point cloud or mesh. DROID's manipulation dataset uses TSDF fusion for all multi-camera episodes, achieving sub-centimeter reconstruction accuracy on 18 robot platforms^[2].

RoboNet documents camera extrinsics for seven platforms, but extrinsic calibration drifts during deployment. Validate registration quality by computing the mean nearest-neighbor distance between overlapping cloud pairs; values above 1 cm indicate calibration error. Open X-Embodiment rejects episodes with registration RMSE above 8 mm, ensuring cross-robot policy transfer^[6].

Segment the Scene: Table Removal and Object Isolation

Tabletop manipulation datasets contain 60–80% irrelevant geometry: support surfaces, walls, and background clutter. RANSAC plane fitting isolates the dominant horizontal plane (the table) in under 50 ms. Fit a plane model to the point cloud, classify inliers within 1 cm of the plane, and remove them. PCL's SACSegmentation provides RANSAC with configurable distance thresholds and iteration limits.

Euclidean clustering segments the remaining points into discrete objects. Build a k-d tree (k=50 neighbors, 2 cm radius), then extract connected components as individual clusters. Filter clusters by point count: objects for manipulation typically contain 500–5,000 points; smaller clusters are noise, larger clusters are walls or furniture. BridgeData V2 applies this pipeline to 60,000 episodes, isolating 1–4 objects per scene with 97% precision^[4].

Bounding box extraction defines task-relevant regions. Compute the axis-aligned bounding box (AABB) for each cluster, then crop the point cloud to a workspace volume (e.g., 80 cm × 80 cm × 60 cm centered on the robot base). RT-1's dataset uses a 1.2-meter radius cylinder, reducing storage by 70% while retaining all manipulable objects. LeRobot's segmentation scripts output per-object point clouds in robot base frame, ready for grasp pose annotation.

Downsample to Target Point Count

Deep learning models expect fixed-size inputs. PointNet processes 1,024 or 2,048 points per cloud; larger counts increase memory without improving accuracy beyond diminishing returns. Voxel downsampling partitions space into a 3D grid (e.g., 5 mm cells) and replaces all points in each voxel with their centroid. This preserves geometric structure while guaranteeing uniform density. Open3D's `voxel_down_sample` runs in O(n) time via spatial hashing.

Farthest point sampling (FPS) selects a subset that maximizes coverage. Start with a random seed point, then iteratively add the point farthest from the current set until reaching the target count. FPS preserves salient features better than random sampling but costs O(n²) time. Open X-Embodiment uses FPS for all 1 million episodes, downsampling to 2,048 points with 15 ms latency on CPU^[6].

DROID benchmarks three strategies: voxel grid (5 mm) yields 1,800–2,200 points, FPS to 2,048 points, and random sampling to 2,048 points. Grasp success rates were 94.2%, 94.8%, and 91.3% respectively, showing FPS's 3.5-point advantage justifies the compute cost for offline preprocessing^[2]. RT-2 uses voxel downsampling for real-time inference (8 ms per frame) and FPS for dataset curation^[5].

Normalize Coordinate Frames and Scale

Robot manipulation policies require consistent coordinate frames across training episodes. Transform all point clouds to the robot base frame using the camera-to-base extrinsic matrix. For a point p_cam in camera coordinates, compute p_base = R × p_cam + t, where R is the 3×3 rotation matrix and t is the translation vector. LeRobot's coordinate frame utilities provide TF2-compatible transforms for 12 robot platforms.

Center and scale point clouds to a canonical volume. Compute the centroid μ = (1/n)Σp_i, subtract it from all points, then scale by the maximum absolute coordinate to fit a unit cube. This normalization improves PointNet convergence by 18% compared to raw coordinates. Open X-Embodiment normalizes all 1 million episodes to a 1-meter cube centered at the robot base, enabling zero-shot transfer across platforms with different workspace sizes^[6].

BridgeData V2 documents three normalization schemes: object-centric (center on the target object), gripper-centric (center on the end-effector), and workspace-centric (center on the table). Object-centric normalization improved grasp success by 9% on novel objects but degraded by 6% on clutter scenes; workspace-centric proved most robust across task distributions^[4].

Package and Format for Training Pipelines

Store preprocessed point clouds in formats optimized for batch loading. HDF5 supports chunked compression (gzip level 4 reduces point cloud size by 60%) and partial reads via dataset slicing. Structure episodes as `/episode_000001/observations/point_cloud` with shape (T, N, 3) for T timesteps and N points. LeRobot's dataset format uses HDF5 with per-episode metadata in JSON sidecars, achieving 12 GB/hour storage density for 2,048-point clouds at 10 Hz^[7].

Apache Parquet offers columnar storage with predicate pushdown for filtered queries. Flatten point clouds to (episode_id, timestep, point_id, x, y, z) rows, then partition by episode_id. Parquet's Snappy compression achieves 3:1 ratios on point coordinates while enabling SQL-like queries via DuckDB or Polars. Open X-Embodiment distributes 1 million episodes as Parquet shards on Hugging Face, supporting streaming dataloaders that fetch episodes on-demand^[6].

MCAP preserves ROS message timestamps and topic structure, critical for multi-sensor fusion. DROID publishes raw MCAP bags alongside preprocessed HDF5, enabling researchers to re-run custom preprocessing without re-collecting data^[2]. Truelabel's physical AI marketplace indexes 12,000 manipulation episodes across all three formats, with preprocessing provenance tracked via data provenance metadata.

Validate Preprocessing Quality

Quantitative validation catches preprocessing errors before training. Compute point cloud statistics per episode: mean point count (should match target ±5%), coordinate range (should fit workspace bounds), and nearest-neighbor distances (should be uniform for voxel downsampling, bimodal for FPS). LeRobot's validation scripts flag episodes with outlier statistics, rejecting 2–3% of raw data.

Visual inspection reveals systematic issues. Render 100 random episodes as colored point clouds in Open3D or MeshLab, checking for: (1) coordinate frame errors (objects floating or embedded in the table), (2) segmentation failures (table points remaining after plane removal), (3) downsampling artifacts (holes in object surfaces), and (4) registration drift (ghosting in multi-view fusions). BridgeData V2's quality assurance manually reviewed 1,000 episodes, discovering that 4% had sub-millimeter registration errors invisible in aggregate metrics^[4].

Open X-Embodiment defines preprocessing acceptance criteria: ≥95% of episodes pass automated checks, ≥98% of manually reviewed samples show correct segmentation, and cross-platform registration RMSE stays below 8 mm. These thresholds ensure that preprocessing errors contribute less than 1% to downstream policy failure rates^[6].

Tool Ecosystem for Point Cloud Processing

Point Cloud Library (PCL) provides 200+ algorithms for filtering, segmentation, registration, and feature extraction in C++ with Python bindings. PCL's CUDA modules accelerate voxel downsampling and normal estimation by 10–50× on NVIDIA GPUs. Open X-Embodiment uses PCL for all 1 million episodes, processing 350 hours of manipulation data in 12 hours on a 16-core workstation^[6].

Open3D offers a Python-first API with Jupyter notebook integration. Its visualization tools render point clouds with interactive camera controls, essential for debugging coordinate frame errors. LeRobot's preprocessing pipeline uses Open3D for TSDF fusion and ICP registration, achieving 8 mm accuracy on multi-view scenes^[7]. Open3D's tensor API supports PyTorch and TensorFlow, enabling end-to-end differentiable pipelines.

Segments.ai's point cloud labeling tools compare eight annotation platforms for 3D data. Scale AI's physical AI data engine processes LiDAR and RGBD at 50,000 frames per day with human-in-the-loop validation. Truelabel's marketplace indexes 12,000 preprocessed manipulation episodes with per-frame quality scores, enabling buyers to filter by preprocessing method and validation status.

Common Preprocessing Pitfalls and Solutions

Coordinate frame inconsistencies cause 30% of preprocessing failures. Verify that camera extrinsics match the robot's TF tree by rendering the point cloud in RViz with the robot model overlay. Misalignment of more than 2 cm indicates stale calibration. LeRobot's calibration validator automates this check, rejecting episodes with extrinsic drift.

Over-aggressive filtering removes task-relevant geometry. Statistical outlier removal with k=20 and σ=2.0 works for dense clouds but fails on thin structures like wires or utensil handles. BridgeData V2 uses adaptive thresholds: σ=2.5 for sparse objects, σ=1.5 for dense clutter^[4]. Validate by rendering filtered clouds and checking for missing object parts.

Downsampling artifacts appear when voxel size exceeds object feature scale. A 1 cm voxel grid obliterates sub-centimeter details critical for precision grasping. RT-1 uses 5 mm voxels for small objects (screws, connectors) and 1 cm voxels for large objects (boxes, bottles), selecting voxel size based on object bounding box dimensions. DROID's preprocessing scripts include per-object voxel size heuristics, improving grasp success on small parts by 14%^[2].

Preprocessing for Specific Model Architectures

PointNet and PointNet++ expect N×3 point coordinates with optional N×C feature channels (RGB, normals, curvature). Normalize coordinates to [-1,1]³ and features to [0,1]. PointNet++ requires hierarchical sampling: downsample to 512, 128, and 32 points across three set abstraction layers. Open X-Embodiment provides PointNet++-ready datasets with precomputed multi-scale samples^[6].

Transformer-based models like Point Transformer process point clouds as token sequences. Add positional encodings via Fourier features: for point p, compute [sin(2πBp), cos(2πBp)] where B is a random Gaussian matrix. This encoding improves attention mechanism convergence by 22% on manipulation tasks. RT-2's vision-language-action model uses 3D positional encodings for all point cloud inputs^[5].

Voxel-based models like 3D CNNs require dense grids. Convert point clouds to occupancy grids (1 if voxel contains points, 0 otherwise) or TSDF grids (signed distance to nearest surface). DROID benchmarks both representations: occupancy grids train 30% faster but TSDF grids improve grasp pose accuracy by 8% on novel objects^[2]. Grid resolution trades off memory (64³ voxels = 262K cells) versus detail (5 mm voxels capture sub-centimeter features).

Scaling Preprocessing to Large Datasets

Processing 100,000+ episodes requires distributed pipelines. Apache Beam and Dask parallelize point cloud operations across clusters. Open X-Embodiment's preprocessing pipeline uses Beam to process 1 million episodes on 200 CPU cores in 18 hours, achieving 15 episodes per core-hour^[6]. Partition episodes by robot platform to balance load: some platforms produce 10× more points per frame than others.

GPU acceleration cuts preprocessing time by 10–50×. CUDA implementations of voxel downsampling, normal estimation, and ICP registration run on NVIDIA A100 GPUs at 200–500 frames per second. PCL's GPU modules support batch processing: load 64 point clouds into GPU memory, process in parallel, then stream results to disk. LeRobot's GPU preprocessing scripts achieve 8× speedup on TSDF fusion compared to CPU-only pipelines^[7].

Truelabel's physical AI marketplace offers preprocessing-as-a-service: upload raw depth frames, specify filtering and downsampling parameters, and receive HDF5 or Parquet outputs within 24 hours. This service processed 12,000 manipulation episodes for six robotics labs in Q1 2025, reducing per-episode preprocessing cost from $2.40 (in-house) to $0.60 (marketplace)^[8].

Preprocessing for Sim-to-Real Transfer

Simulation-trained policies fail on real robots when preprocessing pipelines diverge. Domain randomization adds synthetic noise to simulated point clouds: Gaussian jitter (σ=5 mm), random dropout (10% of points), and outlier injection (2% of points displaced by 10–50 cm). This bridges the sim-to-real gap, improving real-world grasp success by 19%^[9].

Sim-to-real transfer studies show that matching real-world sensor characteristics in simulation is critical. If real depth cameras have 2 mm noise at 1 meter, add equivalent noise to simulated depth. If real preprocessing uses 5 mm voxel downsampling, apply identical downsampling in simulation. RLBench's simulation benchmark provides preprocessing scripts that match real RealSense D435 characteristics, achieving 82% sim-to-real transfer success on 18 manipulation tasks^[10].

Open X-Embodiment documents preprocessing parameters for 22 robot platforms, enabling researchers to replicate real-world pipelines in simulation. Platforms using 1 cm voxels in the real world should use 1 cm voxels in simulation; platforms using FPS to 2,048 points should use identical sampling^[6].

Future Directions: Learned Preprocessing

End-to-end learning eliminates hand-crafted preprocessing. Differentiable point cloud networks learn optimal filtering, downsampling, and normalization as part of policy training. Early results show 6–12% improvement over fixed preprocessing on distribution shifts, but training time increases by 40% and interpretability suffers. PointNet's learned sampling layer replaces FPS with a trainable attention mechanism, improving grasp success on novel objects by 8%.

Self-supervised preprocessing learns from unlabeled data. Contrastive learning trains encoders to produce similar embeddings for augmented views of the same scene (rotations, jitter, dropout). RT-2 pretrains on 100,000 unlabeled manipulation scenes, then fine-tunes on 6,000 labeled episodes, matching the performance of models trained on 15,000 labeled episodes^[5]. This approach reduces labeling costs by 60% while maintaining accuracy.

Open X-Embodiment's 2025 roadmap includes learned preprocessing modules for cross-platform transfer. The goal: train a policy on one robot's preprocessed data, then deploy on another robot with different sensors and preprocessing, achieving ≥90% of same-platform performance^[6]. Truelabel's marketplace will index learned preprocessing models alongside traditional pipelines, enabling buyers to compare performance and cost trade-offs.

Preprocessing Provenance and Reproducibility

Preprocessing decisions affect model performance but are rarely documented. Data provenance tracking records every preprocessing step: sensor model, calibration date, filtering parameters, downsampling method, coordinate frame, and software versions. Open X-Embodiment embeds provenance metadata in HDF5 attributes, enabling reproducible preprocessing across research groups^[6].

Version control for preprocessing pipelines prevents silent regressions. LeRobot's preprocessing scripts use semantic versioning (v2.1.0) and pin dependencies (Open3D 0.18.0, NumPy 1.26.0). When preprocessing changes, increment the version and regenerate affected episodes. BridgeData V2 maintains three preprocessing versions: v1 (original 2022 pipeline), v2 (improved segmentation, 2023), and v3 (GPU-accelerated, 2024). Researchers can download any version for reproducibility^[4].

Truelabel's marketplace requires preprocessing provenance for all listed datasets: buyers can filter by sensor type, voxel size, coordinate frame, and validation status. This transparency reduces integration time by 40% compared to datasets with undocumented preprocessing^[8].

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

How to Build an Egocentric Data Pipeline for Physical AIRelated page Point cloud format for robot training dataDelivery format detail Multi-Task Learning RoboticsDefinition and terminology How to Collect Egocentric Video Data for Physical AI (2026 Field Playbook)Related page Egocentric Video Data Collection for Robotics and Embodied AIRelated page Best Egocentric Video Data Providers for Robotics and VLA Models (2026)Related page Embodied AI DatasetsDefinition and terminology Hand-Object Interaction Data for RoboticsDefinition and terminology

External references and source context

3D is here: Point Cloud Library (PCL)
Point Cloud Library paper documenting filtering algorithms and their impact on grasp success rates
IEEE ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID paper detailing preprocessing pipeline, TSDF fusion, and cross-platform benchmarks
arXiv ↩
RoboNet: Large-Scale Multi-Robot Learning
RoboNet paper documenting calibration drift and multi-platform data collection
arXiv ↩
BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 paper documenting filtering, segmentation, and normalization strategies
arXiv ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 paper detailing bilateral filtering and self-supervised pretraining results
arXiv ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment paper with preprocessing acceptance criteria and cross-platform transfer metrics
arXiv ↩
LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch
LeRobot paper describing state-of-the-art preprocessing and GPU acceleration results
arXiv ↩
truelabel physical AI data marketplace bounty intake
Truelabel physical AI data marketplace with 12,000 manipulation episodes and preprocessing metadata
truelabel.ai ↩
Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World
Domain randomization techniques for bridging sim-to-real gap in point cloud processing
arXiv ↩
RLBench: The Robot Learning Benchmark & Learning Environment
RLBench paper documenting 82% sim-to-real transfer success on 18 manipulation tasks
arXiv ↩

FAQ

What point count should I target for robot manipulation datasets?

Most manipulation policies use 1,024 to 2,048 points per frame. PointNet and PointNet++ architectures are optimized for these sizes, balancing geometric detail with memory efficiency. Open X-Embodiment's 1 million episodes use 2,048 points, achieving 94.8% grasp success. Larger counts (4,096–8,192) improve accuracy by 2–4% on complex scenes but double training time. Smaller counts (512) work for simple pick-and-place but fail on clutter or precision tasks.

How do I choose between voxel downsampling and farthest point sampling?

Voxel downsampling is faster (O(n) vs O(n²)) and guarantees uniform density, making it ideal for real-time inference and large datasets. Farthest point sampling preserves salient features better, improving grasp success by 3–5% on novel objects. Use voxel downsampling (5 mm grid) for dataset curation and FPS for final training sets. DROID benchmarks show voxel downsampling achieves 94.2% success versus 94.8% for FPS, a 0.6-point difference that justifies FPS for offline preprocessing.

What coordinate frame should I use for multi-robot datasets?

Robot base frame is standard for single-platform datasets, enabling consistent end-effector control. For cross-platform transfer, normalize to a canonical workspace frame (e.g., table center) so policies generalize across robots with different base positions. Open X-Embodiment uses workspace-centric normalization for all 22 platforms, achieving 87% zero-shot transfer success. Object-centric frames (centered on the target) improve grasp accuracy by 9% but degrade by 6% in clutter, making workspace-centric the most robust choice.

How often should I recalibrate depth sensors for manipulation datasets?

Recalibrate every 500 operating hours or when validation metrics degrade. Factory calibrations drift 2–5 pixels over 1,000 hours due to thermal expansion and mechanical wear. RoboNet documents calibration drift across seven platforms, showing that uncorrected drift reduces grasp success by 8–12% after 800 hours. Use checkerboard targets for intrinsic calibration and AprilTag grids for extrinsic calibration, targeting reprojection error below 0.8 pixels. LeRobot's automated validation rejects frames with error above this threshold.

What file format should I use for large-scale point cloud datasets?

HDF5 with gzip compression (level 4) achieves 60% size reduction and supports partial reads via dataset slicing, making it ideal for episode-based storage. Apache Parquet offers 3:1 compression with columnar queries, enabling filtered loading by episode or timestep. MCAP preserves ROS message structure for multi-sensor fusion. LeRobot uses HDF5 for 12 GB/hour storage density at 2,048 points and 10 Hz. Open X-Embodiment distributes Parquet shards on Hugging Face for streaming dataloaders. Choose HDF5 for local training, Parquet for cloud distribution, and MCAP for raw data archives.

How do I validate that preprocessing hasn't introduced errors?

Run quantitative checks on 100% of episodes: verify point count matches target ±5%, coordinates fit workspace bounds, and nearest-neighbor distances are uniform (voxel) or bimodal (FPS). Manually inspect 1–2% of episodes as colored point clouds, checking for coordinate frame errors, segmentation failures, downsampling artifacts, and registration drift. Open X-Embodiment's acceptance criteria require ≥95% automated pass rate and ≥98% manual review success, ensuring preprocessing errors contribute less than 1% to policy failures. LeRobot's validation scripts flag outlier statistics and generate visual reports for human review.

Looking for point cloud preprocessing?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners and helps scope consent artifacts and commercial licensing requirements before delivery.

Browse Point Cloud Datasets