Physical AI Infrastructure
How to Set Up a Mobile Manipulation Rig for Physical AI Data Collection
A mobile manipulation rig combines a wheeled base with a mounted robotic arm to collect navigation and manipulation data simultaneously. Core steps: select a mobile platform (differential-drive or omnidirectional), mount a 6-7 DoF arm with end-effector cameras, synchronize all sensors to a shared clock, configure ROS2 or MCAP recording pipelines, and collect teleoperated demonstrations across varied environments to generate training data for vision-language-action models.
Quick facts
- Difficulty
- Intermediate
- Audience
- Physical AI data engineers
- Last reviewed
- 2026-01-15
Why Mobile Manipulation Rigs Are Critical for Physical AI
Mobile manipulation rigs generate the highest-value training data for embodied AI because they capture both navigation and manipulation skills in a single trajectory. Open X-Embodiment aggregated 1 million trajectories from 22 robot types, yet mobile manipulation sequences remain scarce—less than 8% of the corpus[1]. Foundation models like RT-2 and OpenVLA require diverse mobile manipulation data to generalize across tasks like fetch-and-place, room-to-room delivery, and dynamic obstacle avoidance.
The data scarcity stems from hardware complexity. Unlike fixed-arm setups, mobile rigs demand real-time coordination between base odometry, arm kinematics, and multi-camera streams. Scale AI's Physical AI platform reports that mobile manipulation datasets command 3-5× higher per-trajectory pricing than static manipulation due to collection overhead[2]. Buyers prioritize rigs that produce RLDS-compatible or MCAP outputs with synchronized timestamps, calibrated extrinsics, and action labels at 10-30 Hz.
DROID demonstrated that 76,000 mobile manipulation trajectories across 564 scenes enable zero-shot transfer to unseen environments[3]. The rig design matters: wrist-mounted cameras, force-torque sensors, and proprioceptive encoders are non-negotiable for modern physical AI data marketplaces. If your rig cannot produce multi-modal, time-aligned data at scale, you are locked out of the highest-value procurement contracts.
Selecting a Mobile Base Platform
Choose between differential-drive, Ackermann-steering, or omnidirectional bases depending on workspace constraints and maneuverability requirements. Differential-drive platforms (two independently driven wheels) are simplest to control and dominate academic datasets—RoboNet used TurtleBot 2 bases for 15 million frames across 7 institutions[4]. Omnidirectional bases (mecanum or swerve-drive) enable lateral motion critical for tight indoor navigation but add mechanical complexity and cost.
Key specifications: payload capacity must exceed arm weight plus sensors plus safety margin (typically 15-25 kg total), battery runtime should support 2-4 hour collection sessions without recharge, and wheel encoders must provide odometry at ≥50 Hz for accurate base-to-world transforms. LeRobot recommends bases with ROS2 navigation stack compatibility to simplify SLAM integration and path planning during autonomous data collection phases.
Avoid platforms without standardized mounting interfaces. Universal Robots' UR5e and Franka Emika arms expect ISO 9409-1-50-4-M6 flanges; custom adapter plates introduce calibration drift[5]. For outdoor or warehouse scenarios, consider ruggedized bases with suspension—Claru's warehouse teleoperation dataset used Clearpath Ridgeback platforms to handle floor transitions and ramps. Budget $8,000-$25,000 for research-grade mobile bases; industrial AGVs start at $40,000 but offer superior reliability for multi-year data collection programs.
Mounting and Integrating the Manipulator Arm
Mount the arm to maximize workspace overlap with base-mounted cameras while maintaining a low center of gravity to prevent tip-over during high-speed maneuvers. The mounting plate must rigidly couple the arm base to the mobile platform—any flex introduces unmodeled dynamics that corrupt inverse kinematics solutions. Franka's FR3 Duo dual-arm configuration requires custom torque-tube mounts to handle 14 kg cantilevered loads during bimanual tasks[6].
Calibrate the arm-to-base transform using a checkerboard target visible to both wrist cameras and base-mounted RGB-D sensors. Record 20-30 poses spanning the workspace, then solve the hand-eye calibration problem using OpenCV's `calibrateHandEye` or ROS's `easy_handeye` package. Residual errors below 3 mm are acceptable for manipulation; navigation-only tasks tolerate 10 mm. BridgeData V2 achieved 1.8 mm calibration accuracy using AprilTag fiducials and nonlinear least-squares refinement[7].
Cable management is non-negotiable. Route arm power and Ethernet through cable carriers or drag chains to prevent snagging during base rotation. RT-1's data collection rig used spiral-wrap sleeving and strain relief at both arm base and mobile platform to survive 130,000 trajectories without cable failure[8]. Test full 360° base rotation at maximum arm extension before collecting production data—cable binding is the leading cause of mid-trajectory aborts. For 7-DoF arms, verify that null-space motions do not create unreachable base-relative configurations.
Sensor Suite Configuration and Synchronization
Equip the rig with wrist-mounted RGB cameras (for end-effector viewpoint), base-mounted RGB-D cameras (for third-person scene context), joint encoders (for proprioception), and optionally force-torque sensors (for contact-rich tasks). OpenVLA training requires at minimum two camera views per trajectory; single-view data limits spatial reasoning and occlusion handling[9].
Synchronize all sensors to a hardware-triggered clock or use software timestamps with NTP-disciplined system time. Timestamp drift above 50 ms between camera frames and joint states breaks action-observation correspondence in imitation learning. MCAP's message-level timestamps enable post-hoc synchronization, but hardware triggering eliminates the problem at source. ROS2 bag recording defaults to receive-time stamping; override with sensor-native timestamps via `message_filters::TimeSynchronizer`.
Calibrate camera intrinsics and extrinsics before each collection session. Vibration from mobile base motion shifts lens elements; recalibrate weekly or after any mechanical adjustment. DROID's collection protocol includes automated checkerboard detection at session start to flag calibration drift[10]. Store calibration matrices in the dataset metadata—LeRobot's dataset schema reserves `camera_params` fields for intrinsic K matrices and distortion coefficients. Buyers reject datasets with missing or placeholder calibration data.
Recording Pipeline Architecture
Build a recording pipeline that captures raw sensor streams, computes derived quantities (velocities, accelerations), and writes time-aligned data to disk without frame drops. MCAP is the preferred format for mobile manipulation because it supports arbitrary message schemas, handles multi-gigabyte files efficiently, and integrates with Foxglove Studio for real-time visualization during collection.
Structure your pipeline in three stages: acquisition (sensor drivers publish to topics), processing (synchronization nodes align timestamps and compute actions), and serialization (MCAP writer or ROS2 bag recorder persists messages). Run acquisition and processing on separate CPU cores to prevent I/O blocking. LeRobot's data collection scripts use Python's `multiprocessing` to isolate camera capture from disk writes, achieving 30 Hz recording with four cameras on a 12-core workstation[11].
Compute action labels in real-time rather than post-hoc. For teleoperation, record both the operator's commanded actions and the robot's executed joint positions—the delta reveals compliance, latency, and tracking error. RLDS format stores actions as `(batch, time, action_dim)` tensors; mobile manipulation typically uses 9-12 action dimensions (6-7 for arm joints, 2-3 for base velocity, 1 for gripper)[12]. Validate action ranges during recording: out-of-bounds values indicate sensor failure or kinematic singularities. Truelabel's data provenance standards require per-trajectory metadata documenting action space definitions and units.
Teleoperation Interface Design
Design a teleoperation interface that minimizes operator cognitive load while maximizing data diversity. The interface must display multiple camera feeds simultaneously, provide haptic or visual feedback for collision proximity, and allow rapid task resets without manual robot repositioning. ALOHA's bilateral teleoperation setup uses leader-follower arms for intuitive 6-DoF control, enabling non-expert operators to collect 50 trajectories per hour[13].
For mobile manipulation, decouple base and arm control. Simultaneous base motion and manipulation is cognitively demanding; most operators prefer sequential control (navigate to target, then manipulate). Claru's kitchen task dataset used a gamepad for base velocity commands and a SpaceMouse for arm Cartesian control, achieving 85% task success rate across 12 operators[14]. Record operator identity and experience level in trajectory metadata—skill variance is a critical covariate for policy training.
Implement automatic safety stops: e-stop on base-arm collision, workspace boundary violations, or excessive joint torques. RT-1's data collection used virtual safety zones defined in the base frame; arm motions that would intersect the mobile platform triggered automatic retracts[15]. Log all safety events with timestamps—they reveal task difficulty and inform dataset filtering. Provide operators with a "mark bad trajectory" button to flag collection errors in real-time rather than relying on post-hoc review.
Environment Design and Task Diversity
Collect data across diverse environments to maximize policy generalization. Open X-Embodiment analysis shows that models trained on 10+ distinct environments achieve 40% higher zero-shot success rates than single-environment models[16]. Vary lighting conditions, floor textures, object arrangements, and background clutter—domain randomization in data collection is more effective than synthetic domain randomization for mobile manipulation.
Define a task taxonomy before collection. DROID's taxonomy includes 564 scenes across 86 buildings, with tasks categorized by manipulation primitive (pick, place, push, pull) and navigation complexity (straight-line, doorway traversal, multi-room)[17]. Balance task distribution: over-sampling easy tasks inflates success metrics without improving model robustness. Target 60% success rate during collection—higher rates indicate insufficient task diversity, lower rates suggest operator training issues.
Document environment metadata: floor plan dimensions, obstacle density, lighting spectrum (fluorescent vs. LED vs. natural), and acoustic properties if using audio sensors. EPIC-KITCHENS provides per-scene metadata including camera mount heights and participant demographics; this granularity enables dataset slicing for ablation studies[18]. For warehouse or industrial settings, record ambient temperature and humidity—these affect motor performance and sensor noise. Truelabel's marketplace intake requires environment documentation for all mobile manipulation datasets.
Data Validation and Quality Assurance
Implement automated quality checks during and after collection. Real-time checks: verify timestamp monotonicity (no backwards jumps), action bounds (all commands within joint limits), and camera frame completeness (no dropped frames). Post-collection checks: compute trajectory statistics (episode length distribution, action variance, camera motion blur metrics) and flag outliers for manual review.
BridgeData V2's validation pipeline rejects trajectories with >5% frame drops, >100ms timestamp gaps, or action sequences that violate kinematic constraints[19]. Build similar filters using `numpy` and `scipy`: detect action discontinuities via `np.diff`, measure camera blur with Laplacian variance, and check synchronization with cross-correlation of joint encoder and camera timestamps. Store validation results in a per-trajectory manifest—buyers use these to assess dataset quality before purchase.
Manually review 10-20% of trajectories, stratified by task type and operator. RoboCat's data curation used expert reviewers to label trajectory quality on a 1-5 scale; only 4-5 rated trajectories entered the training set[20]. Common failure modes: operator hesitation (long pauses mid-task), collision recovery (arm retracts then resumes), and task abandonment (trajectory ends without goal achievement). Tag these in metadata rather than discarding—they provide negative examples for safety training. LeRobot's ACT training notebook demonstrates filtering low-quality trajectories using success labels and episode length thresholds.
Data Format Conversion and Distribution
Convert raw recordings to training-ready formats. RLDS is the de facto standard for reinforcement learning datasets, storing trajectories as TensorFlow `tf.data.Dataset` objects with standardized observation and action keys. LeRobot's dataset format uses Parquet for tabular data (joint states, actions) and separate image directories for camera frames, enabling efficient random access during training[21].
Preserve raw data alongside processed versions. HDF5 supports hierarchical storage: `/raw/camera0`, `/raw/joint_states`, `/processed/actions`, `/processed/observations`. This enables buyers to apply custom preprocessing without re-collecting. Parquet's columnar layout accelerates filtering and slicing—critical for large datasets where training uses only a subset of trajectories.
Document the conversion pipeline in a reproducible script. LeRobot's diffusion training example includes a `convert_dataset.py` that transforms MCAP recordings to LeRobot format with configurable action spaces and observation keys[22]. Provide a dataset card following Hugging Face conventions: describe collection methodology, task distribution, known limitations, and licensing terms. Truelabel's marketplace requires dataset cards for all listings; incomplete documentation reduces buyer trust and sale prices by 30-50%.
Scaling Collection Operations
Scale from pilot collection (50-100 trajectories) to production (10,000+ trajectories) by parallelizing operators, automating resets, and optimizing task scheduling. RT-1 collected 130,000 trajectories over 17 months using 13 mobile manipulation rigs operated in parallel across three buildings[23]. Each rig collected 250-400 trajectories per week; operator training and task design consumed 40% of total program time.
Automate environment resets between trajectories. Manual resets (returning objects to start positions, clearing clutter) consume 30-50% of operator time. RoboCasa's simulation environment demonstrates automated resets for kitchen tasks; adapt these principles to physical rigs using overhead cameras and object detection to verify reset completeness. For warehouse tasks, use the mobile base to navigate between pre-staged task locations rather than resetting a single workspace.
Track collection metrics: trajectories per operator-hour, task success rate by environment, and data quality scores. Scale AI's Universal Robots partnership reports 3× productivity gains from operator specialization—assigning operators to task categories matching their skill profiles[24]. Use these metrics to identify bottlenecks: if success rates drop below 50%, simplify tasks or provide additional operator training. If quality scores decline over time, investigate hardware degradation (worn cables, loose mounts, sensor drift). Truelabel's request system rewards high-quality, high-volume datasets; optimize your collection pipeline to meet marketplace quality thresholds.
Cost Analysis and ROI Considerations
Mobile manipulation rig setup costs range from $35,000 (academic-grade) to $150,000 (industrial-grade). Budget breakdown: mobile base ($8,000-$40,000), manipulator arm ($15,000-$60,000), sensors ($5,000-$20,000 for cameras, RGB-D, F/T), compute ($3,000-$10,000 for recording workstation), and integration labor (200-500 hours at $75-$150/hour). RoboNet's multi-institution setup amortized costs across 7 labs, achieving $12,000 per-rig average by sharing sensor procurement and software development[25].
Data collection operating costs: operator wages ($25-$50/hour), facility access, equipment maintenance, and cloud storage. At 300 trajectories per operator-week, a 10,000-trajectory dataset requires 33 operator-weeks plus 20% overhead for training, calibration, and quality review—total labor cost $40,000-$80,000. Scale AI's managed collection service charges $150-$400 per trajectory for mobile manipulation, reflecting these underlying costs plus margin[26].
Revenue potential: Truelabel marketplace listings show mobile manipulation datasets selling for $8-$25 per trajectory depending on task diversity, environment count, and data quality. A 10,000-trajectory dataset with strong documentation and multi-environment coverage commands $120,000-$180,000. ROI timeline: 12-18 months for academic labs selling surplus data, 6-9 months for commercial data collection operations. Figure AI's partnership with Brookfield demonstrates enterprise demand for large-scale mobile manipulation data to pretrain humanoid foundation models[27].
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment aggregated 1M trajectories from 22 robot types; mobile manipulation sequences remain scarce at <8% of corpus
arXiv ↩ - Scale AI: Expanding Our Data Engine for Physical AI
Scale AI Physical AI platform; mobile manipulation datasets command 3-5× higher per-trajectory pricing
scale.com ↩ - DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID dataset: 76,000 mobile manipulation trajectories across 564 scenes enable zero-shot transfer
arXiv ↩ - RoboNet: Large-Scale Multi-Robot Learning
RoboNet: 15 million frames across 7 institutions using TurtleBot 2 differential-drive bases
arXiv ↩ - scale.com scale ai universal robots physical ai
Universal Robots UR5e partnership; ISO 9409-1-50-4-M6 flange standards
scale.com ↩ - FR3 Duo
Franka FR3 Duo dual-arm configuration requires custom torque-tube mounts for 14 kg loads
franka.de ↩ - BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 achieved 1.8 mm hand-eye calibration accuracy using AprilTag fiducials
arXiv ↩ - RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 data collection rig survived 130,000 trajectories with proper cable management
arXiv ↩ - OpenVLA project
OpenVLA training requires minimum two camera views per trajectory for spatial reasoning
openvla.github.io ↩ - Project site
DROID collection protocol includes automated checkerboard detection for calibration drift
droid-dataset.github.io ↩ - LeRobot GitHub repository
LeRobot data collection scripts achieve 30 Hz recording with four cameras using multiprocessing
GitHub ↩ - RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning
RLDS format stores actions as (batch, time, action_dim) tensors; mobile manipulation uses 9-12 dimensions
arXiv ↩ - Teleoperation datasets are becoming the highest-intent physical AI content category
ALOHA bilateral teleoperation enables non-expert operators to collect 50 trajectories per hour
tonyzhaozh.github.io ↩ - Kitchen Task Training Data for Robotics
Claru kitchen task dataset used gamepad and SpaceMouse control achieving 85% success rate
claru.ai ↩ - Google Research blog
RT-1 data collection used virtual safety zones to prevent base-arm collisions
robotics-transformer1.github.io ↩ - Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Models trained on 10+ distinct environments achieve 40% higher zero-shot success rates
arXiv ↩ - DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID taxonomy includes 564 scenes across 86 buildings with manipulation and navigation tasks
arXiv ↩ - Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
EPIC-KITCHENS provides per-scene metadata including camera mount heights and participant demographics
arXiv ↩ - BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 validation pipeline rejects trajectories with >5% frame drops or >100ms timestamp gaps
arXiv ↩ - RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
RoboCat data curation used expert reviewers to label trajectory quality on 1-5 scale
arXiv ↩ - LeRobot dataset documentation
LeRobot format uses Parquet for tabular data enabling efficient random access
Hugging Face ↩ - Diffusion Policy training example
LeRobot diffusion training example includes convert_dataset.py for MCAP to LeRobot format
GitHub ↩ - RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 collected 130,000 trajectories over 17 months using 13 rigs operated in parallel
arXiv ↩ - scale.com scale ai universal robots physical ai
Scale AI Universal Robots partnership reports 3× productivity gains from operator specialization
scale.com ↩ - Project site
RoboNet multi-institution setup amortized costs across 7 labs achieving $12,000 per-rig average
robonet.wiki ↩ - scale.com physical ai
Scale AI managed collection service charges $150-$400 per trajectory for mobile manipulation
scale.com ↩ - Figure + Brookfield humanoid pretraining dataset partnership
Figure AI Brookfield partnership demonstrates enterprise demand for mobile manipulation pretraining data
figure.ai ↩
FAQ
What mobile base payload capacity do I need for a 6-DoF arm and sensor suite?
Budget 15-25 kg total payload: a UR5e arm weighs 20.6 kg, wrist cameras add 0.5-1 kg, base-mounted RGB-D sensors add 1-2 kg, and mounting hardware adds 2-3 kg. Choose a base rated for 30-40 kg to maintain safety margin and accommodate future sensor additions. Clearpath Ridgeback (125 kg capacity) and Fetch Mobile Manipulator (25 kg arm payload) are common research platforms. Verify that payload is centered over the wheelbase to prevent tip-over during high-acceleration maneuvers.
How do I synchronize camera frames with joint encoder data in ROS2?
Use ROS2's `message_filters::ApproximateTimeSynchronizer` to align messages with timestamps within a tolerance window (typically 20-50 ms). Configure camera drivers to publish hardware timestamps rather than receive timestamps—most industrial cameras support PTP (Precision Time Protocol) or GPIO triggering. For MCAP recording, enable per-message timestamps and use Foxglove's timeline view to verify synchronization post-collection. If timestamp drift exceeds 50 ms, check NTP daemon status and consider hardware trigger lines between camera and robot controller.
What is the minimum trajectory count for training a mobile manipulation policy?
Minimum viable dataset: 1,000-2,000 trajectories for single-task imitation learning, 10,000+ for multi-task policies, 50,000+ for foundation model fine-tuning. RT-1 used 130,000 trajectories across 700 tasks; OpenVLA fine-tuning requires 5,000-10,000 trajectories per new robot morphology. Quality matters more than quantity—1,000 diverse, high-success-rate trajectories outperform 10,000 repetitive or low-quality trajectories. Start with 50-100 trajectories to validate your pipeline, then scale based on policy performance plateaus.
Should I use MCAP or ROS2 bag format for mobile manipulation data?
Use MCAP for new projects—it offers better compression (30-50% smaller files than ROS2 bags), faster random access, and vendor-neutral schema definitions. MCAP integrates with Foxglove Studio for visualization and supports non-ROS ecosystems (Python-only pipelines, custom message types). ROS2 bags remain necessary if your training pipeline depends on ROS2-specific tooling. Many teams record in MCAP and convert to RLDS or LeRobot format for training. Both formats support the same message types; choose based on your downstream toolchain.
How do I price mobile manipulation data for marketplace sale?
Pricing factors: trajectory count, task diversity (number of distinct manipulation primitives and environments), data quality (success rate, calibration completeness, timestamp accuracy), and format (RLDS/LeRobot-ready commands premium over raw bags). Market rates: $8-$15 per trajectory for single-environment data, $15-$25 for multi-environment with strong documentation, $25-$40 for contact-rich tasks with force-torque data. Truelabel's marketplace shows 10,000-trajectory datasets selling for $120,000-$250,000 depending on these factors. Provide sample trajectories and detailed dataset cards to justify premium pricing.
What are the most common mobile manipulation rig failure modes?
Top five failure modes: cable snagging during base rotation (40% of aborted trajectories), camera-robot timestamp desynchronization from NTP drift (25%), arm-base collision from incorrect workspace limits (15%), battery depletion mid-trajectory (10%), and sensor calibration drift from vibration (10%). Mitigations: use cable carriers and strain relief, implement hardware-triggered timestamps, define conservative virtual safety zones, monitor battery voltage and auto-pause collection below thresholds, and recalibrate weekly or after any mechanical adjustment. Log all failures with timestamps to identify patterns and prioritize fixes.
Looking for mobile manipulation rig?
Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.
List Your Mobile Manipulation Dataset