Physical AI Glossary

Deformable Object Manipulation

Deformable object manipulation is the robotic task of handling materials—cloth, rope, cables, dough, soft packaging—that change shape under contact forces. Unlike rigid-body manipulation, deformable tasks require models that predict continuous shape evolution across contact sequences, typically using vision transformers or graph neural networks trained on 10,000–100,000+ teleoperation trajectories capturing state transitions under varied grasp points, pull directions, and material properties.

Updated 2025-06-08

By truelabel

Reviewed by truelabel · Jun 8, 2025

deformable object manipulation

Post a Deformable Manipulation Data Request Browse glossary

Quick facts

Term: Deformable Object Manipulation
Domain: Robotics and physical AI
Last reviewed: 2025-06-08

What Deformable Object Manipulation Means for Physical AI Systems

Deformable object manipulation addresses a class of robotic tasks where the target object's geometry changes continuously during interaction. A robot folding a towel must predict how fabric drapes under gravity and bunches at grasp points. A system routing cables through conduits must model friction, bending stiffness, and self-collision as the cable snakes through tight spaces. These tasks differ fundamentally from rigid-body pick-and-place because the object's state space is infinite-dimensional—every point on a cloth surface can move independently.

The RT-1 Robotics Transformer demonstrated that vision-language-action models can generalize across 700+ tasks, but deformable scenarios remain underrepresented in public datasets^[1]. DROID's 76,000-trajectory dataset includes cable manipulation and fabric smoothing, yet these constitute under 8% of total episodes. Production systems handling laundry, food prep, or wire harness assembly require dedicated deformable datasets with 50,000+ trajectories per task family to achieve 85%+ success rates in variable lighting and clutter.

Deformable manipulation sits at the intersection of computer vision, contact mechanics, and imitation learning. Models must extract shape representations from RGB-D streams, predict contact outcomes under partial observability, and generate action sequences that achieve goal configurations despite material variance. Scale AI's physical AI data engine now offers deformable object annotation, but most teams still collect proprietary teleoperation data using LeRobot-compatible hardware and custom simulation environments.

Core Data Requirements: Trajectories, Annotations, and Modalities

Deformable object manipulation models consume multi-modal trajectories: RGB-D video at 10–30 Hz, proprioceptive joint states, gripper force/torque readings, and optionally tactile sensor arrays. Each trajectory records a complete task episode—initial object configuration, action sequence, final state—with frame-level annotations for grasp points, contact regions, and intermediate shape keypoints. A single folding task may span 200–400 frames; a cable routing episode 600–1,200 frames.

BridgeData V2 established the multi-task teleoperation format: 60,000+ trajectories across 13 environments, stored as RLDS-compatible datasets with per-frame action labels and language task descriptions. For deformable tasks, annotations expand to include mesh vertices (cloth), centerline splines (rope), or voxel occupancy grids (dough). EPIC-KITCHENS-100 provides 100 hours of egocentric kitchen video with object interaction labels, useful for pretraining vision backbones before fine-tuning on robot data^[2].

Data volume requirements scale with task diversity and material variance. A single-task policy (folding one towel type) may converge on 5,000–10,000 demonstrations. Multi-task generalists targeting 20+ deformable scenarios need 100,000+ trajectories. RoboNet aggregated 15 million frames from seven robot platforms, but deformable tasks were sparse; modern efforts like Open X-Embodiment now explicitly solicit cloth, rope, and soft-object contributions to fill this gap^[1].

Model Architectures: From Graph Networks to Vision Transformers

Early deformable manipulation systems used particle-based simulators and model-predictive control, requiring hand-tuned physics parameters for each material. Modern approaches replace explicit simulation with learned dynamics models. Graph neural networks represent cloth as meshes—nodes for vertices, edges for connectivity—and predict vertex positions after contact. Vision transformers encode RGB-D observations into latent shape representations, then decode action sequences via behavior cloning or inverse reinforcement learning.

RT-2 showed that vision-language models pretrained on web data transfer to robotic control when fine-tuned on 6,000+ robot trajectories. For deformable tasks, this architecture extends to predict not just gripper actions but also intermediate shape waypoints. A cloth-folding policy might output a sequence of grasp-lift-place actions plus expected fabric configurations at each step, enabling closed-loop replanning when the cloth slips or bunches unexpectedly.

Diffusion policies have emerged as the dominant architecture for deformable manipulation. LeRobot's diffusion training pipeline treats action sequences as images, applying denoising diffusion to generate smooth, collision-free trajectories. On cable routing benchmarks, diffusion policies achieve 78% success versus 52% for standard behavior cloning^[3]. The tradeoff: diffusion models require 3–10× more compute per inference step, limiting real-time control to 5–10 Hz unless quantized or distilled.

Simulation Environments and Sim-to-Real Transfer Challenges

Deformable object simulation remains computationally expensive. Finite-element methods (FEM) model cloth and soft bodies with high fidelity but run 10–100× slower than real-time. Position-based dynamics (PBD) trades accuracy for speed, enabling interactive simulation at 30–60 Hz. NVIDIA Cosmos world foundation models now generate synthetic deformable scenarios at scale, but sim-to-real transfer still requires domain randomization over material stiffness, friction coefficients, and visual textures.

Domain randomization introduced the practice of training on wide distributions of simulated parameters to achieve real-world robustness. For cloth manipulation, this means varying fabric weight (50–500 g/m²), bending stiffness (0.01–1.0 N·m), and surface friction (μ = 0.2–0.8). Sim-to-real transfer studies show that policies trained on 100,000 randomized simulation episodes plus 2,000 real-world demonstrations outperform 20,000 real-only demonstrations, reducing data collection costs by 60%^[4].

Public simulation benchmarks include RoboSuite for rigid manipulation, ManiSkill for dexterous tasks, and SoftGym for deformable scenarios. SoftGym provides cloth folding, rope manipulation, and liquid pouring environments with standardized evaluation protocols. However, most production teams build custom simulators tuned to their target materials—automotive wire harnesses, food packaging films, surgical sutures—because off-the-shelf physics engines lack the fidelity needed for zero-shot transfer.

Teleoperation Data Collection: Hardware and Annotation Pipelines

High-quality deformable manipulation datasets come from human teleoperation, not scripted demonstrations. Operators use VR controllers, haptic devices, or leader-follower robot arms to perform tasks while the system records RGB-D video, joint states, and gripper commands at 10–30 Hz. ALOHA's bimanual teleoperation setup costs under $20,000 and has generated 10,000+ cloth and cable manipulation trajectories for academic labs.

Annotation pipelines add semantic labels post-collection. For cloth folding, annotators mark grasp points, fold lines, and final configurations using tools like CVAT's polygon annotation interface. For cable routing, they trace centerline splines and label contact points with fixtures. Encord's active learning platform reduces annotation time by 40% by precomputing keypoint proposals from pretrained vision models, then routing edge cases to human reviewers^[5].

Truelabel's physical AI data marketplace now accepts deformable manipulation requests: buyers specify task, object types, success criteria, and trajectory count; collectors bid on data collection using standardized teleoperation rigs. A 5,000-trajectory cloth-folding dataset with per-frame grasp annotations costs $15,000–$40,000 depending on material diversity and lighting conditions. This is 70% cheaper than in-house collection for teams without existing teleoperation infrastructure.

Dataset Formats: RLDS, HDF5, and MCAP for Multi-Modal Trajectories

Deformable manipulation datasets use hierarchical formats that bundle video, proprioception, and annotations into single files. RLDS (Reinforcement Learning Datasets) wraps TensorFlow Datasets with trajectory semantics: each episode is a sequence of (observation, action, reward) tuples, stored as TFRecord shards for efficient streaming during training. Google's RLDS repository provides conversion scripts for ROS bags, HDF5 archives, and raw video folders.

HDF5 remains popular for offline datasets because it supports random access to multi-GB files without loading everything into RAM. A typical deformable manipulation HDF5 file contains groups for RGB frames (`/rgb`), depth maps (`/depth`), joint positions (`/qpos`), gripper states (`/gripper`), and annotations (`/labels/grasp_points`). LeRobot's dataset format extends HDF5 with metadata fields for camera calibration, task descriptions, and success flags, enabling cross-dataset training without manual preprocessing.

MCAP is the emerging standard for real-time data logging. Unlike ROS bags, MCAP supports schema evolution, compression, and indexed seeking, making it suitable for multi-year dataset archives. ROS 2's rosbag2_storage_mcap plugin writes MCAP natively, and Foxglove's MCAP tooling provides Python/C++ readers for training pipelines. A 10,000-trajectory deformable dataset in MCAP format typically consumes 500 GB–2 TB depending on RGB resolution and compression settings.

Evaluation Metrics: Success Rate, Shape Error, and Temporal Consistency

Deformable manipulation policies are evaluated on task success rate (did the cloth reach the target fold?), shape error (L2 distance between predicted and ground-truth vertex positions), and temporal consistency (do predicted trajectories avoid self-collisions and sudden jumps?). Success rate is binary but sensitive to threshold choice—a towel folded to within 5 cm of the goal may count as success in simulation but fail in production.

Shape error metrics require ground-truth meshes or point clouds. For cloth, this means tracking 500–2,000 vertices per frame using motion capture markers or depth-based reconstruction. For rope, centerline error measures the average distance between predicted and actual spline control points. EPIC-KITCHENS annotations include hand-object contact labels but not full mesh tracking, limiting their use for deformable benchmarks.

Temporal consistency penalizes action sequences that cause the robot to jerk or the object to tear. Metrics include action smoothness (sum of squared action differences between timesteps) and collision rate (percentage of frames where the gripper intersects the object mesh). CALVIN's long-horizon benchmark evaluates multi-step tasks over 1,000+ frames, exposing policies that succeed on short episodes but accumulate errors over extended interactions. Production systems targeting 95%+ reliability require datasets with 20,000+ trajectories and evaluation protocols that test edge cases—wrinkled fabric, tangled cables, partial occlusions.

Procurement Strategies: Build, Buy, or Crowdsource Deformable Datasets

Teams building deformable manipulation systems face a build-versus-buy decision for training data. In-house collection offers full control over task definitions, object types, and annotation schemas but requires $50,000–$200,000 in teleoperation hardware plus 6–12 months of engineering time to build data pipelines. Buying datasets from vendors like Scale AI or Claru delivers 10,000+ trajectories in 4–8 weeks but limits customization to predefined task templates.

Crowdsourced data collection via truelabel's data marketplace splits the difference: buyers post task specifications (fold a towel, route a cable through a fixture) with success criteria and per-trajectory pricing; collectors with teleoperation rigs bid on the work. A 5,000-trajectory dataset costs $12,000–$35,000 and delivers in 6–10 weeks, 60% faster than in-house collection^[6]. Quality control uses automated checks (trajectory length, action smoothness, success flag) plus human review of 10% of episodes.

Public datasets like Open X-Embodiment and DROID provide free baselines but lack task diversity for production use. DROID's 76,000 trajectories span 564 tasks, but only 6,000 involve deformable objects. Teams targeting specific applications—automotive wire harness assembly, surgical suture tying, food packaging—need proprietary datasets with 50,000+ trajectories per task family. The procurement decision hinges on time-to-deployment: research teams use public data; production teams buy or crowdsource.

Licensing and Provenance: Compliance for Commercial Deformable AI Models

Deformable manipulation datasets carry licensing constraints that affect model commercialization. EPIC-KITCHENS-100's annotations are CC BY-NC 4.0, prohibiting commercial use without a separate agreement. RoboNet's dataset license allows commercial training but requires attribution and prohibits redistribution. Open X-Embodiment aggregates 22 datasets with heterogeneous licenses—some permissive (MIT, Apache 2.0), others restrictive (CC BY-NC, custom academic-only terms).

Data provenance tracking is mandatory for EU AI Act compliance and increasingly required by enterprise buyers. Provenance records must document data source, collection method, annotator consent, and license terms for every trajectory. C2PA content credentials provide cryptographic provenance for media files, but adoption in robotics datasets remains under 5%. Most teams use internal metadata databases that map dataset IDs to collection sessions, annotator pools, and license agreements.

Procurement contracts should specify data ownership, derivative work rights, and indemnification for IP claims. U.S. Federal Acquisition Regulation Subpart 27.4 governs data rights in government contracts; commercial buyers typically negotiate perpetual, worldwide licenses with sublicensing rights for model training. Truelabel's marketplace standardizes licensing via a buyer-friendly template: datasets are licensed under CC BY 4.0 by default, with optional commercial-exclusive terms for 2× the base price.

Integration with Robot Learning Frameworks: LeRobot, RoboMimic, and Custom Pipelines

Deformable manipulation datasets plug into robot learning frameworks via standardized loaders. LeRobot provides a unified interface for 25+ public datasets, including BridgeData V2, DROID, and custom HDF5 archives. Datasets are lazy-loaded during training to avoid RAM overflow on multi-TB collections. LeRobot's ACT training notebook demonstrates end-to-end fine-tuning on a 5,000-trajectory cloth-folding dataset in under 200 lines of Python.

RoboMimic targets offline imitation learning with support for behavior cloning, inverse RL, and offline RL algorithms. It expects datasets in HDF5 format with specific key names (`/observations/rgb`, `/actions`, `/rewards`). Converting RLDS or MCAP datasets to RoboMimic format requires writing custom scripts that remap field names and resample timestamps to uniform intervals. Most teams maintain conversion pipelines in internal repositories rather than contributing them upstream.

Custom pipelines dominate production systems because public frameworks lack domain-specific preprocessing. A cable routing pipeline might apply centerline extraction, contact-point detection, and action smoothing before feeding data to the model. A cloth folding pipeline might compute mesh normals, detect grasp affordances, and filter trajectories where the fabric slips. These steps are task-specific and rarely generalize, making dataset format standardization a persistent challenge despite efforts like RLDS and LeRobot's schema.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Physical AI glossaryGlossary hub Data provenance for physical AIRelated page Physical AI data marketplaceBuyer conversion page Sourcing multi-view manipulationRelated page Sourcing rgbd manipulationRelated page Sourcing teleop kitchen dataRelated page Sourcing teleop warehouse dataRelated page Bimanual manipulation training dataTask-specific requirements

External references and source context

Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment dataset statistics and deformable task coverage gaps
arXiv ↩
Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
EPIC-KITCHENS-100 dataset scale and annotation methodology
arXiv ↩
LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch
Diffusion policy performance benchmarks on cable routing tasks
arXiv ↩
Crossing the Reality Gap: A Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning
Survey data on simulation-plus-real versus real-only training efficiency
arXiv ↩
Encord Series C announcement
Encord platform adoption and funding validation
encord.com ↩
truelabel physical AI data marketplace bounty intake
Truelabel marketplace pricing and delivery timelines for deformable datasets
truelabel.ai ↩

More glossary terms

Off-the-shelf datasetAn existing public or commercial dataset bought without custom collection.Physical AI training dataData that teaches models to perceive, reason about, and act in physical environments.Teleoperation dataHuman-controlled robot trajectories used to bootstrap policies for new skills.Data provenanceTraceability metadata: source, consent, rights, capture conditions, chain of custody.Egocentric dataFirst-person camera footage capturing how a worker or operator sees a task.Consent artifactSigned documentation that contributors agreed to commercial use of their data.

FAQ

What sensor modalities are required for deformable object manipulation datasets?

Minimum viable datasets include RGB-D video at 10–30 Hz, robot joint positions, and gripper open/close states. High-fidelity datasets add force/torque sensors, tactile arrays, and motion capture for ground-truth mesh tracking. Depth cameras (Intel RealSense, Azure Kinect) cost $200–$400; tactile sensors (DIGIT, GelSight) add $500–$2,000 per gripper. Most academic datasets use RGB-D only; production systems targeting 90%+ success rates add tactile feedback for contact-rich tasks like cable insertion.

How many trajectories are needed to train a deformable manipulation policy?

Single-task policies (one object type, one task) converge on 5,000–10,000 demonstrations. Multi-task generalists require 50,000–100,000+ trajectories across diverse objects and scenarios. Diffusion policies need 2–3× more data than behavior cloning to achieve equivalent performance. Sim-to-real transfer reduces real-world data requirements by 40–60% when combined with domain randomization over material properties and lighting conditions.

What is the difference between RLDS and HDF5 for storing robot datasets?

RLDS is a TensorFlow Datasets wrapper that adds trajectory semantics (episodes, steps, rewards) and supports streaming from cloud storage. HDF5 is a binary format with hierarchical groups and random access, suitable for offline datasets. RLDS integrates natively with TensorFlow training loops; HDF5 requires custom data loaders but works with PyTorch, JAX, and other frameworks. Most teams use HDF5 for archival storage and convert to RLDS or custom formats for training.

Can deformable manipulation models trained on simulation transfer to real robots?

Sim-to-real transfer works when simulation includes domain randomization over material properties (stiffness, friction, damping) and visual appearance (lighting, textures, backgrounds). Policies trained on 100,000 randomized simulation episodes plus 2,000 real demonstrations outperform 20,000 real-only demonstrations. Zero-shot transfer (simulation only, no real data) achieves under 30% success on deformable tasks due to simulation-reality gaps in contact dynamics and material behavior.

What licensing terms allow commercial use of deformable manipulation datasets?

Permissive licenses (MIT, Apache 2.0, CC BY 4.0) allow commercial model training and deployment without restrictions. CC BY-NC prohibits commercial use; academic-only licenses require separate agreements for production systems. Open X-Embodiment aggregates datasets with mixed licenses—check per-dataset terms before training. Proprietary datasets from Scale AI, Claru, or truelabel's marketplace include commercial licenses by default, with pricing 1.5–2× higher than academic-only alternatives.

Find datasets covering deformable object manipulation

Truelabel surfaces vetted datasets and capture partners working with deformable object manipulation. Send the modality, scale, and rights you need and we route you to the closest match.

Post a Deformable Manipulation Data Request