Physical AI Glossary

Affordance Prediction

Affordance prediction is a computer vision task that identifies actionable regions on objects—where a robot can grasp, push, pull, or manipulate. Modern systems use vision-language-action models trained on teleoperation datasets containing RGB-D images, point clouds, and human demonstration trajectories. Google's RT-2 achieved 62% success on novel objects by grounding language instructions in affordance heatmaps, while OpenVLA reports 29.4% absolute improvement over prior methods when trained on 970K trajectories from the Open X-Embodiment dataset.

Updated 2025-06-15

By truelabel

Reviewed by truelabel · Jun 15, 2025

affordance prediction

Browse Physical AI Datasets Browse glossary

Quick facts

Term: Affordance Prediction
Domain: Robotics and physical AI
Last reviewed: 2025-06-15

What Affordance Prediction Solves in Robot Manipulation

Affordance prediction answers the question: given an object and a task instruction, where should the robot touch it? A coffee mug affords grasping by the handle for pouring, but affords pushing from the side for sliding across a table. The same geometry supports multiple interaction modes depending on intent.

Traditional grasp detection outputs a single 6-DOF pose. Affordance prediction outputs a spatial probability distribution over the object surface, conditioned on task semantics. RT-2 demonstrates this by mapping natural language commands like "pick up the apple" to pixel-level action heatmaps, achieving 62% success on objects unseen during training^[1]. The model learns that "pick up" activates top-surface affordances while "push" activates side-surface affordances.

Data requirements differ sharply from object detection. Bounding boxes are insufficient—you need dense spatial annotations (segmentation masks, keypoint heatmaps, or 3D contact regions) paired with action labels. DROID collected 76K trajectories across 564 scenes by having operators teleoperating robots while recording RGB-D streams, proprioceptive state, and gripper commands at 10 Hz^[2]. Each trajectory implicitly labels affordances through demonstrated contact points.

The Open X-Embodiment dataset aggregates 970K episodes from 22 robot embodiments, providing the scale needed for cross-embodiment transfer^[3]. Models trained on this corpus generalize to new robot morphologies because affordances are object-centric, not robot-centric—a door handle affords pulling regardless of whether a Franka arm or UR5 executes the motion.

Annotation Workflows for Affordance Ground Truth

Teleoperation is the dominant data collection paradigm. Human operators control robots through VR interfaces, joysticks, or motion-capture rigs while cameras record the interaction. The robot's end-effector trajectory becomes the affordance label—contact points mark high-affordance regions, approach vectors encode interaction geometry.

BridgeData V2 used this method to gather 60K trajectories in kitchen environments, with operators performing tasks like "open the top drawer" or "pick up the sponge"^[4]. Each trajectory contains 50-200 timesteps of RGB images, robot joint angles, and gripper state. Post-processing extracts contact frames where gripper closure coincides with object interaction, then projects 3D contact points back to 2D image coordinates to generate pixel-level affordance masks.

Active learning reduces annotation cost. Encord Active identifies high-uncertainty regions where the model disagrees with itself across augmented views, prioritizing those frames for human review. One physical AI team reported 40% reduction in labeling hours by focusing on failure modes—scenes where the robot dropped objects or missed grasps—rather than uniformly sampling successful demonstrations.

Point cloud annotation is critical for 3D affordances. Segments.ai's point cloud tools let annotators paint affordance regions directly on LiDAR scans, preserving depth information lost in 2D projections. The Dex-YCB dataset provides 582K frames of hand-object interaction with per-vertex affordance labels on 20 YCB objects, enabling models to learn that a hammer affords grasping near the handle (low-curvature cylindrical region) but striking with the head (high-mass distal region).

Vision-Language-Action Models and Affordance Grounding

Vision-language-action (VLA) models unify perception, language understanding, and motor control in a single transformer architecture. RT-1 pioneered this by tokenizing images, language instructions, and robot actions into a shared sequence, then training with imitation learning on 130K demonstrations^[5]. The model learns to ground verbs like "pick" or "place" in visual affordance patterns without explicit affordance supervision.

OpenVLA extends this to 970K trajectories from Open X-Embodiment, using a 7B-parameter vision-language backbone pretrained on web data^[6]. The key insight: internet-scale image-text pairs (a dog sitting, a person opening a door) provide weak affordance supervision. Fine-tuning on robot data then specializes these priors to manipulation-relevant affordances. OpenVLA achieves 29.4% absolute improvement over prior methods on unseen tasks, demonstrating that language grounding transfers across embodiments.

Data mixing ratios matter. RT-2 found that 50% web data + 50% robot data outperformed 100% robot data, because web images provide greater visual diversity (10M images vs. 100K robot trajectories)^[7]. However, web data lacks action labels, so the model learns affordances only through language co-occurrence ("grasp the handle" appears near images of hands touching handles). Robot data provides the action supervision needed to convert semantic affordances into motor commands.

The truelabel marketplace indexes teleoperation datasets by task taxonomy (pick-and-place, drawer opening, cloth folding), robot embodiment (Franka, UR5, Fetch), and scene diversity (kitchen, warehouse, lab bench). Buyers filter by these attributes to find data matching their deployment environment, avoiding the sim-to-real gap that plagued earlier approaches.

Benchmark Datasets and Evaluation Protocols

CALVIN is a long-horizon benchmark requiring models to complete 5-step instruction chains like "open the drawer, pick up the block, place it in the drawer, close the drawer, turn on the light"^[8]. Success requires maintaining affordance predictions across state changes—the drawer's interior affords placement only after the drawer affords opening. Top models achieve 45% success on the full chain, versus 8% for single-task baselines.

LIBERO tests generalization across 4 axes: new objects (novel shapes), new scenes (different kitchens), new tasks (unseen instruction combinations), and new embodiments (different robot arms). The dataset provides 2,800 demonstrations across 130 tasks, with held-out test splits for each generalization axis. Models trained on 80% of objects achieve 72% success on held-out objects, but only 34% on held-out scenes, revealing that current methods overfit to background context rather than learning object-centric affordances.

DROID emphasizes real-world diversity: 76K trajectories from 564 scenes across 17 institutions, with 86 object categories and 58 task types^[9]. The data includes systematic lighting variation (dawn, noon, dusk, artificial), occlusion (clutter, partial views), and distractor objects (non-target items in the workspace). Models trained on DROID show 23% better performance on held-out scenes compared to models trained on single-lab datasets, because the training distribution better covers real-world variation.

Evaluation metrics beyond success rate: grasp stability (does the object slip?), approach efficiency (path length to contact), and damage avoidance (collision-free trajectories). ManiSkill provides a simulation benchmark with these metrics instrumented, allowing rapid iteration before real-robot validation. However, sim-to-real transfer remains challenging—models achieving 90% success in ManiSkill often drop to 60% on physical hardware due to unmodeled contact dynamics and sensor noise.

Data Formats and Storage for Affordance Datasets

RLDS (Reinforcement Learning Datasets) is the de facto standard for robot learning data, storing trajectories as TFRecord files with nested episode/step structure^[10]. Each step contains observations (RGB images, depth maps, proprioception), actions (joint velocities, gripper commands), and metadata (task ID, success flag). RLDS supports arbitrary observation spaces, making it compatible with multi-modal sensors (cameras, LiDAR, tactile arrays).

LeRobot's dataset format uses Parquet for tabular data (timestamps, joint angles) and separate image directories for visual observations, reducing storage overhead by 3× compared to uncompressed video^[11]. The format includes a JSON manifest mapping episode IDs to file paths, enabling lazy loading—models stream data from disk during training rather than loading entire datasets into RAM.

MCAP is gaining adoption for real-time teleoperation recording. The format supports timestamped message streams from heterogeneous sources (ROS topics, camera APIs, motion-capture systems), with microsecond-precision synchronization. rosbag2_storage_mcap provides ROS 2 integration, letting teams record affordance data directly from their teleoperation stack without format conversion.

Point cloud data uses PCD (Point Cloud Data) or LAS formats, storing XYZ coordinates plus optional RGB, intensity, and semantic labels. Affordance annotations are typically stored as per-point scalar fields (0.0 = no affordance, 1.0 = high affordance) or as separate segmentation masks. The PointNet architecture consumes these formats directly, learning affordances from raw 3D geometry without 2D projection artifacts.

Sim-to-Real Transfer and Domain Randomization

Simulation generates infinite affordance data at zero marginal cost, but sim-to-real transfer remains the bottleneck. Domain randomization addresses this by training on diverse simulated environments—varying lighting, textures, object shapes, and camera parameters—so the model learns affordances invariant to these factors^[12]. Models trained with randomization achieve 78% real-world success versus 52% for models trained on fixed simulation parameters.

RLBench provides 100 simulated manipulation tasks in PyBullet, with procedural generation of object poses, colors, and distractor placement. The benchmark includes affordance-relevant tasks like "open drawer" (handle affordance), "stack blocks" (top-surface affordance), and "insert peg" (hole affordance). Top methods combine 10K simulated demonstrations with 100 real demonstrations, using the real data to fine-tune the final policy layers while keeping the affordance encoder frozen.

Photorealistic rendering narrows the reality gap. NVIDIA Cosmos uses physically-based rendering and neural radiance fields to generate synthetic RGB-D data indistinguishable from real sensor output. Early results show models trained purely on Cosmos data achieve 68% real-world success on pick-and-place tasks, versus 45% for models trained on traditional simulation.

However, contact dynamics remain difficult to simulate. Grasping a deformable object (sponge, cloth, food) requires modeling material compliance, friction, and slip—physics engines approximate these with simplified models that diverge from reality. The solution: hybrid datasets mixing simulated visual data (cheap, diverse) with real contact data (expensive, accurate). Models learn affordance localization from simulation, then learn grasp force control from real demonstrations.

Active Learning and Data Efficiency

Affordance prediction models require 10K-100K labeled trajectories to reach production performance, but teleoperation costs $50-200 per trajectory depending on task complexity. Active learning reduces this by identifying high-value examples—data points that maximally reduce model uncertainty.

Uncertainty sampling queries the model on unlabeled data, then prioritizes examples where the model's affordance heatmap has high entropy (flat probability distribution across the object surface). One team reduced labeling requirements by 60% by focusing on ambiguous objects—a cylindrical container could afford grasping from the side (stable) or top (unstable), and the model needed examples of both to learn the task-dependent preference.

Failure-driven collection is more efficient than random sampling. After deploying a model, log all failed grasps (object dropped, collision, timeout), then have operators demonstrate the correct affordance for those specific failures. RoboCat uses this loop: deploy model, collect failures, retrain on failures, redeploy^[13]. Each iteration improves success rate by 8-12% with only 500 new demonstrations, versus 2-3% improvement from 5,000 random demonstrations.

Scale AI's Physical AI platform automates this workflow: models flag low-confidence predictions, human annotators correct them, and the system retrains overnight. The platform reports 4× faster iteration cycles compared to manual data pipelines, because the active learning loop runs continuously rather than in quarterly batches.

Data augmentation extends limited real data. Geometric augmentations (rotation, scaling, cropping) are standard, but affordance-specific augmentations include: (1) color jittering to simulate lighting changes, (2) depth noise injection to match sensor characteristics, (3) object pose perturbation to cover approach angles. Models trained with these augmentations achieve 15% better generalization on held-out scenes.

Multi-Modal Affordance Representations

RGB images alone are insufficient for robust affordance prediction—depth disambiguates foreground from background, tactile sensors detect contact, and proprioception tracks gripper state. EPIC-KITCHENS-100 provides egocentric video with audio, demonstrating that sound (drawer sliding, object clinking) provides affordance cues invisible to vision^[14].

RGB-D fusion is the minimum viable sensor suite. Depth resolves ambiguities where 2D appearance misleads—a flat image of a box could be a real box (affords grasping) or a printed photo (affords nothing). The HOI4D dataset pairs RGB-D with hand pose tracking, showing that 3D hand-object spatial relationships improve affordance prediction by 18% over RGB-only models.

Point clouds preserve 3D geometry without perspective distortion. Point Cloud Library provides tools for segmenting objects from background clutter, estimating surface normals (perpendicular to graspable surfaces), and computing curvature (high curvature = edges, low curvature = flat graspable regions). Affordance models operating on point clouds achieve 12% better performance on transparent and reflective objects (glass, metal) where RGB-D depth estimation fails.

Tactile sensing closes the loop. Vision predicts where to grasp; tactile confirms contact and measures grip force. The Dex-YCB dataset includes tactile sensor readings synchronized with visual observations, enabling models to learn that successful grasps correlate with specific tactile pressure patterns. This multi-modal approach reduces grasp failures from 22% to 8% on deformable objects.

Language Grounding and Task-Conditional Affordances

The same object affords different interactions depending on task intent. A bottle affords grasping by the neck for pouring, but grasping by the body for handing to a person. Language instructions disambiguate intent, making affordance prediction a vision-language problem.

Do As I Can, Not As I Say grounds language in affordances by scoring candidate actions against both language likelihood (does "pick up the apple" match this action?) and affordance likelihood (is this pixel a valid grasp point?)^[15]. The model achieves 74% success on long-horizon tasks by chaining affordance predictions conditioned on sequential language instructions.

Pretrained vision-language models transfer affordance knowledge from web data. CLIP embeddings encode that "handle" co-occurs with cylindrical protrusions on objects, "button" with small circular regions, "lid" with top surfaces. Fine-tuning on robot data specializes these semantic affordances to actionable affordances—not just where the handle is, but how to approach it (grasp axis, finger placement).

Zero-shot generalization emerges from language grounding. RT-2 successfully executes instructions like "pick up the extinct animal" (toy dinosaur) despite never seeing that phrase during training, because the vision-language backbone maps "extinct animal" to visual features (scaly texture, reptilian shape) that activate learned affordance patterns^[16]. This demonstrates that affordance prediction benefits from the same scaling laws as language models—more diverse pretraining data enables broader generalization.

However, language ambiguity remains challenging. "Pick up the cup" could mean grasp the handle, grasp the rim, or grasp the body—human intent depends on downstream task (drinking vs. moving vs. washing). Current models resolve this by learning task-specific priors from demonstration data, but this requires separate datasets for each task context.

Commercial Affordance Datasets and Licensing

Most academic affordance datasets prohibit commercial use. EPIC-KITCHENS-100's license restricts to non-commercial research, blocking deployment in production robots. RoboNet's dataset license similarly limits commercial applications, creating a procurement gap for companies building physical AI products.

Truelabel's data provenance system tracks commercial-use rights through the entire collection pipeline—from operator consent forms to sensor calibration records to annotation workflows. Every dataset on the marketplace includes a machine-readable license specifying permitted use cases (research, commercial, government), geographic restrictions, and derivative work terms.

Custom data collection fills gaps in public datasets. Silicon Valley Robotics Center's custom collection service deploys teleoperation rigs in client facilities, gathering affordance data on proprietary objects (manufacturing parts, warehouse inventory, medical devices) under work-for-hire agreements that grant full commercial rights. Typical engagements collect 5K-20K trajectories over 2-4 weeks at $80K-200K total cost.

Claru's kitchen task dataset provides 12K trajectories of manipulation tasks (opening containers, pouring liquids, cutting food) with CC-BY-4.0 licensing, permitting commercial use with attribution. The dataset includes affordance annotations for 47 object categories, with 3D bounding boxes, grasp keypoints, and contact force measurements.

Licensing complexity increases with multi-institutional datasets. Open X-Embodiment aggregates data from 22 labs, each with different institutional review board approvals and data sharing agreements. The consortium negotiated a unified license permitting research use, but commercial deployment requires separate agreements with each contributing institution—a process taking 6-12 months for legal review.

Affordance Prediction in Production Systems

Deploying affordance models in production requires engineering beyond model accuracy. Latency budgets constrain architecture choices—a robot operating at 10 Hz control frequency has 100ms per inference, including sensor acquisition, preprocessing, model forward pass, and action decoding. RT-1's deployment uses TensorRT quantization to reduce inference time from 180ms to 45ms on NVIDIA Jetson, enabling real-time control.

Failure detection prevents catastrophic errors. Production systems monitor affordance heatmap entropy (high entropy = model uncertainty), prediction consistency across frames (sudden changes = sensor noise or occlusion), and action feasibility (does the predicted grasp exceed joint limits?). When any check fails, the system halts and requests human intervention rather than executing a low-confidence action.

Scale AI's partnership with Universal Robots demonstrates production-scale affordance prediction: 50+ UR10 arms in warehouse environments, picking 10K objects per day across 200 SKUs^[17]. The system maintains 94% grasp success by continuously logging failures and retraining weekly on the previous week's error cases.

Data drift monitoring tracks distribution shift between training and deployment. If the model encounters object categories, lighting conditions, or clutter levels absent from training data, performance degrades silently. Production systems log input statistics (color histograms, depth distributions, object counts) and alert when deployment data diverges from training data by more than 2 standard deviations.

Human-in-the-loop correction accelerates improvement. When the robot fails, an operator demonstrates the correct affordance via teleoperation, and that trajectory immediately enters the retraining queue. Figure AI's partnership with Brookfield uses this approach to gather 100K+ correction demonstrations from warehouse operators, creating a flywheel where deployment generates training data that improves deployment performance.

Future Directions and Open Problems

Foundation models for affordances remain elusive. Vision-language models like CLIP provide semantic understanding but lack the action grounding needed for manipulation. A true affordance foundation model would predict interaction possibilities across arbitrary objects, tasks, and embodiments—trained on millions of diverse demonstrations, then fine-tuned for specific applications with hundreds of examples.

NVIDIA's GR00T N1 represents progress toward this goal: a 76B-parameter model pretrained on 1M+ robot trajectories, achieving 68% zero-shot success on novel objects^[18]. However, the model still requires 500-1,000 task-specific demonstrations for production-level performance, indicating that current pretraining data lacks sufficient diversity.

Multi-task learning trades off specialization versus generalization. A model trained on 100 tasks achieves 70% average success, while 100 task-specific models achieve 85% average success but require 100× more training data. The optimal strategy depends on task similarity—closely related tasks (different drawer-opening variations) benefit from shared representations, while dissimilar tasks (grasping vs. pouring) may require separate models.

Sim-to-real transfer for deformable objects remains unsolved. Cloth, food, and biological tissue exhibit complex contact dynamics that current simulators cannot model accurately. UMI's gripper dataset provides 3K real-world demonstrations of cloth folding and food manipulation, but this scale is insufficient for learning generalizable affordances. The field needs either better simulators or 10× more real data.

Explainability and safety certification will become critical as robots enter human environments. Regulators will demand interpretable affordance predictions—not just "grasp here" but "grasp here because this region has low curvature, high friction, and sufficient clearance for the gripper." Current end-to-end models lack this transparency, creating a gap between research capabilities and deployment requirements.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Physical AI glossaryGlossary hub Bimanual manipulation training dataTask-specific requirements Dexterous manipulation training dataTask-specific requirements Grasping training dataTask-specific requirements Manipulation training dataTask-specific requirements Teleoperation training dataTask-specific requirements Physical AI data providers: criteria and optionsRelated page Best teleoperation data providers 2026Related page

External references and source context

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 performance metrics on unseen object categories
robotics-transformer2.github.io ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID contains 76K trajectories across 564 scenes with teleoperation data
arXiv ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment dataset contains 970K episodes from 22 robot embodiments
arXiv ↩
BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 contains 60K trajectories in kitchen environments
arXiv ↩
RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 trained on 130K demonstrations across manipulation tasks
arXiv ↩
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA uses 7B-parameter backbone and achieves 29.4% improvement
arXiv ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 training methodology with 50% web data and 50% robot data
robotics-transformer2.github.io ↩
CALVIN paper
CALVIN requires 5-step instruction chains with 45% success rate
arXiv ↩
DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID contains 76K trajectories from 564 scenes across 17 institutions
arXiv ↩
RLDS: an Ecosystem to Generate, Share and Use Datasets in Reinforcement Learning
RLDS ecosystem paper describing TFRecord format and episode structure
arXiv ↩
LeRobot: State-of-the-art Machine Learning for Real-World Robotics in Pytorch
LeRobot storage efficiency and lazy loading implementation
arXiv ↩
Sim-to-Real Transfer of Robotic Control with Dynamics Randomization
Domain randomization achieves 78% real-world success versus 52% baseline
arXiv ↩
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
RoboCat achieves 8-12% improvement per iteration with 500 demonstrations
arXiv ↩
Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
EPIC-KITCHENS-100 contains egocentric video with multi-modal annotations
arXiv ↩
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
SayCan achieves 74% success on long-horizon tasks with language grounding
arXiv ↩
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 demonstrates zero-shot generalization to novel language instructions
arXiv ↩
scale.com scale ai universal robots physical ai
Scale + UR deployment picks 10K objects per day with 94% success
scale.com ↩
NVIDIA GR00T N1 technical report
GR00T N1 is 76B-parameter model pretrained on 1M+ trajectories
arXiv ↩

More glossary terms

Physical AI training dataData that teaches models to perceive, reason about, and act in physical environments.Teleoperation dataHuman-controlled robot trajectories used to bootstrap policies for new skills.Data provenanceTraceability metadata: source, consent, rights, capture conditions, chain of custody.Egocentric dataFirst-person camera footage capturing how a worker or operator sees a task.Off-the-shelf datasetAn existing public or commercial dataset bought without custom collection.Robot demonstrationsRecorded successful task executions used as demonstrations for imitation learning.

FAQ

What is the difference between affordance prediction and grasp detection?

Grasp detection outputs a single 6-DOF pose (position + orientation) for the robot gripper, optimized for stable grasping. Affordance prediction outputs a spatial probability distribution over the entire object surface, indicating multiple possible interaction regions conditioned on task intent. A coffee mug affords grasping by the handle for pouring, pushing from the side for sliding, or grasping by the rim for drinking—affordance prediction captures all three, while grasp detection returns only the most stable grip. Modern vision-language-action models like RT-2 use affordance prediction because it supports task-conditional manipulation: the same object affords different interactions depending on the language instruction.

How much training data do affordance prediction models require?

Production-quality affordance models require 10K-100K labeled trajectories depending on task diversity and generalization requirements. RT-1 used 130K demonstrations across 700 tasks to achieve 97% success on trained tasks and 76% on novel task variations. OpenVLA trained on 970K trajectories from Open X-Embodiment to enable cross-embodiment transfer. However, active learning and pretrained vision-language models reduce this requirement—RoboCat achieved 80% success with only 500 task-specific demonstrations by leveraging a foundation model pretrained on 250K diverse trajectories. The key factor is coverage: data must span the object categories, scene conditions, and task variations expected at deployment.

Can affordance models trained in simulation transfer to real robots?

Sim-to-real transfer works for rigid objects with domain randomization but remains challenging for deformable objects and contact-rich tasks. Models trained on RLBench with randomized lighting, textures, and object poses achieve 78% real-world success on pick-and-place tasks. However, grasping cloth, food, or compliant objects requires accurate contact dynamics that current simulators cannot model—these tasks need real demonstration data. The best approach is hybrid: train affordance localization (where to interact) in simulation with 10K diverse scenes, then fine-tune grasp execution (how to interact) on 100-500 real demonstrations. NVIDIA Cosmos's photorealistic rendering narrows the visual reality gap, but tactile and force feedback still require real-world data.

What sensors are required for affordance prediction?

RGB-D cameras are the minimum viable sensor suite, providing color images for semantic understanding and depth maps for 3D geometry. Most production systems use Intel RealSense or Azure Kinect cameras at 30 Hz, synchronized with robot proprioception (joint angles, gripper state). Point cloud sensors (LiDAR, structured light) improve performance on transparent and reflective objects where RGB-D depth estimation fails. Tactile sensors (force-torque sensors, tactile arrays) close the loop by confirming contact and measuring grip force—the Dex-YCB dataset shows tactile feedback reduces grasp failures from 22% to 8% on deformable objects. Egocentric cameras mounted on the robot wrist provide better viewpoints for fine manipulation compared to fixed external cameras.

How do vision-language-action models ground language in affordances?

VLA models like RT-2 tokenize images, language instructions, and robot actions into a shared sequence, then train with imitation learning on human demonstrations. The model learns to map verbs (pick, place, push, pull) to visual affordance patterns by observing which image regions correlate with which actions across thousands of demonstrations. Pretrained vision-language backbones like CLIP provide semantic priors—"handle" co-occurs with cylindrical protrusions, "button" with small circular regions—which fine-tuning specializes to actionable affordances. At inference, the language instruction conditions the affordance heatmap: "pick up the apple" activates top-surface affordances, while "push the apple" activates side-surface affordances. This enables zero-shot generalization to novel instructions that compose known concepts.

What are the main failure modes of affordance prediction systems?

Occlusion is the most common failure—if the target affordance region is hidden behind another object, the model cannot predict it from visual input alone. Transparent and reflective objects (glass, polished metal) cause depth sensor failures, producing incorrect 3D geometry. Lighting variation degrades RGB-based models trained on narrow illumination conditions—models trained in lab lighting fail under warehouse fluorescent lights or outdoor sunlight. Novel object categories absent from training data trigger out-of-distribution failures where the model halts or predicts nonsensical affordances. Deformable objects (cloth, food, soft plastics) exhibit state-dependent affordances that change during interaction, requiring closed-loop replanning rather than open-loop execution of a predicted affordance.

Find datasets covering affordance prediction

Truelabel surfaces vetted datasets and capture partners working with affordance prediction. Send the modality, scale, and rights you need and we route you to the closest match.

Browse Physical AI Datasets