Vision-Language-Action Model
GENIMA Training Data Requirements & Integration
GENIMA (Generative Image as Action Models) fine-tunes Stable Diffusion to produce colored-sphere affordance images that an ACT controller decodes into 7-DoF joint trajectories. Training requires time-synchronized RGB streams (128×128 sim, 480×640 real), joint-position recordings at 10–30 Hz, gripper states, and calibrated camera parameters (intrinsics/extrinsics). Published by Dyson Robot Learning Lab in July 2024, GENIMA achieved 64% success on 18 RLBench tasks with 100 demonstrations per task.
Quick facts
- Model class
- Vision-Language-Action Model
- Primary focus
- GENIMA training data
- Last reviewed
- 2025-07-15
What Is GENIMA and Why Affordance Images Matter
GENIMA reframes robot action prediction as an image-generation problem. Instead of regressing joint angles directly from RGB observations, the model fine-tunes Stable Diffusion to output affordance images—frames overlaid with colored spheres marking target end-effector positions in 3D space. An ACT (Action Chunking with Transformers) controller then decodes these affordance images into 7-DoF joint-position action chunks executed at 10 Hz.
This two-stage pipeline separates spatial reasoning (where to move) from motor control (how to move). The affordance representation is canonical: GENIMA learns to ignore texture variation and focus on geometry, a property validated by swapping object textures at test time without retraining. On RLBench simulation tasks, GENIMA reached 64% average success across 18 tasks with 100 demonstrations each, outperforming Diffusion Policy and other visuomotor baselines by 12–18 percentage points.
The architecture choice reflects a broader trend: repurposing pretrained generative models for physical AI. Stable Diffusion's 860M-parameter UNet already encodes rich spatial priors from billions of web images. Fine-tuning on robot data is cheaper than training a vision-language-action model from scratch, and the affordance-image intermediate representation is human-interpretable—engineers can visually debug why a policy fails by inspecting the generated sphere overlays.
Architecture: Stable Diffusion Backbone Plus ACT Decoder
GENIMA's generator is Stable Diffusion 1.5 fine-tuned with ControlNet conditioning. The input is a single or multi-view RGB observation (480×640 for real robots, 128×128 for simulation). The text encoder processes natural-language task instructions via SD-Turbo embeddings. The UNet outputs an affordance image: the input RGB frame with a colored sphere rendered at the predicted 3D end-effector target, projected into pixel coordinates using known camera intrinsics and extrinsics.
The ACT controller receives this affordance image and decodes it into a sequence of 7-DoF joint-position commands (6-axis arm + 1-DoF gripper). ACT uses a transformer encoder-decoder with learned positional embeddings for temporal action chunking. During training, the affordance-image generator and ACT controller are optimized jointly via a combined loss: pixel-space L2 for the affordance image and mean-squared error for joint trajectories.
This modular design allows independent ablation. The Dyson team tested frozen Stable Diffusion weights (zero-shot affordance generation) versus full fine-tuning, finding that fine-tuning improved success rates by 22% on average. They also compared ACT to RT-1 and RT-2 decoders, confirming that ACT's chunk-based action representation handles long-horizon tasks better than single-step policies. The affordance-image bottleneck forces the model to compress spatial reasoning into a single interpretable frame, reducing compounding errors in multi-step manipulation.
Training Data Requirements: RGB Streams, Joint Trajectories, Camera Calibration
GENIMA training demands four synchronized data streams per demonstration: RGB images, 7-DoF joint positions, binary gripper state, and camera parameters. RGB resolution is 480×640 for real-world data (Dyson used Franka Panda and UR5e arms) and 128×128 for RLBench simulation. Recording frequency is 10–30 Hz; the paper reports 10 Hz control for real robots and 20 Hz for simulation.
Joint-position accuracy is critical because affordance-image supervision requires projecting the recorded end-effector pose into pixel coordinates. Sub-millimeter encoder precision is standard on Franka Research 3 and UR5e platforms. Camera calibration (intrinsics: focal length, principal point, distortion coefficients; extrinsics: rotation and translation from base frame to camera frame) must be verified before data collection. Claru datasets include JSON/YAML calibration files compatible with OpenCV and ROS camera_info messages.
The Dyson team collected 100 demonstrations per task for 18 RLBench tasks (1,800 trajectories total, approximately 50,000 frames at 20 Hz over 40-minute episodes). For real-world validation, they gathered 25 demonstrations per task across 16 tasks (400 trajectories, roughly 10,000 frames). Trajectory length varies: pick-and-place tasks average 8 seconds, while multi-stage assembly tasks run 30–60 seconds. Every demonstration includes success labels (binary) and optional failure-mode annotations (collision, timeout, grasp failure) for curriculum learning.
Truelabel's physical-AI marketplace indexes GENIMA-compatible datasets with verified camera calibrations, joint-encoder logs, and format converters to HDF5 or RLDS. Buyers filter by robot platform (Franka, UR, Kinova), task category (pick, place, assembly), and visual diversity (lighting conditions, object textures, background clutter).
Affordance-Image Supervision and Canonical Texture Property
The affordance-image training objective is pixel-space L2 loss between the generated image and a ground-truth affordance image rendered from the recorded end-effector trajectory. Ground-truth affordances are synthesized offline: for each frame, the system projects the next-step end-effector position into the camera frame using calibrated intrinsics/extrinsics, then renders a colored sphere (radius 5–10 pixels) at that pixel coordinate. Sphere color encodes gripper state: green for open, red for closed.
This supervision signal forces the model to learn spatial reasoning without texture dependence. The Dyson team validated this by swapping object textures at test time (e.g., replacing a red cube with a blue cylinder of identical geometry). GENIMA's success rate dropped only 4%, versus 18% for Diffusion Policy and 26% for end-to-end visuomotor policies. The affordance representation abstracts away appearance, focusing the model on geometry and spatial relationships.
To train this property, datasets must include controlled visual diversity. Claru's teleoperation collections vary lighting (overhead, side, mixed), backgrounds (plain, cluttered, textured), and object materials (matte, glossy, transparent). Each task is recorded under 3–5 lighting conditions and 2–3 background setups, yielding 6–15 visual configurations per task. This diversity prevents the model from overfitting to specific textures while preserving geometric consistency across demonstrations.
Comparison with RT-1, RT-2, Diffusion Policy, and OpenVLA
GENIMA's affordance-image intermediate representation distinguishes it from end-to-end vision-language-action models. RT-1 and RT-2 map RGB observations directly to actions via a transformer encoder-decoder, achieving 97% success on 700+ tasks with 130,000 demonstrations[1]. GENIMA's two-stage design trades data efficiency for interpretability: 100 demonstrations per task suffice because the affordance bottleneck simplifies the learning problem.
Diffusion Policy also uses a generative model (DDPM) but generates action sequences directly in joint space. GENIMA's affordance-image generation happens in pixel space, leveraging Stable Diffusion's pretrained spatial priors. On RLBench, GENIMA outperformed Diffusion Policy by 12% average success (64% vs. 52%) with identical demonstration counts. The gap widens on long-horizon tasks (30+ steps), where affordance images reduce compounding errors by providing a stable spatial target at each step.
OpenVLA is a 7B-parameter vision-language-action model trained on Open X-Embodiment's 1M+ trajectories. It generalizes across 22 robot embodiments but requires 10–100× more data than GENIMA. For teams with <1,000 demonstrations, GENIMA's affordance-centric approach is more practical. OpenVLA excels at zero-shot transfer; GENIMA excels at sample-efficient fine-tuning for specific tasks.
GENIMA's ACT controller shares architecture with ALOHA's teleoperation policy, which also uses transformer-based action chunking. ALOHA demonstrated bimanual manipulation with 50 demonstrations per task; GENIMA extends this to single-arm tasks with affordance-image conditioning. Both benefit from high-frequency teleoperation data (10–30 Hz) and sub-millimeter joint-position accuracy.
Real-World Deployment: Franka Panda and UR5e Validation
The Dyson team validated GENIMA on two real-world platforms: Franka Panda (7-DoF arm, 2-finger parallel gripper) and UR5e (6-DoF arm, Robotiq 2F-85 gripper). Tasks included pick-and-place, peg insertion, drawer opening, and object sorting. Real-world success rates ranged from 52% (drawer opening, 25 demonstrations) to 76% (pick-and-place, 25 demonstrations), averaging 64% across 16 tasks.
Deployment challenges centered on camera calibration drift and lighting variation. The team recalibrated cameras every 50 demonstrations using a checkerboard target and OpenCV's calibrateCamera. Lighting was controlled via overhead LED panels with adjustable color temperature (3000–6500K). Background clutter was introduced gradually: initial demonstrations used a plain white surface, then added 5–10 distractor objects (books, tools, packaging) to test canonical texture robustness.
GENIMA's 10 Hz control frequency is slower than RT-1's 3 Hz but faster than human teleoperation (typically 1–5 Hz). The ACT controller's action chunking (predicting 10-step sequences) smooths trajectories and reduces jitter. Gripper commands are binary (open/closed) rather than continuous force control, simplifying the action space but limiting fine-grained manipulation (e.g., partial grasps, compliant insertion).
Truelabel's marketplace lists real-world GENIMA datasets with verified success labels, camera calibration logs, and failure-mode annotations. Buyers specify robot platform, task category, and visual diversity requirements. Datasets include HDF5 files with synchronized RGB/joint/gripper streams, JSON camera parameters, and Python scripts for affordance-image rendering.
Data Formats: HDF5, RLDS, and LeRobot Compatibility
GENIMA's reference implementation uses HDF5 for storage. Each demonstration is an HDF5 group containing four datasets: `observations/rgb` (T×H×W×3 uint8 array), `actions/joint_positions` (T×7 float32 array), `actions/gripper` (T×1 bool array), and `metadata/camera_params` (JSON string with intrinsics/extrinsics). T is trajectory length (typically 80–600 frames at 10–30 Hz). Multi-view setups store separate `observations/rgb_wrist` and `observations/rgb_third_person` datasets.
RLDS (Reinforcement Learning Datasets) is an alternative format supported by TensorFlow Datasets. RLDS wraps trajectories in a standardized schema with `steps` (observations, actions, rewards) and `episodes` (metadata). Claru provides RLDS converters for GENIMA data, enabling integration with Open X-Embodiment pipelines. RLDS files are typically 2–3× larger than HDF5 due to redundant metadata, but they simplify multi-dataset training.
LeRobot is Hugging Face's robotics framework, using Parquet for tabular data and MP4 for video. LeRobot datasets include a `meta` directory with camera calibrations and a `data` directory with per-episode Parquet files. Truelabel's marketplace offers LeRobot-compatible GENIMA datasets with automatic format conversion. LeRobot's `push_to_hub` utility uploads datasets to Hugging Face Hub with dataset cards, simplifying sharing and reproducibility.
All formats require synchronized timestamps. HDF5 stores timestamps as a separate `timestamps` dataset (T×1 float64 array, Unix epoch seconds). RLDS embeds timestamps in each step's metadata. LeRobot uses frame indices and a global `fps` field. Claru's data-collection pipeline records hardware timestamps from robot controllers and camera drivers, then resamples to a common 10 Hz or 30 Hz grid using linear interpolation for joint positions and nearest-neighbor for images.
Visual Diversity and Domain Randomization for Canonical Texture Learning
GENIMA's canonical texture property—ignoring appearance variation while preserving geometric reasoning—requires training data with controlled visual diversity. The Dyson team applied domain randomization during data collection: varying lighting intensity (50–100% brightness), color temperature (3000–6500K), and background textures (plain, wood grain, fabric, metal). Object textures were swapped post-collection using image-editing tools, generating 3–5 texture variants per demonstration without re-recording.
This approach differs from sim-to-real transfer, which randomizes physics parameters (friction, mass, damping) in simulation. GENIMA's randomization is purely visual, targeting the affordance-image generator's texture invariance. The ACT controller sees only affordance images (colored spheres on black backgrounds), so it never observes texture variation directly. This architectural separation is key: the generator learns texture invariance, the controller learns motor control.
Claru's teleoperation datasets include visual-diversity metadata: lighting conditions (overhead, side, mixed), background types (plain, cluttered, textured), and object materials (matte, glossy, transparent). Buyers filter by diversity level (low: 1–2 conditions, medium: 3–5, high: 6+). High-diversity datasets cost 20–30% more due to longer collection time but reduce fine-tuning requirements for new environments.
The Dyson team also tested procedural texture synthesis: replacing object textures with Perlin noise, checkerboards, and gradient patterns. GENIMA's success rate dropped only 6% on procedurally textured objects versus natural textures, confirming that the model ignores fine-grained appearance. This robustness is critical for warehouse and manufacturing deployments, where object packaging and lighting vary unpredictably.
Multi-View Observations and Camera Placement Strategies
GENIMA supports single-view and multi-view RGB inputs. Single-view setups use a fixed third-person camera (60–90 cm from workspace, 30–45° elevation angle). Multi-view setups add a wrist-mounted camera (fisheye lens, 120–150° field of view) for close-up manipulation. The Dyson team found that multi-view improved success rates by 8–12% on tasks requiring precise alignment (peg insertion, screw driving) but added minimal benefit for coarse pick-and-place.
Camera placement affects affordance-image quality. Third-person cameras should minimize occlusions: position the camera opposite the robot's dominant workspace quadrant, angled to view the tabletop and gripper simultaneously. Wrist cameras should avoid lens distortion artifacts: use rectilinear lenses (focal length 2.8–4 mm) rather than fisheye for tasks requiring metric depth estimation. Claru's data-collection rigs include adjustable camera mounts with calibrated positions (X/Y/Z offsets, roll/pitch/yaw angles) stored in dataset metadata.
Multi-view fusion in GENIMA is late-stage: each camera generates a separate affordance image, then the ACT controller concatenates them as input channels. Early-stage fusion (merging RGB streams before affordance generation) was tested but degraded performance by 5%, likely because Stable Diffusion's pretrained weights expect single-view inputs. Late-stage fusion preserves the pretrained spatial priors while allowing the ACT controller to learn view-specific attention.
Open X-Embodiment datasets use 1–4 cameras per robot. GENIMA's architecture scales to 4+ views by increasing the ACT controller's input dimension, though training time grows linearly with view count. For cost-sensitive deployments, Claru recommends starting with single third-person view, then adding wrist camera only if initial success rates fall below 60%.
Failure Modes and Debugging with Affordance-Image Visualization
GENIMA's interpretable affordance images simplify failure-mode diagnosis. When a policy fails, engineers inspect the generated affordance image to identify spatial reasoning errors. Common failure modes include: (1) affordance sphere placed outside workspace bounds (camera calibration drift), (2) sphere occluded by robot arm (camera placement issue), (3) sphere color incorrect (gripper-state prediction error), (4) sphere jittering between frames (ACT controller instability).
The Dyson team built a real-time visualization tool that overlays generated affordance images on live camera feeds during deployment. Operators flag failures and annotate root causes (calibration, occlusion, lighting, object slip). This feedback loop improves data-collection protocols: if 20% of failures stem from wrist-camera occlusion, the team adjusts camera mounting angle and re-collects 10–20 demonstrations.
Affordance-image visualization also reveals dataset biases. If the model consistently places spheres 2–3 cm left of the true target, the camera extrinsics (translation vector) are likely miscalibrated. If sphere color is correct but the gripper fails to close, the issue is mechanical (gripper wear, object slip) rather than perceptual. This separation of perception and control errors accelerates debugging compared to end-to-end policies, where failure attribution is ambiguous.
Truelabel's marketplace includes failure-annotated datasets with per-frame labels (success, collision, timeout, grasp failure, calibration error). Buyers use these annotations to train failure-prediction models or filter low-quality demonstrations. Failure rates vary by task: pick-and-place averages 15% failure, peg insertion 30%, drawer opening 25%. High-failure tasks require more demonstrations (150–200) to achieve 60%+ success.
Integration with Truelabel's Physical-AI Data Marketplace
Truelabel's marketplace lists 120+ GENIMA-compatible datasets across 40 manipulation tasks. Each dataset includes: (1) time-synchronized RGB/joint/gripper streams at 10–30 Hz, (2) verified camera calibrations (intrinsics/extrinsics in JSON/YAML), (3) success labels and failure-mode annotations, (4) format converters to HDF5, RLDS, and LeRobot. Datasets are tagged by robot platform (Franka, UR, Kinova), task category (pick, place, assembly, sorting), and visual diversity (lighting, backgrounds, textures).
Buyers filter by demonstration count (10–500), trajectory length (5–60 seconds), and control frequency (10–30 Hz). Pricing is per-demonstration: $8–$15 for single-view, $12–$20 for multi-view, $20–$30 for high-diversity (6+ visual conditions). Volume discounts apply: 100+ demonstrations receive 15% off, 500+ receive 25% off. Enterprise buyers negotiate custom data-collection contracts for proprietary tasks or robot platforms.
Every dataset includes a data-provenance record: collector identity (human teleoperator or autonomous policy), collection date, robot serial number, camera model, and calibration timestamp. Provenance enables reproducibility audits and license compliance. Datasets are licensed under CC BY 4.0 (commercial use allowed, attribution required) or custom enterprise licenses (exclusive rights, no redistribution).
Truelabel's platform also hosts GENIMA fine-tuning notebooks: Jupyter environments with preloaded datasets, affordance-image rendering scripts, and ACT training loops. Users launch a notebook, select a dataset, adjust hyperparameters (learning rate, batch size, augmentation), and train a policy in 2–6 hours on a single A100 GPU. Trained policies export to ONNX or TorchScript for deployment on edge devices (NVIDIA Jetson, Intel NUC).
Scaling GENIMA: From 100 to 10,000 Demonstrations
GENIMA's sample efficiency (64% success with 100 demonstrations per task) is a strength for small-scale deployments but a bottleneck for generalist policies. Scaling to 10,000+ demonstrations per task improves success rates to 85–90% but requires infrastructure for distributed data collection, quality control, and storage. Open X-Embodiment aggregated 1M+ trajectories from 22 labs over 18 months; replicating this for GENIMA demands similar coordination.
The Dyson team tested scaling on three RLBench tasks (pick-and-place, peg insertion, drawer opening) by collecting 1,000 demonstrations per task (3,000 total). Success rates improved by 18% on average (from 64% to 82%), with diminishing returns beyond 500 demonstrations. The team hypothesized that affordance-image supervision saturates faster than end-to-end policies because the spatial reasoning problem is simpler than joint-space regression.
Scaling also requires data-quality filters. The Dyson team rejected 12% of demonstrations due to: (1) camera calibration drift (sphere projection error >5 pixels), (2) trajectory smoothness violations (joint-velocity spikes >2 rad/s), (3) success-label errors (human annotator disagreement). Automated quality checks (calibration validation, smoothness metrics, success-label consensus) reduce manual review time by 70%.
Truelabel's marketplace supports bulk data procurement: buyers request 1,000–10,000 demonstrations for a task, and Truelabel coordinates collection across 50+ partner labs. Delivery time is 4–12 weeks depending on task complexity and robot availability. Bulk orders include dataset versioning (v1.0, v1.1, v2.0) with changelogs documenting calibration updates, annotation corrections, and format migrations.
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- RT-1: Robotics Transformer for Real-World Control at Scale
RT-1 architecture and 97% success rate on 700+ tasks with 130,000 demonstrations
arXiv ↩ - HDF5 format overview
HDF5 hierarchical data format for robot trajectory storage
hdfgroup.org - Scale AI: Expanding Our Data Engine for Physical AI
Scale AI's physical-AI data engine and teleoperation data collection
scale.com - encord.com annotate
Encord annotation platform for multi-sensor robotics data
encord.com - labelbox
Labelbox data labeling and quality control workflows
labelbox.com - roboflow.com annotate
Roboflow annotation tools for computer vision datasets
roboflow.com - Segments.ai multi-sensor data labeling
Segments.ai multi-sensor data labeling for robotics
segments.ai - Kognic autonomous and robotics annotation
Kognic annotation platform for autonomous systems and robotics
kognic.com
FAQ
What robot platforms are compatible with GENIMA training data?
GENIMA requires 7-DoF joint-position control (6-axis arm + 1-DoF gripper). Compatible platforms include Franka Panda, Franka Research 3, Universal Robots UR5e/UR10e, Kinova Gen3, and ABB YuMi. The robot must provide high-resolution joint encoders (sub-millimeter accuracy) and support 10–30 Hz control frequency. Truelabel's marketplace lists datasets for Franka (60+ tasks), UR (40+ tasks), and Kinova (20+ tasks). Custom data collection is available for other platforms (Rethink Sawyer, KUKA LBR, Dobot) with 6–8 week lead time.
How does GENIMA handle multi-step tasks like assembly or sorting?
GENIMA's ACT controller predicts 10-step action chunks (70–100 ms lookahead at 10 Hz control). For multi-step tasks, the affordance-image generator produces a new target sphere every 10 steps, and the ACT controller executes the chunk before requesting the next affordance image. This chunking reduces compounding errors compared to single-step policies. On RLBench's 30-step assembly tasks, GENIMA achieved 58% success versus 42% for Diffusion Policy with identical demonstration counts. Longer tasks (50+ steps) benefit from hierarchical policies: a high-level planner selects subtasks, and GENIMA executes each subtask as a 10-step chunk.
Can GENIMA generalize to new object shapes or sizes without retraining?
GENIMA's canonical texture property enables zero-shot generalization to new textures but not new geometries. If an object's shape or size changes significantly (e.g., replacing a cube with a sphere), the affordance-image generator must be fine-tuned on 10–50 demonstrations of the new object. The Dyson team tested this by swapping a cylindrical peg (diameter 2 cm) with a square peg (side length 2 cm) in a peg-insertion task. Success rate dropped from 68% to 34% without fine-tuning, then recovered to 62% after 25 demonstrations with the square peg. Geometric generalization requires shape-aware representations (e.g., point clouds, mesh embeddings) not present in GENIMA's RGB-only pipeline.
What camera calibration accuracy is required for affordance-image training?
Affordance-image supervision requires camera intrinsics (focal length, principal point, distortion coefficients) accurate to ±2 pixels and extrinsics (rotation, translation from base frame to camera frame) accurate to ±5 mm and ±2 degrees. Calibration drift beyond these thresholds causes sphere projection errors, degrading success rates by 10–15%. The Dyson team recalibrated every 50 demonstrations using a checkerboard target and OpenCV's calibrateCamera function. Truelabel's datasets include calibration timestamps and validation metrics (reprojection error <0.5 pixels). Buyers can request recalibration if deployment environments differ from collection environments (e.g., different lighting, camera mounting).
How does GENIMA compare to OpenVLA for sample efficiency?
GENIMA achieves 64% success with 100 demonstrations per task; OpenVLA requires 1,000–10,000 demonstrations for comparable performance but generalizes across 22 robot embodiments. For single-task deployments with <500 demonstrations, GENIMA is more sample-efficient. For multi-task or cross-embodiment deployments with 10,000+ demonstrations, OpenVLA's generalist architecture amortizes data costs. The choice depends on deployment scale: GENIMA for focused applications (warehouse pick-and-place, assembly line), OpenVLA for research platforms or multi-robot fleets. Truelabel's marketplace lists both GENIMA-specific datasets (100–500 demonstrations, single embodiment) and Open X-Embodiment datasets (1,000–100,000 demonstrations, 22 embodiments).
What are the storage and compute requirements for GENIMA training?
A 100-demonstration dataset (10 Hz, 10-second episodes, 480×640 RGB) requires 12–18 GB storage in HDF5 format (6–8 GB compressed). Training GENIMA's affordance-image generator (Stable Diffusion 1.5 fine-tuning) takes 4–8 hours on a single NVIDIA A100 GPU (40 GB VRAM) with batch size 16. Training the ACT controller takes 2–4 hours on the same GPU. Inference runs at 10 Hz on an NVIDIA Jetson AGX Orin (32 GB) or Intel NUC with RTX 4060 (8 GB). Truelabel's marketplace provides cloud training environments (A100 instances, $2.50/hour) and edge deployment guides for Jetson and NUC platforms.
Looking for GENIMA training data?
Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.
Browse GENIMA-Ready Datasets