Model Profile
SuSIE: Hierarchical Robot Manipulation via Subgoal Image Editing
SuSIE (Subgoal Synthesis via Image Editing) is a two-tier manipulation architecture from UC Berkeley that fine-tunes InstructPix2Pix on 220,847 video instruction pairs to generate 256×256 subgoal images, then trains a goal-conditioned policy on 60,000 robot demonstrations to execute low-level motor commands at 5–10 Hz.
Quick facts
- Model class
- Model Profile
- Primary focus
- SuSIE robot manipulation
- Last reviewed
- 2026-05-13
What Is SuSIE?
SuSIE is a hierarchical manipulation framework published at ICLR 2024 by Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine from UC Berkeley. The architecture splits robotic control into two stages: a high-level subgoal synthesizer that generates intermediate visual waypoints by fine-tuning InstructPix2Pix on 220,847 video instruction pairs, and a low-level goal-conditioned policy trained on 60,000 robot demonstrations from BridgeData V2.
The high-level model consumes a 256×256 RGB observation and a natural-language instruction, then outputs a 256×256 subgoal image representing the desired future state. The low-level policy receives the current observation and the synthesized subgoal, producing continuous end-effector deltas at 5–10 Hz. This decomposition enables zero-shot generalization to novel objects and environments by leveraging the semantic priors learned during InstructPix2Pix pretraining on web-scale image-editing datasets.
SuSIE demonstrated a 55 percent success rate on real-world WidowX manipulation tasks, outperforming RT-2-X's 20 percent baseline on the same benchmark. The model's hierarchical structure mirrors classical planning but replaces symbolic state representations with learned visual embeddings, a pattern also explored in RT-1 and RT-2 but with explicit subgoal supervision rather than end-to-end action prediction.
Architecture and Key Innovations
SuSIE's high-level subgoal synthesizer is a fine-tuned InstructPix2Pix diffusion model conditioned on language instructions and current observations. Fine-tuning occurs on 220,847 video pairs from Something-Something-v2, where each pair consists of a start frame, an end frame, and a natural-language description of the action. The model learns to predict plausible intermediate states that satisfy the instruction, effectively compressing long-horizon tasks into a sequence of visual subgoals.
The low-level policy is a goal-conditioned behavioral cloning network trained on 60,000 teleoperated demonstrations from BridgeData V2. Each demonstration includes synchronized 256×256 RGB streams from a third-person camera, 7-DoF end-effector poses, and gripper states. The policy network is a ResNet-18 encoder followed by a three-layer MLP that outputs continuous action deltas. Training uses mean-squared-error loss on action residuals, with data augmentation including random crops, color jitter, and temporal subsampling to improve robustness.
The hierarchical decomposition reduces the effective horizon length for the low-level policy, mitigating compounding error in long-horizon tasks. By generating subgoals at approximately 2 Hz and executing motor commands at 5–10 Hz, SuSIE achieves a 10× reduction in planning depth compared to end-to-end approaches like RT-2. This design choice trades off real-time reactivity for improved sample efficiency and generalization, a tradeoff validated by the model's zero-shot performance on novel object categories not present in the training distribution.
Training Data Requirements
SuSIE requires two distinct data streams. The high-level subgoal model consumes 220,847 video instruction triples from Something-Something-v2, a human-activity dataset with 174 action classes and dense language annotations. Each triple consists of a 256×256 start frame, a 256×256 end frame 1–3 seconds later, and a free-form instruction describing the state transition. The dataset's diversity across object categories and manipulation primitives is critical for learning generalizable subgoal priors.
The low-level policy trains on 60,000 robot demonstrations from BridgeData V2, collected via teleoperation on a WidowX 250 6-DoF arm with a parallel-jaw gripper. Each demonstration includes synchronized 256×256 RGB observations from a fixed third-person camera, 7-DoF end-effector poses at 10 Hz, binary gripper states, and task-level language annotations. Camera calibration must remain consistent across demonstrations to avoid distribution shift during goal-conditioned training.
Data preprocessing includes frame resizing to 256×256 via center cropping, temporal alignment of observation and action streams, and removal of demonstrations with occlusions or motion blur exceeding a 10-pixel threshold. The RLDS format is used for serialization, with each episode stored as a sequence of (observation, action, reward, discount) tuples. Augmentation during training includes random horizontal flips, color jitter with ±0.2 brightness and ±0.1 saturation, and temporal subsampling at 0.5–2× speed to improve robustness to execution-time latency[1].
Input and Output Specifications
The high-level subgoal synthesizer accepts a 256×256 RGB observation tensor and a variable-length natural-language instruction encoded via CLIP text embeddings. The observation is normalized to [0,1] and passed through the InstructPix2Pix U-Net encoder, while the instruction embedding conditions the diffusion process via cross-attention layers. The model outputs a 256×256 RGB subgoal image representing the desired future state, generated via 50-step DDIM sampling with classifier-free guidance at scale 7.5.
The low-level policy receives a 256×256 current observation and a 256×256 subgoal image, both normalized to [0,1]. The ResNet-18 encoder processes each image independently, producing 512-dimensional feature vectors that are concatenated and passed to a three-layer MLP with hidden dimensions [512, 256, 128]. The output is a 7-dimensional continuous action vector: 3D end-effector position deltas in meters, 3D orientation deltas as axis-angle rotations, and a binary gripper command. Position deltas are clipped to ±0.05 m per timestep, and orientation deltas are clipped to ±0.2 radians.
Control frequency is decoupled across the hierarchy. The high-level model generates subgoals at approximately 2 Hz, recomputing the target image every 0.5 seconds or when the low-level policy signals subgoal achievement via a learned termination classifier. The low-level policy executes at 5–10 Hz, issuing motor commands at the robot's native control rate. This decoupling allows the high-level planner to operate on a coarser timescale while the low-level controller handles fine-grained reactive adjustments, a design pattern also used in RT-1's action chunking.
SuSIE vs. End-to-End Vision-Language-Action Models
SuSIE's hierarchical structure contrasts with end-to-end VLA models like RT-2 and OpenVLA, which map language instructions and observations directly to low-level actions via a single transformer. End-to-end models achieve higher control frequencies (up to 50 Hz) and tighter sensorimotor coupling, but require 10–100× more robot demonstrations to learn comparable task coverage. SuSIE's explicit subgoal supervision reduces the effective horizon length, enabling sample-efficient training on 60,000 demonstrations versus RT-2's 130,000-episode dataset[2].
The tradeoff is reduced reactivity to dynamic perturbations. Because SuSIE recomputes subgoals at 2 Hz, the system cannot respond to sudden object motion or contact forces within a 0.5-second window. End-to-end models like RT-1 process observations at 3 Hz and issue actions at 5 Hz, halving the reaction latency. For tasks requiring tight closed-loop control—such as peg insertion or deformable object manipulation—end-to-end architectures maintain an advantage despite their higher data requirements.
SuSIE's zero-shot generalization to novel objects stems from the semantic priors learned during InstructPix2Pix pretraining on web-scale image-editing datasets. The high-level model has seen millions of object categories during pretraining, enabling it to synthesize plausible subgoals for objects absent from the robot demonstration set. This capability is unavailable to end-to-end models trained exclusively on robot data, which exhibit catastrophic performance drops when encountering out-of-distribution objects. The Open X-Embodiment dataset attempts to address this limitation via cross-embodiment transfer, but still requires robot-specific fine-tuning for each new platform.
Comparison to Classical Hierarchical Planning
SuSIE's two-tier architecture resembles classical hierarchical task networks (HTNs) but replaces symbolic state representations with learned visual embeddings. Traditional HTN planners decompose tasks into abstract operators and concrete actions, using hand-coded preconditions and effects to ensure plan validity. SuSIE learns this decomposition end-to-end from data, with the high-level model implicitly encoding task structure via the distribution of subgoal images and the low-level policy learning action feasibility via behavioral cloning.
The advantage of learned hierarchies is robustness to perceptual ambiguity and partial observability. Classical planners require complete state estimation—object poses, contact states, occlusion masks—which is brittle in cluttered real-world scenes. SuSIE operates directly on RGB pixels, using the diffusion model's generative capacity to hallucinate occluded regions and infer object affordances from visual context. This approach mirrors recent work in world models, which use generative video prediction to plan in latent space without explicit state estimation.
The disadvantage is lack of interpretability and formal guarantees. Classical planners produce human-readable action sequences with provable correctness under specified preconditions. SuSIE's subgoal images are opaque to human inspection—a synthesized image may appear plausible but encode an infeasible state transition that causes the low-level policy to fail. The model provides no confidence estimates or failure-mode diagnostics, complicating deployment in safety-critical applications. Hybrid approaches that combine learned perception with symbolic planning—such as SayCan—offer a middle ground, but require task-specific ontology engineering that SuSIE avoids.
Dataset Formats and Preprocessing Pipelines
SuSIE's high-level training data uses the Something-Something-v2 format: a JSON manifest mapping video IDs to start frames, end frames, and instruction strings, with frames stored as 256×256 JPEG files. Preprocessing extracts frame pairs separated by 1–3 seconds, applies center cropping to 256×256, and normalizes pixel values to [0,1]. Instructions are tokenized via CLIP's byte-pair encoding and embedded into 512-dimensional vectors. The resulting triples are serialized into TFRecord shards for efficient distributed training.
The low-level policy consumes BridgeData V2 in RLDS format, a standardized schema for episodic robot data built on TensorFlow Datasets. Each episode is a sequence of steps, where each step contains a 256×256 RGB observation, a 7-DoF action vector, a scalar reward, and a discount factor. Camera intrinsics and extrinsics are stored in episode-level metadata to enable geometric augmentation. The dataset includes 60,000 episodes totaling 1.2 million timesteps, with an average episode length of 20 seconds at 10 Hz control frequency[1].
Data augmentation is critical for generalization. The high-level model applies random horizontal flips (50 percent probability), color jitter (±0.2 brightness, ±0.1 saturation, ±0.1 hue), and Gaussian noise (σ=0.02) to both start and end frames. The low-level policy uses the same augmentation pipeline plus temporal subsampling: episodes are replayed at 0.5–2× speed by dropping or duplicating frames, simulating execution-time latency and improving robustness to variable control frequencies. Augmentation parameters are tuned via ablation on a held-out validation set, with final hyperparameters yielding a 12 percent improvement in zero-shot success rate over unaugmented baselines.
Fine-Tuning InstructPix2Pix for Robotic Subgoals
InstructPix2Pix is a text-conditioned image-editing diffusion model pretrained on 450,000 synthetic image-instruction-edit triples generated by combining GPT-3 captions with Stable Diffusion edits. SuSIE fine-tunes this model on 220,847 video pairs from Something-Something-v2, replacing the synthetic edits with real state transitions observed in human manipulation videos. Fine-tuning uses a learning rate of 1e-5, batch size 64, and 50,000 gradient steps, taking approximately 48 hours on 8 A100 GPUs.
The fine-tuning objective is a weighted combination of denoising score matching loss and CLIP-based perceptual loss. The denoising loss ensures the model learns the conditional distribution of future frames given current frames and instructions. The perceptual loss penalizes semantic drift by comparing CLIP embeddings of the generated subgoal and the ground-truth future frame, with a weight of 0.1 relative to the denoising loss. This combination prevents mode collapse and encourages the model to generate subgoals that are both visually plausible and semantically aligned with the instruction.
Inference uses 50-step DDIM sampling with classifier-free guidance at scale 7.5. Classifier-free guidance interpolates between conditional and unconditional predictions, amplifying the influence of the instruction embedding on the generated image. The guidance scale is tuned via grid search on a validation set, with higher scales (>10) producing sharper images but increasing the risk of out-of-distribution artifacts. The final scale of 7.5 balances image quality and task success rate, yielding a 55 percent zero-shot success rate on real-world WidowX tasks. Alternative diffusion samplers like DPM-Solver++ reduce inference time to 20 steps but degrade success rate by 8 percent, making DDIM the preferred choice for real-time deployment.
Goal-Conditioned Policy Training on BridgeData V2
The low-level policy is a goal-conditioned behavioral cloning network trained on 60,000 demonstrations from BridgeData V2. Each demonstration is a sequence of (observation, goal, action) tuples, where the goal is a future observation sampled 1–5 seconds ahead in the same episode. This temporal goal relabeling increases data efficiency by generating multiple training examples per demonstration: a 20-second episode at 10 Hz yields 200 timesteps, which can be relabeled into 19,800 unique (observation, goal, action) triples by pairing each timestep with all future timesteps.
The policy network is a ResNet-18 encoder pretrained on ImageNet, followed by a three-layer MLP with hidden dimensions [512, 256, 128] and ReLU activations. The encoder processes the current observation and the goal image independently, producing 512-dimensional feature vectors that are concatenated and passed to the MLP. The output is a 7-dimensional action vector: 3D position deltas, 3D orientation deltas as axis-angle rotations, and a binary gripper command. Training uses mean-squared-error loss on continuous actions and binary cross-entropy loss on gripper commands, with a learning rate of 3e-4 and batch size 256.
Data augmentation includes random crops (224×224 from 256×256), horizontal flips (50 percent probability), and color jitter (±0.2 brightness, ±0.1 saturation). Temporal subsampling replays episodes at 0.5–2× speed by dropping or duplicating frames, simulating variable control frequencies and improving robustness to execution-time latency. The model is trained for 200,000 gradient steps, taking approximately 24 hours on 4 A100 GPUs. Final validation performance is 78 percent success rate on in-distribution tasks and 55 percent on zero-shot tasks when paired with SuSIE's subgoal synthesizer.
Zero-Shot Generalization to Novel Objects
SuSIE's zero-shot generalization stems from the semantic priors learned during InstructPix2Pix pretraining on web-scale image-editing datasets. The high-level model has seen millions of object categories during pretraining, enabling it to synthesize plausible subgoals for objects absent from the robot demonstration set. For example, when instructed to "pick up the toy dinosaur," the model generates a subgoal image showing the gripper grasping a dinosaur-shaped object, even though no dinosaurs appear in BridgeData V2.
This capability is validated on a held-out test set of 40 novel object categories, including kitchen utensils, office supplies, and children's toys. The model achieves a 55 percent success rate on these tasks, compared to 20 percent for RT-2 and 12 percent for a baseline goal-conditioned policy without subgoal synthesis. The performance gap is largest for objects with complex geometry or articulated parts, where the diffusion model's generative capacity enables it to hallucinate plausible intermediate states that guide the low-level policy toward successful grasps.
Failure modes include semantic drift and infeasible subgoals. Semantic drift occurs when the generated subgoal image depicts a plausible scene that does not satisfy the instruction—for example, generating an image of the gripper near the object but not grasping it. Infeasible subgoals arise when the diffusion model hallucinates object poses or gripper configurations that violate kinematic constraints, causing the low-level policy to fail. These failure modes are more common for instructions with ambiguous semantics or objects with high visual similarity to training-set categories. Future work on world models and physics-informed diffusion priors may mitigate these issues by incorporating geometric and dynamic constraints into the generative process.
Deployment Considerations and Real-Time Performance
SuSIE's hierarchical structure introduces latency at two levels. The high-level subgoal synthesizer requires 50-step DDIM sampling, taking approximately 0.5 seconds per subgoal on an A100 GPU. The low-level policy executes at 5–10 Hz, issuing motor commands every 0.1–0.2 seconds. Total system latency from observation to action is 0.6–0.7 seconds, which is acceptable for quasi-static manipulation tasks but prohibitive for dynamic tasks requiring sub-100ms reaction times.
Real-time deployment requires GPU acceleration for both models. The subgoal synthesizer is deployed on an NVIDIA A100 with TensorRT optimization, reducing inference time from 0.5 seconds to 0.3 seconds via mixed-precision inference and kernel fusion. The low-level policy runs on the same GPU, with inference time under 10ms per action. Camera capture and preprocessing add 20–30ms of latency, yielding a total system latency of 0.35–0.4 seconds. This is sufficient for tabletop manipulation but marginal for tasks requiring tight closed-loop control, such as contact-rich assembly or dynamic object tracking.
Failure recovery is handled via a learned termination classifier that detects when the low-level policy has achieved the current subgoal or encountered an unrecoverable failure. The classifier is a binary ResNet-18 trained on 10,000 labeled (observation, subgoal, success) triples, with 92 percent accuracy on a held-out validation set. When the classifier signals subgoal achievement, the high-level model generates a new subgoal; when it signals failure, the system resets to the initial state and requests a new instruction. This recovery mechanism improves task completion rate by 18 percent compared to a fixed-duration subgoal schedule. Alternative approaches using world models for predictive failure detection remain an open research direction.
Sourcing Training Data for SuSIE-Style Hierarchies
Replicating SuSIE's performance requires two data streams: 200,000+ video instruction pairs for high-level subgoal synthesis and 50,000+ robot demonstrations for low-level policy training. Video data must include diverse object categories, manipulation primitives, and environmental contexts to enable zero-shot generalization. Something-Something-v2 provides 220,847 labeled pairs, but custom datasets may be required for domain-specific tasks such as warehouse logistics or surgical manipulation.
Robot demonstration data must be collected on the target hardware with consistent camera calibration and lighting conditions. BridgeData V2 provides 60,000 WidowX demonstrations, but transferring to a different robot platform—such as a Franka Emika Panda or Universal Robots UR5—requires recollecting the entire low-level dataset. Cross-embodiment transfer via Open X-Embodiment reduces this burden but still requires 5,000–10,000 platform-specific demonstrations for fine-tuning.
Data quality is critical for both streams. Video pairs must have consistent frame resolution (256×256), temporal alignment within 0.1 seconds, and language annotations that precisely describe the state transition. Robot demonstrations must have synchronized observation and action streams, camera calibration metadata, and removal of episodes with occlusions or motion blur. The truelabel marketplace provides quality-checked datasets for both streams, with per-episode provenance tracking and format validation to ensure compatibility with SuSIE's training pipeline. Custom data collection services are available for domain-specific tasks, with turnaround times of 4–8 weeks for 10,000-episode datasets.
Related pages
Use these to move from category-level context into specific task, dataset, format, and comparison detail.
External references and source context
- BridgeData V2: A Dataset for Robot Learning at Scale
BridgeData V2 contains 60,000 episodes totaling 1.2M timesteps at 10 Hz control frequency
arXiv ↩ - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
RT-2 trains on 130,000 robot episodes, 10× more than SuSIE's low-level policy
arXiv ↩ - scale.com physical ai
Scale AI's physical AI data engine provides annotation and collection services for robot datasets
scale.com - labelbox
Labelbox offers data annotation platform for computer vision and robotics applications
labelbox.com - encord
Encord provides annotation tools for multi-sensor robotics data including point clouds
encord.com - segments
Segments.ai specializes in multi-sensor data labeling for autonomous systems
segments.ai - roboflow.com annotate
Roboflow provides annotation tools and dataset management for computer vision models
roboflow.com - LeRobot documentation
LeRobot is an open-source robotics framework with standardized dataset formats
Hugging Face - Scaling Egocentric Vision: The EPIC-KITCHENS Dataset
EPIC-KITCHENS is a large-scale egocentric video dataset with 100 hours of kitchen activities
arXiv - RoboNet: Large-Scale Multi-Robot Learning
RoboNet is a multi-robot dataset with 15M frames from 7 robot platforms
arXiv - CALVIN paper
CALVIN is a benchmark for language-conditioned policy learning with long-horizon tasks
arXiv - DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
DROID provides 76K real-world manipulation trajectories across 564 scenes and 86 tasks
arXiv - RLBench: The Robot Learning Benchmark & Learning Environment
RLBench is a simulation benchmark with 100 tasks for evaluating robot learning algorithms
arXiv - Introduction to HDF5
HDF5 is a hierarchical data format commonly used for storing large robotics datasets
The HDF Group - MCAP guides
MCAP is a container format for multi-modal time-series data used in robotics
MCAP - Apache Arrow Parquet files
Apache Parquet is a columnar storage format used for efficient dataset serialization
Apache Arrow - Safetensors documentation
Safetensors is a secure format for storing model weights without arbitrary code execution
Hugging Face - truelabel data provenance glossary
Data provenance tracking ensures dataset lineage and licensing compliance for model training
truelabel.ai
FAQ
What robot platforms does SuSIE support?
SuSIE was developed and validated on the WidowX 250 6-DoF arm with a parallel-jaw gripper. The low-level policy is hardware-specific and requires retraining for different robot platforms. Transferring to a Franka Emika Panda or Universal Robots UR5 requires collecting 50,000+ demonstrations on the target hardware with consistent camera calibration. The high-level subgoal synthesizer is hardware-agnostic and can be reused across platforms without retraining, provided the camera viewpoint and resolution remain consistent.
How does SuSIE handle dynamic objects or moving obstacles?
SuSIE's 2 Hz subgoal generation frequency limits reactivity to dynamic perturbations. The system cannot respond to sudden object motion or contact forces within a 0.5-second window. For tasks requiring tight closed-loop control—such as catching a thrown object or tracking a moving target—end-to-end models like RT-1 or RT-2 are better suited. SuSIE is optimized for quasi-static manipulation tasks where objects remain stationary between subgoal updates, such as pick-and-place, rearrangement, or assembly with fixed components.
Can SuSIE be fine-tuned on custom video datasets?
Yes. The high-level subgoal synthesizer can be fine-tuned on custom video instruction pairs to improve performance on domain-specific tasks. Fine-tuning requires 10,000+ video pairs with consistent frame resolution (256×256), temporal alignment, and language annotations. The process takes 12–24 hours on 8 A100 GPUs and improves zero-shot success rate by 8–15 percent on in-domain tasks. Custom datasets must follow the Something-Something-v2 format: a JSON manifest mapping video IDs to start frames, end frames, and instruction strings, with frames stored as JPEG files.
What is the minimum dataset size for training a SuSIE-style hierarchy?
The high-level subgoal model requires at least 100,000 video instruction pairs to achieve competitive zero-shot generalization. Training on fewer than 50,000 pairs results in semantic drift and infeasible subgoals. The low-level policy requires at least 30,000 robot demonstrations to learn robust goal-conditioned control. Training on fewer than 10,000 demonstrations yields policies that overfit to training-set object poses and fail on novel configurations. Data augmentation can reduce these requirements by 20–30 percent but cannot fully compensate for insufficient dataset diversity.
How does SuSIE compare to OpenVLA on sample efficiency?
SuSIE achieves comparable task success rates to OpenVLA using 5× fewer robot demonstrations (60,000 vs. 300,000+) by leveraging web-scale pretraining on image-editing datasets. However, SuSIE requires an additional 220,000 video instruction pairs for high-level subgoal synthesis, which OpenVLA does not need. Total data requirements are similar, but SuSIE's video data is cheaper to collect than robot demonstrations because it does not require hardware access or teleoperation labor. For organizations with existing video datasets, SuSIE offers better sample efficiency; for organizations starting from scratch, OpenVLA's end-to-end training pipeline may be simpler to deploy.
What file formats does SuSIE use for training data?
The high-level model consumes video pairs in JPEG format with a JSON manifest mapping video IDs to frame paths and instruction strings. The low-level policy uses RLDS format, a TensorFlow Datasets schema for episodic robot data. Each episode is serialized as a TFRecord containing sequences of (observation, action, reward, discount) tuples, with camera intrinsics and extrinsics stored in episode-level metadata. Alternative formats like HDF5 or MCAP can be converted to RLDS via the rlds_dataset_builder utility, which handles frame extraction, temporal alignment, and metadata serialization.
Looking for SuSIE robot manipulation?
Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.
Source SuSIE Training Data