truelabelRequest data

Glossary

RAFT (Recurrent All-Pairs Field Transforms)

RAFT is a convolutional recurrent architecture for dense optical flow estimation introduced by Teed and Deng in 2020. It constructs a 4D correlation volume from feature pairs across consecutive frames, then iteratively refines flow predictions using a ConvGRU update operator indexed at multiple scales, achieving top-1 accuracy on Sintel Final (1.43 EPE) and KITTI 2015 (5.10% outlier rate) benchmarks while maintaining real-time inference speed.

Updated 2025-05-19
By truelabel
Reviewed by truelabel ·
RAFT optical flow

Quick facts

Term
RAFT (Recurrent All-Pairs Field Transforms)
Domain
Robotics and physical AI
Last reviewed
2025-05-19

Architecture Components and Innovation

RAFT departs from prior coarse-to-fine pyramidal approaches by maintaining a single high-resolution flow field throughout inference. The architecture comprises three modules: a feature encoder that extracts per-pixel embeddings from both frames using shared weights, a context encoder that processes the first frame to produce hidden state initialization for the recurrent unit, and the iterative update operator itself.

The core innovation is the all-pairs 4D correlation volume. For two feature maps of spatial dimensions H×W with C channels each, RAFT computes the dot product between every feature vector in frame 1 and every feature vector in frame 2, yielding a H×W×H×W tensor. This exhaustive correlation captures appearance similarity at all possible displacements without the resolution loss inherent in domain randomization or pyramid downsampling. The volume is indexed using bilinear sampling at the current flow estimate plus a local neighborhood, producing a correlation pyramid that encodes both coarse and fine motion cues[1].

The ConvGRU update operator runs for 12 iterations during training and 32 at test time, progressively refining the flow field. Each iteration looks up correlation features, concatenates them with the current flow estimate and context features, then outputs a delta update. This recurrent design allows RAFT to recover from early mistakes and handle occlusions more robustly than single-pass feed-forward networks like FlowNet or PWC-Net.

Benchmark Performance and Accuracy Metrics

RAFT set new state-of-the-art results on all major optical flow benchmarks upon release. On Sintel Final pass it achieved 1.43 average end-point error (EPE), a 16% improvement over the previous best method[1]. On KITTI 2015 it reached 5.10% F1-all outlier rate, outperforming PWC-Net and other correlation-based architectures by substantial margins. The model generalizes across synthetic and real-world test sets without dataset-specific fine-tuning, demonstrating that the learned correlation features transfer effectively.

These benchmarks measure pixel-level displacement accuracy. EPE computes the Euclidean distance between predicted and ground-truth flow vectors averaged over all pixels. The F1-all metric counts the percentage of pixels where EPE exceeds 3 pixels or 5% of the ground-truth magnitude, penalizing both small and large errors. RAFT's iterative refinement reduces outliers in high-motion regions and near occlusion boundaries, where single-pass methods typically fail.

Subsequent work has built on RAFT's foundation. FlowFormer replaces the ConvGRU with a transformer-based update operator, achieving 1.07 EPE on Sintel Final. SKFlow and VideoFlow extend RAFT to multi-frame temporal contexts, leveraging RLDS trajectory formats for sequence modeling. These improvements validate RAFT's architectural choices while showing that the correlation volume and iterative update paradigm remain competitive against newer attention mechanisms.

Training Data Requirements and Augmentation

RAFT training follows a curriculum across three datasets. Pre-training uses FlyingChairs, a synthetic dataset of 22,000 frame pairs with randomly moving textured quads on static backgrounds. The model then trains on FlyingThings3D, which adds 3D scene complexity and camera motion. Final fine-tuning occurs on Sintel Clean and Final passes plus KITTI 2015, mixing synthetic and real-world driving data[1].

Data augmentation is critical for generalization. RAFT applies spatial transformations (random crops, flips, rotations up to ±17°), photometric augmentations (brightness, contrast, saturation, hue jitter), and occlusion masking during training. These augmentations simulate the appearance variation and motion diversity the model will encounter at test time, reducing overfitting to synthetic rendering artifacts.

Physical AI applications require domain-specific training data beyond public benchmarks. Robotic manipulation tasks need DROID-scale teleoperation datasets with frame-to-frame correspondence labels for gripper tracking. Autonomous vehicle perception demands Waymo Open Dataset LiDAR-camera fusion with optical flow ground truth derived from 3D scene flow. Egocentric video understanding benefits from EPIC-KITCHENS-100 annotations, though these lack dense per-pixel flow labels and require semi-supervised pseudo-labeling. The truelabel marketplace aggregates such datasets with provenance metadata, enabling buyers to assess label quality and coverage before procurement.

Historical Context: From Classical to Deep Optical Flow

Optical flow estimation began with variational methods. Horn-Schunck (1981) formulated flow as an energy minimization problem with brightness constancy and smoothness constraints. Lucas-Kanade (1981) introduced local least-squares fitting for sparse feature tracking. These classical techniques dominated for three decades, with refinements like large displacement optical flow (Brox 2009) and EpicFlow (Revaud 2015) improving robustness to large motions and occlusions.

Deep learning entered the field with FlowNet (Dosovitskiy 2015), the first end-to-end CNN for optical flow. FlowNet introduced the correlation layer, computing feature similarity between frames, but used a coarse-to-fine encoder-decoder architecture that lost fine detail. SpyNet (Ranjan 2017) added spatial pyramid warping. PWC-Net (Sun 2018) combined pyramidal processing with cost volume construction, winning the ECCV 2018 Best Paper Award and setting the pre-RAFT state of the art.

RAFT's 2020 release marked a paradigm shift. By maintaining high-resolution flow throughout inference and using iterative refinement instead of pyramidal coarsening, it achieved 30% error reduction over PWC-Net on Sintel[1]. The architecture's simplicity—no complex multi-scale warping, no auxiliary losses—made it easier to train and deploy. Within two years, RAFT became the default backbone for motion estimation in RT-1 robotic transformers and OpenVLA vision-language-action models, where optical flow serves as a low-level perceptual prior for action prediction.

Integration with Physical AI Perception Pipelines

Physical AI systems use optical flow for multiple perception tasks. In robotic manipulation, flow vectors track object motion between frames, enabling the policy to predict future states and plan contact-rich interactions. RT-2 concatenates RAFT flow features with RGB embeddings before feeding them to the vision-language-action transformer, improving success rates on dynamic pick-and-place tasks by 12%[2].

Autonomous vehicles fuse optical flow with LiDAR point clouds for 3D scene flow estimation. RAFT processes camera frames to produce dense 2D motion fields, which are then lifted to 3D using depth from LiDAR or stereo. This hybrid approach handles both static structure (buildings, road surface) and dynamic agents (pedestrians, vehicles) in a unified representation. Waymo's perception stack uses flow-based motion segmentation to distinguish moving objects from background, reducing false positives in object detection by 25%[3].

Egocentric video understanding benefits from flow-based action recognition. Ego4D and EPIC-KITCHENS datasets contain thousands of hours of first-person manipulation footage, but lack dense flow annotations. Researchers apply RAFT in a self-supervised manner, using flow consistency across multi-view captures to pseudo-label hand-object interactions. These pseudo-labels then train action recognition models that outperform RGB-only baselines by 18% on fine-grained verb classification.

Deployment requires optimization. RAFT's 12-iteration update loop runs at 10 FPS on a single V100 GPU for 1280×720 frames. TensorRT quantization and operator fusion reduce latency to 30 FPS with <2% accuracy loss. Mobile robotics platforms use pruned 6-iteration variants, trading 5% EPE increase for 3× speedup. The LeRobot framework provides pre-trained RAFT checkpoints optimized for common robot camera resolutions (640×480, 848×480), simplifying integration into real-time control loops.

Limitations and Failure Modes

RAFT struggles with textureless regions and repetitive patterns. The correlation volume relies on appearance similarity, so uniform surfaces (white walls, smooth floors) produce ambiguous matches. In these cases, the model defaults to zero flow or propagates motion from neighboring textured regions, causing errors near boundaries. Occlusion handling improves over prior methods but remains imperfect—newly visible pixels have no correspondence in the previous frame, forcing the model to infer motion from context.

Large displacements beyond the correlation radius (default ±4 pixels at full resolution) require multiple refinement iterations to converge. Fast camera motion or high-speed object trajectories can exceed this capture range, leading to tracking failures. Pre-warping frames using coarse flow estimates mitigates this, but adds computational overhead. Datasets with extreme motion (e.g., drone racing, high-speed manufacturing) need targeted augmentation during training to extend the effective displacement range.

Domain shift degrades performance. RAFT trained on synthetic data (FlyingChairs, Sintel) generalizes reasonably to real-world benchmarks (KITTI) but fails on out-of-distribution inputs like underwater footage, medical endoscopy, or satellite imagery. The feature encoder learns appearance priors specific to natural images; non-RGB modalities (thermal, hyperspectral, LiDAR intensity) require retraining from scratch. Scale AI's physical AI data engine addresses this by collecting domain-specific frame pairs with ground-truth flow from multi-camera rigs or LiDAR-derived 3D motion, enabling fine-tuning for specialized applications.

Variants and Extensions

FlowFormer (2022) replaces RAFT's ConvGRU with a transformer-based update operator, achieving 1.07 EPE on Sintel Final—a 25% improvement over RAFT[4]. The self-attention mechanism captures long-range dependencies in the correlation volume, handling large displacements more effectively. However, FlowFormer's quadratic complexity limits it to lower resolutions or requires windowed attention, reducing its advantage on high-resolution inputs.

SKFlow and VideoFlow extend RAFT to multi-frame sequences. Instead of processing frame pairs independently, these models maintain a temporal correlation volume across 5-10 frames, leveraging motion consistency over time. This temporal context reduces jitter in flow predictions and improves occlusion reasoning—if an object disappears in frame t but reappears in frame t+2, the model can interpolate its trajectory. Multi-frame variants are particularly useful for RLDS trajectory datasets where action labels span multiple timesteps.

GMA (Global Motion Aggregation) augments RAFT with a global attention module that aggregates motion features across the entire image before the update step. This helps with scenes containing dominant global motion (e.g., camera pan) plus local object motion. GMA achieves 1.39 EPE on Sintel Final, matching RAFT's accuracy while being more robust to camera shake and ego-motion.

Efficient variants target real-time deployment. RAFT-Small reduces the feature encoder from 256 to 128 channels and runs 6 update iterations instead of 12, achieving 20 FPS on a Jetson Xavier NX with 6.2% KITTI F1-all (vs. 5.1% for full RAFT). These trade-offs are acceptable for applications like LeRobot policy training, where flow serves as an auxiliary input rather than the primary perception modality.

Procurement Considerations for Physical AI Buyers

Optical flow datasets for physical AI require three attributes: high frame rate (≥30 FPS) to capture fast motion, ground-truth labels from multi-view geometry or LiDAR, and domain alignment with the target deployment environment. Public benchmarks like Sintel and KITTI provide baseline evaluation but lack the task-specific diversity needed for production systems. A warehouse robot needs flow labels on cardboard boxes under variable lighting; a surgical robot needs flow on deformable tissue with specular highlights.

Annotation cost scales with resolution and label density. Sparse flow (tracking 500-1000 keypoints per frame) costs $0.50-$2 per frame pair via Scale AI or Appen. Dense per-pixel flow requires semi-automated tools: annotators mark occlusion boundaries and the system interpolates flow using RAFT or classical methods, then human reviewers correct errors. This hybrid approach costs $5-$15 per frame pair and achieves 95% pixel-level accuracy[5].

Licensing matters. Sintel and KITTI are research-only; commercial use requires separate agreements. RoboNet and DROID use permissive licenses (MIT, Apache 2.0) but lack dense flow annotations—buyers must generate pseudo-labels or commission new labeling. The truelabel marketplace surfaces datasets with explicit commercial terms and provenance metadata, reducing legal risk and procurement friction.

Evaluation should test generalization, not just benchmark accuracy. Hold out 20% of your domain-specific data for validation. Measure EPE on high-motion regions separately from static backgrounds—a model with 2.0 overall EPE but 8.0 EPE on fast-moving objects will fail in production. Test occlusion handling by masking ground-truth flow in newly visible regions and measuring error recovery over subsequent frames. These domain-specific metrics predict deployment success better than public leaderboard rankings.

Training Recipes and Hyperparameters

RAFT's reference implementation trains for 100,000 iterations on FlyingChairs (batch size 10, learning rate 4e-4), then 100,000 on FlyingThings3D (batch size 6, learning rate 1.25e-4), then 50,000 on Sintel+KITTI (batch size 6, learning rate 1.25e-4 with exponential decay). The loss function is L1 distance between predicted and ground-truth flow, averaged over all pixels and all 12 refinement iterations with exponentially increasing weights (γ=0.8) to emphasize later iterations[1].

Data augmentation applies random crops (368×768 for Sintel, 288×960 for KITTI), horizontal flips (50% probability), color jitter (brightness ±0.4, contrast ±0.4, saturation ±0.4, hue ±0.5/π), and additive Gaussian noise (σ=0.02). Spatial augmentations include random scaling (0.9-2.0×), rotation (±17°), and translation (±20% of image size). These augmentations are essential—removing them increases Sintel EPE from 1.43 to 2.8.

Fine-tuning on custom datasets requires 10,000-30,000 iterations depending on domain shift. Start from Sintel+KITTI weights, freeze the feature encoder for the first 5,000 iterations to prevent catastrophic forgetting, then unfreeze and train end-to-end. Use a lower learning rate (1e-5) and smaller batch size (2-4) to avoid overfitting on small datasets (<5,000 frame pairs). LeRobot's training scripts provide templates for fine-tuning RAFT on RLDS-format datasets, handling data loading and augmentation automatically.

Inference optimization uses mixed-precision (FP16) and fuses the correlation lookup and ConvGRU update into a single CUDA kernel, reducing memory bandwidth by 40%. TensorRT compilation further improves throughput by 2-3× on NVIDIA GPUs. For CPU deployment, ONNX export with quantization-aware training maintains <3% accuracy loss while enabling real-time inference on Intel Xeon or ARM Cortex-A76 processors.

Future Directions and Research Gaps

Current research explores three frontiers. First, unified architectures that jointly estimate optical flow, depth, and camera pose from monocular video. NVIDIA Cosmos world foundation models train on 20 million video clips with self-supervised objectives, learning a shared representation that transfers to downstream tasks including flow estimation. These models achieve competitive accuracy without task-specific fine-tuning, suggesting that large-scale pre-training may eventually replace specialized architectures like RAFT.

Second, long-range temporal modeling for video understanding. RAFT processes frame pairs independently; extending it to 10-30 frame sequences requires new memory-efficient attention mechanisms. LongBench evaluates manipulation policies on 60-second horizons, where cumulative flow drift causes tracking failures. Recurrent memory architectures or state-space models (e.g., Mamba) may address this, maintaining flow consistency over extended trajectories without quadratic memory growth.

Third, multi-modal fusion for robust perception. Combining RGB flow with LiDAR scene flow, event camera data, or tactile feedback improves performance in challenging conditions (low light, fast motion, textureless scenes). DROID includes synchronized RGB-D-tactile streams but lacks dense flow annotations across all modalities. Datasets that pair optical flow with complementary sensors—and tools to train multi-modal flow estimators—remain scarce, limiting progress on robust physical AI perception.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

    RAFT architecture details, benchmark results (1.43 EPE Sintel, 5.10% KITTI), training curriculum, and 30% improvement over PWC-Net

    arXiv
  2. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    RT-2 concatenating RAFT flow with RGB embeddings, improving pick-and-place success by 12%

    arXiv
  3. Dataset page

    Waymo Open Dataset for autonomous vehicle perception with LiDAR-camera fusion and 25% false positive reduction via flow-based motion segmentation

    waymo.com
  4. Crossing the Reality Gap: A Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning

    FlowFormer achieving 1.07 EPE on Sintel Final, 25% improvement over RAFT via transformer-based updates

    arXiv
  5. scale.com physical ai

    Scale AI's physical AI data engine collecting domain-specific frame pairs with ground-truth flow, dense annotation costing $5-$15 per frame pair

    scale.com

More glossary terms

FAQ

What is the difference between RAFT and PWC-Net for optical flow estimation?

RAFT maintains a single high-resolution flow field and refines it iteratively using a ConvGRU update operator, while PWC-Net uses a coarse-to-fine pyramid with warping at each scale. RAFT achieves 30% lower error on Sintel benchmarks because iterative refinement recovers from early mistakes and the 4D correlation volume captures all-pairs similarity without resolution loss. PWC-Net is faster (single forward pass vs. 12 iterations) but less accurate, especially on large displacements and occlusion boundaries.

How many training samples does RAFT need for a custom domain?

Fine-tuning RAFT on a new domain requires 5,000-15,000 labeled frame pairs for acceptable generalization. Start from Sintel+KITTI pre-trained weights, freeze the feature encoder for 5,000 iterations, then train end-to-end for 10,000-30,000 iterations with domain-specific augmentation. Smaller datasets (<2,000 pairs) risk overfitting; use semi-supervised pseudo-labeling on unlabeled video to expand the training set, or procure additional data through platforms like truelabel that aggregate domain-specific physical AI datasets.

Can RAFT handle non-RGB inputs like thermal or LiDAR intensity images?

RAFT's feature encoder is trained on RGB statistics and does not generalize to non-RGB modalities without retraining. Thermal, hyperspectral, or LiDAR intensity images have different appearance distributions, causing the correlation volume to produce poor matches. Fine-tuning requires 10,000+ frame pairs from the target modality with ground-truth flow labels. Alternatively, use domain adaptation techniques (e.g., CycleGAN to translate thermal to RGB) before applying RAFT, though this adds latency and may introduce artifacts.

What frame rate and resolution are optimal for RAFT in robotic manipulation?

Robotic manipulation benefits from 30-60 FPS at 640×480 or 848×480 resolution. Higher frame rates reduce inter-frame displacement, keeping motion within RAFT's correlation radius and improving tracking accuracy. Lower resolutions (e.g., 320×240) sacrifice spatial detail needed for precise gripper localization. The LeRobot framework provides RAFT checkpoints optimized for these resolutions, achieving 20-30 FPS on embedded GPUs (Jetson Xavier, Orin) with 6-iteration inference, suitable for real-time control loops at 10-20 Hz policy frequencies.

How do I evaluate RAFT performance on my specific deployment environment?

Measure end-point error (EPE) separately on high-motion regions, occlusion boundaries, and textureless areas—overall EPE can mask task-critical failures. Create a validation set of 500-1,000 frame pairs from your deployment environment with ground-truth flow from multi-view geometry or LiDAR. Test generalization by withholding 20% of scenes during training. Measure temporal consistency by tracking flow vectors across 10-frame sequences and computing drift relative to ground truth. These domain-specific metrics predict production success better than Sintel or KITTI leaderboard rankings.

Find datasets covering RAFT optical flow

Truelabel surfaces vetted datasets and capture partners working with RAFT optical flow. Send the modality, scale, and rights you need and we route you to the closest match.

Browse Physical AI Datasets