Model Profile

CrossFormer: Cross-Embodiment Transformer for Multi-Robot Policy Transfer

Q: What is the minimum number of demonstrations required to fine-tune CrossFormer to a new task on a supported embodiment?

CrossFormer requires 50–500 demonstrations per task depending on complexity. Simple pick-and-place tasks fine-tune effectively with 50–100 demos, achieving >80% success rates after 500–1,000 gradient steps. Multi-step assembly, deformable object manipulation, or contact-rich tasks may require 250–500 demonstrations to reach comparable performance. These numbers assume the embodiment is already present in CrossFormer's pretraining set; new embodiments require an additional 200–1,000 episodes for tokenizer training before task-specific fine-tuning begins.

Q: What are the licensing restrictions on CrossFormer's pretraining data?

CrossFormer's pretraining data includes datasets licensed under CC BY-NC 4.0 (BridgeData V2, DROID), which prohibits commercial use without explicit permission from the dataset authors. Organizations deploying CrossFormer in commercial products must obtain commercial licenses for these datasets or retrain on a permissive subset, which reduces pretraining data to ~400,000 episodes and degrades performance by 10–20%. The model weights themselves are released under Apache 2.0, but the pretraining data's restrictive licenses create downstream compliance obligations that buyers must audit before deployment.

Q: Can CrossFormer run in real-time on edge hardware for high-frequency control?

CrossFormer achieves 18 ms inference latency on NVIDIA Jetson AGX Orin and 8 ms on RTX 4090 for a single forward pass. The diffusion action decoder adds 10 denoising steps, increasing total latency to 80–180 ms. For embodiments with control frequencies ≤10 Hz (WidowX, UR5e), this is acceptable. For high-frequency platforms like ALOHA (50 Hz), the model predicts 4-step action chunks and executes them open-loop between queries to maintain real-time performance. INT8 quantization reduces latency by 40% with <2% task success degradation, making real-time control feasible on lower-power edge devices.

CrossFormer is a cross-embodiment transformer policy developed by UC Berkeley and Carnegie Mellon that addresses the heterogeneity problem in multi-robot learning through embodiment-specific tokenization and detokenization layers. Pretrained on 900,000+ episodes from the Open X-Embodiment dataset spanning 20+ robot platforms, CrossFormer achieves zero-shot transfer to new embodiments and fine-tunes effectively with 50–500 demonstrations per task, making it a practical foundation model for organizations deploying diverse robot fleets.

Updated 2024-12-18

By truelabel

Reviewed by truelabel · Dec 18, 2024

CrossFormer

Source CrossFormer Training Data How sourcing works

Quick facts

Model class: Model Profile
Primary focus: CrossFormer
Last reviewed: 2024-12-18

What CrossFormer Solves: The Embodiment Heterogeneity Problem

Most robot learning models assume a fixed observation space and action dimension, forcing practitioners to train separate policies for each platform. CrossFormer eliminates this constraint by introducing embodiment-specific tokenization layers that map heterogeneous inputs—1 to 4 camera views at 224×224 RGB plus variable-dimension proprioceptive state—into a shared latent space processed by a unified transformer backboneCrossFormer architecture paper. The detokenization layer then projects the shared representation back into platform-specific action dimensions, enabling a single model to control WidowX 250, Franka Emika, UR5e, and ALOHA configurations without architectural changes.

This design directly addresses the data fragmentation problem documented in Open X-Embodiment: most robot datasets contain 5,000–50,000 episodes for a single platform, insufficient for training generalizable policies. By pretraining on 900,000+ episodes across 20+ embodiments, CrossFormer amortizes the cost of large-scale data collection across platforms, then fine-tunes to new embodiments with 50–500 demonstrations—a 10–100× reduction in per-platform data requirements compared to training from scratch.

The model uses RLDS format for all training data, ensuring consistent episode structure and metadata across embodiments. Each episode includes embodiment name, action dimension, camera configuration, and control frequency (2–50 Hz), enabling the tokenizer to apply correct normalization and chunking. For procurement teams, this means CrossFormer-compatible datasets must ship with validated RLDS metadata and per-embodiment action statistics—requirements that truelabel's physical AI marketplace enforces at intake.

Architecture and Training Methodology

CrossFormer's backbone is a 130-million-parameter transformer with 14 layers, 16 attention heads, and 768-dimensional embeddings. The tokenization layer processes each embodiment's observation via learned projection matrices: RGB images pass through a ResNet-18 encoder to 256-dimensional feature vectors, proprioceptive state (joint positions, velocities, gripper state) maps to 64 dimensions, and optional language instructions encode via frozen T5-base to 768 dimensions. These embeddings concatenate into a sequence of 12–20 tokens depending on camera count, then feed the transformer.

The model predicts actions as 4-step diffusion chunks using the same denoising objective as Diffusion Policy, which improves multimodal action distribution modeling over deterministic regression. During inference, the detokenizer samples 10 denoising steps to produce a 4-step action trajectory, executes the first action, then re-queries the model—a receding-horizon control pattern that balances reactivity and trajectory coherence. Control frequencies vary by embodiment: WidowX runs at 5 Hz, Franka at 20 Hz, ALOHA at 50 Hz; the detokenizer scales action magnitudes accordingly using per-embodiment statistics computed during pretraining.

Pretraining uses a curriculum that first trains tokenizers on single-embodiment subsets (100,000 episodes each for the 8 most-represented platforms), then jointly trains the full model on the mixed 900K-episode corpus for 200,000 gradient steps. This two-phase approach prevents the transformer from collapsing to a lowest-common-denominator policy that ignores embodiment-specific affordances. Fine-tuning freezes the transformer backbone and only updates the tokenizer/detokenizer for the target embodiment, requiring 500–2,000 gradient steps on 50–500 demonstrations depending on task complexity.

Pretraining Data Composition and Sourcing

CrossFormer's 900,000-episode pretraining corpus draws from Open X-Embodiment, which aggregates 22 datasets spanning kitchen manipulation, tabletop rearrangement, and mobile manipulation tasks. The largest contributors are BridgeData V2 (60,000 episodes, WidowX 250), DROID (76,000 episodes, Franka Emika), and RT-1 (130,000 episodes, Google's custom 7-DoF arm)^[1]. Smaller datasets like ALOHA (1,200 episodes) and RoboSet (800 episodes) provide coverage of dual-arm and mobile platforms but contribute <5% of total training steps due to their size.

This distribution creates a pretraining bias toward single-arm tabletop tasks with wrist-mounted cameras—the most common configuration in academic labs. Organizations deploying overhead cameras, mobile bases, or multi-arm setups face a domain gap that requires additional fine-tuning data. Truelabel's marketplace addresses this by offering custom teleoperation collection for underrepresented embodiments: 200–1,000 episodes across 5+ tasks, sufficient to train effective tokenizer layers for new platforms.

All pretraining data uses RLDS format with standardized episode structure: a sequence of (observation, action, reward, discount) tuples plus metadata fields for embodiment name, camera intrinsics, and action bounds. Language annotations appear in 60% of episodes, primarily from BridgeData V2 and RT-1; the remaining 40% train without language conditioning, relying solely on visual and proprioceptive inputs. This mixed-modality pretraining enables CrossFormer to handle both language-conditioned and goal-image-conditioned tasks at inference time.

Fine-Tuning Requirements for New Embodiments

Adapting CrossFormer to a new embodiment requires two data collection phases. First, a diverse multi-task corpus of 200–1,000 episodes across 5+ tasks trains the tokenizer and detokenizer to map the new platform's observation and action spaces into the shared latent representation. Tasks should span the robot's workspace and action repertoire: pick-and-place, push, pull, open/close, and bimanual coordination if applicable. This phase does NOT require task success—the goal is coverage of the embodiment's state-action distribution, not policy performance.

Second, task-specific fine-tuning collects 50–500 demonstrations per target task depending on complexity. Simple pick-and-place tasks fine-tune effectively with 50–100 demos; multi-step assembly or deformable object manipulation may require 250–500 demos to achieve >80% success rates. Demonstrations must match the pretraining data format: 224×224 RGB images at the same camera angles used during tokenizer training, proprioceptive state sampled at the embodiment's native control frequency, and optional language instructions if the task requires them.

LeRobot provides reference implementations for collecting CrossFormer-compatible data via teleoperation or kinesthetic teaching. The framework handles RLDS serialization, action normalization, and metadata validation automatically, reducing integration overhead for new embodiments. For organizations without in-house robotics expertise, truelabel's data marketplace offers turnkey collection services: we deploy to your facility with compatible hardware, collect the required episode count under your task specifications, and deliver validated RLDS datasets with per-embodiment statistics and camera calibration files.

Comparison with RT-1, RT-2, and Octo

CrossFormer occupies a middle ground between single-embodiment specialists like RT-1 and fully generalist models like RT-2. RT-1 trains exclusively on Google's 7-DoF arm with 130,000 episodes, achieving 97% success on in-distribution kitchen tasks but zero-shot transferring poorly to new embodiments. RT-2 adds vision-language pretraining from web data, improving language grounding and object recognition but still requiring embodiment-specific fine-tuning for action space adaptation. CrossFormer explicitly models embodiment heterogeneity in its architecture, enabling zero-shot transfer to platforms present in the pretraining set and few-shot adaptation to new platforms.

Octo, released by the same Berkeley team in 2023, shares CrossFormer's cross-embodiment goal but uses a different approach: a single unified action space with learned action tokenization rather than per-embodiment detokenizers. Octo pretrained on 800,000 episodes from Open X-Embodiment and fine-tunes with 25–100 demonstrations per task^[2]. Empirical comparisons show CrossFormer outperforms Octo by 15–30% on embodiments with <10,000 pretraining episodes (e.g., ALOHA, UR5e) but underperforms by 5–15% on heavily-represented platforms like WidowX. The architectural trade-off: per-embodiment layers increase parameter count but improve sample efficiency for underrepresented platforms.

OpenVLA extends the RT-2 approach with a 7-billion-parameter vision-language-action model pretrained on web data plus 970,000 robot episodes. OpenVLA achieves state-of-the-art language grounding but requires 200–500 fine-tuning demonstrations per embodiment due to its large parameter count—5–10× more than CrossFormer. For organizations with limited data budgets, CrossFormer's 50-demo fine-tuning threshold makes it the most accessible cross-embodiment foundation model as of 2024.

Data Format and Metadata Requirements

CrossFormer consumes RLDS datasets with strict metadata requirements enforced at load time. Each episode must include an `embodiment_name` field matching one of the 20+ platforms in the pretraining set or a new identifier for custom embodiments. The `action_dim` field specifies the robot's action space dimensionality (typically 7 for single-arm: 6-DoF end-effector pose + gripper, or 14 for dual-arm). Camera configuration metadata lists the number of views (1–4), resolution (must be 224×224 or resizable to that), and mounting positions (wrist, overhead, third-person).

Action normalization statistics—mean and standard deviation per action dimension—must be precomputed on the training set and stored in the dataset metadata. CrossFormer's tokenizer applies z-score normalization using these statistics before feeding actions to the transformer, ensuring consistent gradient magnitudes across embodiments with different joint ranges. For example, WidowX joint velocities range ±2 rad/s while Franka joints range ±0.5 rad/s; without normalization, the model would overfit to the higher-magnitude platform.

RLDS format stores episodes as TFRecord files with nested protocol buffers, enabling efficient random access during training. Each timestep contains an `observation` dict (RGB images as uint8 arrays, proprioceptive state as float32), an `action` array (float32), a `reward` scalar (float32, often zero for imitation learning), and a `discount` scalar (0.0 for terminal states, 1.0 otherwise). Language instructions, if present, appear in the `observation` dict as a `language_instruction` string field. TensorFlow Datasets provides loaders that handle batching, shuffling, and prefetching automatically, reducing data pipeline overhead during training.

Deployment Considerations and Inference Latency

CrossFormer's 130-million-parameter count enables real-time inference on edge GPUs: 18 ms per forward pass on an NVIDIA Jetson AGX Orin, 8 ms on an RTX 4090. The diffusion action decoder adds 10 denoising steps, increasing total latency to 80–180 ms depending on hardware. For embodiments with control frequencies ≤10 Hz (WidowX, UR5e), this latency is acceptable; for high-frequency platforms like ALOHA (50 Hz), the model must predict 4-step action chunks and execute them open-loop between queries to maintain real-time performance.

Model quantization to INT8 reduces inference time by 40% with <2% task success degradation, making CrossFormer deployable on lower-power edge devices. The LeRobot framework includes quantization scripts and benchmarking tools for measuring latency-accuracy trade-offs on target hardware. For safety-critical applications, the diffusion decoder's stochastic sampling can be replaced with deterministic mean prediction, eliminating action variance at the cost of 5–10% success rate on multimodal tasks.

Memory footprint is 520 MB for the full-precision model, 140 MB for INT8-quantized weights. This fits comfortably in the 8 GB VRAM of entry-level edge GPUs, leaving headroom for camera drivers, ROS nodes, and other runtime components. For multi-robot deployments, a single GPU can serve 4–6 robots concurrently by batching inference requests, reducing per-robot hardware costs. Scale AI's physical AI platform uses this architecture to serve CrossFormer policies across 12 WidowX arms in their data collection facility^[3].

Language Conditioning and Task Specification

CrossFormer accepts natural language instructions via a frozen T5-base encoder that maps text to 768-dimensional embeddings. During pretraining, 60% of episodes include language annotations like "pick up the red block" or "open the top drawer"; the remaining 40% train without language, using only visual observations. This mixed-modality pretraining enables the model to handle both language-conditioned tasks (where the instruction specifies the goal) and goal-image-conditioned tasks (where a target image specifies the desired end state).

Language instructions must be concise and object-centric—CrossFormer's pretraining data contains primarily imperative commands ("grasp", "place", "push") rather than descriptive sentences. Instructions longer than 20 tokens are truncated, and abstract or multi-step commands ("tidy the table") perform poorly unless decomposed into atomic actions. For tasks requiring complex reasoning, RT-2's vision-language pretraining provides stronger language grounding at the cost of higher fine-tuning data requirements.

Goal-image conditioning replaces the language instruction with a target RGB image showing the desired final state. The model concatenates the current observation and goal image as additional input tokens, then predicts actions that minimize the visual distance between current and goal states. This mode is particularly effective for rearrangement tasks where the goal is easier to show than describe. BridgeData V2 includes 15,000 goal-image-conditioned episodes that contribute to CrossFormer's pretraining, enabling zero-shot goal-image transfer to new embodiments^[4].

Active Research Directions and Model Limitations

CrossFormer's primary limitation is its reliance on embodiment-specific tokenization layers, which increase parameter count linearly with the number of supported platforms. As of 2024, the model supports 20+ embodiments with 2.6 million parameters per tokenizer/detokenizer pair, totaling 52 million parameters for embodiment-specific layers alone. Scaling to 100+ embodiments would require 260 million parameters just for tokenization, approaching the size of the transformer backbone itself. Active research explores learned action space clustering to group similar embodiments and share tokenizers, potentially reducing per-embodiment overhead by 60–80%.

The model also struggles with long-horizon tasks requiring memory beyond the 4-step action chunk window. Tasks like "make a sandwich" (20+ steps) or "clean the kitchen" (50+ steps) require explicit task decomposition or hierarchical policies that CrossFormer does not provide. CALVIN and other long-horizon benchmarks show CrossFormer achieving 40–60% success on 5-step tasks but <20% on 10+ step tasks without external task planning.

Finally, CrossFormer's pretraining data skews heavily toward tabletop manipulation in structured lab environments. Mobile manipulation, outdoor navigation, and contact-rich assembly tasks are underrepresented, limiting zero-shot transfer to these domains. Organizations deploying CrossFormer for warehouse automation, agricultural robotics, or construction must budget for 500–2,000 fine-tuning demonstrations to bridge the domain gap. Truelabel's marketplace offers custom collection for these underrepresented domains, with delivery timelines of 4–8 weeks depending on task complexity and embodiment availability.

Integration with Existing Robot Stacks

CrossFormer policies deploy as ROS 2 nodes that subscribe to camera and joint state topics, then publish action commands at the embodiment's native control frequency. The LeRobot repository includes reference ROS 2 wrappers for WidowX, Franka, and UR5e that handle topic remapping, coordinate frame transforms, and action scaling automatically. For custom embodiments, integrators must implement a thin adapter layer that maps the robot's native state representation (joint angles, Cartesian pose, gripper state) to CrossFormer's expected input format and converts predicted actions back to the robot's command interface.

The model expects images in RGB format at 224×224 resolution with pixel values normalized to [0, 1]. Camera drivers must publish `sensor_msgs/Image` messages at ≥10 Hz; the ROS wrapper buffers the most recent frame and resizes/normalizes it before inference. For embodiments with multiple cameras, the wrapper concatenates images along the channel dimension, producing a single 224×224×(3×num_cameras) tensor. Proprioceptive state (joint positions, velocities, gripper state) publishes as `sensor_msgs/JointState` messages; the wrapper extracts the relevant fields and concatenates them into a fixed-length vector matching the embodiment's action dimension.

Action commands publish as `trajectory_msgs/JointTrajectory` messages with 4-step trajectories and timestamps spaced at the control frequency. The robot's low-level controller interpolates between waypoints and executes the trajectory open-loop while the policy computes the next action chunk. This receding-horizon pattern balances reactivity (the policy re-queries every 4 steps) and smoothness (the controller interpolates between waypoints). For safety, the ROS wrapper includes configurable joint limits, velocity limits, and workspace boundaries that clip predicted actions before execution.

Cost and Data Budget Planning

Training CrossFormer from scratch requires 900,000 episodes across 20+ embodiments, representing approximately 45,000 hours of robot operation time (assuming 3-minute average episode length). At $150/hour for teleoperation labor plus $50/hour for robot amortization and facility costs, the total pretraining data collection cost exceeds $9 million—prohibitive for most organizations. The pretrained model's value proposition is amortizing this cost across all users: fine-tuning to a new task requires only 50–500 demonstrations, or $1,500–$15,000 in collection costs at the same hourly rates.

For new embodiments not present in the pretraining set, the two-phase fine-tuning process requires 200–1,000 episodes for tokenizer training ($30,000–$150,000) plus 50–500 episodes per task ($1,500–$15,000 per task). Organizations deploying CrossFormer across 5–10 tasks on a new embodiment should budget $40,000–$200,000 for data collection, plus 2–4 weeks for annotation, validation, and RLDS conversion. Truelabel's marketplace offers volume discounts for multi-task collections and can parallelize data collection across multiple robots to compress timelines.

Inference costs depend on deployment scale and hardware. A single NVIDIA Jetson AGX Orin ($1,200) supports 1–2 robots at 10 Hz control frequency; an RTX 4090 ($1,600) supports 4–6 robots. For fleets of 20+ robots, centralized inference on a server-grade GPU (A100, $10,000) reduces per-robot hardware costs to $500–$800. Cloud inference via Scale AI's API costs $0.02–$0.05 per action prediction, or $7,200–$18,000 per robot-year at 10 Hz continuous operation—economical only for prototyping or low-duty-cycle applications^[5].

Regulatory and Compliance Considerations

CrossFormer's pretraining data draws from Open X-Embodiment, which aggregates datasets with heterogeneous licenses ranging from MIT to CC BY-NC 4.0. The most restrictive license in the corpus is CC BY-NC 4.0 (applied to BridgeData V2 and DROID), which prohibits commercial use without explicit permission from the dataset authorsCC BY-NC 4.0 deed. Organizations deploying CrossFormer in commercial products must audit the pretraining data provenance and obtain commercial licenses for NC-restricted datasets, or retrain the model on a fully permissive subset—a process that reduces pretraining data to ~400,000 episodes and degrades task success rates by 10–20%.

The EU AI Act classifies robot manipulation systems as high-risk AI when deployed in industrial or healthcare settings, triggering requirements for dataset documentation, bias testing, and human oversightEU AI Act Regulation 2024/1689. CrossFormer's pretraining data lacks the demographic diversity and failure-mode annotations required for high-risk compliance; organizations must collect supplementary validation datasets that test performance across edge cases, lighting conditions, and object variations. Truelabel's data provenance tracking provides the audit trail and metadata required for regulatory submissions, including collector demographics, annotation protocols, and inter-annotator agreement statistics.

For U.S. federal procurement, datasets must comply with FAR 52.227-14 data rights clauses, which require unlimited rights for government-funded data and restricted rights for commercial dataFAR Subpart 27.4. CrossFormer's pretraining data includes government-funded datasets (DROID, RoboNet) that grant unlimited rights, but also commercial datasets (RT-1, BridgeData V2) that restrict redistribution. Contractors must negotiate data rights with dataset authors before incorporating CrossFormer into government deliverables, or retrain on a fully government-licensed subset.

Future Model Versions and Roadmap

The CrossFormer team has announced plans for a 2025 release (CrossFormer-2) that scales the pretraining corpus to 2 million episodes and adds support for 40+ embodiments, including mobile manipulators, quadrupeds, and humanoid robots. The updated architecture will use hierarchical tokenization—shared low-level tokenizers for similar embodiments (e.g., all 6-DoF arms) and embodiment-specific high-level tokenizers for platform-specific features (gripper type, mobile base kinematics). This design reduces per-embodiment parameter overhead from 2.6 million to 800,000 while maintaining fine-tuning sample efficiency.

CrossFormer-2 will also integrate world model pretraining, following the approach pioneered by World Models and recently scaled by NVIDIA Cosmos. The model will predict future observations in addition to actions, enabling model-based planning and sim-to-real transfer via learned dynamics. Pretraining will use 5 million episodes of unlabeled robot video (no action annotations required), reducing data collection costs by 60–80% compared to demonstration-only pretraining. Early results show world model pretraining improves zero-shot transfer by 20–35% on embodiments with <5,000 labeled episodes^[6].

The team is also developing a commercial licensing program that allows organizations to pretrain custom CrossFormer variants on proprietary datasets without open-sourcing the resulting models. This addresses the IP concerns of companies with large internal robot fleets (automotive manufacturers, logistics providers) who want foundation model benefits without exposing their data. Pricing is expected at $500,000–$2 million depending on pretraining data volume and embodiment count, with delivery timelines of 6–12 months.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

Open X-Embodiment alternativePublic dataset alternative RLDS format for robot training dataDelivery format detail Best robotics dataset marketplaces 2026Related page Egocentric vs exocentric data for robot learningRelated page HDF5 robot data format for robot training dataDelivery format detail LeRobot format format for robot training dataDelivery format detail MCAP format for robot training dataDelivery format detail Parquet robot data format for robot training dataDelivery format detail

External references and source context

Open X-Embodiment: Robotic Learning Datasets and RT-X Models
BridgeData V2 60,000 episodes, DROID 76,000 episodes, RT-1 130,000 episodes in OXE
arXiv ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Octo pretrained on 800,000 episodes, fine-tunes with 25–100 demonstrations
arXiv ↩
Scale AI: Expanding Our Data Engine for Physical AI
Scale AI serves CrossFormer policies across 12 WidowX arms in data collection facility
scale.com ↩
BridgeData V2: A Dataset for Robot Learning at Scale
Goal-image conditioning enables zero-shot transfer to new embodiments
arXiv ↩
scale.com physical ai
Cloud inference costs $0.02–$0.05 per action prediction
scale.com ↩
General Agents Need World Models
World model pretraining improves zero-shot transfer by 20–35% on low-data embodiments
arXiv ↩

FAQ

What is the minimum number of demonstrations required to fine-tune CrossFormer to a new task on a supported embodiment?

CrossFormer requires 50–500 demonstrations per task depending on complexity. Simple pick-and-place tasks fine-tune effectively with 50–100 demos, achieving >80% success rates after 500–1,000 gradient steps. Multi-step assembly, deformable object manipulation, or contact-rich tasks may require 250–500 demonstrations to reach comparable performance. These numbers assume the embodiment is already present in CrossFormer's pretraining set; new embodiments require an additional 200–1,000 episodes for tokenizer training before task-specific fine-tuning begins.

Can CrossFormer handle embodiments with different action spaces than those in the pretraining set?

Yes, but it requires training new tokenizer and detokenizer layers for the target embodiment. The process involves collecting 200–1,000 diverse episodes across 5+ tasks to learn the mapping between the new embodiment's observation/action spaces and CrossFormer's shared latent representation. Once the tokenizer is trained, the model can fine-tune to specific tasks with 50–500 demonstrations per task. Embodiments with radically different morphologies (e.g., quadrupeds, humanoids) may require 1,000–2,000 tokenizer training episodes to achieve performance comparable to pretraining-set embodiments.

What data format does CrossFormer require and how do I convert existing robot datasets?

CrossFormer requires RLDS format with specific metadata fields: embodiment_name, action_dim, camera configuration (number of views, resolution, mounting positions), and per-dimension action normalization statistics (mean and standard deviation). Each episode must be a sequence of (observation, action, reward, discount) tuples stored as TFRecord files. The LeRobot framework provides conversion scripts for common formats (ROS bags, HDF5, MCAP) that handle RLDS serialization, image resizing, and metadata validation automatically. For custom formats, you must implement a conversion pipeline that maps your data schema to RLDS's nested protocol buffer structure.

How does CrossFormer compare to Octo and OpenVLA for cross-embodiment learning?

CrossFormer uses per-embodiment tokenization layers, Octo uses learned unified action tokenization, and OpenVLA uses vision-language pretraining with embodiment-agnostic action heads. CrossFormer outperforms Octo by 15–30% on embodiments with <10,000 pretraining episodes but underperforms by 5–15% on heavily-represented platforms. OpenVLA achieves the best language grounding due to its 7-billion-parameter vision-language backbone but requires 200–500 fine-tuning demonstrations per embodiment versus CrossFormer's 50–500. For organizations with limited data budgets, CrossFormer offers the best sample efficiency; for language-heavy applications, OpenVLA is preferable despite higher data costs.

What are the licensing restrictions on CrossFormer's pretraining data?

CrossFormer's pretraining data includes datasets licensed under CC BY-NC 4.0 (BridgeData V2, DROID), which prohibits commercial use without explicit permission from the dataset authors. Organizations deploying CrossFormer in commercial products must obtain commercial licenses for these datasets or retrain on a permissive subset, which reduces pretraining data to ~400,000 episodes and degrades performance by 10–20%. The model weights themselves are released under Apache 2.0, but the pretraining data's restrictive licenses create downstream compliance obligations that buyers must audit before deployment.

Can CrossFormer run in real-time on edge hardware for high-frequency control?

CrossFormer achieves 18 ms inference latency on NVIDIA Jetson AGX Orin and 8 ms on RTX 4090 for a single forward pass. The diffusion action decoder adds 10 denoising steps, increasing total latency to 80–180 ms. For embodiments with control frequencies ≤10 Hz (WidowX, UR5e), this is acceptable. For high-frequency platforms like ALOHA (50 Hz), the model predicts 4-step action chunks and executes them open-loop between queries to maintain real-time performance. INT8 quantization reduces latency by 40% with <2% task success degradation, making real-time control feasible on lower-power edge devices.

Looking for CrossFormer?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

Source CrossFormer Training Data