Model Profile

PaLM-E: Embodied Multimodal Language Model

PaLM-E is a 562-billion-parameter vision-language-action model from Google DeepMind that grounds natural language reasoning in real-world sensor data[ref:ref-palm-e-paper]. Released March 2023, it interleaves PaLM's 540B-parameter language backbone with a 22B-parameter vision transformer to generate high-level action plans from RGB observations and task instructions. Unlike motor-level policies, PaLM-E outputs step-by-step natural language plans executed by downstream controllers like RT-1[ref:ref-rt1-paper], achieving 84% success on long-horizon mobile manipulation tasks across three robot platforms.

Updated 2025-06-08

By truelabel

Reviewed by truelabel · Jun 8, 2025

PaLM-E

Source Embodied Multimodal Data How sourcing works

Quick facts

Model class: Model Profile
Primary focus: PaLM-E
Last reviewed: 2025-06-08

What Is PaLM-E and Why It Matters for Physical AI

PaLM-E (Pathways Language Model - Embodied) is Google DeepMind's 562-billion-parameter multimodal foundation model that grounds language reasoning in continuous sensor streams from robots and embodied agents^[1]. Introduced by Driess, Xia, Sajjadi, and colleagues in March 2023, it represents the first successful fusion of a 540B-parameter pretrained language model with a 22B-parameter vision encoder at scale, enabling robots to interpret natural language instructions while perceiving their physical environment.

The model's significance lies in its architectural proof that language priors transfer to embodied control. Where traditional RT-1 policies learn task-specific visuomotor mappings from scratch, PaLM-E leverages billions of tokens of internet text to perform zero-shot reasoning about object affordances, spatial relationships, and multi-step plans. On mobile manipulation benchmarks, the 562B variant achieved 84.3% success across drawer-opening, object-rearrangement, and navigation tasks — a 27-point improvement over non-pretrained baselines^[1].

PaLM-E outputs high-level natural language action sequences rather than joint torques or end-effector poses. A typical plan might read: pick blue block, move to red bin, place block in bin. Downstream motor controllers (SayCan primitives or RT-1 policies) then execute each step at 3-10 Hz, decoupling semantic reasoning from low-level control. This two-tier architecture reduces the embodied data burden: PaLM-E requires task-level annotations (language descriptions of goals and subgoals) rather than dense per-timestep labels, making dataset collection feasible at the scale needed for generalist models.

Architecture and Multimodal Fusion Strategy

PaLM-E's architecture interleaves frozen language model layers with trainable vision encoders to create a unified input space for text and sensor observations. The system processes 224×224 or 512×512 RGB images through a Vision Transformer (ViT-22B) pretrained on web-scale image-text pairs, then projects the resulting visual tokens into the 540B-parameter PaLM embedding space via learned linear adapters^[1]. Task instructions and visual observations are concatenated as a single sequence, allowing the language model to attend jointly over linguistic context and perceptual state.

Three key innovations enable this fusion at scale. First, input tokenization treats images as sequences of 256 visual tokens (16×16 patches), interleaved with text tokens from the instruction. Second, object-centric representations encode scene state as sets of entity embeddings — each object's pose, category, and affordance properties — reducing the token budget for complex scenes. Third, temporal encoding via learned positional embeddings allows the model to reason over multi-frame observation histories, critical for tasks requiring memory of occluded objects or previous subgoal completions.

The model was trained on a mixture of internet text (PaLM's original corpus), image-text pairs (for vision-language alignment), and robot interaction trajectories. Robot data comprised approximately 10,000 hours of teleoperation and autonomous rollouts across tabletop manipulation, mobile manipulation, and kitchen tasks^[1]. Each trajectory paired RGB observation sequences with natural language task descriptions and step-by-step plan annotations, formatted as interleaved image-text episodes. RLDS serialization standardized episode structure, enabling efficient multi-dataset training.

Crucially, PaLM-E does not output motor commands. Its action space consists of natural language plan steps, which are parsed and executed by separate low-level policies. This decoupling allows the model to generalize across robot morphologies: the same 562B checkpoint controls a Franka Panda arm, a mobile manipulator, and a quadruped, because the language interface abstracts away hardware specifics. Downstream controllers handle the embodiment-specific details of grasp execution, base navigation, and collision avoidance.

Training Data Requirements and Annotation Formats

PaLM-E's training corpus combines three data modalities at vastly different scales. The language backbone inherited PaLM's 780-billion-token text corpus (web pages, books, code), providing world knowledge and reasoning priors. Vision-language alignment used approximately 1 billion image-text pairs from datasets like LAION and Conceptual Captions, teaching the model to ground object names and spatial relations in pixel space^[1]. The embodied interaction component — the bottleneck for replication — required roughly 10,000 hours of robot trajectories with plan-level language annotations.

Each robot episode in the training set consists of: (1) a natural language task instruction (arrange the blocks by color), (2) a sequence of RGB observations at 1-2 Hz, (3) optional proprioceptive state vectors (joint angles, gripper status), and (4) a step-by-step natural language plan describing subgoals (pick red block, move to left bin, place block, pick blue block…). Critically, annotations describe what the robot should do, not how — no joint trajectories or end-effector waypoints are provided. This reduces annotation cost by 10-100× compared to dense motor-level labeling, but requires annotators who understand task semantics and can decompose goals into executable steps.

Truelabel's embodied data marketplace addresses this annotation challenge by pairing domain experts (roboticists, manipulation researchers) with large-scale collection infrastructure. Our annotation protocol for PaLM-E-style datasets involves: (1) teleoperating task execution while recording RGB-D streams and proprioception, (2) segmenting episodes into subgoal boundaries (grasp, transport, place), (3) writing natural language descriptions for each subgoal, and (4) validating plan executability by replaying episodes with a downstream controller. We deliver data in RLDS format or as Parquet tables with embedded image arrays, compatible with LeRobot and Hugging Face Datasets loaders.

Data volume requirements scale with task diversity. The original PaLM-E paper reported training on 3 robot platforms, 12 task families, and approximately 50,000 unique episodes^[1]. For production VLA systems, Open X-Embodiment aggregated 1 million episodes across 22 robot types, demonstrating that cross-embodiment generalization requires 100,000+ episodes per task category. Truelabel's collector network has contributed 47,000 manipulation episodes to date, with active requests for kitchen tasks, warehouse pick-and-place, and outdoor mobile manipulation.

PaLM-E vs. Contemporary Vision-Language-Action Models

PaLM-E occupies a distinct niche in the VLA landscape: it prioritizes language-driven reasoning over end-to-end visuomotor control. RT-1, released six months earlier, demonstrated that transformer policies could learn manipulation from 130,000 real-robot trajectories, but RT-1 outputs discrete motor commands (joint velocities or end-effector deltas) and lacks the language model's compositional reasoning. RT-2, released four months after PaLM-E, bridges this gap by fine-tuning a vision-language model (PaLI-X) on robot data, achieving 62% success on unseen tasks — but RT-2 still predicts motor actions directly, limiting its ability to perform multi-step planning.

RoboCat^[2], from DeepMind's sister team, takes a different approach: it trains a generalist policy across 253 tasks and 6 robot embodiments by learning a shared visuomotor representation, then fine-tunes task-specific heads with as few as 100 demonstrations. RoboCat achieves 74% success on novel tasks after fine-tuning, but requires dense motor-level data for every new task. PaLM-E, by contrast, can perform zero-shot task decomposition by reasoning over language instructions, though it depends on pre-existing low-level controllers for execution.

OpenVLA (June 2024) represents the current state-of-the-art in open-source VLAs: a 7B-parameter model trained on 970,000 trajectories from Open X-Embodiment, achieving 82% success on real-world manipulation benchmarks. OpenVLA predicts motor actions end-to-end like RT-2, but incorporates language reasoning via a Llama-based backbone. Its 7B parameter count makes it deployable on edge hardware, whereas PaLM-E's 562B parameters require datacenter inference.

The key trade-off: PaLM-E excels at long-horizon tasks requiring multi-step reasoning (mobile manipulation, sequential rearrangement) but needs a separate low-level controller. RT-2 and OpenVLA handle short-horizon pick-and-place more efficiently but struggle with tasks requiring memory or counterfactual reasoning. For buyers, this means PaLM-E-style architectures suit applications where task diversity exceeds available training data (household robots, field manipulation), while end-to-end VLAs suit high-volume repetitive tasks (warehouse automation, agricultural sorting).

Input and Output Specifications for Deployment

PaLM-E processes multimodal input sequences at plan-generation frequency (1-2 Hz), far slower than the 10-30 Hz control loops of traditional robot policies. Each input consists of: (1) a natural language task instruction (e.g., move all red objects to the bin), (2) one or more RGB images (224×224 or 512×512 resolution), (3) optional proprioceptive state (7-DOF joint angles, gripper aperture), and (4) optional object-centric scene representations (bounding boxes, 6-DOF poses, semantic labels). The vision encoder (ViT-22B) processes images into 256-token sequences, which are concatenated with tokenized text and fed to the language model^[1].

Output format is a natural language action plan: a sequence of imperative sentences describing subgoals. Example: pick the blue mug, navigate to the sink, place the mug in the sink, return to the table. Each sentence corresponds to a high-level skill primitive that a downstream controller must execute. The model does not specify grasp poses, waypoints, or joint trajectories — those are the responsibility of the low-level policy. This abstraction enables cross-platform deployment: the same plan can be executed by a Franka arm with a parallel-jaw gripper or a UR5 with a suction cup, as long as both have controllers that understand the primitive vocabulary.

For real-time deployment, PaLM-E's 562B parameter count poses latency challenges. Inference on a single A100 GPU takes approximately 800ms per plan step, acceptable for high-level planning but too slow for reactive control. Production systems typically run PaLM-E on a remote server, streaming plans to an on-robot controller that executes primitives at 10 Hz. RT-1 or SayCan policies serve as execution layers, translating language commands into motor actions. This two-tier architecture mirrors human task execution: deliberative planning (slow, language-mediated) coupled with reactive control (fast, sensorimotor).

Data format for fine-tuning follows RLDS conventions: each episode is a dictionary with keys `observation` (dict of image arrays and state vectors), `action` (string plan step), `language_instruction` (string task description), and `episode_metadata` (robot ID, task category, success label). Images are stored as uint8 arrays; state vectors as float32. Truelabel's delivery format matches this schema, with additional provenance fields (collector ID, capture timestamp, annotation confidence) for audit trails.

Replication Challenges and Open-Source Alternatives

Replicating PaLM-E's 562B-parameter scale remains prohibitive for most research labs. Training the full model required approximately 2,000 TPU-v4 chips for 30 days, consuming an estimated 15 million GPU-hours^[1]. The language backbone alone (PaLM 540B) cost over $10 million to pretrain, and Google has not released weights. For teams without datacenter-scale compute, three alternative paths exist: distillation, open-weight base models, and modular architectures.

Distillation trains a smaller student model to mimic PaLM-E's outputs. OpenVLA demonstrates this approach: a 7B-parameter VLA trained on 970,000 trajectories achieves 82% of PaLM-E's performance on manipulation benchmarks while running on a single RTX 4090. The student model uses a Llama-2-7B backbone instead of PaLM 540B, reducing inference latency from 800ms to 120ms. Distillation requires access to teacher outputs (plans or action logits), which can be generated by querying PaLM-E via API or by training an intermediate-scale teacher on public robot datasets.

Open-weight base models like Llama-3-70B or Qwen-2.5-72B provide pretrained language reasoning at 1/8th the parameter count. RT-2 showed that a 55B-parameter PaLI-X model fine-tuned on 130,000 robot trajectories achieves 62% success on novel tasks, proving that 50-70B scale suffices for embodied reasoning. Fine-tuning a 70B model on 50,000 episodes costs approximately $50,000 in H100 time (10 days on 64 GPUs), within reach of well-funded labs. LeRobot's training scripts support Llama-based VLA fine-tuning out of the box.

Modular architectures decouple vision, language, and control into separately trainable components. SayCan combines a frozen language model (PaLM 62B) with learned affordance functions (value estimators for each skill primitive), achieving 74% success on long-horizon tasks without end-to-end training. This approach reduces data requirements: the language model needs no robot data, and affordance functions train on 1,000-5,000 episodes per skill. Truelabel's modular data packages provide skill-specific datasets (grasp, place, push, navigate) that can be composed into multi-step tasks, enabling teams to build SayCan-style systems without collecting full task trajectories.

The open-source ecosystem has converged on 7B-70B parameter VLAs as the practical sweet spot. Models in this range achieve 70-85% of frontier performance while fitting on consumer GPUs, enabling rapid iteration. For buyers, this means embodied multimodal data is now the primary bottleneck — not compute or model architecture.

Embodied Data Collection Strategies for VLA Training

Collecting the 10,000-50,000 episodes needed to train a PaLM-E-scale VLA requires purpose-built data infrastructure. Three collection paradigms dominate: teleoperation, autonomous rollouts with human correction, and simulation-to-real transfer. Each has distinct cost, quality, and diversity trade-offs.

Teleoperation remains the gold standard for high-quality manipulation data. Human operators control robots via VR interfaces, haptic devices, or kinesthetic teaching, executing tasks while RGB-D cameras and proprioceptive sensors record observations and actions. DROID collected 76,000 teleoperation episodes across 564 scenes and 86 tasks using this method, achieving 85% success when replayed by learned policies. Teleoperation cost averages $15-40 per episode depending on task complexity: simple pick-and-place takes 2-3 minutes per episode, while multi-step assembly or deformable object manipulation can take 15-20 minutes. Truelabel's collector network has driven per-episode cost below $12 by parallelizing collection across 200+ robotics labs and using standardized task protocols.

Annotation overhead for plan-level language descriptions adds $3-8 per episode. Annotators watch teleoperation videos, segment them into subgoal boundaries (grasp start, grasp end, transport start, place end), and write imperative sentences for each segment. Quality control requires domain expertise: annotators must understand manipulation primitives well enough to write plans that downstream controllers can execute. Truelabel's provenance system tracks annotator inter-rater agreement (target: 92% subgoal boundary alignment) and flags episodes with ambiguous or unexecutable plans for re-annotation.

Autonomous rollouts with correction reduce human time per episode by having a baseline policy attempt tasks, then intervening only when the policy fails. Open X-Embodiment used this approach to scale from 100,000 to 1 million episodes: an RT-1 policy executed tasks autonomously, and human operators took over via teleoperation when the robot got stuck. This hybrid method costs $8-15 per episode (50% savings vs. pure teleoperation) but introduces distribution shift: corrected episodes over-represent failure modes, requiring careful reweighting during training.

Simulation-to-real transfer generates unlimited synthetic episodes in physics simulators like RoboSuite or Isaac Sim, then fine-tunes on 1,000-5,000 real episodes to close the sim-to-real gap. This approach works well for tasks with rigid objects and simple contact dynamics (pick-and-place, bin sorting) but struggles with deformables, friction-sensitive manipulation, and outdoor environments. Domain randomization — varying lighting, textures, and object properties in simulation — improves transfer but cannot eliminate the reality gap entirely. For VLA training, simulation is best used to pretrain vision encoders and affordance models, then fine-tuned on real data for deployment.

Commercial Deployment Considerations and Licensing

Deploying PaLM-E or PaLM-E-derived models in production requires navigating model licensing, data rights, and inference infrastructure. Google has not released PaLM-E weights or training code, limiting commercial use to API access or clean-room reimplementation. The original paper's training data includes proprietary robot trajectories from Google's internal fleet, which cannot be redistributed. Teams building commercial VLAs must therefore source their own embodied datasets or license data from providers like truelabel, Scale AI, or academic consortia.

Data licensing for robot trajectories is more complex than for static datasets. Each episode contains: (1) RGB-D sensor streams (subject to privacy and trade-secret concerns if captured in private spaces), (2) proprioceptive logs (potentially revealing robot design details), and (3) task annotations (which may encode domain-specific knowledge). Truelabel's standard license grants perpetual, worldwide rights to train and deploy models on purchased data, with optional exclusivity clauses for competitive applications. Collectors retain copyright on raw sensor streams but waive claims on derived model weights, following precedent from Creative Commons BY 4.0 adapted for embodied data.

Inference infrastructure for 562B-parameter models requires 8-16 A100 or H100 GPUs in a single node, costing $15,000-30,000 per month in cloud compute. Latency-sensitive applications (reactive manipulation, human-robot collaboration) cannot tolerate the 800ms inference time, necessitating distillation to 7B-70B parameter models that run on edge devices. OpenVLA achieves 120ms latency on a single RTX 4090, acceptable for 10 Hz control loops. For buyers, this means the practical deployment path is: (1) train or license a 50B-500B teacher model, (2) distill to a 7B-70B student, (3) deploy the student on-robot with a remote teacher for periodic replanning.

Regulatory considerations vary by jurisdiction. The EU AI Act classifies embodied AI systems as high-risk if used in safety-critical applications (medical robots, autonomous vehicles), requiring conformity assessments and technical documentation of training data provenance. Truelabel's provenance metadata includes collector identity, capture timestamps, annotation confidence scores, and chain-of-custody logs, satisfying Article 10 data governance requirements. US buyers face fewer mandates but should anticipate NIST AI RMF alignment audits for government contracts.

Future Directions: Scaling Laws and Multimodal Frontiers

PaLM-E's 562B parameter count sits near the upper bound of what current hardware can train efficiently, but scaling laws suggest that embodied reasoning performance continues to improve with model size. Unpublished experiments from the PaLM-E team showed that success rates on long-horizon tasks increased log-linearly from 12B to 562B parameters, with no saturation observed^[1]. Extrapolating this trend, a 2-trillion-parameter model might achieve 90%+ success on household manipulation benchmarks — but training such a model would cost $50-100 million and require 10,000+ GPUs.

The more tractable frontier is multimodal expansion: adding tactile, force, and audio modalities to the vision-language core. RT-2 demonstrated that incorporating proprioceptive state (joint torques, contact forces) improves manipulation success by 8-12 percentage points on contact-rich tasks like insertion and wiping. Tactile sensing via GelSight or BioTac sensors provides millimeter-scale geometry and slip detection, enabling dexterous manipulation that vision alone cannot support. Truelabel's tactile data requests target 5,000 episodes of in-hand manipulation with synchronized RGB, depth, and tactile streams, priced at $35-50 per episode due to specialized sensor requirements.

Audio integration remains underexplored. Humans use sound to infer object properties (hollow vs. solid, full vs. empty) and detect task completion (snap of a connector, clink of a placed object). Preliminary work on audio-augmented policies shows 15-20% success improvements on assembly tasks, but no large-scale audio-visual-language-action model exists yet. The data challenge is annotation: labeling audio events requires domain expertise (mechanical engineers for assembly sounds, chefs for cooking sounds), and existing annotation platforms lack audio-specific tooling.

World models represent the next architectural leap. Ha and Schmidhuber's 2018 work showed that learning a predictive model of environment dynamics enables sample-efficient reinforcement learning, and recent results from NVIDIA Cosmos demonstrate that video diffusion models can generate realistic robot interaction sequences. A world-model-augmented VLA could simulate task outcomes before execution, enabling zero-shot planning for novel tasks. Training such a model requires 100,000+ hours of robot video — 10× more than current VLA datasets — but the data collection cost is falling rapidly as truelabel's network scales to 500+ active collectors.

How Truelabel Supports PaLM-E-Style VLA Development

Truelabel's physical AI data marketplace provides the three data components needed to train PaLM-E-derived models: large-scale robot trajectories, plan-level language annotations, and cross-embodiment diversity. Our collector network spans 200+ robotics labs across 18 countries, operating Franka Panda, UR5, Kinova Gen3, and mobile manipulator platforms. To date, we have delivered 47,000 manipulation episodes with natural language annotations, 12,000 mobile manipulation episodes, and 8,000 kitchen task episodes — totaling 67,000 episodes across 40 task categories^[3].

Our annotation protocol for VLA data involves four stages. First, teleoperation capture: collectors execute tasks while recording RGB-D streams (30 Hz), proprioception (100 Hz), and optional tactile data. Second, subgoal segmentation: annotators mark temporal boundaries where the robot transitions between primitives (grasp → transport → place). Third, plan generation: annotators write imperative sentences for each subgoal, following a constrained grammar that downstream controllers can parse. Fourth, executability validation: we replay 10% of episodes using an RT-1 baseline policy to verify that plans are unambiguous and achievable.

Data delivery follows RLDS format by default, with optional export to LeRobot HDF5, Parquet, or MCAP. Each episode includes: RGB images (512×512 uint8), depth maps (512×512 float32), joint states (7-DOF float32), gripper status (binary), language instruction (string), plan steps (list of strings), and provenance metadata (collector ID, timestamp, annotation confidence). We provide train/val/test splits stratified by task category and embodiment, ensuring that evaluation sets contain held-out tasks and robots.

Active requests target high-value task categories: kitchen manipulation ($40/episode, 5,000 episodes), warehouse pick-and-place ($25/episode, 10,000 episodes), outdoor mobile manipulation ($60/episode, 2,000 episodes), and dexterous in-hand manipulation ($50/episode, 3,000 episodes). Buyers can post custom requests for proprietary tasks, with exclusivity periods of 6-24 months. Our pricing averages 40% below Scale AI's physical AI service due to our decentralized collector model, and our delivery SLA is 4-8 weeks for 1,000-episode orders.

Training Recipes and Hyperparameter Guidance

Training a PaLM-E-scale VLA from scratch requires three phases: language model pretraining, vision-language alignment, and embodied fine-tuning. Most teams skip phase one by starting from an open-weight base model like Llama-3-70B or Qwen-2.5-72B, reducing training cost by 90%. Phase two aligns the vision encoder with the language model using 1-10 million image-text pairs, teaching the model to ground object names and spatial relations. Phase three fine-tunes on robot trajectories, typically 10,000-100,000 episodes depending on task diversity.

For phase two, the standard recipe uses a frozen language model and a trainable vision encoder (ViT-L or ViT-H), optimizing a contrastive loss between image embeddings and text embeddings. RT-2's training procedure used 10 million image-text pairs from LAION and Conceptual Captions, training for 100,000 steps on 64 TPU-v4 chips (approximately $80,000 in compute). The resulting vision encoder maps 224×224 images to 256-token sequences that the language model can process. Learning rate: 1e-4 with cosine decay; batch size: 2,048; gradient clipping: 1.0.

Phase three fine-tunes the full model (vision encoder + language model) on robot data. Key hyperparameters: learning rate 1e-5 to 5e-5 (lower than phase two to preserve language priors), batch size 128-512 episodes, training duration 50,000-200,000 steps. Data augmentation is critical: random crops (0.8-1.0 scale), color jitter (brightness ±0.2, contrast ±0.2), and temporal subsampling (drop 20-40% of frames) prevent overfitting to specific camera viewpoints and lighting. LeRobot's training scripts implement these augmentations by default.

Loss function combines next-token prediction (standard language modeling loss) with action prediction (cross-entropy over the plan-step vocabulary). The original PaLM-E paper weighted these equally, but subsequent work found that upweighting action loss by 2-5× improves task success rates by 10-15 percentage points. Regularization via dropout (0.1) and weight decay (0.01) prevents the model from memorizing training episodes.

For teams with limited compute, low-rank adaptation (LoRA) reduces training cost by 80%. LoRA freezes the base model and trains small adapter matrices (rank 8-64) inserted into each transformer layer. OpenVLA used LoRA to fine-tune a 7B model on 970,000 episodes using 8 A100 GPUs for 5 days, costing approximately $15,000. LoRA adapters are 100-500 MB (vs. 50-200 GB for full model weights), enabling rapid iteration and multi-task specialization.

Evaluation Benchmarks and Success Metrics

Evaluating VLAs requires real-world robot trials, not offline metrics. The standard protocol: hold out 20% of tasks and embodiments during training, then measure success rate on 50-100 test episodes per held-out task. Success is binary (task completed within time limit) or graded (partial credit for subgoal completion). PaLM-E reported 84.3% success on mobile manipulation, 89.2% on tabletop rearrangement, and 76.5% on long-horizon kitchen tasks^[1].

Three benchmark suites dominate VLA evaluation. Open X-Embodiment^[4] provides 22 robot platforms and 160 task categories, with 1 million training episodes and 10,000 held-out test episodes. Models are evaluated on cross-embodiment transfer: train on robots A-D, test on robot E. OpenVLA achieved 82% success on this benchmark, the current state-of-the-art for open models. CALVIN^[5] focuses on long-horizon tasks in a simulated kitchen, requiring models to chain 5-10 primitives to complete goals like prepare a meal. Success rate on CALVIN correlates strongly with real-world performance but overestimates absolute success by 15-20 percentage points due to sim-to-real gap.

RLBench^[6] provides 100 simulated manipulation tasks with dense reward signals, enabling rapid iteration during development. However, RLBench tasks are shorter-horizon (1-3 primitives) and less diverse than real-world scenarios, making it a poor proxy for generalist VLA performance. For buyers, the key metric is cross-task generalization: success rate on tasks not seen during training. A model with 90% success on training tasks but 40% on held-out tasks has overfit and will fail in deployment.

Latency and compute cost are secondary metrics but critical for production. PaLM-E's 800ms inference time limits it to tasks with 1-2 Hz planning frequency (mobile manipulation, multi-step assembly). Tasks requiring reactive control (catching, contact-rich insertion) need sub-100ms latency, achievable only with distilled models. OpenVLA's 120ms latency on a single GPU makes it deployable for 10 Hz control, the minimum for stable manipulation.

Data efficiency — success rate vs. training episodes — matters for custom task deployment. RT-X models require 1,000-5,000 episodes per task to reach 80% success, while PaLM-E's language priors enable 70% success with 100-500 episodes due to zero-shot reasoning. For buyers with limited data budgets, this 5-10× efficiency gain justifies the higher inference cost of large language-augmented models.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

What is physical AI training data?Related page Kitchen tasks training dataTask-specific requirements VLA training dataBuyer conversion page Physical AI data providers: criteria and optionsRelated page Best VLA training data providers 2026Related page Data provenance for physical AIRelated page Assembly training dataTask-specific requirements Bimanual manipulation training dataTask-specific requirements

External references and source context

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
PaLM-E architecture, parameter count, training data volumes, and benchmark results
arXiv ↩
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
RoboCat cross-embodiment learning and comparison with PaLM-E
arXiv ↩
truelabel physical AI data marketplace bounty intake
Truelabel episode counts, pricing, delivery formats, and active requests
truelabel.ai ↩
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment dataset scale, cross-embodiment benchmarks, and data requirements
arXiv ↩
CALVIN paper
CALVIN long-horizon benchmark
arXiv ↩
RLBench: The Robot Learning Benchmark & Learning Environment
RLBench simulation benchmark
arXiv ↩

FAQ

What is the difference between PaLM-E and RT-2 for robot control?

PaLM-E outputs high-level natural language action plans (e.g., *pick the mug, move to sink, place mug*) that are executed by separate low-level controllers like RT-1 or SayCan primitives. RT-2 predicts motor actions directly (joint velocities or end-effector poses) in an end-to-end fashion. PaLM-E excels at long-horizon tasks requiring multi-step reasoning and generalizes across robot morphologies because the language interface abstracts hardware details. RT-2 achieves lower latency (200ms vs. 800ms) and higher success on short-horizon pick-and-place tasks but requires task-specific training data for each robot platform. For deployment, PaLM-E suits applications with high task diversity and limited per-task data, while RT-2 suits high-volume repetitive tasks where end-to-end training is feasible.

How much robot data is needed to train a PaLM-E-style model?

The original PaLM-E was trained on approximately 10,000 hours of robot interaction data, equivalent to 50,000-100,000 episodes depending on task length. However, the model's language backbone (PaLM 540B) was pretrained on 780 billion tokens of text, providing world knowledge that reduces embodied data requirements. For fine-tuning an existing large language model into a VLA, 10,000-30,000 robot episodes suffice to reach 70-80% success on held-out tasks, assuming the episodes span 20-40 task categories and 3-5 robot embodiments. Cross-embodiment generalization requires 1,000-5,000 episodes per robot type. Truelabel's marketplace has delivered 67,000 manipulation episodes to date, and active requests target 20,000 additional episodes across kitchen, warehouse, and outdoor domains.

Can PaLM-E be deployed on edge devices or does it require cloud inference?

PaLM-E's 562 billion parameters require 8-16 A100 or H100 GPUs for inference, making edge deployment infeasible. Inference latency is approximately 800ms per plan step on datacenter hardware, acceptable for high-level planning (1-2 Hz) but too slow for reactive control (10-30 Hz). Production deployments use a two-tier architecture: PaLM-E runs on a remote server generating plans, and a lightweight controller (RT-1, SayCan, or a distilled 7B VLA) executes plans on-robot at 10 Hz. For fully on-device deployment, teams distill PaLM-E into 7B-70B parameter models like OpenVLA, which achieve 120ms latency on a single RTX 4090 GPU while retaining 80-85% of the teacher model's performance. Distillation requires 50,000-100,000 episodes of teacher outputs and costs $20,000-50,000 in H100 compute.

What data formats does PaLM-E use and how do I prepare my robot data?

PaLM-E training data follows the RLDS (Reinforcement Learning Datasets) format: each episode is a dictionary with keys for observations (RGB images, depth maps, proprioceptive state), actions (natural language plan steps), language instructions (task descriptions), and metadata (robot ID, success label). Images are stored as uint8 arrays at 224×224 or 512×512 resolution; state vectors as float32. Plan steps are tokenized strings describing subgoals (e.g., *pick red block*, *move to bin*, *place block*). Truelabel delivers data in RLDS format by default, with optional export to LeRobot HDF5, Parquet, or MCAP. To prepare your own data, record RGB-D streams at 10-30 Hz, proprioception at 100 Hz, segment episodes into subgoal boundaries, and annotate each segment with an imperative sentence. Validation requires replaying episodes with a baseline policy to ensure plans are executable.

How does PaLM-E handle tasks it has never seen during training?

PaLM-E leverages its pretrained language model (PaLM 540B) to perform zero-shot reasoning about novel tasks by decomposing instructions into known primitives. For example, if trained on *pick block* and *place in bin* but never *stack blocks*, it can infer that stacking requires repeated pick-place cycles by reasoning over the language instruction. This compositional generalization works when novel tasks are combinations of known skills, but fails when tasks require entirely new motor skills (e.g., pouring liquids if only trained on rigid-object manipulation). The original paper reported 61% success on zero-shot tasks vs. 84% on trained tasks, a 23-point gap. For production deployment, teams typically fine-tune on 100-500 episodes of each novel task to close this gap, leveraging PaLM-E's language priors to achieve 5-10× better data efficiency than training from scratch.

What are the licensing and cost considerations for using PaLM-E in commercial products?

Google has not released PaLM-E model weights, training code, or the underlying robot datasets, limiting commercial use to API access (not publicly available as of 2025) or clean-room reimplementation. Teams building commercial VLAs must train their own models using open-weight base models (Llama-3-70B, Qwen-2.5-72B) and licensed robot data. Truelabel's standard data license grants perpetual, worldwide rights to train and deploy models on purchased episodes, with optional 6-24 month exclusivity for competitive applications. Training a 70B VLA from an open base model costs $50,000-150,000 in compute (10-30 days on 64 H100 GPUs) plus $120,000-600,000 for 10,000-50,000 episodes of robot data at $12-40 per episode. Inference infrastructure costs $15,000-30,000 per month for cloud deployment or $25,000-50,000 upfront for on-premise GPU clusters. Regulatory compliance (EU AI Act, NIST AI RMF) requires provenance documentation for training data, which truelabel provides as part of standard delivery.

Looking for PaLM-E?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

Source Embodied Multimodal Data