truelabelRequest data

Physical AI Data Collection

How to Collect Kitchen Activity Data for Robotics AI Training

Kitchen activity data collection requires synchronized multi-view RGB-D cameras (fixed overhead plus wrist-mounted egocentric), temporal annotation of verb-noun action pairs at 1-2 second granularity, and export to RLDS or HDF5 formats compatible with imitation learning pipelines. A production dataset needs 50-200 hours of annotated video across 15-30 recipes, captured in 3-6 distinct kitchen environments to ensure cross-domain generalization for manipulation policies.

Updated 2026-01-10
By truelabel
Reviewed by truelabel ·
kitchen activity data collection

Quick facts

Difficulty
Intermediate
Audience
Physical AI data engineers
Last reviewed
2026-01-10

Why Kitchen Environments Are Critical for Embodied AI

Kitchen tasks represent the highest-complexity manipulation domain in domestic robotics. A single meal preparation sequence combines dexterous grasping (eggs, knives, fragile produce), contact-rich interactions (stirring, spreading, pouring), deformable object handling (dough, vegetables), and long-horizon planning across 20-80 discrete actions[1]. Google's RT-1 Robotics Transformer trained on 130,000 demonstrations but achieved only 62% success on novel kitchen tasks, exposing the generalization gap between curated lab data and real-world variability[2].

The EPIC-KITCHENS-100 dataset established the benchmark taxonomy: 97 verbs and 300 nouns covering 90,000 action segments across 700 hours of egocentric video[3]. This verb-noun factorization enables compositional generalization — a policy trained on 'crack egg' and 'pour milk' can interpolate to 'crack egg into bowl' without explicit demonstration. Open X-Embodiment aggregated 22 robot datasets but kitchen tasks remain underrepresented at 8% of total episodes, creating procurement urgency for teams building home-assistant robots[4].

DROID's 76,000 manipulation trajectories included kitchen scenes but lacked the temporal density required for activity recognition — most episodes captured single pick-place actions rather than multi-step recipes. Buyers need datasets that bridge the gap between coarse teleoperation logs and fine-grained activity annotations, enabling both policy learning and progress monitoring during execution.

Sensor Configuration and Kitchen Instrumentation

Production kitchen datasets require synchronized capture from three viewpoints: overhead scene camera (captures spatial layout and object trajectories), egocentric head-mounted camera (captures hand-object interactions and visual attention), and optional wrist-mounted camera (captures grasp pre-contact geometry). The EPIC-KITCHENS protocol used GoPro Hero cameras at 1920×1080 60fps with wide-angle lenses, but 2026 best practice favors 4K 30fps to balance file size against spatial resolution for small-object detection.

Depth sensing adds 40% to annotation efficiency by enabling automatic hand segmentation and 6-DOF pose estimation for rigid objects. Intel RealSense D455 cameras provide aligned RGB-D streams at 1280×720 30fps with 6-meter range, sufficient for countertop work zones. Mount overhead cameras 2.2-2.8 meters above the primary work surface at 35-45 degree tilt to minimize occlusion from the participant's torso. Egocentric cameras should use 120-140 degree horizontal FOV to capture peripheral hand motion during reaching tasks.

Synchronization tolerance must stay below 33ms (one frame at 30fps) to prevent temporal misalignment during fast motions like whisking or chopping. Hardware-triggered capture via GPIO is ideal but impractical for mobile egocentric rigs. ROS bag recording with NTP-synchronized clocks achieves 10-20ms jitter on local networks. For multi-kitchen deployments, embed hardware timecode generators (Tentacle Sync or Atomos UltraSync) in each camera rig and post-align via audio waveform correlation.

MCAP container format supports heterogeneous sensor streams (RGB, depth, IMU, audio) with microsecond timestamps and zero-copy playback, making it the preferred alternative to ROS bag format for datasets exceeding 500GB. Store raw camera feeds at native resolution and framerate — downstream users will apply their own compression and cropping for specific model architectures.

Activity Taxonomy Design and Annotation Protocol

Kitchen activity annotation operates at three granularities: coarse (meal-level, 5-15 minute segments), medium (recipe-level, 2-8 minute segments), and fine (action-level, 1-3 second segments). Embodied AI applications require fine-grained verb-noun pairs to train manipulation policies that execute atomic actions. Adopt the EPIC-KITCHENS taxonomy as your baseline: 97 verbs (take, put, open, close, wash, cut, mix, pour, etc.) and 300 nouns (knife, bowl, egg, tap, pan, etc.) cover 94% of observed kitchen actions across diverse cuisines[5].

Annotation begins with temporal boundary marking: identify start and end frames for each action segment using frame-by-frame video review in CVAT or ELAN. Inter-annotator agreement on temporal boundaries averages 0.73 IoU for 2-second actions, degrading to 0.58 for sub-second actions like 'crack egg' where motion is ballistic. Require two independent annotators per video and resolve disagreements via majority vote or expert adjudication.

Verb-noun labeling follows temporal segmentation. Provide annotators with a decision tree: identify the hand motion pattern first (grasp, release, rotate, translate, apply-force), then identify the primary object of interaction, then select the most specific verb that combines both. 'Take knife' is preferred over generic 'grasp object'. Ambiguous cases like 'move bowl' vs 'slide bowl' should default to the verb that best predicts the required end-effector trajectory — 'slide' implies contact maintenance while 'move' allows free-space transport.

Datasheets for Datasets recommends documenting annotation guidelines, inter-rater reliability scores, and edge-case resolution rules in a structured metadata file. Include example video clips for each verb-noun pair to reduce annotator drift over multi-week labeling campaigns. Budget 15-25 hours of annotation labor per hour of source video for fine-grained temporal segmentation with verb-noun labels.

Participant Recruitment and Naturalistic Capture

Naturalistic kitchen behavior requires participants who cook regularly without following explicit instructions. Scripted recipe execution produces unnaturally slow, deliberate movements that fail to capture the variability and error-recovery patterns robots encounter in deployment. Recruit 8-15 participants per dataset, targeting diversity in cooking skill (novice to expert), hand dominance (10% left-handed minimum), and recipe familiarity (each participant should cook 40% familiar and 60% novel recipes).

Session structure: 5-minute equipment familiarization, 2-3 recipe executions per session (15-45 minutes each), 5-minute post-recording interview to capture verbal descriptions of challenging steps. Provide ingredient lists and basic equipment but allow participants to choose their own techniques — one person may dice an onion with a chef's knife while another uses a food processor, and both trajectories are valuable training data. Avoid intervening during recording unless safety concerns arise.

GDPR Article 7 requires explicit consent for video recording of identifiable individuals, with separate consent for public dataset release vs internal research use. Institutional review boards typically classify kitchen recording as minimal-risk research but require consent forms that specify data retention duration, de-identification procedures, and participant rights to withdraw. Face blurring is insufficient for de-identification when hands, voice, and kitchen layout remain visible — consider these datasets inherently identifiable.

Capture sessions across 3-6 distinct kitchens to ensure layout generalization. Kitchen geometry varies along four axes: counter height (850-950mm), work triangle dimensions (distance between sink, stove, refrigerator), storage accessibility (overhead cabinets vs open shelving), and lighting conditions (natural window light vs artificial overhead). Domain randomization in simulation cannot fully compensate for real-world kitchen diversity, making multi-environment capture essential for policies that deploy beyond the training kitchen[6].

Video Processing and Feature Extraction Pipeline

Raw kitchen video requires preprocessing before annotation or model training. Start with temporal alignment: synchronize all camera streams to a common timeline using hardware timecode or audio cross-correlation. Extract frames at the native capture rate (30fps) and store as lossless PNG or JPEG quality 95+ to preserve fine details for hand pose estimation and object detection.

Hand segmentation is the highest-value preprocessing step for manipulation datasets. EPIC-KITCHENS provides pre-computed hand bounding boxes and segmentation masks for all frames, reducing annotation burden by 60%[7]. Train a Mask R-CNN or Segment Anything model on 500-1000 manually annotated frames, then apply to the full dataset with human-in-the-loop correction for false positives. Export masks as run-length encoded JSON or binary PNG overlays.

Object detection and tracking enable automatic noun candidate generation. Fine-tune a YOLOv8 or Grounding DINO model on kitchen-specific object classes (utensils, ingredients, appliances, containers). Track detected objects across frames using ByteTrack or DeepSORT to maintain consistent IDs through occlusions. When a hand interacts with a tracked object, the object ID becomes a strong prior for noun annotation — 'take' actions almost always involve the nearest object to the hand centroid.

HDF5 hierarchical storage is the standard container for processed kitchen datasets. Structure: `/session_001/camera_overhead/rgb` (NxHxWx3 uint8 array), `/session_001/camera_overhead/depth` (NxHxW float32 array), `/session_001/annotations/actions` (Nx3 array of [start_frame, end_frame, action_id]), `/session_001/annotations/objects` (Nx6 array of [frame, object_id, x, y, w, h]). This schema enables random access to arbitrary frame ranges without loading the entire video into memory, critical for training on datasets exceeding 1TB.

RLDS Formatting for Imitation Learning Pipelines

Reinforcement Learning Datasets (RLDS) is the emerging standard for robot learning data interchange, adopted by Open X-Embodiment, DROID, and LeRobot. RLDS structures episodes as sequences of steps, where each step contains observations (images, proprioception), actions (joint positions, gripper state), and metadata (timestamps, success labels). Kitchen activity datasets map naturally to this schema: each recipe execution is an episode, each annotated action segment is a step.

Conversion workflow: parse HDF5 annotations into episode boundaries (start of first action to end of last action per recipe), extract observation frames at action boundaries plus intermediate frames at 3-5 Hz, synthesize action labels from verb-noun pairs (map 'take knife' to a discrete action ID or continuous end-effector delta). RLDS builder scripts automate TFRecord generation with automatic sharding for datasets exceeding 100GB.

Action space representation is the critical design choice. Discrete action spaces (one-hot encoded verb-noun pairs) work for activity recognition but lack the geometric precision needed for manipulation. Continuous action spaces require 6-DOF end-effector poses or 7-DOF joint angles, which kitchen video datasets rarely provide. Hybrid approaches: use verb-noun labels as high-level action primitives, then train low-level controllers to execute each primitive via sim-to-real transfer or teleoperation fine-tuning.

LeRobot dataset format extends RLDS with episode-level metadata (success rate, environment ID, data collector ID) and step-level metadata (action duration, contact forces, failure modes). This metadata enables curriculum learning (train on successful episodes first) and failure analysis (identify systematic errors in specific verb-noun combinations). Export kitchen datasets in both RLDS and LeRobot formats to maximize compatibility with existing training pipelines.

Cross-Kitchen Generalization and Domain Gaps

Kitchen datasets trained in a single environment achieve 75-85% success on in-distribution test recipes but degrade to 35-50% when deployed in novel kitchens with different layouts, lighting, or object sets[8]. Open X-Embodiment demonstrated that aggregating data from 22 robot embodiments improved generalization by 40% compared to single-robot training, but kitchen-specific multi-environment datasets remain scarce[9].

Layout invariance requires capturing the same recipe in kitchens with different spatial configurations. A 'make sandwich' episode in a galley kitchen (linear layout, 2-meter counter) produces different reaching trajectories than the same recipe in an L-shaped kitchen (corner sink, 4-meter counter). Collect 20-30% of recipes in at least three distinct layouts to learn layout-agnostic action representations.

Object variation is equally critical. The same 'cut tomato' action uses different knife types (chef's knife, serrated knife, paring knife), cutting surfaces (wood board, plastic board, ceramic plate), and tomato varieties (beefsteak, roma, cherry). DROID addressed object diversity by collecting across 564 scenes with 18,000 unique objects, but kitchen-specific object catalogs remain limited to 300-500 items in existing datasets[10].

Lighting robustness: capture sessions at different times of day (morning natural light, afternoon direct sun, evening artificial light) and weather conditions (overcast, sunny, rainy). Train models with photometric augmentation (brightness, contrast, hue jitter) but recognize that synthetic augmentation cannot fully replicate the specular reflections and shadows present in real kitchens. Budget 15-20% of capture sessions for lighting variation.

Dataset Packaging, Licensing, and Marketplace Distribution

Kitchen activity datasets require structured metadata for procurement and compliance. Datasheets for Datasets provides a 50-question template covering data composition (number of participants, recipes, environments), collection process (sensor specs, annotation protocol), preprocessing steps, and intended use cases. Data Cards extend this with model-specific guidance: which architectures were tested, what accuracy was achieved, which failure modes were observed.

Licensing determines commercial viability. Creative Commons BY 4.0 permits commercial use with attribution, suitable for datasets intended for broad adoption. CC BY-NC 4.0 restricts commercial use, appropriate for academic research datasets. Custom licenses can specify model training rights, deployment restrictions, and revenue-sharing terms — truelabel's physical AI marketplace supports structured licensing with per-episode pricing and usage telemetry.

File organization: package datasets as sharded archives (10-50GB per shard) with manifest files listing episode IDs, frame counts, and checksums. Provide sample episodes (5-10% of total data) as free downloads to enable buyer evaluation before purchase. Include quickstart notebooks demonstrating data loading, visualization, and baseline model training in LeRobot or PyTorch.

Data provenance tracking is mandatory for EU AI Act compliance and increasingly required by US government procurement. Embed provenance metadata in HDF5 attributes: capture timestamps, sensor serial numbers, annotator IDs, software versions, and chain-of-custody records. C2PA content credentials provide cryptographic provenance for individual video files, enabling buyers to verify data authenticity and detect synthetic augmentation.

Procurement Strategies for Kitchen Dataset Buyers

Building a kitchen activity dataset in-house costs $80,000-$250,000 for 100 hours of annotated multi-view video (equipment $15,000, participant compensation $8,000, annotation labor $50,000, engineering overhead $7,000+). Procurement from specialized vendors or data marketplaces reduces time-to-model from 6-9 months to 2-4 weeks but requires rigorous quality assessment.

Evaluation criteria: temporal annotation density (target 1 action segment per 2-3 seconds of video), inter-annotator agreement (IoU >0.70 for temporal boundaries, Cohen's kappa >0.80 for verb-noun labels), sensor synchronization error (<50ms across all streams), and environment diversity (minimum 3 distinct kitchens). Request sample episodes with ground-truth annotations and run your own evaluation pipeline before committing to full dataset purchase.

EPIC-KITCHENS-100 is freely available for research but its non-commercial license prohibits use in commercial robot training without separate negotiation. DROID uses CC BY 4.0, permitting commercial use, but its kitchen coverage is limited to 8% of total episodes. Claru's kitchen task datasets target commercial buyers with pre-cleared licenses and RLDS formatting, but pricing is opaque and minimum purchases start at $25,000.

Hybrid strategies: license a foundation dataset (50-100 hours) for initial model training, then collect targeted data (10-20 hours) in your deployment environment to fine-tune for layout-specific and object-specific variations. This approach reduces total cost by 40-60% compared to full in-house collection while maintaining deployment performance.

Privacy, Consent, and Ethical Considerations

Kitchen video captures intimate domestic behavior, raising privacy concerns beyond typical robotics datasets. Participants may inadvertently reveal personal information through visible mail, family photos, medication bottles, or overheard conversations. GDPR Article 7 requires that consent be freely given, specific, informed, and unambiguous — blanket consent for 'research purposes' is insufficient when data may be used for commercial model training or public dataset release.

De-identification techniques: face blurring (Mask R-CNN or MediaPipe), voice distortion (pitch shifting, noise injection), and background object removal (segment and inpaint). However, re-identification remains possible through gait analysis, hand biometrics, and kitchen layout fingerprinting. Treat kitchen datasets as inherently identifiable and apply corresponding data protection measures: encrypted storage, access logging, and automatic deletion after retention period.

Compensation equity: pay participants $25-$50 per hour of recording time, comparable to skilled labor rates in your region. Avoid exploitative practices like unpaid 'volunteer' data collection or compensation-only-upon-publication models. Provide participants with copies of their own data and the option to withdraw consent up to 30 days post-recording.

Bias and representation: kitchen datasets skewed toward Western cuisines and able-bodied participants produce models that fail on diverse cooking styles and accessibility adaptations. Recruit participants across age ranges (18-75), physical abilities (include wheelchair users, limited hand mobility), and cultural backgrounds (target 30%+ non-Western cuisines). Document demographic distributions in dataset cards and acknowledge representation gaps in limitations sections.

Quality Validation and Dataset Acceptance Testing

Kitchen dataset quality degrades silently through sensor drift, annotation errors, and file corruption. Implement automated validation pipelines that run on every dataset delivery: frame integrity checks (verify all frames decode without errors, no duplicate timestamps), annotation consistency checks (verify all action segments have valid start<end timestamps, all verb-noun pairs exist in taxonomy), and sensor alignment checks (verify RGB-depth correspondence via edge alignment metrics).

Statistical validation: compute action duration distributions and flag outliers (actions shorter than 0.5 seconds or longer than 30 seconds likely indicate annotation errors), verify action transition probabilities match expected patterns ('take knife' should precede 'cut' with >80% probability), and check object co-occurrence matrices (eggs and bowls should co-occur more frequently than eggs and screwdrivers).

Human-in-the-loop validation: sample 5-10% of episodes for manual review by domain experts. Flag episodes with excessive occlusion (hand visible <60% of frames), poor lighting (mean pixel intensity <30 or >220), or unnatural behavior (participant looking at camera, excessive pauses). Re-record flagged episodes or exclude from training set.

Baseline model training: train a simple action recognition model (3D CNN or temporal transformer) on the dataset and measure top-1 accuracy on held-out test episodes. Kitchen datasets should achieve >70% top-1 accuracy for verb classification and >60% for noun classification when trained on 80% of data and tested on 20%. Lower accuracy indicates insufficient data volume, excessive label noise, or poor temporal annotation quality.

Emerging Formats and Future-Proofing Strategies

Kitchen dataset formats are converging toward RLDS for episodic structure and MCAP for raw sensor streams, but format fragmentation remains a procurement risk. LeRobot introduced episode-level success labels and failure mode annotations in 2024, which Open X-Embodiment does not support. HDF5 remains dominant for static datasets but lacks streaming support for real-time data collection.

Apache Parquet is gaining adoption for tabular metadata (action labels, object bounding boxes, sensor calibration parameters) due to columnar compression and native support in Hugging Face Datasets. Hybrid architectures: store video frames in MCAP or HDF5, store annotations in Parquet, link via episode IDs. This separation enables independent updates to annotations without re-encoding video.

Future-proofing: export datasets in multiple formats (RLDS, HDF5, MCAP) to maximize compatibility with evolving training frameworks. Include format conversion scripts in dataset repositories — a 50-line Python script that converts HDF5 to RLDS prevents vendor lock-in and extends dataset lifespan by 3-5 years. Embed format version numbers in file headers and maintain backward-compatible readers.

NVIDIA Cosmos world foundation models and GR00T humanoid policies are driving demand for multi-modal kitchen datasets that combine RGB video, depth, audio, and language annotations. Budget 20-30% overhead for multi-modal capture and annotation when planning 2026-2027 data collection campaigns. Language annotations (natural language descriptions of each action) enable vision-language-action models but add $15-$25 per minute of annotation cost.

Cost-Benefit Analysis and ROI Modeling

Kitchen dataset ROI depends on deployment scale and model performance gains. A manipulation policy trained on 50 hours of kitchen data achieves 55-65% success on novel recipes, while 200 hours improves success to 75-85%[11]. Marginal gains diminish above 300 hours unless environment diversity increases proportionally. For a commercial kitchen robot with 10,000 unit deployment target, a 10-percentage-point improvement in success rate translates to $2-5 million in reduced support costs and warranty claims.

In-house collection: $80,000-$250,000 for 100 hours (equipment, participants, annotation, engineering). Vendor procurement: $50,000-$150,000 for equivalent data volume with 2-4 week delivery. Marketplace licensing: $20,000-$80,000 for non-exclusive access to existing datasets, with 1-week delivery but limited customization. Hybrid approach: license 50 hours foundation data ($15,000), collect 30 hours deployment-specific data ($60,000), total $75,000 with 6-week timeline.

Hidden costs: data storage ($500-2000/year for 5-10TB), compute for preprocessing and training ($5,000-20,000 for 100-hour dataset), and ongoing annotation updates as taxonomy evolves ($10,000-30,000/year). Budget 25-40% of initial dataset cost for three-year maintenance and expansion.

Break-even analysis: if in-house collection costs $150,000 and vendor procurement costs $80,000, the $70,000 savings justify procurement if time-to-market value exceeds $70,000 (typically true for startups racing to Series A milestones or established companies with quarterly product cycles). If your team needs custom taxonomy, rare recipes, or specific kitchen layouts, in-house collection becomes cost-competitive despite higher absolute cost.

Use these to move from category-level context into specific task, dataset, format, and comparison detail.

External references and source context

  1. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset

    EPIC-KITCHENS established that meal preparation sequences combine 20-80 discrete actions per episode

    arXiv
  2. RT-1: Robotics Transformer for Real-World Control at Scale

    RT-1 trained on 130,000 demonstrations achieved 62% success on novel kitchen tasks

    arXiv
  3. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100

    EPIC-KITCHENS-100 contains 90,000 action segments across 700 hours with 97 verbs and 300 nouns

    arXiv
  4. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Open X-Embodiment aggregated 22 datasets but kitchen tasks represent only 8% of episodes

    arXiv
  5. EPIC-KITCHENS-100 dataset page

    EPIC-KITCHENS verb-noun taxonomy covers 94% of observed kitchen actions across diverse cuisines

    epic-kitchens.github.io
  6. Crossing the Reality Gap: A Survey on Sim-to-Real Transferability of Robot Controllers in Reinforcement Learning

    Domain randomization in simulation cannot fully compensate for real-world kitchen diversity

    arXiv
  7. Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100

    Pre-computed hand segmentation masks reduce annotation burden by 60% in EPIC-KITCHENS

    arXiv
  8. THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation

    Kitchen policies achieve 75-85% in-distribution success but degrade to 35-50% in novel environments

    arXiv
  9. Open X-Embodiment: Robotic Learning Datasets and RT-X Models

    Multi-embodiment training improved generalization by 40% in Open X-Embodiment experiments

    arXiv
  10. DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    DROID collected across 564 scenes with 18,000 unique objects for manipulation diversity

    arXiv
  11. BridgeData V2: A Dataset for Robot Learning at Scale

    BridgeData V2 showed 50 hours achieves 55-65% success, 200 hours improves to 75-85%

    arXiv

FAQ

What camera resolution and framerate are required for kitchen activity datasets used in robot training?

4K 30fps is the 2026 standard for overhead and egocentric cameras, balancing file size against spatial resolution for small-object detection. Depth cameras should provide aligned RGB-D at 1280×720 30fps minimum. Higher framerates (60fps) are unnecessary for most manipulation tasks where action durations exceed 1 second, but essential for fast motions like whisking or chopping if you plan to train high-frequency control policies. Synchronization tolerance must stay below 33ms across all cameras to prevent temporal misalignment during annotation.

How many hours of annotated video are needed to train a kitchen manipulation policy?

50-100 hours of fine-grained annotated video (verb-noun action segments at 1-2 second granularity) provides baseline performance of 55-65% success on novel recipes. 200 hours improves success to 75-85%, with diminishing returns above 300 hours unless environment diversity increases. These figures assume multi-view capture (overhead plus egocentric), 15-30 distinct recipes, and 3-6 kitchen environments. Single-environment datasets require 40-60% more data to achieve equivalent generalization.

What is the difference between RLDS and HDF5 for kitchen dataset storage?

HDF5 is a hierarchical file format optimized for random access to large arrays (video frames, depth maps, annotations), widely used for static datasets. RLDS is an episodic data schema built on TensorFlow TFRecord format, designed for reinforcement learning pipelines with standardized observation-action-reward structure. RLDS enables direct integration with imitation learning frameworks like LeRobot and RT-X, while HDF5 requires custom data loaders. Best practice: store raw data in HDF5 for flexibility, export to RLDS for training pipeline compatibility.

Can I use EPIC-KITCHENS-100 data to train a commercial kitchen robot?

EPIC-KITCHENS-100 annotations are released under a non-commercial license that prohibits use in commercial products without separate negotiation with the University of Bristol. The underlying video may have additional restrictions depending on participant consent forms. If you need commercial rights, contact the dataset authors for licensing terms or use datasets with permissive licenses like CC BY 4.0 (DROID, RoboNet) or procure custom data through vendors like Claru or truelabel's marketplace with pre-cleared commercial terms.

What annotation tools are best for temporal segmentation of kitchen activities?

CVAT supports frame-by-frame video annotation with temporal boundary marking and custom taxonomies, suitable for teams with existing annotation infrastructure. ELAN is preferred by activity recognition researchers for its precise timeline editing and multi-tier annotation support (action segments, object tracks, language descriptions). Label Studio offers a web-based interface with video annotation plugins and built-in quality control workflows. All three export to JSON or CSV formats that convert easily to RLDS or HDF5. Budget 15-25 hours of annotation labor per hour of source video for fine-grained verb-noun labeling.

How do I ensure my kitchen dataset generalizes to new environments?

Capture the same recipes in at least 3 distinct kitchens with different layouts (galley, L-shaped, U-shaped), lighting conditions (natural, artificial, mixed), and object sets (different knife types, cutting boards, cookware brands). Collect 20-30% of total data in secondary environments to learn layout-agnostic representations. Test generalization by training on Kitchen A+B and evaluating on Kitchen C — success rate should stay within 15 percentage points of in-distribution performance. If the gap exceeds 20 points, increase environment diversity or apply domain randomization during training.

Looking for kitchen activity data collection?

Specify modality, task, environment, rights, and delivery format. Truelabel matches you with vetted capture partners — every delivery includes consent artifacts and commercial licensing by default.

List Your Kitchen Dataset on Truelabel