feat(ci): benchmark smoke tests with isolated Docker images (LIBERO + MetaWorld)#3319
feat(ci): benchmark smoke tests with isolated Docker images (LIBERO + MetaWorld)#3319
Conversation
…chmark docs Add a comprehensive guide for adding new benchmarks to LeRobot, and refactor the existing LIBERO and Meta-World docs to follow the new standardized template. Made-with: Cursor
…asses Replace hardcoded if/elif chains in factory.py with create_envs() and get_env_processors() methods on EnvConfig. New benchmarks now only need to register a config subclass — no factory.py edits required. Net -23 lines: factory.py shrinks from ~200 to ~70 lines of logic. Made-with: Cursor
Rewrite for simpler language, better structure, and easier navigation. Move quick-reference table to the top, fold eval explanation into architecture section, condense the doc template to a bulleted outline. Made-with: Cursor
- Thread camera_name_mapping from LiberoEnv config through to gym envs - Sync features_map with camera_name_mapping in LiberoEnv.__post_init__ - Fix render() to use first available camera instead of hardcoded "image" - Handle non-dict final_info in rollout by falling back to info["is_success"] - Add use_peft legacy field to SmolVLAConfig for checkpoint compat - Add defaults to GR00TN15Config init=False fields for transformers 5.3 Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
- Revert GR00T N1.5 default_factory/default changes (transformers compat) - Revert SmolVLA use_peft legacy field - Apply ruff formatting fixes - camera_name_mapping stays entirely in env/eval layer (no policy changes) Made-with: Cursor
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co> Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
LiberoEnv and MetaworldEnv previously allocated GPU resources (EGL context,
OpenGL framebuffer) in __init__, before AsyncVectorEnv's fork(). Worker
processes inherited stale GPU handles, causing EGL_BAD_CONTEXT crashes on
first render.
Fix: defer OffScreenRenderEnv / MT1 construction to _ensure_env(), called on
first reset() or step() inside the worker subprocess. Each worker creates its
own clean context after fork().
Also fixes lerobot_eval.py:170 (add_envs_task TODO): replace with
env.call("task") which works with both SyncVectorEnv and AsyncVectorEnv.
AsyncVectorEnv is now the default for n_envs > 1; auto-downgraded to
SyncVectorEnv when n_envs=1 (no benefit, less overhead).
Expected speedup: ~15-20x for LIBERO Spatial with batch_size=50.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
eval_policy_all never closed environments after each task completed, causing AsyncVectorEnv worker processes to accumulate (N_tasks × n_envs). This led to OOM, BrokenPipeError and EOFError on multi-task benchmarks. Also fixes: - AsyncVectorEnv compat in envs/utils.py (use get_attr/call instead of .envs) - Tuple task handling in tokenizer_processor and lerobot_eval - _LazyAsyncVectorEnv for deferred worker spawning in LIBERO Made-with: Cursor
…ning
env.call("task") returns the LIBERO task name with underscores
(e.g. "pick_up_the_black_bowl_...") instead of the natural language
description ("pick up the black bowl ..."). The VLM tokenizes these
completely differently, causing 0.0 reward across all episodes.
Made-with: Cursor
- Replace add_envs_task reference with env.call("task_description")
- Update use_async_envs default to True
- Add note about lazy GPU init for AsyncVectorEnv compatibility
Made-with: Cursor
- batch_size=0 (default) auto-tunes based on CPU cores, capped by n_episodes and 64. Removes the need for users to guess the right value. The old batch_size > n_episodes error is replaced by silently clamping to n_episodes. - _LazyAsyncVectorEnv accepts pre-computed spaces so only one temp env is created per suite (not per task). For libero_spatial (10 tasks) this avoids 9 redundant LiberoEnv instantiations during env setup. Made-with: Cursor
- New docs/source/evaluation.mdx covering lerobot-eval usage, batch_size auto-tuning, AsyncVectorEnv performance, tuning tips, output format, multi-task evaluation, and programmatic usage. - Add evaluation page to _toctree.yml under Benchmarks section. - Update adding_benchmarks.mdx to reference batch_size auto default and link to the evaluation guide. Made-with: Cursor
Made-with: Cursor
- AsyncVectorEnv now uses shared_memory=True for zero-copy observation transfer - LiberoEnvConfig.gym_kwargs passes observation_height/width to the env - eval_policy_all prefetches next task's workers while current task runs Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Each benchmark gets its own Docker image (lerobot[libero] / lerobot[metaworld] only) so incompatible dep trees cannot collide. A 1-episode smoke eval runs per benchmark on GPU runners. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
libero/__init__.py calls input() to ask about a custom dataset path, which raises EOFError when stdin is closed inside Docker. Setting LIBERO_DATA_FOLDER skips the prompt entirely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
libero/__init__.py calls input() when ~/.libero/config.yaml is missing. We write the config at image build time (without importing libero) so the prompt never fires at runtime. Also trigger CI on pyproject.toml changes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n -c The multiline RUN python -c "..." was being parsed as Dockerfile instructions. Use printf to write ~/.libero/config.yaml directly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The config was pointing to /tmp/libero_init which doesn't exist. Use importlib.util.find_spec to locate the hf-libero package directory and write paths to the actual bundled bddl_files/init_files/assets. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
num2words (required by SmolVLM processor) is declared in lerobot[smolvla], not lerobot[libero/metaworld]. Install both extras together. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Runs accelerate launch --num_processes=1 lerobot-train with: - steps=1, batch_size=1, dataset.episodes=[0] (episode 0 only) - eval_freq=1 so the training loop triggers eval after step 1 - eval.n_episodes=1, eval.use_async_envs=false Tests the full train→eval-within-training pipeline in the existing libero-benchmark-libero:ci image (no extra Docker build cost). Uploads eval video from /tmp/train-smoke/eval/ as libero-train-smoke-video. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Resolves conflict in lerobot_eval.py by taking explicit (AttributeError, NotImplementedError) catches from main (#3274). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add scripts/ci/extract_task_descriptions.py: runs inside the benchmark Docker container (LIBERO/MetaWorld installed) after lerobot-eval and writes task_descriptions.json mapping task keys to NL instructions. LIBERO: uses libero.libero.benchmark to get suite.get_task(i).language. MetaWorld: formats task name as human-readable label. - Call extraction at the end of each eval bash-c (|| true so never fatal). - parse_eval_metrics.py reads task_descriptions.json and includes it in metrics.json so the health dashboard Space can label videos by task. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
src/lerobot/envs/factory.py
Outdated
| cfg: EnvConfig | str, | ||
| n_envs: int = 1, | ||
| use_async_envs: bool = False, | ||
| use_async_envs: bool = True, |
The task descriptions were never populated in metrics.json because extract_task_descriptions.py was never invoked. The script exists and parse_eval_metrics.py already looks for its output — the call was simply missing from the workflow. Appends the extraction step to the existing bash -c block (runs inside the container where libero/metaworld is installed) so task_descriptions.json is written to the eval-artifacts dir before docker cp copies it out. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AsyncVectorEnv spawns new subprocesses that do not inherit the in-process gym registration created by the test. Pass use_async_envs=False since this test validates dispatch logic, not async parallelism. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…builds The dep-install layer (uv sync) now only depends on pyproject.toml, uv.lock, and a minimal package stub — not the full src/ tree. Source code changes only rebuild the final COPY layer (seconds, not minutes). Also switch from type=local cache (lost on ephemeral runners) to type=gha (persisted in GitHub Actions cache, shared across all runs). Before: every src/ change → full uv sync rebuild (~8-10 min) After: src/-only change → cached dep layer, ~30s source copy Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| @@ -0,0 +1,310 @@ | |||
| # Copyright 2025 The HuggingFace Inc. team. All rights reserved. | |||
| # | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||
There was a problem hiding this comment.
Claude finished @pkooij's task in 1m 25s —— View job
PR Review: Benchmark Smoke Tests with Isolated Docker Images
Overall this is well-structured CI work — isolated Docker images per benchmark is the right approach for incompatible dep trees, and the dep-layer / source-layer split in the Dockerfiles is good for cache efficiency. Here's my detailed feedback:
Security Issue
Token passed via CLI argument in "Login to Hugging Face" step (benchmark_tests.yml:106)
bash -c "hf auth login --token '$HF_USER_TOKEN' --add-to-git-credential && hf auth whoami"The token is interpolated into the shell string via '$HF_USER_TOKEN' (single quotes in the outer bash -c). This exposes it in docker inspect, /proc/*/cmdline, and process listings. The eval steps correctly pass it as -e HF_USER_TOKEN and read it inside the container with \$HF_USER_TOKEN, but this login step embeds the literal value. Consider passing it as an env var here too:
docker run --rm \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
lerobot-benchmark-libero:ci \
bash -c 'hf auth login --token "$HF_USER_TOKEN" --add-to-git-credential && hf auth whoami'Additionally, this login step (lines 100–106) appears to be a no-op — the container it runs is ephemeral (--rm), so the credential is immediately discarded. The actual eval step (line 119) re-authenticates independently. Consider removing this step entirely unless it's intended as a pre-flight check.
Bugs / Functional Issues
-
parse_eval_metrics.pyruns on the host, not in Docker (benchmark_tests.yml:147-151)python3 scripts/ci/parse_eval_metrics.py \ --artifacts-dir /tmp/libero-artifacts \
This runs
python3on the bare runner (Ubuntu), which won't have the project dependencies installed (nouv syncon the host). It only uses stdlib (json,argparse,pathlib,math), so it works today — but this is fragile. If any future import is added, it'll break silently. Worth adding a comment noting the stdlib-only constraint, or running it inside the container. -
feat/benchmark-ciin push trigger paths (benchmark_tests.yml:34)push: branches: - feat/benchmark-ci - main
This feature branch trigger should be removed before merging to main — it'll cause the workflow to fire on pushes to a branch that won't exist post-merge.
Dockerfile Concerns
-
Unpinned
uvversion in Dockerfiles (Dockerfile.benchmark.libero:46,Dockerfile.benchmark.metaworld:46)curl -LsSf https://astral.sh/uv/install.sh | shThe workflow defines
UV_VERSION: "0.8.0"but the Dockerfiles install whatever versionuvships by default. For reproducible builds, pin the version:curl -LsSf https://astral.sh/uv/0.8.0/install.sh | sh -
Duplicated system deps block — The
apt-getblock in both Dockerfiles is identical (lines 35-50). Not a blocker, but if you anticipate adding more benchmarks, consider a shared base image or a multi-stage pattern to reduce drift between them. -
chmod +x ptxas(Dockerfile.benchmark.libero:87,Dockerfile.benchmark.metaworld:72)RUN chmod +x .../triton/backends/nvidia/bin/ptxas
This is a workaround for a known Triton packaging issue. Worth a comment explaining why so future maintainers don't remove it.
Source Code Changes
-
Default
use_async_envsflip fromFalse→True(configs.py,factory.py)This changes the public API default for
EnvConfig.create_envs()andmake_env(). The actual behavior is guarded (use_async and n_envs > 1), so withn_envs=1it's effectively a no-op. But for callers passingn_envs > 1without specifyinguse_async_envs, this silently switches them fromSyncVectorEnvtoAsyncVectorEnv. This could surface latent serialization/fork issues.The test update (
test_dispatch.py:90) correctly pins the test touse_async_envs=False, which confirms this default change is intentional. Just want to flag that this is a behavioral change beyond CI — it affects all library consumers. The PR description doesn't highlight this.
CI Scripts
-
extract_task_descriptions.py— Clean implementation. The MetaWorld fallback (using cleaned task name as description) is reasonable. -
parse_eval_metrics.py— Handles bothaggregatedandoveralloutput shapes well. One minor note: theNaNcheck onpc_success(line 57) is good defensive coding, butavg_sum_rewardandeval_saren't checked for NaN — worth being consistent.
Minor / Style
- The workflow has
permissions: contents: read— good, minimal permissions. - The concurrency group setup is correct for cancelling in-flight runs.
- Artifact uploads use
if: always()— good, ensures artifacts are captured even on failure.
Summary
The main items to address:
- Remove
feat/benchmark-cifrom the push branch trigger before merge - Fix the token exposure in the "Login to Hugging Face" step (or remove it entirely since it's a no-op)
- Pin the
uvversion in both Dockerfiles for reproducibility - Document the
use_async_envsdefault change in the PR description since it's a public API change
The rest are suggestions for robustness. Solid work overall — the isolated Docker approach and cache layering strategy are well thought out.
Anonymous pulls from Docker Hub are rate-limited to 100/6h, which fails when multiple benchmark jobs pull nvidia/cuda in parallel. Add docker/login-action step (conditional on DOCKERHUB_USERNAME var) to authenticate and get 200 pulls/6h. Setup: add DOCKERHUB_USERNAME as a repository variable and DOCKERHUB_TOKEN as a repository secret in GitHub Settings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Step-level 'if' cannot reference 'secrets' directly. Expose the secret via an env var and check that instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drop the conditional guard — other workflows (docker_publish, full_tests) call docker/login-action unconditionally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GHA cache is capped at 10GB per repo — a single CUDA + PyTorch +
benchmark image is ~8GB so the cache evicts before it's reused.
Switch to type=registry which pushes cache layers to Docker Hub
(huggingface/lerobot-benchmark-cache:{libero,metaworld}). No size
limit, layers persist until explicitly deleted, and shared across
all runners and branches.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Docker Hub CI token can't push to new repos. GHCR works out of the box — GITHUB_TOKEN has automatic packages:write for the repo owner. - Add GHCR login step (github.actor + GITHUB_TOKEN) - Switch cache refs to ghcr.io/huggingface/lerobot/cache-benchmark - Add packages:write at job level (not workflow, per zizmor) - Keep Docker Hub login for pulling nvidia/cuda base image Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The huggingface org restricts GHCR package creation via GITHUB_TOKEN, causing 403 on cache export. Remove all registry caching and GHCR login. The Dockerfile layer split (deps vs source) still helps when the runner has a warm Docker daemon. Also fix the metaworld job which had a stale conditional Docker Hub login and was missing the GHCR login entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Security: - Remove "Login to Hugging Face" step — it was a no-op (ephemeral --rm container) that exposed the HF token via CLI argument in docker inspect / /proc/*/cmdline. The eval step already re-authenticates via env var. Functional: - Remove feat/benchmark-ci from push trigger branches (won't exist post-merge). Dockerfiles: - Pin uv to 0.8.0 (was unpinned, fetching whatever latest ships). - Add comment explaining the chmod +x ptxas workaround (Triton packaging bug — ships ptxas without execute bit). Scripts: - parse_eval_metrics.py: add note that it runs on bare host and must stay stdlib-only. - parse_eval_metrics.py: add NaN guard for avg_sum_reward and eval_s (was only guarding pc_success). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Benchmark PRs (robomme, libero-plus, robocerebra, robotwin) target feat/benchmark-ci, not main. Without this, the workflow never runs on those PRs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ict) uv sync --locked validates the entire lockfile across all extras. Since robomme depends on mani-skill which pins numpy<2.0, and the base project requires numpy>=2.0, the full lockfile is unsatisfiable. Switch to uv pip install -e ".[libero,smolvla]" which only resolves the requested extras for the current Python version and platform, avoiding the cross-extra numpy conflict entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These use_async_envs default changes belong to the async-vector-env PR (#3274), not this CI PR. Restore to match origin/main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@claude Review this PR |
|
Claude finished @imstevenpmwork's task in 2m 2s —— View job PR Review: Benchmark Smoke Tests with Isolated Docker Images
Overall this is solid CI work. Isolated Docker images per benchmark is the right architectural choice for incompatible dep trees, and the dep-layer / source-layer split in the Dockerfiles is good for Docker cache efficiency. The CI scripts are clean stdlib-only Python. Here's detailed feedback: Must-fix before merge1. pull_request:
branches:
- main
- feat/benchmark-ci # ← removeThis feature branch won't exist post-merge, so this line becomes dead config. It also means PRs targeting Security2. Token handling in The token is passed as Note: the previous Claude review raised this as an issue, but examining the actual code, the token is properly passed as an env var to the container and dereferenced inside it. The escaped Bugs / Correctness3. python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/libero-artifacts \This runs on the host, not in Docker. The script header already documents this constraint (line 28: 4. Container cleanup on failure ( The "Copy artifacts" steps have 5. No container cleanup for If the "Run Libero train+eval smoke" step fails, Dockerfiles6. Both Dockerfiles pin curl -LsSf https://astral.sh/uv/0.8.0/install.sh | shThis matches the workflow env var and is better than what 7. Broken Triton issue URL ( # See: https://github.com/triton-lang/triton/issues/2dueThe URL 8. Duplicated system deps block is acceptable Both benchmark Dockerfiles duplicate the identical 9. Both Dockerfiles CI Workflow Design10. Jobs run independently (no matrix) — Each benchmark is a separate job, not a matrix strategy. This is the right call: the Docker images, eval commands, and artifact names all differ significantly between benchmarks, so a matrix would require extensive 11. Concurrency group is correct ( group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: trueSame pattern as other workflows. Good. 12. The 13. No Other workflows (e.g., if: env.HF_USER_TOKEN != ''This PR's eval steps will fail hard if the secret isn't configured (e.g., on forks). Since the entire job depends on HF Hub access (model download), this might be intentional — there's no point running the eval without the token. But it would produce a confusing error message. Consider either adding the guard or adding a clear early-fail step that checks for the secret. Python Scripts14. Stdlib-only, proper error handling, graceful fallback for unknown envs. The MetaWorld description extraction (cleaned task name) is reasonable given MetaWorld doesn't expose NL descriptions. 15. The helper 16.
Summary
The architecture and approach are well thought out. The isolated Docker image pattern, dep/source layer split, and stdlib-only metrics parser are all good engineering decisions. |
Type / Scope
docker/,.github/workflows/,docs/Summary / Motivation
Adds isolated CI smoke tests for LIBERO and MetaWorld, stacked on top of #3274. Each benchmark gets its own Docker image (
lerobot[<benchmark>,smolvla]) so incompatible dependency trees (e.g.hf-liberovsmetaworld==3.0.0) can never collide. A 1-episode eval runs on GPU runners to catch install-time regressions (broken deps, import errors, interactive prompts) before they reach users.Related issues
What changed
docker/Dockerfile.benchmark.libero— isolated LIBERO image; pre-creates~/.libero/config.yamlat build time to bypass the interactive stdin prompt on importdocker/Dockerfile.benchmark.metaworld— isolated MetaWorld image.github/workflows/benchmark_tests.yml— one job per benchmark; triggers onenvs/**,lerobot_eval.py, Dockerfiles,pyproject.toml; uploads rollout videos andmetrics.jsonartifactsdocs/source/evaluation.mdx— new lerobot-eval user guidedocs/source/adding_benchmarks.mdx— step 7: CI smoke test instructions for new benchmarksHow was this tested (or how to run locally)
Build and run manually:
docker build -f docker/Dockerfile.benchmark.libero -t lerobot-benchmark-libero . docker run --rm --gpus all --shm-size=4g lerobot-benchmark-libero \ lerobot-eval --policy.path=pepijn223/smolvla_libero \ --env.type=libero --env.task=libero_spatial \ --eval.batch_size=1 --eval.n_episodes=1 --eval.use_async_envs=false \ --policy.device=cudaChecklist (required before merge)
pre-commit run -a)pytest)TODO (post-merge validation)
parse_eval_metrics.pywrites correctmetrics.jsonafter a libero/metaworld evallibero-metrics/metaworld-metricsartifacts appear in the Actions UIGITHUB_RO_TOKENSpace secret to be set)Reviewer notes