feat(ci): benchmark smoke tests with isolated Docker images (LIBERO + MetaWorld) by pkooij · Pull Request #3319 · huggingface/lerobot

pkooij · 2026-04-08T12:46:39Z

Type / Scope

Type: CI
Scope: docker/, .github/workflows/, docs/

Summary / Motivation

Adds isolated CI smoke tests for LIBERO and MetaWorld, stacked on top of #3274. Each benchmark gets its own Docker image (lerobot[<benchmark>,smolvla]) so incompatible dependency trees (e.g. hf-libero vs metaworld==3.0.0) can never collide. A 1-episode eval runs on GPU runners to catch install-time regressions (broken deps, import errors, interactive prompts) before they reach users.

Related issues

Related: feat(envs): lazy env init + AsyncVectorEnv as default for n_envs > 1 #3274 (async vector env — base branch)
Related: feat(ci): live health dashboard — GitHub API + Gradio Space #3324 (health dashboard — stacked on top of this PR)

What changed

docker/Dockerfile.benchmark.libero — isolated LIBERO image; pre-creates ~/.libero/config.yaml at build time to bypass the interactive stdin prompt on import
docker/Dockerfile.benchmark.metaworld — isolated MetaWorld image
.github/workflows/benchmark_tests.yml — one job per benchmark; triggers on envs/**, lerobot_eval.py, Dockerfiles, pyproject.toml; uploads rollout videos and metrics.json artifacts
docs/source/evaluation.mdx — new lerobot-eval user guide
docs/source/adding_benchmarks.mdx — step 7: CI smoke test instructions for new benchmarks

How was this tested (or how to run locally)

Build and run manually:

docker build -f docker/Dockerfile.benchmark.libero -t lerobot-benchmark-libero .
docker run --rm --gpus all --shm-size=4g lerobot-benchmark-libero \
  lerobot-eval --policy.path=pepijn223/smolvla_libero \
    --env.type=libero --env.task=libero_spatial \
    --eval.batch_size=1 --eval.n_episodes=1 --eval.use_async_envs=false \
    --policy.device=cuda

Checklist (required before merge)

Linting/formatting run (pre-commit run -a)
All tests pass locally (pytest)
Documentation updated
CI is green

TODO (post-merge validation)

Verify parse_eval_metrics.py writes correct metrics.json after a libero/metaworld eval
Verify libero-metrics / metaworld-metrics artifacts appear in the Actions UI
Open lerobot/health-dashboard — confirm status table, charts, and videos load (requires GITHUB_RO_TOKEN Space secret to be set)

Reviewer notes

Merge base branch (feat(envs): lazy env init + AsyncVectorEnv as default for n_envs > 1 #3274) first; this PR should only be merged after.

…chmark docs Add a comprehensive guide for adding new benchmarks to LeRobot, and refactor the existing LIBERO and Meta-World docs to follow the new standardized template. Made-with: Cursor

…asses Replace hardcoded if/elif chains in factory.py with create_envs() and get_env_processors() methods on EnvConfig. New benchmarks now only need to register a config subclass — no factory.py edits required. Net -23 lines: factory.py shrinks from ~200 to ~70 lines of logic. Made-with: Cursor

Rewrite for simpler language, better structure, and easier navigation. Move quick-reference table to the top, fold eval explanation into architecture section, condense the doc template to a bulleted outline. Made-with: Cursor

- Thread camera_name_mapping from LiberoEnv config through to gym envs - Sync features_map with camera_name_mapping in LiberoEnv.__post_init__ - Fix render() to use first available camera instead of hardcoded "image" - Handle non-dict final_info in rollout by falling back to info["is_success"] - Add use_peft legacy field to SmolVLAConfig for checkpoint compat - Add defaults to GR00TN15Config init=False fields for transformers 5.3 Made-with: Cursor

Made-with: Cursor

- Revert GR00T N1.5 default_factory/default changes (transformers compat) - Revert SmolVLA use_peft legacy field - Apply ruff formatting fixes - camera_name_mapping stays entirely in env/eval layer (no policy changes) Made-with: Cursor

Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co> Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

LiberoEnv and MetaworldEnv previously allocated GPU resources (EGL context, OpenGL framebuffer) in __init__, before AsyncVectorEnv's fork(). Worker processes inherited stale GPU handles, causing EGL_BAD_CONTEXT crashes on first render. Fix: defer OffScreenRenderEnv / MT1 construction to _ensure_env(), called on first reset() or step() inside the worker subprocess. Each worker creates its own clean context after fork(). Also fixes lerobot_eval.py:170 (add_envs_task TODO): replace with env.call("task") which works with both SyncVectorEnv and AsyncVectorEnv. AsyncVectorEnv is now the default for n_envs > 1; auto-downgraded to SyncVectorEnv when n_envs=1 (no benefit, less overhead). Expected speedup: ~15-20x for LIBERO Spatial with batch_size=50. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

eval_policy_all never closed environments after each task completed, causing AsyncVectorEnv worker processes to accumulate (N_tasks × n_envs). This led to OOM, BrokenPipeError and EOFError on multi-task benchmarks. Also fixes: - AsyncVectorEnv compat in envs/utils.py (use get_attr/call instead of .envs) - Tuple task handling in tokenizer_processor and lerobot_eval - _LazyAsyncVectorEnv for deferred worker spawning in LIBERO Made-with: Cursor

…ning env.call("task") returns the LIBERO task name with underscores (e.g. "pick_up_the_black_bowl_...") instead of the natural language description ("pick up the black bowl ..."). The VLM tokenizes these completely differently, causing 0.0 reward across all episodes. Made-with: Cursor

- Replace add_envs_task reference with env.call("task_description") - Update use_async_envs default to True - Add note about lazy GPU init for AsyncVectorEnv compatibility Made-with: Cursor

- batch_size=0 (default) auto-tunes based on CPU cores, capped by n_episodes and 64. Removes the need for users to guess the right value. The old batch_size > n_episodes error is replaced by silently clamping to n_episodes. - _LazyAsyncVectorEnv accepts pre-computed spaces so only one temp env is created per suite (not per task). For libero_spatial (10 tasks) this avoids 9 redundant LiberoEnv instantiations during env setup. Made-with: Cursor

- New docs/source/evaluation.mdx covering lerobot-eval usage, batch_size auto-tuning, AsyncVectorEnv performance, tuning tips, output format, multi-task evaluation, and programmatic usage. - Add evaluation page to _toctree.yml under Benchmarks section. - Update adding_benchmarks.mdx to reference batch_size auto default and link to the evaluation guide. Made-with: Cursor

Made-with: Cursor

- AsyncVectorEnv now uses shared_memory=True for zero-copy observation transfer - LiberoEnvConfig.gym_kwargs passes observation_height/width to the env - eval_policy_all prefetches next task's workers while current task runs Made-with: Cursor

Made-with: Cursor

Each benchmark gets its own Docker image (lerobot[libero] / lerobot[metaworld] only) so incompatible dep trees cannot collide. A 1-episode smoke eval runs per benchmark on GPU runners. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

libero/__init__.py calls input() to ask about a custom dataset path, which raises EOFError when stdin is closed inside Docker. Setting LIBERO_DATA_FOLDER skips the prompt entirely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

libero/__init__.py calls input() when ~/.libero/config.yaml is missing. We write the config at image build time (without importing libero) so the prompt never fires at runtime. Also trigger CI on pyproject.toml changes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…n -c The multiline RUN python -c "..." was being parsed as Dockerfile instructions. Use printf to write ~/.libero/config.yaml directly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The config was pointing to /tmp/libero_init which doesn't exist. Use importlib.util.find_spec to locate the hf-libero package directory and write paths to the actual bundled bddl_files/init_files/assets. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

num2words (required by SmolVLM processor) is declared in lerobot[smolvla], not lerobot[libero/metaworld]. Install both extras together. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Runs accelerate launch --num_processes=1 lerobot-train with: - steps=1, batch_size=1, dataset.episodes=[0] (episode 0 only) - eval_freq=1 so the training loop triggers eval after step 1 - eval.n_episodes=1, eval.use_async_envs=false Tests the full train→eval-within-training pipeline in the existing libero-benchmark-libero:ci image (no extra Docker build cost). Uploads eval video from /tmp/train-smoke/eval/ as libero-train-smoke-video. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Resolves conflict in lerobot_eval.py by taking explicit (AttributeError, NotImplementedError) catches from main (#3274). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add scripts/ci/extract_task_descriptions.py: runs inside the benchmark Docker container (LIBERO/MetaWorld installed) after lerobot-eval and writes task_descriptions.json mapping task keys to NL instructions. LIBERO: uses libero.libero.benchmark to get suite.get_task(i).language. MetaWorld: formats task name as human-readable label. - Call extraction at the end of each eval bash-c (|| true so never fatal). - parse_eval_metrics.py reads task_descriptions.json and includes it in metrics.json so the health dashboard Space can label videos by task. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

imstevenpmwork · 2026-04-09T11:40:26Z

src/lerobot/envs/factory.py

    cfg: EnvConfig | str,
    n_envs: int = 1,
-    use_async_envs: bool = False,
+    use_async_envs: bool = True,


@claude what is this doing?

The task descriptions were never populated in metrics.json because extract_task_descriptions.py was never invoked. The script exists and parse_eval_metrics.py already looks for its output — the call was simply missing from the workflow. Appends the extraction step to the existing bash -c block (runs inside the container where libero/metaworld is installed) so task_descriptions.json is written to the eval-artifacts dir before docker cp copies it out. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AsyncVectorEnv spawns new subprocesses that do not inherit the in-process gym registration created by the test. Pass use_async_envs=False since this test validates dispatch logic, not async parallelism. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…builds The dep-install layer (uv sync) now only depends on pyproject.toml, uv.lock, and a minimal package stub — not the full src/ tree. Source code changes only rebuild the final COPY layer (seconds, not minutes). Also switch from type=local cache (lost on ephemeral runners) to type=gha (persisted in GitHub Actions cache, shared across all runs). Before: every src/ change → full uv sync rebuild (~8-10 min) After: src/-only change → cached dep layer, ~30s source copy Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

pkooij · 2026-04-09T13:16:02Z

.github/workflows/benchmark_tests.yml

@@ -0,0 +1,310 @@
+# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");


@claude review this pr

Claude finished @pkooij's task in 1m 25s —— View job

PR Review: Benchmark Smoke Tests with Isolated Docker Images

Overall this is well-structured CI work — isolated Docker images per benchmark is the right approach for incompatible dep trees, and the dep-layer / source-layer split in the Dockerfiles is good for cache efficiency. Here's my detailed feedback:

Security Issue

Token passed via CLI argument in "Login to Hugging Face" step (benchmark_tests.yml:106)

bash -c "hf auth login --token '$HF_USER_TOKEN' --add-to-git-credential && hf auth whoami"

The token is interpolated into the shell string via '$HF_USER_TOKEN' (single quotes in the outer bash -c). This exposes it in docker inspect, /proc/*/cmdline, and process listings. The eval steps correctly pass it as -e HF_USER_TOKEN and read it inside the container with \$HF_USER_TOKEN, but this login step embeds the literal value. Consider passing it as an env var here too:

docker run --rm \ -e HF_HOME=/tmp/hf \ -e HF_USER_TOKEN="${HF_USER_TOKEN}" \ lerobot-benchmark-libero:ci \ bash -c 'hf auth login --token "$HF_USER_TOKEN" --add-to-git-credential && hf auth whoami'

Additionally, this login step (lines 100–106) appears to be a no-op — the container it runs is ephemeral (--rm), so the credential is immediately discarded. The actual eval step (line 119) re-authenticates independently. Consider removing this step entirely unless it's intended as a pre-flight check.

Bugs / Functional Issues

parse_eval_metrics.py runs on the host, not in Docker (benchmark_tests.yml:147-151)

python3 scripts/ci/parse_eval_metrics.py \ --artifacts-dir /tmp/libero-artifacts \

This runs python3 on the bare runner (Ubuntu), which won't have the project dependencies installed (no uv sync on the host). It only uses stdlib (json, argparse, pathlib, math), so it works today — but this is fragile. If any future import is added, it'll break silently. Worth adding a comment noting the stdlib-only constraint, or running it inside the container.

feat/benchmark-ci in push trigger paths (benchmark_tests.yml:34)

push: branches: - feat/benchmark-ci - main

This feature branch trigger should be removed before merging to main — it'll cause the workflow to fire on pushes to a branch that won't exist post-merge.

Dockerfile Concerns

Unpinned uv version in Dockerfiles (Dockerfile.benchmark.libero:46, Dockerfile.benchmark.metaworld:46)

curl -LsSf https://astral.sh/uv/install.sh | sh

The workflow defines UV_VERSION: "0.8.0" but the Dockerfiles install whatever version uv ships by default. For reproducible builds, pin the version:

curl -LsSf https://astral.sh/uv/0.8.0/install.sh | sh

Duplicated system deps block — The apt-get block in both Dockerfiles is identical (lines 35-50). Not a blocker, but if you anticipate adding more benchmarks, consider a shared base image or a multi-stage pattern to reduce drift between them.

chmod +x ptxas (Dockerfile.benchmark.libero:87, Dockerfile.benchmark.metaworld:72)

RUN chmod +x .../triton/backends/nvidia/bin/ptxas

This is a workaround for a known Triton packaging issue. Worth a comment explaining why so future maintainers don't remove it.

Source Code Changes

Default use_async_envs flip from False → True (configs.py, factory.py)

This changes the public API default for EnvConfig.create_envs() and make_env(). The actual behavior is guarded (use_async and n_envs > 1), so with n_envs=1 it's effectively a no-op. But for callers passing n_envs > 1 without specifying use_async_envs, this silently switches them from SyncVectorEnv to AsyncVectorEnv. This could surface latent serialization/fork issues.

The test update (test_dispatch.py:90) correctly pins the test to use_async_envs=False, which confirms this default change is intentional. Just want to flag that this is a behavioral change beyond CI — it affects all library consumers. The PR description doesn't highlight this.

CI Scripts

extract_task_descriptions.py — Clean implementation. The MetaWorld fallback (using cleaned task name as description) is reasonable.

parse_eval_metrics.py — Handles both aggregated and overall output shapes well. One minor note: the NaN check on pc_success (line 57) is good defensive coding, but avg_sum_reward and eval_s aren't checked for NaN — worth being consistent.

Minor / Style

The workflow has permissions: contents: read — good, minimal permissions.

The concurrency group setup is correct for cancelling in-flight runs.

Artifact uploads use if: always() — good, ensures artifacts are captured even on failure.

Summary

The main items to address:

Remove feat/benchmark-ci from the push branch trigger before merge

Fix the token exposure in the "Login to Hugging Face" step (or remove it entirely since it's a no-op)

Pin the uv version in both Dockerfiles for reproducibility

Document the use_async_envs default change in the PR description since it's a public API change

The rest are suggestions for robustness. Solid work overall — the isolated Docker approach and cache layering strategy are well thought out.

Anonymous pulls from Docker Hub are rate-limited to 100/6h, which fails when multiple benchmark jobs pull nvidia/cuda in parallel. Add docker/login-action step (conditional on DOCKERHUB_USERNAME var) to authenticate and get 200 pulls/6h. Setup: add DOCKERHUB_USERNAME as a repository variable and DOCKERHUB_TOKEN as a repository secret in GitHub Settings. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Step-level 'if' cannot reference 'secrets' directly. Expose the secret via an env var and check that instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Drop the conditional guard — other workflows (docker_publish, full_tests) call docker/login-action unconditionally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

GHA cache is capped at 10GB per repo — a single CUDA + PyTorch + benchmark image is ~8GB so the cache evicts before it's reused. Switch to type=registry which pushes cache layers to Docker Hub (huggingface/lerobot-benchmark-cache:{libero,metaworld}). No size limit, layers persist until explicitly deleted, and shared across all runners and branches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Docker Hub CI token can't push to new repos. GHCR works out of the box — GITHUB_TOKEN has automatic packages:write for the repo owner. - Add GHCR login step (github.actor + GITHUB_TOKEN) - Switch cache refs to ghcr.io/huggingface/lerobot/cache-benchmark - Add packages:write at job level (not workflow, per zizmor) - Keep Docker Hub login for pulling nvidia/cuda base image Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The huggingface org restricts GHCR package creation via GITHUB_TOKEN, causing 403 on cache export. Remove all registry caching and GHCR login. The Dockerfile layer split (deps vs source) still helps when the runner has a warm Docker daemon. Also fix the metaworld job which had a stale conditional Docker Hub login and was missing the GHCR login entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Security: - Remove "Login to Hugging Face" step — it was a no-op (ephemeral --rm container) that exposed the HF token via CLI argument in docker inspect / /proc/*/cmdline. The eval step already re-authenticates via env var. Functional: - Remove feat/benchmark-ci from push trigger branches (won't exist post-merge). Dockerfiles: - Pin uv to 0.8.0 (was unpinned, fetching whatever latest ships). - Add comment explaining the chmod +x ptxas workaround (Triton packaging bug — ships ptxas without execute bit). Scripts: - parse_eval_metrics.py: add note that it runs on bare host and must stay stdlib-only. - parse_eval_metrics.py: add NaN guard for avg_sum_reward and eval_s (was only guarding pc_success). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Benchmark PRs (robomme, libero-plus, robocerebra, robotwin) target feat/benchmark-ci, not main. Without this, the workflow never runs on those PRs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ict) uv sync --locked validates the entire lockfile across all extras. Since robomme depends on mani-skill which pins numpy<2.0, and the base project requires numpy>=2.0, the full lockfile is unsatisfiable. Switch to uv pip install -e ".[libero,smolvla]" which only resolves the requested extras for the current Python version and platform, avoiding the cross-extra numpy conflict entirely. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

These use_async_envs default changes belong to the async-vector-env PR (#3274), not this CI PR. Restore to match origin/main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

imstevenpmwork · 2026-04-10T15:30:47Z

@claude Review this PR

claude · 2026-04-10T15:31:09Z

Claude finished @imstevenpmwork's task in 2m 2s —— View job

PR Review: Benchmark Smoke Tests with Isolated Docker Images

Read all changed files
Review CI workflow (benchmark_tests.yml)
Review Dockerfiles (LIBERO + MetaWorld)
Review CI scripts (extract_task_descriptions.py, parse_eval_metrics.py)
Post review findings

Overall this is solid CI work. Isolated Docker images per benchmark is the right architectural choice for incompatible dep trees, and the dep-layer / source-layer split in the Dockerfiles is good for Docker cache efficiency. The CI scripts are clean stdlib-only Python. Here's detailed feedback:

Must-fix before merge

1. feat/benchmark-ci in pull_request trigger — remove before merge (benchmark_tests.yml:44-45)

pull_request:
  branches:
    - main
    - feat/benchmark-ci  # ← remove

This feature branch won't exist post-merge, so this line becomes dead config. It also means PRs targeting feat/benchmark-ci trigger the workflow, which is only useful during development of this PR.

Fix this →

Security

2. Token handling in docker run is correct (benchmark_tests.yml:112-116)

The token is passed as -e HF_USER_TOKEN="${HF_USER_TOKEN}" and then referenced inside the container as \"\$HF_USER_TOKEN\" — this avoids embedding the literal secret value in the shell command visible to process listings. This pattern is sound. Good.

Note: the previous Claude review raised this as an issue, but examining the actual code, the token is properly passed as an env var to the container and dereferenced inside it. The escaped \$ ensures the shell expansion happens inside the container where the env var is set, not on the host.

Bugs / Correctness

3. parse_eval_metrics.py runs on the bare runner (benchmark_tests.yml:142-147)

python3 scripts/ci/parse_eval_metrics.py \
  --artifacts-dir /tmp/libero-artifacts \

This runs on the host, not in Docker. The script header already documents this constraint (line 28: "NOTE: This script runs on the bare CI runner (not inside Docker), so it must use only Python stdlib modules."), and the implementation is indeed stdlib-only (json, argparse, pathlib, math). This is fine as-is, but the constraint is fragile — a future contributor could easily add an import without realizing it. Consider a lightweight CI check or a more prominent warning. Low priority.

4. Container cleanup on failure (benchmark_tests.yml:133-138, 280-285)

The "Copy artifacts" steps have if: always(), which is correct. However, the main docker run step (e.g., line 109) uses a named container (--name libero-eval) without --rm. If the eval fails, the container persists until the copy step removes it (docker rm -f). This is intentional and correct for artifact extraction — just noting the design is sound.

5. No container cleanup for libero-train-smoke on build failure (benchmark_tests.yml:169-209)

If the "Run Libero train+eval smoke" step fails, libero-train-smoke container is cleaned up in the "Copy Libero train-smoke artifacts" step (line 204-209) which has if: always(). Good.

Dockerfiles

6. uv version is properly pinned (Dockerfile.benchmark.libero:46, Dockerfile.benchmark.metaworld:46)

Both Dockerfiles pin uv to 0.8.0:

curl -LsSf https://astral.sh/uv/0.8.0/install.sh | sh

This matches the workflow env var and is better than what Dockerfile.internal does (unpinned uv). Good.

7. Broken Triton issue URL (Dockerfile.benchmark.libero:93, Dockerfile.benchmark.metaworld:76)

# See: https://github.com/triton-lang/triton/issues/2due

The URL issues/2due is clearly truncated/corrupted — it should be a numeric issue ID. Note that Dockerfile.internal:81 has the same chmod workaround but no URL at all. Consider finding the correct issue link or removing the URL.

8. Duplicated system deps block is acceptable

Both benchmark Dockerfiles duplicate the identical apt-get block from Dockerfile.internal. For 2 benchmarks this is fine. If you anticipate scaling to many more, a shared base image or build arg pattern would reduce drift, but that's a future concern.

9. COPY setup.py — setup.py must exist (Dockerfile.benchmark.libero:71, Dockerfile.benchmark.metaworld:69)

Both Dockerfiles COPY setup.py. Verified that setup.py and MANIFEST.in exist at the repo root. Good.

CI Workflow Design

10. Jobs run independently (no matrix) — Each benchmark is a separate job, not a matrix strategy. This is the right call: the Docker images, eval commands, and artifact names all differ significantly between benchmarks, so a matrix would require extensive if/include overrides and be harder to read.

11. Concurrency group is correct (benchmark_tests.yml:62-63)

group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true

Same pattern as other workflows. Good.

12. upload-artifact@v4 actions are unpinned (benchmark_tests.yml:151, 159, etc.)

The actions/checkout is pinned to a SHA (@de0fac2e...), but actions/upload-artifact@v4 uses a tag. The other unpinned actions (docker/setup-buildx-action@v3, etc.) have # zizmor: ignore[unpinned-uses] comments, suggesting this is a deliberate policy. upload-artifact lacks this annotation — add the zizmor ignore comment for consistency, or pin it.

13. No if: env.HF_USER_TOKEN != '' guard (benchmark_tests.yml:109, 256)

Other workflows (e.g., full_tests.yml:93, fast_tests.yml:96) guard the HF login with:

if: env.HF_USER_TOKEN != ''

This PR's eval steps will fail hard if the secret isn't configured (e.g., on forks). Since the entire job depends on HF Hub access (model download), this might be intentional — there's no point running the eval without the token. But it would produce a confusing error message. Consider either adding the guard or adding a clear early-fail step that checks for the secret.

Python Scripts

14. extract_task_descriptions.py — Clean implementation

Stdlib-only, proper error handling, graceful fallback for unknown envs. The MetaWorld description extraction (cleaned task name) is reasonable given MetaWorld doesn't expose NL descriptions.

15. parse_eval_metrics.py — _safe_float defined inside function body (parse_eval_metrics.py:61-65)

The helper _safe_float is defined inside _extract_metrics on every call. This is fine functionally but slightly unusual — it could be a module-level helper. Very minor, not worth changing.

16. parse_eval_metrics.py — inconsistent NaN handling (parse_eval_metrics.py:67-73)

pc_success is checked for NaN explicitly (line 67), while avg_sum_reward and eval_s go through _safe_float which also handles NaN (line 65). The n_episodes field is not NaN-checked at all (an int shouldn't be NaN, but json.loads could return a float for it). The code is correct but the two different NaN-checking paths for different fields is slightly confusing. Consider running all four fields through _safe_float for consistency.

Summary

Priority	Item	Description
Must-fix	#1	Remove `feat/benchmark-ci` from PR branch trigger
Suggestion	#7	Fix broken Triton issue URL in both Dockerfiles
Suggestion	#12	Add zizmor ignore comment to `upload-artifact` or pin it
Suggestion	#13	Add `HF_USER_TOKEN` guard or early-fail check for fork compatibility
Nit	#16	Consistent NaN handling across all metric fields

The architecture and approach are well thought out. The isolated Docker image pattern, dep/source layer split, and stdlib-only metrics parser are all good engineering decisions.

pkooij mentioned this pull request Apr 8, 2026

feat(ci): live health dashboard — GitHub API + Gradio Space #3324

Merged

3 tasks

pkooij and others added 29 commits April 8, 2026 18:28

docs(benchmarks): add benchmark integration guide and standardize ben…

69eec9c

…chmark docs Add a comprehensive guide for adding new benchmarks to LeRobot, and refactor the existing LIBERO and Meta-World docs to follow the new standardized template. Made-with: Cursor

fix link

75d5e5b

fix task count

7abe5f7

fix: use direct AutoresetMode import for gymnasium compat

d8e0eaa

Made-with: Cursor

fix: handle gymnasium < 1.0 without AutoresetMode

0ea6aac

Made-with: Cursor

Update docs/source/env_processor.mdx

8e07cab

Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co> Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>

docs: update adding_benchmarks for async env changes

8a778c0

- Replace add_envs_task reference with env.call("task_description") - Update use_async_envs default to True - Add note about lazy GPU init for AsyncVectorEnv compatibility Made-with: Cursor

docs(evaluation): remove benchmark table, rename section header

43abbcc

Made-with: Cursor

style: ruff format

12023f4

Made-with: Cursor

chore: revert env_processor.mdx changes (not part of this PR)

9a6ab6a

Made-with: Cursor

ci(benchmarks): pin action hashes and use uv sync --locked

61e2be8

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci(benchmarks): trigger only on envs/ or lerobot_eval.py changes

07350f9

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(benchmarks): add CI smoke test step to adding_benchmarks guide

42ef36e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(ci): use shell to create libero config instead of multiline pytho…

c24687d

…n -c The multiline RUN python -c "..." was being parsed as Dockerfile instructions. Use printf to write ~/.libero/config.yaml directly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(ci): add smolvla extra to benchmark Dockerfiles

58a5bcb

num2words (required by SmolVLM processor) is declared in lerobot[smolvla], not lerobot[libero/metaworld]. Install both extras together. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Base automatically changed from feat/async-vector-env to main April 9, 2026 08:29

chore: merge main into feat/benchmark-ci-clean

d39a621

Resolves conflict in lerobot_eval.py by taking explicit (AttributeError, NotImplementedError) catches from main (#3274). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions bot added CI Issues related to the continuous integration pipeline evaluation For issues or PRs related to environment evaluation, and benchmarks. github_actions labels Apr 9, 2026

imstevenpmwork reviewed Apr 9, 2026

View reviewed changes

pkooij and others added 3 commits April 9, 2026 14:53

Merge branch 'main' into feat/benchmark-ci

c454d29

github-actions bot added the tests Problems with test coverage, failures, or improvements to testing label Apr 9, 2026

pkooij commented Apr 9, 2026

View reviewed changes

pkooij and others added 12 commits April 9, 2026 15:17

fix(ci): use existing DOCKERHUB_LEROBOT_USERNAME/PASSWORD secrets

14f1e09

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(ci): use env context for secrets check in step if-condition

e72b168

Step-level 'if' cannot reference 'secrets' directly. Expose the secret via an env var and check that instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix(ci): simplify Docker Hub login to match existing workflows

0490e97

Drop the conditional guard — other workflows (docker_publish, full_tests) call docker/login-action unconditionally. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' into feat/benchmark-ci

58d4ecd

ci(benchmarks): trigger on PRs targeting feat/benchmark-ci

183fdb7

Benchmark PRs (robomme, libero-plus, robocerebra, robotwin) target feat/benchmark-ci, not main. Without this, the workflow never runs on those PRs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: revert configs.py, factory.py, test_dispatch.py to main

9702f58

These use_async_envs default changes belong to the async-vector-env PR (#3274), not this CI PR. Restore to match origin/main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions bot removed tests Problems with test coverage, failures, or improvements to testing evaluation For issues or PRs related to environment evaluation, and benchmarks. labels Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ci): benchmark smoke tests with isolated Docker images (LIBERO + MetaWorld)#3319

feat(ci): benchmark smoke tests with isolated Docker images (LIBERO + MetaWorld)#3319
pkooij wants to merge 67 commits intomainfrom
feat/benchmark-ci

pkooij commented Apr 8, 2026 •

edited

Loading

Uh oh!

imstevenpmwork Apr 9, 2026

Uh oh!

pkooij Apr 9, 2026

Uh oh!

claude bot Apr 9, 2026 •

edited

Loading

Uh oh!

imstevenpmwork commented Apr 10, 2026

Uh oh!

claude bot commented Apr 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pkooij commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type / Scope

Summary / Motivation

Related issues

What changed

How was this tested (or how to run locally)

Checklist (required before merge)

TODO (post-merge validation)

Reviewer notes

Uh oh!

imstevenpmwork Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

pkooij Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

PR Review: Benchmark Smoke Tests with Isolated Docker Images

Security Issue

Bugs / Functional Issues

Dockerfile Concerns

Source Code Changes

CI Scripts

Minor / Style

Summary

Uh oh!

imstevenpmwork commented Apr 10, 2026

Uh oh!

claude bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Benchmark Smoke Tests with Isolated Docker Images

Must-fix before merge

Security

Bugs / Correctness

Dockerfiles

CI Workflow Design

Python Scripts

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pkooij commented Apr 8, 2026 •

edited

Loading

claude bot Apr 9, 2026 •

edited

Loading

claude bot commented Apr 10, 2026 •

edited

Loading