Skip to content

feat(ci): benchmark smoke tests with isolated Docker images (LIBERO + MetaWorld)#3319

Open
pkooij wants to merge 67 commits intomainfrom
feat/benchmark-ci
Open

feat(ci): benchmark smoke tests with isolated Docker images (LIBERO + MetaWorld)#3319
pkooij wants to merge 67 commits intomainfrom
feat/benchmark-ci

Conversation

@pkooij
Copy link
Copy Markdown
Member

@pkooij pkooij commented Apr 8, 2026

Type / Scope

  • Type: CI
  • Scope: docker/, .github/workflows/, docs/

Summary / Motivation

Adds isolated CI smoke tests for LIBERO and MetaWorld, stacked on top of #3274. Each benchmark gets its own Docker image (lerobot[<benchmark>,smolvla]) so incompatible dependency trees (e.g. hf-libero vs metaworld==3.0.0) can never collide. A 1-episode eval runs on GPU runners to catch install-time regressions (broken deps, import errors, interactive prompts) before they reach users.

Related issues

What changed

  • docker/Dockerfile.benchmark.libero — isolated LIBERO image; pre-creates ~/.libero/config.yaml at build time to bypass the interactive stdin prompt on import
  • docker/Dockerfile.benchmark.metaworld — isolated MetaWorld image
  • .github/workflows/benchmark_tests.yml — one job per benchmark; triggers on envs/**, lerobot_eval.py, Dockerfiles, pyproject.toml; uploads rollout videos and metrics.json artifacts
  • docs/source/evaluation.mdx — new lerobot-eval user guide
  • docs/source/adding_benchmarks.mdx — step 7: CI smoke test instructions for new benchmarks

How was this tested (or how to run locally)

Build and run manually:

docker build -f docker/Dockerfile.benchmark.libero -t lerobot-benchmark-libero .
docker run --rm --gpus all --shm-size=4g lerobot-benchmark-libero \
  lerobot-eval --policy.path=pepijn223/smolvla_libero \
    --env.type=libero --env.task=libero_spatial \
    --eval.batch_size=1 --eval.n_episodes=1 --eval.use_async_envs=false \
    --policy.device=cuda

Checklist (required before merge)

  • Linting/formatting run (pre-commit run -a)
  • All tests pass locally (pytest)
  • Documentation updated
  • CI is green

TODO (post-merge validation)

  • Verify parse_eval_metrics.py writes correct metrics.json after a libero/metaworld eval
  • Verify libero-metrics / metaworld-metrics artifacts appear in the Actions UI
  • Open lerobot/health-dashboard — confirm status table, charts, and videos load (requires GITHUB_RO_TOKEN Space secret to be set)

Reviewer notes

pkooij and others added 29 commits April 8, 2026 18:28
…chmark docs

Add a comprehensive guide for adding new benchmarks to LeRobot, and
refactor the existing LIBERO and Meta-World docs to follow the new
standardized template.

Made-with: Cursor
…asses

Replace hardcoded if/elif chains in factory.py with create_envs() and
get_env_processors() methods on EnvConfig. New benchmarks now only need
to register a config subclass — no factory.py edits required.

Net -23 lines: factory.py shrinks from ~200 to ~70 lines of logic.

Made-with: Cursor
Rewrite for simpler language, better structure, and easier navigation.
Move quick-reference table to the top, fold eval explanation into
architecture section, condense the doc template to a bulleted outline.

Made-with: Cursor
- Thread camera_name_mapping from LiberoEnv config through to gym envs
- Sync features_map with camera_name_mapping in LiberoEnv.__post_init__
- Fix render() to use first available camera instead of hardcoded "image"
- Handle non-dict final_info in rollout by falling back to info["is_success"]
- Add use_peft legacy field to SmolVLAConfig for checkpoint compat
- Add defaults to GR00TN15Config init=False fields for transformers 5.3

Made-with: Cursor
- Revert GR00T N1.5 default_factory/default changes (transformers compat)
- Revert SmolVLA use_peft legacy field
- Apply ruff formatting fixes
- camera_name_mapping stays entirely in env/eval layer (no policy changes)

Made-with: Cursor
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
LiberoEnv and MetaworldEnv previously allocated GPU resources (EGL context,
OpenGL framebuffer) in __init__, before AsyncVectorEnv's fork(). Worker
processes inherited stale GPU handles, causing EGL_BAD_CONTEXT crashes on
first render.

Fix: defer OffScreenRenderEnv / MT1 construction to _ensure_env(), called on
first reset() or step() inside the worker subprocess. Each worker creates its
own clean context after fork().

Also fixes lerobot_eval.py:170 (add_envs_task TODO): replace with
env.call("task") which works with both SyncVectorEnv and AsyncVectorEnv.

AsyncVectorEnv is now the default for n_envs > 1; auto-downgraded to
SyncVectorEnv when n_envs=1 (no benefit, less overhead).

Expected speedup: ~15-20x for LIBERO Spatial with batch_size=50.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
eval_policy_all never closed environments after each task completed,
causing AsyncVectorEnv worker processes to accumulate (N_tasks × n_envs).
This led to OOM, BrokenPipeError and EOFError on multi-task benchmarks.

Also fixes:
- AsyncVectorEnv compat in envs/utils.py (use get_attr/call instead of .envs)
- Tuple task handling in tokenizer_processor and lerobot_eval
- _LazyAsyncVectorEnv for deferred worker spawning in LIBERO

Made-with: Cursor
…ning

env.call("task") returns the LIBERO task name with underscores
(e.g. "pick_up_the_black_bowl_...") instead of the natural language
description ("pick up the black bowl ..."). The VLM tokenizes these
completely differently, causing 0.0 reward across all episodes.

Made-with: Cursor
- Replace add_envs_task reference with env.call("task_description")
- Update use_async_envs default to True
- Add note about lazy GPU init for AsyncVectorEnv compatibility

Made-with: Cursor
- batch_size=0 (default) auto-tunes based on CPU cores, capped by
  n_episodes and 64. Removes the need for users to guess the right
  value. The old batch_size > n_episodes error is replaced by silently
  clamping to n_episodes.
- _LazyAsyncVectorEnv accepts pre-computed spaces so only one temp env
  is created per suite (not per task). For libero_spatial (10 tasks)
  this avoids 9 redundant LiberoEnv instantiations during env setup.

Made-with: Cursor
- New docs/source/evaluation.mdx covering lerobot-eval usage, batch_size
  auto-tuning, AsyncVectorEnv performance, tuning tips, output format,
  multi-task evaluation, and programmatic usage.
- Add evaluation page to _toctree.yml under Benchmarks section.
- Update adding_benchmarks.mdx to reference batch_size auto default and
  link to the evaluation guide.

Made-with: Cursor
- AsyncVectorEnv now uses shared_memory=True for zero-copy observation transfer
- LiberoEnvConfig.gym_kwargs passes observation_height/width to the env
- eval_policy_all prefetches next task's workers while current task runs

Made-with: Cursor
Made-with: Cursor
Each benchmark gets its own Docker image (lerobot[libero] / lerobot[metaworld]
only) so incompatible dep trees cannot collide. A 1-episode smoke eval runs
per benchmark on GPU runners.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
libero/__init__.py calls input() to ask about a custom dataset path,
which raises EOFError when stdin is closed inside Docker. Setting
LIBERO_DATA_FOLDER skips the prompt entirely.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
libero/__init__.py calls input() when ~/.libero/config.yaml is missing.
We write the config at image build time (without importing libero) so
the prompt never fires at runtime. Also trigger CI on pyproject.toml changes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n -c

The multiline RUN python -c "..." was being parsed as Dockerfile
instructions. Use printf to write ~/.libero/config.yaml directly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The config was pointing to /tmp/libero_init which doesn't exist.
Use importlib.util.find_spec to locate the hf-libero package directory
and write paths to the actual bundled bddl_files/init_files/assets.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
num2words (required by SmolVLM processor) is declared in lerobot[smolvla],
not lerobot[libero/metaworld]. Install both extras together.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Runs accelerate launch --num_processes=1 lerobot-train with:
- steps=1, batch_size=1, dataset.episodes=[0] (episode 0 only)
- eval_freq=1 so the training loop triggers eval after step 1
- eval.n_episodes=1, eval.use_async_envs=false

Tests the full train→eval-within-training pipeline in the existing
libero-benchmark-libero:ci image (no extra Docker build cost).
Uploads eval video from /tmp/train-smoke/eval/ as libero-train-smoke-video.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Base automatically changed from feat/async-vector-env to main April 9, 2026 08:29
Resolves conflict in lerobot_eval.py by taking explicit
(AttributeError, NotImplementedError) catches from main (#3274).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added CI Issues related to the continuous integration pipeline evaluation For issues or PRs related to environment evaluation, and benchmarks. github_actions labels Apr 9, 2026
- Add scripts/ci/extract_task_descriptions.py: runs inside the benchmark
  Docker container (LIBERO/MetaWorld installed) after lerobot-eval and
  writes task_descriptions.json mapping task keys to NL instructions.
  LIBERO: uses libero.libero.benchmark to get suite.get_task(i).language.
  MetaWorld: formats task name as human-readable label.
- Call extraction at the end of each eval bash-c (|| true so never fatal).
- parse_eval_metrics.py reads task_descriptions.json and includes it in
  metrics.json so the health dashboard Space can label videos by task.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
cfg: EnvConfig | str,
n_envs: int = 1,
use_async_envs: bool = False,
use_async_envs: bool = True,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude what is this doing?

pkooij and others added 3 commits April 9, 2026 14:53
The task descriptions were never populated in metrics.json because
extract_task_descriptions.py was never invoked. The script exists and
parse_eval_metrics.py already looks for its output — the call was
simply missing from the workflow.

Appends the extraction step to the existing bash -c block (runs inside
the container where libero/metaworld is installed) so task_descriptions.json
is written to the eval-artifacts dir before docker cp copies it out.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AsyncVectorEnv spawns new subprocesses that do not inherit the
in-process gym registration created by the test. Pass
use_async_envs=False since this test validates dispatch logic,
not async parallelism.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the tests Problems with test coverage, failures, or improvements to testing label Apr 9, 2026
…builds

The dep-install layer (uv sync) now only depends on pyproject.toml,
uv.lock, and a minimal package stub — not the full src/ tree. Source
code changes only rebuild the final COPY layer (seconds, not minutes).

Also switch from type=local cache (lost on ephemeral runners) to
type=gha (persisted in GitHub Actions cache, shared across all runs).

Before: every src/ change → full uv sync rebuild (~8-10 min)
After:  src/-only change → cached dep layer, ~30s source copy

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@@ -0,0 +1,310 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude review this pr

Copy link
Copy Markdown

@claude claude bot Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finished @pkooij's task in 1m 25s —— View job


PR Review: Benchmark Smoke Tests with Isolated Docker Images

Overall this is well-structured CI work — isolated Docker images per benchmark is the right approach for incompatible dep trees, and the dep-layer / source-layer split in the Dockerfiles is good for cache efficiency. Here's my detailed feedback:


Security Issue

Token passed via CLI argument in "Login to Hugging Face" step (benchmark_tests.yml:106)

bash -c "hf auth login --token '$HF_USER_TOKEN' --add-to-git-credential && hf auth whoami"

The token is interpolated into the shell string via '$HF_USER_TOKEN' (single quotes in the outer bash -c). This exposes it in docker inspect, /proc/*/cmdline, and process listings. The eval steps correctly pass it as -e HF_USER_TOKEN and read it inside the container with \$HF_USER_TOKEN, but this login step embeds the literal value. Consider passing it as an env var here too:

docker run --rm \
  -e HF_HOME=/tmp/hf \
  -e HF_USER_TOKEN="${HF_USER_TOKEN}" \
  lerobot-benchmark-libero:ci \
  bash -c 'hf auth login --token "$HF_USER_TOKEN" --add-to-git-credential && hf auth whoami'

Additionally, this login step (lines 100–106) appears to be a no-op — the container it runs is ephemeral (--rm), so the credential is immediately discarded. The actual eval step (line 119) re-authenticates independently. Consider removing this step entirely unless it's intended as a pre-flight check.


Bugs / Functional Issues

  1. parse_eval_metrics.py runs on the host, not in Docker (benchmark_tests.yml:147-151)

    python3 scripts/ci/parse_eval_metrics.py \
      --artifacts-dir /tmp/libero-artifacts \

    This runs python3 on the bare runner (Ubuntu), which won't have the project dependencies installed (no uv sync on the host). It only uses stdlib (json, argparse, pathlib, math), so it works today — but this is fragile. If any future import is added, it'll break silently. Worth adding a comment noting the stdlib-only constraint, or running it inside the container.

  2. feat/benchmark-ci in push trigger paths (benchmark_tests.yml:34)

    push:
      branches:
        - feat/benchmark-ci
        - main

    This feature branch trigger should be removed before merging to main — it'll cause the workflow to fire on pushes to a branch that won't exist post-merge.


Dockerfile Concerns

  1. Unpinned uv version in Dockerfiles (Dockerfile.benchmark.libero:46, Dockerfile.benchmark.metaworld:46)

    curl -LsSf https://astral.sh/uv/install.sh | sh

    The workflow defines UV_VERSION: "0.8.0" but the Dockerfiles install whatever version uv ships by default. For reproducible builds, pin the version:

    curl -LsSf https://astral.sh/uv/0.8.0/install.sh | sh
  2. Duplicated system deps block — The apt-get block in both Dockerfiles is identical (lines 35-50). Not a blocker, but if you anticipate adding more benchmarks, consider a shared base image or a multi-stage pattern to reduce drift between them.

  3. chmod +x ptxas (Dockerfile.benchmark.libero:87, Dockerfile.benchmark.metaworld:72)

    RUN chmod +x .../triton/backends/nvidia/bin/ptxas

    This is a workaround for a known Triton packaging issue. Worth a comment explaining why so future maintainers don't remove it.


Source Code Changes

  1. Default use_async_envs flip from FalseTrue (configs.py, factory.py)

    This changes the public API default for EnvConfig.create_envs() and make_env(). The actual behavior is guarded (use_async and n_envs > 1), so with n_envs=1 it's effectively a no-op. But for callers passing n_envs > 1 without specifying use_async_envs, this silently switches them from SyncVectorEnv to AsyncVectorEnv. This could surface latent serialization/fork issues.

    The test update (test_dispatch.py:90) correctly pins the test to use_async_envs=False, which confirms this default change is intentional. Just want to flag that this is a behavioral change beyond CI — it affects all library consumers. The PR description doesn't highlight this.


CI Scripts

  1. extract_task_descriptions.py — Clean implementation. The MetaWorld fallback (using cleaned task name as description) is reasonable.

  2. parse_eval_metrics.py — Handles both aggregated and overall output shapes well. One minor note: the NaN check on pc_success (line 57) is good defensive coding, but avg_sum_reward and eval_s aren't checked for NaN — worth being consistent.


Minor / Style

  1. The workflow has permissions: contents: read — good, minimal permissions.
  2. The concurrency group setup is correct for cancelling in-flight runs.
  3. Artifact uploads use if: always() — good, ensures artifacts are captured even on failure.

Summary

The main items to address:

  • Remove feat/benchmark-ci from the push branch trigger before merge
  • Fix the token exposure in the "Login to Hugging Face" step (or remove it entirely since it's a no-op)
  • Pin the uv version in both Dockerfiles for reproducibility
  • Document the use_async_envs default change in the PR description since it's a public API change

The rest are suggestions for robustness. Solid work overall — the isolated Docker approach and cache layering strategy are well thought out.

pkooij and others added 12 commits April 9, 2026 15:17
Anonymous pulls from Docker Hub are rate-limited to 100/6h, which
fails when multiple benchmark jobs pull nvidia/cuda in parallel.
Add docker/login-action step (conditional on DOCKERHUB_USERNAME var)
to authenticate and get 200 pulls/6h.

Setup: add DOCKERHUB_USERNAME as a repository variable and
DOCKERHUB_TOKEN as a repository secret in GitHub Settings.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Step-level 'if' cannot reference 'secrets' directly. Expose the
secret via an env var and check that instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Drop the conditional guard — other workflows (docker_publish,
full_tests) call docker/login-action unconditionally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GHA cache is capped at 10GB per repo — a single CUDA + PyTorch +
benchmark image is ~8GB so the cache evicts before it's reused.

Switch to type=registry which pushes cache layers to Docker Hub
(huggingface/lerobot-benchmark-cache:{libero,metaworld}). No size
limit, layers persist until explicitly deleted, and shared across
all runners and branches.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Docker Hub CI token can't push to new repos. GHCR works out of the
box — GITHUB_TOKEN has automatic packages:write for the repo owner.

- Add GHCR login step (github.actor + GITHUB_TOKEN)
- Switch cache refs to ghcr.io/huggingface/lerobot/cache-benchmark
- Add packages:write at job level (not workflow, per zizmor)
- Keep Docker Hub login for pulling nvidia/cuda base image

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The huggingface org restricts GHCR package creation via GITHUB_TOKEN,
causing 403 on cache export. Remove all registry caching and GHCR
login. The Dockerfile layer split (deps vs source) still helps when
the runner has a warm Docker daemon.

Also fix the metaworld job which had a stale conditional Docker Hub
login and was missing the GHCR login entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Security:
- Remove "Login to Hugging Face" step — it was a no-op (ephemeral
  --rm container) that exposed the HF token via CLI argument in
  docker inspect / /proc/*/cmdline. The eval step already
  re-authenticates via env var.

Functional:
- Remove feat/benchmark-ci from push trigger branches (won't exist
  post-merge).

Dockerfiles:
- Pin uv to 0.8.0 (was unpinned, fetching whatever latest ships).
- Add comment explaining the chmod +x ptxas workaround (Triton
  packaging bug — ships ptxas without execute bit).

Scripts:
- parse_eval_metrics.py: add note that it runs on bare host and must
  stay stdlib-only.
- parse_eval_metrics.py: add NaN guard for avg_sum_reward and eval_s
  (was only guarding pc_success).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Benchmark PRs (robomme, libero-plus, robocerebra, robotwin) target
feat/benchmark-ci, not main. Without this, the workflow never runs
on those PRs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ict)

uv sync --locked validates the entire lockfile across all extras.
Since robomme depends on mani-skill which pins numpy<2.0, and the
base project requires numpy>=2.0, the full lockfile is unsatisfiable.

Switch to uv pip install -e ".[libero,smolvla]" which only resolves
the requested extras for the current Python version and platform,
avoiding the cross-extra numpy conflict entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
These use_async_envs default changes belong to the async-vector-env
PR (#3274), not this CI PR. Restore to match origin/main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot removed tests Problems with test coverage, failures, or improvements to testing evaluation For issues or PRs related to environment evaluation, and benchmarks. labels Apr 10, 2026
@imstevenpmwork
Copy link
Copy Markdown
Collaborator

@claude Review this PR

@claude
Copy link
Copy Markdown

claude bot commented Apr 10, 2026

Claude finished @imstevenpmwork's task in 2m 2s —— View job


PR Review: Benchmark Smoke Tests with Isolated Docker Images

  • Read all changed files
  • Review CI workflow (benchmark_tests.yml)
  • Review Dockerfiles (LIBERO + MetaWorld)
  • Review CI scripts (extract_task_descriptions.py, parse_eval_metrics.py)
  • Post review findings

Overall this is solid CI work. Isolated Docker images per benchmark is the right architectural choice for incompatible dep trees, and the dep-layer / source-layer split in the Dockerfiles is good for Docker cache efficiency. The CI scripts are clean stdlib-only Python. Here's detailed feedback:


Must-fix before merge

1. feat/benchmark-ci in pull_request trigger — remove before merge (benchmark_tests.yml:44-45)

pull_request:
  branches:
    - main
    - feat/benchmark-ci  # ← remove

This feature branch won't exist post-merge, so this line becomes dead config. It also means PRs targeting feat/benchmark-ci trigger the workflow, which is only useful during development of this PR.

Fix this →


Security

2. Token handling in docker run is correct (benchmark_tests.yml:112-116)

The token is passed as -e HF_USER_TOKEN="${HF_USER_TOKEN}" and then referenced inside the container as \"\$HF_USER_TOKEN\" — this avoids embedding the literal secret value in the shell command visible to process listings. This pattern is sound. Good.

Note: the previous Claude review raised this as an issue, but examining the actual code, the token is properly passed as an env var to the container and dereferenced inside it. The escaped \$ ensures the shell expansion happens inside the container where the env var is set, not on the host.


Bugs / Correctness

3. parse_eval_metrics.py runs on the bare runner (benchmark_tests.yml:142-147)

python3 scripts/ci/parse_eval_metrics.py \
  --artifacts-dir /tmp/libero-artifacts \

This runs on the host, not in Docker. The script header already documents this constraint (line 28: "NOTE: This script runs on the bare CI runner (not inside Docker), so it must use only Python stdlib modules."), and the implementation is indeed stdlib-only (json, argparse, pathlib, math). This is fine as-is, but the constraint is fragile — a future contributor could easily add an import without realizing it. Consider a lightweight CI check or a more prominent warning. Low priority.

4. Container cleanup on failure (benchmark_tests.yml:133-138, 280-285)

The "Copy artifacts" steps have if: always(), which is correct. However, the main docker run step (e.g., line 109) uses a named container (--name libero-eval) without --rm. If the eval fails, the container persists until the copy step removes it (docker rm -f). This is intentional and correct for artifact extraction — just noting the design is sound.

5. No container cleanup for libero-train-smoke on build failure (benchmark_tests.yml:169-209)

If the "Run Libero train+eval smoke" step fails, libero-train-smoke container is cleaned up in the "Copy Libero train-smoke artifacts" step (line 204-209) which has if: always(). Good.


Dockerfiles

6. uv version is properly pinned (Dockerfile.benchmark.libero:46, Dockerfile.benchmark.metaworld:46)

Both Dockerfiles pin uv to 0.8.0:

curl -LsSf https://astral.sh/uv/0.8.0/install.sh | sh

This matches the workflow env var and is better than what Dockerfile.internal does (unpinned uv). Good.

7. Broken Triton issue URL (Dockerfile.benchmark.libero:93, Dockerfile.benchmark.metaworld:76)

# See: https://github.com/triton-lang/triton/issues/2due

The URL issues/2due is clearly truncated/corrupted — it should be a numeric issue ID. Note that Dockerfile.internal:81 has the same chmod workaround but no URL at all. Consider finding the correct issue link or removing the URL.

8. Duplicated system deps block is acceptable

Both benchmark Dockerfiles duplicate the identical apt-get block from Dockerfile.internal. For 2 benchmarks this is fine. If you anticipate scaling to many more, a shared base image or build arg pattern would reduce drift, but that's a future concern.

9. COPY setup.pysetup.py must exist (Dockerfile.benchmark.libero:71, Dockerfile.benchmark.metaworld:69)

Both Dockerfiles COPY setup.py. Verified that setup.py and MANIFEST.in exist at the repo root. Good.


CI Workflow Design

10. Jobs run independently (no matrix) — Each benchmark is a separate job, not a matrix strategy. This is the right call: the Docker images, eval commands, and artifact names all differ significantly between benchmarks, so a matrix would require extensive if/include overrides and be harder to read.

11. Concurrency group is correct (benchmark_tests.yml:62-63)

group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true

Same pattern as other workflows. Good.

12. upload-artifact@v4 actions are unpinned (benchmark_tests.yml:151, 159, etc.)

The actions/checkout is pinned to a SHA (@de0fac2e...), but actions/upload-artifact@v4 uses a tag. The other unpinned actions (docker/setup-buildx-action@v3, etc.) have # zizmor: ignore[unpinned-uses] comments, suggesting this is a deliberate policy. upload-artifact lacks this annotation — add the zizmor ignore comment for consistency, or pin it.

13. No if: env.HF_USER_TOKEN != '' guard (benchmark_tests.yml:109, 256)

Other workflows (e.g., full_tests.yml:93, fast_tests.yml:96) guard the HF login with:

if: env.HF_USER_TOKEN != ''

This PR's eval steps will fail hard if the secret isn't configured (e.g., on forks). Since the entire job depends on HF Hub access (model download), this might be intentional — there's no point running the eval without the token. But it would produce a confusing error message. Consider either adding the guard or adding a clear early-fail step that checks for the secret.


Python Scripts

14. extract_task_descriptions.py — Clean implementation

Stdlib-only, proper error handling, graceful fallback for unknown envs. The MetaWorld description extraction (cleaned task name) is reasonable given MetaWorld doesn't expose NL descriptions.

15. parse_eval_metrics.py_safe_float defined inside function body (parse_eval_metrics.py:61-65)

The helper _safe_float is defined inside _extract_metrics on every call. This is fine functionally but slightly unusual — it could be a module-level helper. Very minor, not worth changing.

16. parse_eval_metrics.py — inconsistent NaN handling (parse_eval_metrics.py:67-73)

pc_success is checked for NaN explicitly (line 67), while avg_sum_reward and eval_s go through _safe_float which also handles NaN (line 65). The n_episodes field is not NaN-checked at all (an int shouldn't be NaN, but json.loads could return a float for it). The code is correct but the two different NaN-checking paths for different fields is slightly confusing. Consider running all four fields through _safe_float for consistency.


Summary

Priority Item Description
Must-fix #1 Remove feat/benchmark-ci from PR branch trigger
Suggestion #7 Fix broken Triton issue URL in both Dockerfiles
Suggestion #12 Add zizmor ignore comment to upload-artifact or pin it
Suggestion #13 Add HF_USER_TOKEN guard or early-fail check for fork compatibility
Nit #16 Consistent NaN handling across all metric fields

The architecture and approach are well thought out. The isolated Docker image pattern, dep/source layer split, and stdlib-only metrics parser are all good engineering decisions.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI Issues related to the continuous integration pipeline github_actions

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants