feat(ci): benchmark smoke tests with isolated Docker images (LIBERO + MetaWorld) by pkooij · Pull Request #3319 · huggingface/lerobot

pkooij · 2026-04-08T12:46:39Z

Type / Scope

Type: CI
Scope: docker/, .github/workflows/, docs/

Summary / Motivation

Adds isolated CI smoke tests for LIBERO and MetaWorld, stacked on top of #3274. Each benchmark gets its own Docker image (lerobot[<benchmark>,smolvla]) so incompatible dependency trees (e.g. hf-libero vs metaworld==3.0.0) can never collide. A 1-episode eval runs on GPU runners to catch install-time regressions (broken deps, import errors, interactive prompts) before they reach users.

Related issues

Related: feat(envs): lazy env init + AsyncVectorEnv as default for n_envs > 1 #3274 (async vector env — base branch)
Related: feat(ci): live health dashboard — GitHub API + Gradio Space #3324 (health dashboard — stacked on top of this PR)

What changed

docker/Dockerfile.benchmark.libero — isolated LIBERO image; pre-creates ~/.libero/config.yaml at build time to bypass the interactive stdin prompt on import
docker/Dockerfile.benchmark.metaworld — isolated MetaWorld image
.github/workflows/benchmark_tests.yml — one job per benchmark; triggers on envs/**, lerobot_eval.py, Dockerfiles, pyproject.toml; uploads rollout videos and metrics.json artifacts
docs/source/evaluation.mdx — new lerobot-eval user guide
docs/source/adding_benchmarks.mdx — step 7: CI smoke test instructions for new benchmarks

How was this tested (or how to run locally)

Build and run manually:

docker build -f docker/Dockerfile.benchmark.libero -t lerobot-benchmark-libero .
docker run --rm --gpus all --shm-size=4g lerobot-benchmark-libero \
  lerobot-eval --policy.path=pepijn223/smolvla_libero \
    --env.type=libero --env.task=libero_spatial \
    --eval.batch_size=1 --eval.n_episodes=1 --eval.use_async_envs=false \
    --policy.device=cuda

Checklist (required before merge)

Linting/formatting run (pre-commit run -a)
All tests pass locally (pytest)
Documentation updated
CI is green

TODO (post-merge validation)

Verify parse_eval_metrics.py writes correct metrics.json after a libero/metaworld eval
Verify libero-metrics / metaworld-metrics artifacts appear in the Actions UI
Open lerobot/health-dashboard — confirm status table, charts, and videos load (requires GITHUB_RO_TOKEN Space secret to be set)

Reviewer notes

Merge base branch (feat(envs): lazy env init + AsyncVectorEnv as default for n_envs > 1 #3274) first; this PR should only be merged after.

…dirs Running chmod on the host doesn't propagate into Docker due to UID/SELinux mismatch. Instead, spin up the image as root to mkdir+chmod from inside the container before the eval run mounts the same path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files created by user_lerobot inside the eval container inherit a restrictive umask, making them unreadable by the runner after the container exits. Add a post-eval 'docker run --user root' chmod step so upload-artifact can find the video files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Runs on the 1st of every month at 02:00 UTC in addition to the existing push/PR and manual dispatch triggers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Bind mounts on these runners don't surface container-written files on the host path (likely DinD/socket-mount setup). Switch to named containers + docker cp, which copies directly through the daemon and lands files in the runner's accessible filesystem. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

user_lerobot cannot create /artifacts at the container root. Use /tmp/eval-artifacts (always writable) then docker cp it out. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

pkooij mentioned this pull request Apr 8, 2026

feat(ci): live health dashboard — GitHub API + Gradio Space #3324

Merged

3 tasks

pkooij force-pushed the feat/async-vector-env branch 2 times, most recently from 35f18d4 to 566a77b Compare April 8, 2026 17:05

pkooij force-pushed the feat/benchmark-ci branch from e89e6d9 to 927118e Compare April 8, 2026 17:22

pkooij and others added 5 commits April 8, 2026 19:59

feat(ci): add monthly schedule trigger for benchmark tests

d8305ab

Runs on the 1st of every month at 02:00 UTC in addition to the existing push/PR and manual dispatch triggers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(ci): change benchmark schedule from monthly to weekly (every Monday)

e8d029e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(ci): write eval output to /tmp inside container

0dd0a8f

user_lerobot cannot create /artifacts at the container root. Use /tmp/eval-artifacts (always writable) then docker cp it out. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ci): benchmark smoke tests with isolated Docker images (LIBERO + MetaWorld)#3319

feat(ci): benchmark smoke tests with isolated Docker images (LIBERO + MetaWorld)#3319
pkooij wants to merge 6 commits intofeat/async-vector-envfrom
feat/benchmark-ci

pkooij commented Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pkooij commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type / Scope

Summary / Motivation

Related issues

What changed

How was this tested (or how to run locally)

Checklist (required before merge)

TODO (post-merge validation)

Reviewer notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pkooij commented Apr 8, 2026 •

edited

Loading