Skip to content

feat(ci): benchmark smoke tests with isolated Docker images (LIBERO + MetaWorld)#3319

Open
pkooij wants to merge 6 commits intofeat/async-vector-envfrom
feat/benchmark-ci
Open

feat(ci): benchmark smoke tests with isolated Docker images (LIBERO + MetaWorld)#3319
pkooij wants to merge 6 commits intofeat/async-vector-envfrom
feat/benchmark-ci

Conversation

@pkooij
Copy link
Copy Markdown
Member

@pkooij pkooij commented Apr 8, 2026

Type / Scope

  • Type: CI
  • Scope: docker/, .github/workflows/, docs/

Summary / Motivation

Adds isolated CI smoke tests for LIBERO and MetaWorld, stacked on top of #3274. Each benchmark gets its own Docker image (lerobot[<benchmark>,smolvla]) so incompatible dependency trees (e.g. hf-libero vs metaworld==3.0.0) can never collide. A 1-episode eval runs on GPU runners to catch install-time regressions (broken deps, import errors, interactive prompts) before they reach users.

Related issues

What changed

  • docker/Dockerfile.benchmark.libero — isolated LIBERO image; pre-creates ~/.libero/config.yaml at build time to bypass the interactive stdin prompt on import
  • docker/Dockerfile.benchmark.metaworld — isolated MetaWorld image
  • .github/workflows/benchmark_tests.yml — one job per benchmark; triggers on envs/**, lerobot_eval.py, Dockerfiles, pyproject.toml; uploads rollout videos and metrics.json artifacts
  • docs/source/evaluation.mdx — new lerobot-eval user guide
  • docs/source/adding_benchmarks.mdx — step 7: CI smoke test instructions for new benchmarks

How was this tested (or how to run locally)

Build and run manually:

docker build -f docker/Dockerfile.benchmark.libero -t lerobot-benchmark-libero .
docker run --rm --gpus all --shm-size=4g lerobot-benchmark-libero \
  lerobot-eval --policy.path=pepijn223/smolvla_libero \
    --env.type=libero --env.task=libero_spatial \
    --eval.batch_size=1 --eval.n_episodes=1 --eval.use_async_envs=false \
    --policy.device=cuda

Checklist (required before merge)

  • Linting/formatting run (pre-commit run -a)
  • All tests pass locally (pytest)
  • Documentation updated
  • CI is green

TODO (post-merge validation)

  • Verify parse_eval_metrics.py writes correct metrics.json after a libero/metaworld eval
  • Verify libero-metrics / metaworld-metrics artifacts appear in the Actions UI
  • Open lerobot/health-dashboard — confirm status table, charts, and videos load (requires GITHUB_RO_TOKEN Space secret to be set)

Reviewer notes

@pkooij pkooij force-pushed the feat/async-vector-env branch 2 times, most recently from 35f18d4 to 566a77b Compare April 8, 2026 17:05
…dirs

Running chmod on the host doesn't propagate into Docker due to UID/SELinux
mismatch. Instead, spin up the image as root to mkdir+chmod from inside
the container before the eval run mounts the same path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@pkooij pkooij force-pushed the feat/benchmark-ci branch from e89e6d9 to 927118e Compare April 8, 2026 17:22
pkooij and others added 5 commits April 8, 2026 19:59
Files created by user_lerobot inside the eval container inherit a
restrictive umask, making them unreadable by the runner after the
container exits. Add a post-eval 'docker run --user root' chmod step
so upload-artifact can find the video files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Runs on the 1st of every month at 02:00 UTC in addition to the
existing push/PR and manual dispatch triggers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bind mounts on these runners don't surface container-written files on
the host path (likely DinD/socket-mount setup). Switch to named
containers + docker cp, which copies directly through the daemon and
lands files in the runner's accessible filesystem.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
user_lerobot cannot create /artifacts at the container root.
Use /tmp/eval-artifacts (always writable) then docker cp it out.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant