feat(ci): benchmark smoke tests with isolated Docker images (LIBERO + MetaWorld)#3319
Open
pkooij wants to merge 6 commits intofeat/async-vector-envfrom
Open
feat(ci): benchmark smoke tests with isolated Docker images (LIBERO + MetaWorld)#3319pkooij wants to merge 6 commits intofeat/async-vector-envfrom
pkooij wants to merge 6 commits intofeat/async-vector-envfrom
Conversation
3 tasks
35f18d4 to
566a77b
Compare
…dirs Running chmod on the host doesn't propagate into Docker due to UID/SELinux mismatch. Instead, spin up the image as root to mkdir+chmod from inside the container before the eval run mounts the same path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
e89e6d9 to
927118e
Compare
Files created by user_lerobot inside the eval container inherit a restrictive umask, making them unreadable by the runner after the container exits. Add a post-eval 'docker run --user root' chmod step so upload-artifact can find the video files. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Runs on the 1st of every month at 02:00 UTC in addition to the existing push/PR and manual dispatch triggers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bind mounts on these runners don't surface container-written files on the host path (likely DinD/socket-mount setup). Switch to named containers + docker cp, which copies directly through the daemon and lands files in the runner's accessible filesystem. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
user_lerobot cannot create /artifacts at the container root. Use /tmp/eval-artifacts (always writable) then docker cp it out. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Type / Scope
docker/,.github/workflows/,docs/Summary / Motivation
Adds isolated CI smoke tests for LIBERO and MetaWorld, stacked on top of #3274. Each benchmark gets its own Docker image (
lerobot[<benchmark>,smolvla]) so incompatible dependency trees (e.g.hf-liberovsmetaworld==3.0.0) can never collide. A 1-episode eval runs on GPU runners to catch install-time regressions (broken deps, import errors, interactive prompts) before they reach users.Related issues
What changed
docker/Dockerfile.benchmark.libero— isolated LIBERO image; pre-creates~/.libero/config.yamlat build time to bypass the interactive stdin prompt on importdocker/Dockerfile.benchmark.metaworld— isolated MetaWorld image.github/workflows/benchmark_tests.yml— one job per benchmark; triggers onenvs/**,lerobot_eval.py, Dockerfiles,pyproject.toml; uploads rollout videos andmetrics.jsonartifactsdocs/source/evaluation.mdx— new lerobot-eval user guidedocs/source/adding_benchmarks.mdx— step 7: CI smoke test instructions for new benchmarksHow was this tested (or how to run locally)
Build and run manually:
docker build -f docker/Dockerfile.benchmark.libero -t lerobot-benchmark-libero . docker run --rm --gpus all --shm-size=4g lerobot-benchmark-libero \ lerobot-eval --policy.path=pepijn223/smolvla_libero \ --env.type=libero --env.task=libero_spatial \ --eval.batch_size=1 --eval.n_episodes=1 --eval.use_async_envs=false \ --policy.device=cudaChecklist (required before merge)
pre-commit run -a)pytest)TODO (post-merge validation)
parse_eval_metrics.pywrites correctmetrics.jsonafter a libero/metaworld evallibero-metrics/metaworld-metricsartifacts appear in the Actions UIGITHUB_RO_TOKENSpace secret to be set)Reviewer notes