-
Notifications
You must be signed in to change notification settings - Fork 4.2k
feat(ci): benchmark smoke tests with isolated Docker images (LIBERO + MetaWorld) #3319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
pkooij
wants to merge
67
commits into
main
Choose a base branch
from
feat/benchmark-ci
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
67 commits
Select commit
Hold shift + click to select a range
69eec9c
docs(benchmarks): add benchmark integration guide and standardize ben…
pkooij 5ad4c8f
refactor(envs): move dispatch logic from factory into EnvConfig subcl…
pkooij bfa0a0f
docs(benchmarks): clean up adding-benchmarks guide for clarity
pkooij 75d5e5b
fix link
pkooij 7abe5f7
fix task count
pkooij 1fad71c
fix: enable SmolVLA eval on LIBERO with custom camera mappings
pkooij d8e0eaa
fix: use direct AutoresetMode import for gymnasium compat
pkooij 0ea6aac
fix: handle gymnasium < 1.0 without AutoresetMode
pkooij 27bbb6b
refactor: revert policy changes, keep env-only camera mapping fixes
pkooij 8e07cab
Update docs/source/env_processor.mdx
pkooij fd99209
feat(envs): lazy env init + AsyncVectorEnv as default for n_envs > 1
pkooij dbc8c2e
fix: close envs between tasks to prevent worker process accumulation
pkooij aebc5e2
fix(eval): use task_description instead of task for language conditio…
pkooij 8a778c0
docs: update adding_benchmarks for async env changes
pkooij 5ec6119
feat(eval): batch_size=auto + faster env loading
pkooij 2c32c04
docs: add evaluation guide and update benchmarks doc
pkooij 43abbcc
docs(evaluation): remove benchmark table, rename section header
pkooij 03e1901
perf(eval): shared memory, observation passthrough, task prefetch
pkooij 12023f4
style: ruff format
pkooij 9a6ab6a
chore: revert env_processor.mdx changes (not part of this PR)
pkooij 6e6f76d
ci(benchmarks): add isolated integration tests for libero and metaworld
pkooij 61e2be8
ci(benchmarks): pin action hashes and use uv sync --locked
pkooij 07350f9
ci(benchmarks): trigger only on envs/ or lerobot_eval.py changes
pkooij dfd09c0
fix(ci): set LIBERO_DATA_FOLDER to bypass interactive stdin prompt
pkooij 42ef36e
docs(benchmarks): add CI smoke test step to adding_benchmarks guide
pkooij 841cbb0
fix(ci): pre-create libero config in Dockerfile to bypass stdin prompt
pkooij c24687d
fix(ci): use shell to create libero config instead of multiline pytho…
pkooij 2420d20
fix(ci): point libero config to bundled package init_files
pkooij 58a5bcb
fix(ci): add smolvla extra to benchmark Dockerfiles
pkooij f3853c9
fix(eval): render_frame covers _LazyAsyncVectorEnv
pkooij e35b485
refactor(envs): remove unused _get_sub_env_attr helper
pkooij 28d353e
chore: apply prettier formatting to docs
pkooij 527463c
docs(env_processor): remove deprecated add_envs_task from pipeline ex…
pkooij 606ed97
refactor(envs): remove __del__ from _LazyAsyncVectorEnv
pkooij 93b99e4
fix(eval): prefetch next task's workers after close to avoid GPU memo…
pkooij fe05e50
refactor(envs): move _LazyAsyncVectorEnv to utils and apply to metaworld
pkooij c8c2e88
chore: remove out-of-scope benchmark/CI/docs files from PR
pkooij f4bc9b5
chore: restore adding_benchmarks + test_dispatch, drop env_processor …
pkooij 5bc90c7
docs(adding_benchmarks): remove CI smoke test step (coming in separat…
pkooij 566a77b
refactor(envs): remove unused add_envs_task
pkooij 973bb7c
style: fix prettier formatting in env_processor.mdx
pkooij 927118e
fix(ci): use root container chmod to fix PermissionError on artifact …
pkooij a16f00c
fix(ci): re-chmod artifacts after eval to fix unreadable files
pkooij d8305ab
feat(ci): add monthly schedule trigger for benchmark tests
pkooij e8d029e
fix(ci): change benchmark schedule from monthly to weekly (every Monday)
pkooij 936b42e
fix(ci): use docker cp instead of bind mounts for artifacts
pkooij 0dd0a8f
fix(ci): write eval output to /tmp inside container
pkooij 3534331
feat(ci): add parse_eval_metrics step to benchmark workflow
pkooij 17a5431
feat(ci): add Libero train+eval smoke test (1 step, eval_freq=1)
pkooij d39a621
chore: merge main into feat/benchmark-ci-clean
pkooij 192a53d
feat(ci): extract task descriptions and embed in metrics artifact
pkooij 9a9bc3b
fix(ci): call extract_task_descriptions.py after eval in benchmark jobs
pkooij c454d29
Merge branch 'main' into feat/benchmark-ci
pkooij 415c504
fix(test): use SyncVectorEnv in test_base_create_envs
pkooij 9a84ae7
perf(ci): split Dockerfile dep-install from source-copy for faster re…
pkooij c713c7f
fix(ci): add Docker Hub login to avoid pull rate limits
pkooij 14f1e09
fix(ci): use existing DOCKERHUB_LEROBOT_USERNAME/PASSWORD secrets
pkooij e72b168
fix(ci): use env context for secrets check in step if-condition
pkooij 0490e97
fix(ci): simplify Docker Hub login to match existing workflows
pkooij a8b6ecd
fix(ci): switch Docker cache from type=gha to type=registry
pkooij c3429aa
fix(ci): use GHCR for Docker layer cache (Docker Hub push denied)
pkooij 86c51a5
fix(ci): remove GHCR cache (org blocks GITHUB_TOKEN package writes)
pkooij 58d4ecd
Merge branch 'main' into feat/benchmark-ci
pkooij c505a71
fix(ci): address PR review feedback for benchmark smoke tests
pkooij 183fdb7
ci(benchmarks): trigger on PRs targeting feat/benchmark-ci
pkooij dd84819
fix(docker): use uv pip install instead of uv sync (cross-extra confl…
pkooij 9702f58
chore: revert configs.py, factory.py, test_dispatch.py to main
pkooij File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,310 @@ | ||
| # Copyright 2025 The HuggingFace Inc. team. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| # Integration tests: build an isolated Docker image per benchmark and run a | ||
| # 1-episode smoke eval. Each benchmark gets its own image so incompatible | ||
| # dependency trees (e.g. hf-libero vs metaworld==3.0.0) can never collide. | ||
| # | ||
| # To add a new benchmark: | ||
| # 1. Add docker/Dockerfile.benchmark.<name> (install only lerobot[<name>]) | ||
| # 2. Copy one of the jobs below and adjust the image name and eval command. | ||
| name: Benchmark Integration Tests | ||
|
|
||
| on: | ||
| # Run manually from the Actions tab | ||
| workflow_dispatch: | ||
|
|
||
| # Run every Monday at 02:00 UTC. | ||
| schedule: | ||
| - cron: "0 2 * * 1" | ||
|
|
||
| push: | ||
| branches: | ||
| - main | ||
| paths: | ||
| - "src/lerobot/envs/**" | ||
| - "src/lerobot/scripts/lerobot_eval.py" | ||
| - "docker/Dockerfile.benchmark.*" | ||
| - ".github/workflows/benchmark_tests.yml" | ||
| - "pyproject.toml" | ||
|
|
||
| pull_request: | ||
| branches: | ||
| - main | ||
| - feat/benchmark-ci | ||
| paths: | ||
| - "src/lerobot/envs/**" | ||
| - "src/lerobot/scripts/lerobot_eval.py" | ||
| - "docker/Dockerfile.benchmark.*" | ||
| - ".github/workflows/benchmark_tests.yml" | ||
| - "pyproject.toml" | ||
|
|
||
| permissions: | ||
| contents: read | ||
|
|
||
| env: | ||
| UV_VERSION: "0.8.0" | ||
| PYTHON_VERSION: "3.12" | ||
|
|
||
| # Cancel in-flight runs for the same branch/PR. | ||
| concurrency: | ||
| group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} | ||
| cancel-in-progress: true | ||
|
|
||
| jobs: | ||
| # ── LIBERO ──────────────────────────────────────────────────────────────── | ||
| # Isolated image: lerobot[libero] only (hf-libero, dm-control, mujoco chain) | ||
| libero-integration-test: | ||
| name: Libero — build image + 1-episode eval | ||
| runs-on: | ||
| group: aws-g6-4xlarge-plus | ||
| env: | ||
| HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }} | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 | ||
| with: | ||
| persist-credentials: false | ||
| lfs: true | ||
|
|
||
| - name: Set up Docker Buildx | ||
| uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses] | ||
| with: | ||
| cache-binary: false | ||
|
|
||
| - name: Login to Docker Hub | ||
| uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses] | ||
| with: | ||
| username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }} | ||
| password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }} | ||
|
|
||
| # Build the benchmark-specific image. The Dockerfile separates dep-install | ||
| # from source-copy, so code-only changes skip the slow uv-sync layer | ||
| # when the runner has a warm Docker daemon cache. | ||
| - name: Build Libero benchmark image | ||
| uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses] | ||
| with: | ||
| context: . | ||
| file: docker/Dockerfile.benchmark.libero | ||
| push: false | ||
| load: true | ||
| tags: lerobot-benchmark-libero:ci | ||
|
|
||
| - name: Run Libero smoke eval (1 episode) | ||
| run: | | ||
| # Named container (no --rm) so we can docker cp artifacts out. | ||
| # Output to /tmp inside the container — /artifacts doesn't exist | ||
| # and user_lerobot cannot create root-level dirs. | ||
| docker run --name libero-eval --gpus all \ | ||
| --shm-size=4g \ | ||
| -e HF_HOME=/tmp/hf \ | ||
| -e HF_USER_TOKEN="${HF_USER_TOKEN}" \ | ||
| -e HF_HUB_DOWNLOAD_TIMEOUT=300 \ | ||
| lerobot-benchmark-libero:ci \ | ||
| bash -c " | ||
| hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true | ||
| lerobot-eval \ | ||
| --policy.path=pepijn223/smolvla_libero \ | ||
| --env.type=libero \ | ||
| --env.task=libero_spatial \ | ||
| --eval.batch_size=1 \ | ||
| --eval.n_episodes=1 \ | ||
| --eval.use_async_envs=false \ | ||
| --policy.device=cuda \ | ||
| '--env.camera_name_mapping={\"agentview_image\": \"camera1\", \"robot0_eye_in_hand_image\": \"camera2\"}' \ | ||
| --policy.empty_cameras=1 \ | ||
| --output_dir=/tmp/eval-artifacts | ||
| python scripts/ci/extract_task_descriptions.py \ | ||
| --env libero --task libero_spatial \ | ||
| --output /tmp/eval-artifacts/task_descriptions.json | ||
| " | ||
|
|
||
| - name: Copy Libero artifacts from container | ||
| if: always() | ||
| run: | | ||
| mkdir -p /tmp/libero-artifacts | ||
| docker cp libero-eval:/tmp/eval-artifacts/. /tmp/libero-artifacts/ 2>/dev/null || true | ||
| docker rm -f libero-eval || true | ||
|
|
||
| - name: Parse Libero eval metrics | ||
| if: always() | ||
| run: | | ||
| python3 scripts/ci/parse_eval_metrics.py \ | ||
| --artifacts-dir /tmp/libero-artifacts \ | ||
| --env libero \ | ||
| --task libero_spatial \ | ||
| --policy pepijn223/smolvla_libero | ||
|
|
||
| - name: Upload Libero rollout video | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: libero-rollout-video | ||
| path: /tmp/libero-artifacts/videos/ | ||
| if-no-files-found: warn | ||
|
|
||
| - name: Upload Libero eval metrics | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: libero-metrics | ||
| path: /tmp/libero-artifacts/metrics.json | ||
| if-no-files-found: warn | ||
|
|
||
| # ── LIBERO TRAIN+EVAL SMOKE ────────────────────────────────────────────── | ||
| # Train SmolVLA for 1 step (batch_size=1, dataset episode 0 only) then | ||
| # immediately runs eval inside the training loop (eval_freq=1, 1 episode). | ||
| # Tests the full train→eval-within-training pipeline end-to-end. | ||
| - name: Run Libero train+eval smoke (1 step, eval_freq=1) | ||
| run: | | ||
| docker run --name libero-train-smoke --gpus all \ | ||
| --shm-size=4g \ | ||
| -e HF_HOME=/tmp/hf \ | ||
| -e HF_USER_TOKEN="${HF_USER_TOKEN}" \ | ||
| -e HF_HUB_DOWNLOAD_TIMEOUT=300 \ | ||
| lerobot-benchmark-libero:ci \ | ||
| bash -c " | ||
| hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true | ||
| accelerate launch --num_processes=1 \$(which lerobot-train) \ | ||
| --policy.path=lerobot/smolvla_base \ | ||
| --policy.load_vlm_weights=true \ | ||
| --policy.scheduler_decay_steps=25000 \ | ||
| --policy.freeze_vision_encoder=false \ | ||
| --policy.train_expert_only=false \ | ||
| --dataset.repo_id=lerobot/libero \ | ||
| --dataset.episodes=[0] \ | ||
| --dataset.use_imagenet_stats=false \ | ||
| --env.type=libero \ | ||
| --env.task=libero_spatial \ | ||
| '--env.camera_name_mapping={\"agentview_image\": \"camera1\", \"robot0_eye_in_hand_image\": \"camera2\"}' \ | ||
| --policy.empty_cameras=1 \ | ||
| --output_dir=/tmp/train-smoke \ | ||
| --steps=1 \ | ||
| --batch_size=1 \ | ||
| --eval_freq=1 \ | ||
| --eval.n_episodes=1 \ | ||
| --eval.batch_size=1 \ | ||
| --eval.use_async_envs=false \ | ||
| --save_freq=1 \ | ||
| --policy.push_to_hub=false \ | ||
| '--rename_map={\"observation.images.image\": \"observation.images.camera1\", \"observation.images.image2\": \"observation.images.camera2\"}' | ||
| " | ||
|
|
||
| - name: Copy Libero train-smoke artifacts from container | ||
| if: always() | ||
| run: | | ||
| mkdir -p /tmp/libero-train-smoke-artifacts | ||
| docker cp libero-train-smoke:/tmp/train-smoke/. /tmp/libero-train-smoke-artifacts/ 2>/dev/null || true | ||
| docker rm -f libero-train-smoke || true | ||
|
|
||
| - name: Upload Libero train-smoke eval video | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: libero-train-smoke-video | ||
| path: /tmp/libero-train-smoke-artifacts/eval/ | ||
| if-no-files-found: warn | ||
|
|
||
| # ── METAWORLD ───────────────────────────────────────────────────────────── | ||
| # Isolated image: lerobot[metaworld] only (metaworld==3.0.0, mujoco>=3 chain) | ||
| metaworld-integration-test: | ||
| name: MetaWorld — build image + 1-episode eval | ||
| runs-on: | ||
| group: aws-g6-4xlarge-plus | ||
| env: | ||
| HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }} | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 | ||
| with: | ||
| persist-credentials: false | ||
| lfs: true | ||
|
|
||
| - name: Set up Docker Buildx | ||
| uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses] | ||
| with: | ||
| cache-binary: false | ||
|
|
||
| - name: Login to Docker Hub | ||
| uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses] | ||
| with: | ||
| username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }} | ||
| password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }} | ||
|
|
||
| - name: Build MetaWorld benchmark image | ||
| uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses] | ||
| with: | ||
| context: . | ||
| file: docker/Dockerfile.benchmark.metaworld | ||
| push: false | ||
| load: true | ||
| tags: lerobot-benchmark-metaworld:ci | ||
|
|
||
| - name: Run MetaWorld smoke eval (1 episode) | ||
| run: | | ||
| docker run --name metaworld-eval --gpus all \ | ||
| --shm-size=4g \ | ||
| -e HF_HOME=/tmp/hf \ | ||
| -e HF_USER_TOKEN="${HF_USER_TOKEN}" \ | ||
| -e HF_HUB_DOWNLOAD_TIMEOUT=300 \ | ||
| lerobot-benchmark-metaworld:ci \ | ||
| bash -c " | ||
| hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true | ||
| lerobot-eval \ | ||
| --policy.path=pepijn223/smolvla_metaworld \ | ||
| --env.type=metaworld \ | ||
| --env.task=metaworld-push-v3 \ | ||
| --eval.batch_size=1 \ | ||
| --eval.n_episodes=1 \ | ||
| --eval.use_async_envs=false \ | ||
| --policy.device=cuda \ | ||
| '--rename_map={\"observation.image\": \"observation.images.camera1\"}' \ | ||
| --policy.empty_cameras=2 \ | ||
| --output_dir=/tmp/eval-artifacts | ||
| python scripts/ci/extract_task_descriptions.py \ | ||
| --env metaworld --task metaworld-push-v3 \ | ||
| --output /tmp/eval-artifacts/task_descriptions.json | ||
| " | ||
|
|
||
| - name: Copy MetaWorld artifacts from container | ||
| if: always() | ||
| run: | | ||
| mkdir -p /tmp/metaworld-artifacts | ||
| docker cp metaworld-eval:/tmp/eval-artifacts/. /tmp/metaworld-artifacts/ 2>/dev/null || true | ||
| docker rm -f metaworld-eval || true | ||
|
|
||
| - name: Parse MetaWorld eval metrics | ||
| if: always() | ||
| run: | | ||
| python3 scripts/ci/parse_eval_metrics.py \ | ||
| --artifacts-dir /tmp/metaworld-artifacts \ | ||
| --env metaworld \ | ||
| --task metaworld-push-v3 \ | ||
| --policy pepijn223/smolvla_metaworld | ||
|
|
||
| - name: Upload MetaWorld rollout video | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: metaworld-rollout-video | ||
| path: /tmp/metaworld-artifacts/videos/ | ||
| if-no-files-found: warn | ||
|
|
||
| - name: Upload MetaWorld eval metrics | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: metaworld-metrics | ||
| path: /tmp/metaworld-artifacts/metrics.json | ||
| if-no-files-found: warn | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@claude review this pr
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Claude finished @pkooij's task in 1m 25s —— View job
PR Review: Benchmark Smoke Tests with Isolated Docker Images
Overall this is well-structured CI work — isolated Docker images per benchmark is the right approach for incompatible dep trees, and the dep-layer / source-layer split in the Dockerfiles is good for cache efficiency. Here's my detailed feedback:
Security Issue
Token passed via CLI argument in "Login to Hugging Face" step (
benchmark_tests.yml:106)bash -c "hf auth login --token '$HF_USER_TOKEN' --add-to-git-credential && hf auth whoami"The token is interpolated into the shell string via
'$HF_USER_TOKEN'(single quotes in the outerbash -c). This exposes it indocker inspect,/proc/*/cmdline, and process listings. The eval steps correctly pass it as-e HF_USER_TOKENand read it inside the container with\$HF_USER_TOKEN, but this login step embeds the literal value. Consider passing it as an env var here too:Additionally, this login step (lines 100–106) appears to be a no-op — the container it runs is ephemeral (
--rm), so the credential is immediately discarded. The actual eval step (line 119) re-authenticates independently. Consider removing this step entirely unless it's intended as a pre-flight check.Bugs / Functional Issues
parse_eval_metrics.pyruns on the host, not in Docker (benchmark_tests.yml:147-151)This runs
python3on the bare runner (Ubuntu), which won't have the project dependencies installed (nouv syncon the host). It only uses stdlib (json,argparse,pathlib,math), so it works today — but this is fragile. If any future import is added, it'll break silently. Worth adding a comment noting the stdlib-only constraint, or running it inside the container.feat/benchmark-ciin push trigger paths (benchmark_tests.yml:34)This feature branch trigger should be removed before merging to main — it'll cause the workflow to fire on pushes to a branch that won't exist post-merge.
Dockerfile Concerns
Unpinned
uvversion in Dockerfiles (Dockerfile.benchmark.libero:46,Dockerfile.benchmark.metaworld:46)curl -LsSf https://astral.sh/uv/install.sh | shThe workflow defines
UV_VERSION: "0.8.0"but the Dockerfiles install whatever versionuvships by default. For reproducible builds, pin the version:curl -LsSf https://astral.sh/uv/0.8.0/install.sh | shDuplicated system deps block — The
apt-getblock in both Dockerfiles is identical (lines 35-50). Not a blocker, but if you anticipate adding more benchmarks, consider a shared base image or a multi-stage pattern to reduce drift between them.chmod +x ptxas(Dockerfile.benchmark.libero:87,Dockerfile.benchmark.metaworld:72)This is a workaround for a known Triton packaging issue. Worth a comment explaining why so future maintainers don't remove it.
Source Code Changes
Default
use_async_envsflip fromFalse→True(configs.py,factory.py)This changes the public API default for
EnvConfig.create_envs()andmake_env(). The actual behavior is guarded (use_async and n_envs > 1), so withn_envs=1it's effectively a no-op. But for callers passingn_envs > 1without specifyinguse_async_envs, this silently switches them fromSyncVectorEnvtoAsyncVectorEnv. This could surface latent serialization/fork issues.The test update (
test_dispatch.py:90) correctly pins the test touse_async_envs=False, which confirms this default change is intentional. Just want to flag that this is a behavioral change beyond CI — it affects all library consumers. The PR description doesn't highlight this.CI Scripts
extract_task_descriptions.py— Clean implementation. The MetaWorld fallback (using cleaned task name as description) is reasonable.parse_eval_metrics.py— Handles bothaggregatedandoveralloutput shapes well. One minor note: theNaNcheck onpc_success(line 57) is good defensive coding, butavg_sum_rewardandeval_saren't checked for NaN — worth being consistent.Minor / Style
permissions: contents: read— good, minimal permissions.if: always()— good, ensures artifacts are captured even on failure.Summary
The main items to address:
feat/benchmark-cifrom the push branch trigger before mergeuvversion in both Dockerfiles for reproducibilityuse_async_envsdefault change in the PR description since it's a public API changeThe rest are suggestions for robustness. Solid work overall — the isolated Docker approach and cache layering strategy are well thought out.