-
Notifications
You must be signed in to change notification settings - Fork 4.2k
feat(ci): benchmark smoke tests with isolated Docker images (LIBERO + MetaWorld) #3319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 55 commits
69eec9c
5ad4c8f
bfa0a0f
75d5e5b
7abe5f7
1fad71c
d8e0eaa
0ea6aac
27bbb6b
8e07cab
fd99209
dbc8c2e
aebc5e2
8a778c0
5ec6119
2c32c04
43abbcc
03e1901
12023f4
9a6ab6a
6e6f76d
61e2be8
07350f9
dfd09c0
42ef36e
841cbb0
c24687d
2420d20
58a5bcb
f3853c9
e35b485
28d353e
527463c
606ed97
93b99e4
fe05e50
c8c2e88
f4bc9b5
5bc90c7
566a77b
973bb7c
927118e
a16f00c
d8305ab
e8d029e
936b42e
0dd0a8f
3534331
17a5431
d39a621
192a53d
9a9bc3b
c454d29
415c504
9a84ae7
c713c7f
14f1e09
e72b168
0490e97
a8b6ecd
c3429aa
86c51a5
58d4ecd
c505a71
183fdb7
dd84819
9702f58
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,310 @@ | ||
| # Copyright 2025 The HuggingFace Inc. team. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| # Integration tests: build an isolated Docker image per benchmark and run a | ||
| # 1-episode smoke eval. Each benchmark gets its own image so incompatible | ||
| # dependency trees (e.g. hf-libero vs metaworld==3.0.0) can never collide. | ||
| # | ||
| # To add a new benchmark: | ||
| # 1. Add docker/Dockerfile.benchmark.<name> (install only lerobot[<name>]) | ||
| # 2. Copy one of the jobs below and adjust the image name and eval command. | ||
| name: Benchmark Integration Tests | ||
|
|
||
| on: | ||
| # Run manually from the Actions tab | ||
| workflow_dispatch: | ||
|
|
||
| # Run every Monday at 02:00 UTC. | ||
| schedule: | ||
| - cron: "0 2 * * 1" | ||
|
|
||
| push: | ||
| branches: | ||
| - feat/benchmark-ci | ||
| - main | ||
| paths: | ||
| - "src/lerobot/envs/**" | ||
| - "src/lerobot/scripts/lerobot_eval.py" | ||
| - "docker/Dockerfile.benchmark.*" | ||
| - ".github/workflows/benchmark_tests.yml" | ||
| - "pyproject.toml" | ||
|
|
||
| pull_request: | ||
| branches: | ||
| - main | ||
| paths: | ||
| - "src/lerobot/envs/**" | ||
| - "src/lerobot/scripts/lerobot_eval.py" | ||
| - "docker/Dockerfile.benchmark.*" | ||
| - ".github/workflows/benchmark_tests.yml" | ||
| - "pyproject.toml" | ||
|
|
||
| permissions: | ||
| contents: read | ||
|
|
||
| env: | ||
| UV_VERSION: "0.8.0" | ||
| PYTHON_VERSION: "3.12" | ||
|
|
||
| # Cancel in-flight runs for the same branch/PR. | ||
| concurrency: | ||
| group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }} | ||
| cancel-in-progress: true | ||
|
|
||
| jobs: | ||
| # ── LIBERO ──────────────────────────────────────────────────────────────── | ||
| # Isolated image: lerobot[libero] only (hf-libero, dm-control, mujoco chain) | ||
| libero-integration-test: | ||
| name: Libero — build image + 1-episode eval | ||
| runs-on: | ||
| group: aws-g6-4xlarge-plus | ||
| env: | ||
| HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }} | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 | ||
| with: | ||
| persist-credentials: false | ||
| lfs: true | ||
|
|
||
| - name: Set up Docker Buildx | ||
| uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses] | ||
| with: | ||
| cache-binary: false | ||
|
|
||
| # Build the benchmark-specific image. Layer cache uses GHA cache (persists | ||
| # across runners). The Dockerfile separates dep-install from source-copy, | ||
| # so code-only changes skip the slow uv-sync layer entirely. | ||
| - name: Build Libero benchmark image | ||
| uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses] | ||
| with: | ||
| context: . | ||
| file: docker/Dockerfile.benchmark.libero | ||
| push: false | ||
| load: true | ||
| tags: lerobot-benchmark-libero:ci | ||
| cache-from: type=gha,scope=benchmark-libero | ||
| cache-to: type=gha,scope=benchmark-libero,mode=max | ||
|
|
||
| - name: Login to Hugging Face | ||
| if: env.HF_USER_TOKEN != '' | ||
| run: | | ||
| docker run --rm \ | ||
| -e HF_HOME=/tmp/hf \ | ||
| lerobot-benchmark-libero:ci \ | ||
| bash -c "hf auth login --token '$HF_USER_TOKEN' --add-to-git-credential && hf auth whoami" | ||
|
|
||
| - name: Run Libero smoke eval (1 episode) | ||
| run: | | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add |
||
| # Named container (no --rm) so we can docker cp artifacts out. | ||
| # Output to /tmp inside the container — /artifacts doesn't exist | ||
| # and user_lerobot cannot create root-level dirs. | ||
| docker run --name libero-eval --gpus all \ | ||
| --shm-size=4g \ | ||
| -e HF_HOME=/tmp/hf \ | ||
| -e HF_USER_TOKEN="${HF_USER_TOKEN}" \ | ||
| -e HF_HUB_DOWNLOAD_TIMEOUT=300 \ | ||
| lerobot-benchmark-libero:ci \ | ||
| bash -c " | ||
| hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true | ||
| lerobot-eval \ | ||
| --policy.path=pepijn223/smolvla_libero \ | ||
| --env.type=libero \ | ||
| --env.task=libero_spatial \ | ||
| --eval.batch_size=1 \ | ||
| --eval.n_episodes=1 \ | ||
| --eval.use_async_envs=false \ | ||
| --policy.device=cuda \ | ||
| '--env.camera_name_mapping={\"agentview_image\": \"camera1\", \"robot0_eye_in_hand_image\": \"camera2\"}' \ | ||
| --policy.empty_cameras=1 \ | ||
| --output_dir=/tmp/eval-artifacts | ||
| python scripts/ci/extract_task_descriptions.py \ | ||
| --env libero --task libero_spatial \ | ||
| --output /tmp/eval-artifacts/task_descriptions.json | ||
| " | ||
|
|
||
| - name: Copy Libero artifacts from container | ||
| if: always() | ||
| run: | | ||
| mkdir -p /tmp/libero-artifacts | ||
| docker cp libero-eval:/tmp/eval-artifacts/. /tmp/libero-artifacts/ 2>/dev/null || true | ||
| docker rm -f libero-eval || true | ||
|
|
||
| - name: Parse Libero eval metrics | ||
| if: always() | ||
| run: | | ||
| python3 scripts/ci/parse_eval_metrics.py \ | ||
| --artifacts-dir /tmp/libero-artifacts \ | ||
| --env libero \ | ||
| --task libero_spatial \ | ||
| --policy pepijn223/smolvla_libero | ||
|
|
||
| - name: Upload Libero rollout video | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use tags with a # zizmor: ignore[unpinned-uses] comment. |
||
| with: | ||
| name: libero-rollout-video | ||
| path: /tmp/libero-artifacts/videos/ | ||
| if-no-files-found: warn | ||
|
|
||
| - name: Upload Libero eval metrics | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use tags with a # zizmor: ignore[unpinned-uses] comment. |
||
| with: | ||
| name: libero-metrics | ||
| path: /tmp/libero-artifacts/metrics.json | ||
| if-no-files-found: warn | ||
|
|
||
| # ── LIBERO TRAIN+EVAL SMOKE ────────────────────────────────────────────── | ||
| # Train SmolVLA for 1 step (batch_size=1, dataset episode 0 only) then | ||
| # immediately runs eval inside the training loop (eval_freq=1, 1 episode). | ||
| # Tests the full train→eval-within-training pipeline end-to-end. | ||
| - name: Run Libero train+eval smoke (1 step, eval_freq=1) | ||
| run: | | ||
| docker run --name libero-train-smoke --gpus all \ | ||
| --shm-size=4g \ | ||
| -e HF_HOME=/tmp/hf \ | ||
| -e HF_USER_TOKEN="${HF_USER_TOKEN}" \ | ||
| -e HF_HUB_DOWNLOAD_TIMEOUT=300 \ | ||
| lerobot-benchmark-libero:ci \ | ||
| bash -c " | ||
| hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true | ||
| accelerate launch --num_processes=1 \$(which lerobot-train) \ | ||
| --policy.path=lerobot/smolvla_base \ | ||
| --policy.load_vlm_weights=true \ | ||
| --policy.scheduler_decay_steps=25000 \ | ||
| --policy.freeze_vision_encoder=false \ | ||
| --policy.train_expert_only=false \ | ||
| --dataset.repo_id=lerobot/libero \ | ||
| --dataset.episodes=[0] \ | ||
| --dataset.use_imagenet_stats=false \ | ||
| --env.type=libero \ | ||
| --env.task=libero_spatial \ | ||
| '--env.camera_name_mapping={\"agentview_image\": \"camera1\", \"robot0_eye_in_hand_image\": \"camera2\"}' \ | ||
| --policy.empty_cameras=1 \ | ||
| --output_dir=/tmp/train-smoke \ | ||
| --steps=1 \ | ||
| --batch_size=1 \ | ||
| --eval_freq=1 \ | ||
| --eval.n_episodes=1 \ | ||
| --eval.batch_size=1 \ | ||
| --eval.use_async_envs=false \ | ||
| --save_freq=1 \ | ||
| --policy.push_to_hub=false \ | ||
| '--rename_map={\"observation.images.image\": \"observation.images.camera1\", \"observation.images.image2\": \"observation.images.camera2\"}' | ||
| " | ||
|
|
||
| - name: Copy Libero train-smoke artifacts from container | ||
| if: always() | ||
| run: | | ||
| mkdir -p /tmp/libero-train-smoke-artifacts | ||
| docker cp libero-train-smoke:/tmp/train-smoke/. /tmp/libero-train-smoke-artifacts/ 2>/dev/null || true | ||
| docker rm -f libero-train-smoke || true | ||
|
|
||
| - name: Upload Libero train-smoke eval video | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use tags with a # zizmor: ignore[unpinned-uses] comment. |
||
| with: | ||
| name: libero-train-smoke-video | ||
| path: /tmp/libero-train-smoke-artifacts/eval/ | ||
| if-no-files-found: warn | ||
|
|
||
| # ── METAWORLD ───────────────────────────────────────────────────────────── | ||
| # Isolated image: lerobot[metaworld] only (metaworld==3.0.0, mujoco>=3 chain) | ||
| metaworld-integration-test: | ||
| name: MetaWorld — build image + 1-episode eval | ||
| runs-on: | ||
| group: aws-g6-4xlarge-plus | ||
| env: | ||
| HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }} | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2 | ||
| with: | ||
| persist-credentials: false | ||
| lfs: true | ||
|
|
||
| - name: Set up Docker Buildx | ||
| uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses] | ||
| with: | ||
| cache-binary: false | ||
|
|
||
| - name: Build MetaWorld benchmark image | ||
| uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses] | ||
| with: | ||
| context: . | ||
| file: docker/Dockerfile.benchmark.metaworld | ||
| push: false | ||
| load: true | ||
| tags: lerobot-benchmark-metaworld:ci | ||
| cache-from: type=gha,scope=benchmark-metaworld | ||
| cache-to: type=gha,scope=benchmark-metaworld,mode=max | ||
|
|
||
| - name: Run MetaWorld smoke eval (1 episode) | ||
| run: | | ||
| docker run --name metaworld-eval --gpus all \ | ||
| --shm-size=4g \ | ||
| -e HF_HOME=/tmp/hf \ | ||
| -e HF_USER_TOKEN="${HF_USER_TOKEN}" \ | ||
| -e HF_HUB_DOWNLOAD_TIMEOUT=300 \ | ||
| lerobot-benchmark-metaworld:ci \ | ||
| bash -c " | ||
| hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true | ||
| lerobot-eval \ | ||
| --policy.path=pepijn223/smolvla_metaworld \ | ||
| --env.type=metaworld \ | ||
| --env.task=metaworld-push-v3 \ | ||
| --eval.batch_size=1 \ | ||
| --eval.n_episodes=1 \ | ||
| --eval.use_async_envs=false \ | ||
| --policy.device=cuda \ | ||
| '--rename_map={\"observation.image\": \"observation.images.camera1\"}' \ | ||
| --policy.empty_cameras=2 \ | ||
| --output_dir=/tmp/eval-artifacts | ||
| python scripts/ci/extract_task_descriptions.py \ | ||
| --env metaworld --task metaworld-push-v3 \ | ||
| --output /tmp/eval-artifacts/task_descriptions.json | ||
| " | ||
|
|
||
| - name: Copy MetaWorld artifacts from container | ||
| if: always() | ||
| run: | | ||
| mkdir -p /tmp/metaworld-artifacts | ||
| docker cp metaworld-eval:/tmp/eval-artifacts/. /tmp/metaworld-artifacts/ 2>/dev/null || true | ||
| docker rm -f metaworld-eval || true | ||
|
|
||
| - name: Parse MetaWorld eval metrics | ||
| if: always() | ||
| run: | | ||
| python3 scripts/ci/parse_eval_metrics.py \ | ||
| --artifacts-dir /tmp/metaworld-artifacts \ | ||
| --env metaworld \ | ||
| --task metaworld-push-v3 \ | ||
| --policy pepijn223/smolvla_metaworld | ||
|
|
||
| - name: Upload MetaWorld rollout video | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use tags with a # zizmor: ignore[unpinned-uses] comment. |
||
| with: | ||
| name: metaworld-rollout-video | ||
| path: /tmp/metaworld-artifacts/videos/ | ||
| if-no-files-found: warn | ||
|
|
||
| - name: Upload MetaWorld eval metrics | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use tags with a # zizmor: ignore[unpinned-uses] comment. |
||
| with: | ||
| name: metaworld-metrics | ||
| path: /tmp/metaworld-artifacts/metrics.json | ||
| if-no-files-found: warn | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@claude review this pr
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Claude finished @pkooij's task in 1m 25s —— View job
PR Review: Benchmark Smoke Tests with Isolated Docker Images
Overall this is well-structured CI work — isolated Docker images per benchmark is the right approach for incompatible dep trees, and the dep-layer / source-layer split in the Dockerfiles is good for cache efficiency. Here's my detailed feedback:
Security Issue
Token passed via CLI argument in "Login to Hugging Face" step (
benchmark_tests.yml:106)bash -c "hf auth login --token '$HF_USER_TOKEN' --add-to-git-credential && hf auth whoami"The token is interpolated into the shell string via
'$HF_USER_TOKEN'(single quotes in the outerbash -c). This exposes it indocker inspect,/proc/*/cmdline, and process listings. The eval steps correctly pass it as-e HF_USER_TOKENand read it inside the container with\$HF_USER_TOKEN, but this login step embeds the literal value. Consider passing it as an env var here too:Additionally, this login step (lines 100–106) appears to be a no-op — the container it runs is ephemeral (
--rm), so the credential is immediately discarded. The actual eval step (line 119) re-authenticates independently. Consider removing this step entirely unless it's intended as a pre-flight check.Bugs / Functional Issues
parse_eval_metrics.pyruns on the host, not in Docker (benchmark_tests.yml:147-151)This runs
python3on the bare runner (Ubuntu), which won't have the project dependencies installed (nouv syncon the host). It only uses stdlib (json,argparse,pathlib,math), so it works today — but this is fragile. If any future import is added, it'll break silently. Worth adding a comment noting the stdlib-only constraint, or running it inside the container.feat/benchmark-ciin push trigger paths (benchmark_tests.yml:34)This feature branch trigger should be removed before merging to main — it'll cause the workflow to fire on pushes to a branch that won't exist post-merge.
Dockerfile Concerns
Unpinned
uvversion in Dockerfiles (Dockerfile.benchmark.libero:46,Dockerfile.benchmark.metaworld:46)curl -LsSf https://astral.sh/uv/install.sh | shThe workflow defines
UV_VERSION: "0.8.0"but the Dockerfiles install whatever versionuvships by default. For reproducible builds, pin the version:curl -LsSf https://astral.sh/uv/0.8.0/install.sh | shDuplicated system deps block — The
apt-getblock in both Dockerfiles is identical (lines 35-50). Not a blocker, but if you anticipate adding more benchmarks, consider a shared base image or a multi-stage pattern to reduce drift between them.chmod +x ptxas(Dockerfile.benchmark.libero:87,Dockerfile.benchmark.metaworld:72)This is a workaround for a known Triton packaging issue. Worth a comment explaining why so future maintainers don't remove it.
Source Code Changes
Default
use_async_envsflip fromFalse→True(configs.py,factory.py)This changes the public API default for
EnvConfig.create_envs()andmake_env(). The actual behavior is guarded (use_async and n_envs > 1), so withn_envs=1it's effectively a no-op. But for callers passingn_envs > 1without specifyinguse_async_envs, this silently switches them fromSyncVectorEnvtoAsyncVectorEnv. This could surface latent serialization/fork issues.The test update (
test_dispatch.py:90) correctly pins the test touse_async_envs=False, which confirms this default change is intentional. Just want to flag that this is a behavioral change beyond CI — it affects all library consumers. The PR description doesn't highlight this.CI Scripts
extract_task_descriptions.py— Clean implementation. The MetaWorld fallback (using cleaned task name as description) is reasonable.parse_eval_metrics.py— Handles bothaggregatedandoveralloutput shapes well. One minor note: theNaNcheck onpc_success(line 57) is good defensive coding, butavg_sum_rewardandeval_saren't checked for NaN — worth being consistent.Minor / Style
permissions: contents: read— good, minimal permissions.if: always()— good, ensures artifacts are captured even on failure.Summary
The main items to address:
feat/benchmark-cifrom the push branch trigger before mergeuvversion in both Dockerfiles for reproducibilityuse_async_envsdefault change in the PR description since it's a public API changeThe rest are suggestions for robustness. Solid work overall — the isolated Docker approach and cache layering strategy are well thought out.