Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
69eec9c
docs(benchmarks): add benchmark integration guide and standardize ben…
pkooij Apr 2, 2026
5ad4c8f
refactor(envs): move dispatch logic from factory into EnvConfig subcl…
pkooij Apr 3, 2026
bfa0a0f
docs(benchmarks): clean up adding-benchmarks guide for clarity
pkooij Apr 3, 2026
75d5e5b
fix link
pkooij Apr 3, 2026
7abe5f7
fix task count
pkooij Apr 3, 2026
1fad71c
fix: enable SmolVLA eval on LIBERO with custom camera mappings
pkooij Apr 7, 2026
d8e0eaa
fix: use direct AutoresetMode import for gymnasium compat
pkooij Apr 7, 2026
0ea6aac
fix: handle gymnasium < 1.0 without AutoresetMode
pkooij Apr 7, 2026
27bbb6b
refactor: revert policy changes, keep env-only camera mapping fixes
pkooij Apr 7, 2026
8e07cab
Update docs/source/env_processor.mdx
pkooij Apr 7, 2026
fd99209
feat(envs): lazy env init + AsyncVectorEnv as default for n_envs > 1
pkooij Apr 3, 2026
dbc8c2e
fix: close envs between tasks to prevent worker process accumulation
pkooij Apr 7, 2026
aebc5e2
fix(eval): use task_description instead of task for language conditio…
pkooij Apr 7, 2026
8a778c0
docs: update adding_benchmarks for async env changes
pkooij Apr 7, 2026
5ec6119
feat(eval): batch_size=auto + faster env loading
pkooij Apr 7, 2026
2c32c04
docs: add evaluation guide and update benchmarks doc
pkooij Apr 7, 2026
43abbcc
docs(evaluation): remove benchmark table, rename section header
pkooij Apr 7, 2026
03e1901
perf(eval): shared memory, observation passthrough, task prefetch
pkooij Apr 7, 2026
12023f4
style: ruff format
pkooij Apr 7, 2026
9a6ab6a
chore: revert env_processor.mdx changes (not part of this PR)
pkooij Apr 7, 2026
6e6f76d
ci(benchmarks): add isolated integration tests for libero and metaworld
pkooij Apr 7, 2026
61e2be8
ci(benchmarks): pin action hashes and use uv sync --locked
pkooij Apr 7, 2026
07350f9
ci(benchmarks): trigger only on envs/ or lerobot_eval.py changes
pkooij Apr 7, 2026
dfd09c0
fix(ci): set LIBERO_DATA_FOLDER to bypass interactive stdin prompt
pkooij Apr 8, 2026
42ef36e
docs(benchmarks): add CI smoke test step to adding_benchmarks guide
pkooij Apr 8, 2026
841cbb0
fix(ci): pre-create libero config in Dockerfile to bypass stdin prompt
pkooij Apr 8, 2026
c24687d
fix(ci): use shell to create libero config instead of multiline pytho…
pkooij Apr 8, 2026
2420d20
fix(ci): point libero config to bundled package init_files
pkooij Apr 8, 2026
58a5bcb
fix(ci): add smolvla extra to benchmark Dockerfiles
pkooij Apr 8, 2026
f3853c9
fix(eval): render_frame covers _LazyAsyncVectorEnv
pkooij Apr 8, 2026
e35b485
refactor(envs): remove unused _get_sub_env_attr helper
pkooij Apr 8, 2026
28d353e
chore: apply prettier formatting to docs
pkooij Apr 8, 2026
527463c
docs(env_processor): remove deprecated add_envs_task from pipeline ex…
pkooij Apr 8, 2026
606ed97
refactor(envs): remove __del__ from _LazyAsyncVectorEnv
pkooij Apr 8, 2026
93b99e4
fix(eval): prefetch next task's workers after close to avoid GPU memo…
pkooij Apr 8, 2026
fe05e50
refactor(envs): move _LazyAsyncVectorEnv to utils and apply to metaworld
pkooij Apr 8, 2026
c8c2e88
chore: remove out-of-scope benchmark/CI/docs files from PR
pkooij Apr 8, 2026
f4bc9b5
chore: restore adding_benchmarks + test_dispatch, drop env_processor …
pkooij Apr 8, 2026
5bc90c7
docs(adding_benchmarks): remove CI smoke test step (coming in separat…
pkooij Apr 8, 2026
566a77b
refactor(envs): remove unused add_envs_task
pkooij Apr 8, 2026
973bb7c
style: fix prettier formatting in env_processor.mdx
pkooij Apr 8, 2026
927118e
fix(ci): use root container chmod to fix PermissionError on artifact …
pkooij Apr 8, 2026
a16f00c
fix(ci): re-chmod artifacts after eval to fix unreadable files
pkooij Apr 8, 2026
d8305ab
feat(ci): add monthly schedule trigger for benchmark tests
pkooij Apr 8, 2026
e8d029e
fix(ci): change benchmark schedule from monthly to weekly (every Monday)
pkooij Apr 8, 2026
936b42e
fix(ci): use docker cp instead of bind mounts for artifacts
pkooij Apr 8, 2026
0dd0a8f
fix(ci): write eval output to /tmp inside container
pkooij Apr 8, 2026
3534331
feat(ci): add parse_eval_metrics step to benchmark workflow
pkooij Apr 9, 2026
17a5431
feat(ci): add Libero train+eval smoke test (1 step, eval_freq=1)
pkooij Apr 9, 2026
d39a621
chore: merge main into feat/benchmark-ci-clean
pkooij Apr 9, 2026
192a53d
feat(ci): extract task descriptions and embed in metrics artifact
pkooij Apr 9, 2026
9a9bc3b
fix(ci): call extract_task_descriptions.py after eval in benchmark jobs
pkooij Apr 9, 2026
c454d29
Merge branch 'main' into feat/benchmark-ci
pkooij Apr 9, 2026
415c504
fix(test): use SyncVectorEnv in test_base_create_envs
pkooij Apr 9, 2026
9a84ae7
perf(ci): split Dockerfile dep-install from source-copy for faster re…
pkooij Apr 9, 2026
c713c7f
fix(ci): add Docker Hub login to avoid pull rate limits
pkooij Apr 9, 2026
14f1e09
fix(ci): use existing DOCKERHUB_LEROBOT_USERNAME/PASSWORD secrets
pkooij Apr 9, 2026
e72b168
fix(ci): use env context for secrets check in step if-condition
pkooij Apr 9, 2026
0490e97
fix(ci): simplify Docker Hub login to match existing workflows
pkooij Apr 9, 2026
a8b6ecd
fix(ci): switch Docker cache from type=gha to type=registry
pkooij Apr 9, 2026
c3429aa
fix(ci): use GHCR for Docker layer cache (Docker Hub push denied)
pkooij Apr 9, 2026
86c51a5
fix(ci): remove GHCR cache (org blocks GITHUB_TOKEN package writes)
pkooij Apr 10, 2026
58d4ecd
Merge branch 'main' into feat/benchmark-ci
pkooij Apr 10, 2026
c505a71
fix(ci): address PR review feedback for benchmark smoke tests
pkooij Apr 10, 2026
183fdb7
ci(benchmarks): trigger on PRs targeting feat/benchmark-ci
pkooij Apr 10, 2026
dd84819
fix(docker): use uv pip install instead of uv sync (cross-extra confl…
pkooij Apr 10, 2026
9702f58
chore: revert configs.py, factory.py, test_dispatch.py to main
pkooij Apr 10, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
310 changes: 310 additions & 0 deletions .github/workflows/benchmark_tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,310 @@
# Copyright 2025 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude review this pr

Copy link
Copy Markdown

@claude claude bot Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude finished @pkooij's task in 1m 25s —— View job


PR Review: Benchmark Smoke Tests with Isolated Docker Images

Overall this is well-structured CI work — isolated Docker images per benchmark is the right approach for incompatible dep trees, and the dep-layer / source-layer split in the Dockerfiles is good for cache efficiency. Here's my detailed feedback:


Security Issue

Token passed via CLI argument in "Login to Hugging Face" step (benchmark_tests.yml:106)

bash -c "hf auth login --token '$HF_USER_TOKEN' --add-to-git-credential && hf auth whoami"

The token is interpolated into the shell string via '$HF_USER_TOKEN' (single quotes in the outer bash -c). This exposes it in docker inspect, /proc/*/cmdline, and process listings. The eval steps correctly pass it as -e HF_USER_TOKEN and read it inside the container with \$HF_USER_TOKEN, but this login step embeds the literal value. Consider passing it as an env var here too:

docker run --rm \
  -e HF_HOME=/tmp/hf \
  -e HF_USER_TOKEN="${HF_USER_TOKEN}" \
  lerobot-benchmark-libero:ci \
  bash -c 'hf auth login --token "$HF_USER_TOKEN" --add-to-git-credential && hf auth whoami'

Additionally, this login step (lines 100–106) appears to be a no-op — the container it runs is ephemeral (--rm), so the credential is immediately discarded. The actual eval step (line 119) re-authenticates independently. Consider removing this step entirely unless it's intended as a pre-flight check.


Bugs / Functional Issues

  1. parse_eval_metrics.py runs on the host, not in Docker (benchmark_tests.yml:147-151)

    python3 scripts/ci/parse_eval_metrics.py \
      --artifacts-dir /tmp/libero-artifacts \

    This runs python3 on the bare runner (Ubuntu), which won't have the project dependencies installed (no uv sync on the host). It only uses stdlib (json, argparse, pathlib, math), so it works today — but this is fragile. If any future import is added, it'll break silently. Worth adding a comment noting the stdlib-only constraint, or running it inside the container.

  2. feat/benchmark-ci in push trigger paths (benchmark_tests.yml:34)

    push:
      branches:
        - feat/benchmark-ci
        - main

    This feature branch trigger should be removed before merging to main — it'll cause the workflow to fire on pushes to a branch that won't exist post-merge.


Dockerfile Concerns

  1. Unpinned uv version in Dockerfiles (Dockerfile.benchmark.libero:46, Dockerfile.benchmark.metaworld:46)

    curl -LsSf https://astral.sh/uv/install.sh | sh

    The workflow defines UV_VERSION: "0.8.0" but the Dockerfiles install whatever version uv ships by default. For reproducible builds, pin the version:

    curl -LsSf https://astral.sh/uv/0.8.0/install.sh | sh
  2. Duplicated system deps block — The apt-get block in both Dockerfiles is identical (lines 35-50). Not a blocker, but if you anticipate adding more benchmarks, consider a shared base image or a multi-stage pattern to reduce drift between them.

  3. chmod +x ptxas (Dockerfile.benchmark.libero:87, Dockerfile.benchmark.metaworld:72)

    RUN chmod +x .../triton/backends/nvidia/bin/ptxas

    This is a workaround for a known Triton packaging issue. Worth a comment explaining why so future maintainers don't remove it.


Source Code Changes

  1. Default use_async_envs flip from FalseTrue (configs.py, factory.py)

    This changes the public API default for EnvConfig.create_envs() and make_env(). The actual behavior is guarded (use_async and n_envs > 1), so with n_envs=1 it's effectively a no-op. But for callers passing n_envs > 1 without specifying use_async_envs, this silently switches them from SyncVectorEnv to AsyncVectorEnv. This could surface latent serialization/fork issues.

    The test update (test_dispatch.py:90) correctly pins the test to use_async_envs=False, which confirms this default change is intentional. Just want to flag that this is a behavioral change beyond CI — it affects all library consumers. The PR description doesn't highlight this.


CI Scripts

  1. extract_task_descriptions.py — Clean implementation. The MetaWorld fallback (using cleaned task name as description) is reasonable.

  2. parse_eval_metrics.py — Handles both aggregated and overall output shapes well. One minor note: the NaN check on pc_success (line 57) is good defensive coding, but avg_sum_reward and eval_s aren't checked for NaN — worth being consistent.


Minor / Style

  1. The workflow has permissions: contents: read — good, minimal permissions.
  2. The concurrency group setup is correct for cancelling in-flight runs.
  3. Artifact uploads use if: always() — good, ensures artifacts are captured even on failure.

Summary

The main items to address:

  • Remove feat/benchmark-ci from the push branch trigger before merge
  • Fix the token exposure in the "Login to Hugging Face" step (or remove it entirely since it's a no-op)
  • Pin the uv version in both Dockerfiles for reproducibility
  • Document the use_async_envs default change in the PR description since it's a public API change

The rest are suggestions for robustness. Solid work overall — the isolated Docker approach and cache layering strategy are well thought out.

# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Integration tests: build an isolated Docker image per benchmark and run a
# 1-episode smoke eval. Each benchmark gets its own image so incompatible
# dependency trees (e.g. hf-libero vs metaworld==3.0.0) can never collide.
#
# To add a new benchmark:
# 1. Add docker/Dockerfile.benchmark.<name> (install only lerobot[<name>])
# 2. Copy one of the jobs below and adjust the image name and eval command.
name: Benchmark Integration Tests

on:
# Run manually from the Actions tab
workflow_dispatch:

# Run every Monday at 02:00 UTC.
schedule:
- cron: "0 2 * * 1"

push:
branches:
- main
paths:
- "src/lerobot/envs/**"
- "src/lerobot/scripts/lerobot_eval.py"
- "docker/Dockerfile.benchmark.*"
- ".github/workflows/benchmark_tests.yml"
- "pyproject.toml"

pull_request:
branches:
- main
- feat/benchmark-ci
paths:
- "src/lerobot/envs/**"
- "src/lerobot/scripts/lerobot_eval.py"
- "docker/Dockerfile.benchmark.*"
- ".github/workflows/benchmark_tests.yml"
- "pyproject.toml"

permissions:
contents: read

env:
UV_VERSION: "0.8.0"
PYTHON_VERSION: "3.12"

# Cancel in-flight runs for the same branch/PR.
concurrency:
group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
cancel-in-progress: true

jobs:
# ── LIBERO ────────────────────────────────────────────────────────────────
# Isolated image: lerobot[libero] only (hf-libero, dm-control, mujoco chain)
libero-integration-test:
name: Libero — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}

steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false

- name: Login to Docker Hub
uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
with:
username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}

# Build the benchmark-specific image. The Dockerfile separates dep-install
# from source-copy, so code-only changes skip the slow uv-sync layer
# when the runner has a warm Docker daemon cache.
- name: Build Libero benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.libero
push: false
load: true
tags: lerobot-benchmark-libero:ci

- name: Run Libero smoke eval (1 episode)
run: |
# Named container (no --rm) so we can docker cp artifacts out.
# Output to /tmp inside the container — /artifacts doesn't exist
# and user_lerobot cannot create root-level dirs.
docker run --name libero-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
lerobot-benchmark-libero:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=pepijn223/smolvla_libero \
--env.type=libero \
--env.task=libero_spatial \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--env.camera_name_mapping={\"agentview_image\": \"camera1\", \"robot0_eye_in_hand_image\": \"camera2\"}' \
--policy.empty_cameras=1 \
--output_dir=/tmp/eval-artifacts
python scripts/ci/extract_task_descriptions.py \
--env libero --task libero_spatial \
--output /tmp/eval-artifacts/task_descriptions.json
"

- name: Copy Libero artifacts from container
if: always()
run: |
mkdir -p /tmp/libero-artifacts
docker cp libero-eval:/tmp/eval-artifacts/. /tmp/libero-artifacts/ 2>/dev/null || true
docker rm -f libero-eval || true

- name: Parse Libero eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/libero-artifacts \
--env libero \
--task libero_spatial \
--policy pepijn223/smolvla_libero

- name: Upload Libero rollout video
if: always()
uses: actions/upload-artifact@v4
with:
name: libero-rollout-video
path: /tmp/libero-artifacts/videos/
if-no-files-found: warn

- name: Upload Libero eval metrics
if: always()
uses: actions/upload-artifact@v4
with:
name: libero-metrics
path: /tmp/libero-artifacts/metrics.json
if-no-files-found: warn

# ── LIBERO TRAIN+EVAL SMOKE ──────────────────────────────────────────────
# Train SmolVLA for 1 step (batch_size=1, dataset episode 0 only) then
# immediately runs eval inside the training loop (eval_freq=1, 1 episode).
# Tests the full train→eval-within-training pipeline end-to-end.
- name: Run Libero train+eval smoke (1 step, eval_freq=1)
run: |
docker run --name libero-train-smoke --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
lerobot-benchmark-libero:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
accelerate launch --num_processes=1 \$(which lerobot-train) \
--policy.path=lerobot/smolvla_base \
--policy.load_vlm_weights=true \
--policy.scheduler_decay_steps=25000 \
--policy.freeze_vision_encoder=false \
--policy.train_expert_only=false \
--dataset.repo_id=lerobot/libero \
--dataset.episodes=[0] \
--dataset.use_imagenet_stats=false \
--env.type=libero \
--env.task=libero_spatial \
'--env.camera_name_mapping={\"agentview_image\": \"camera1\", \"robot0_eye_in_hand_image\": \"camera2\"}' \
--policy.empty_cameras=1 \
--output_dir=/tmp/train-smoke \
--steps=1 \
--batch_size=1 \
--eval_freq=1 \
--eval.n_episodes=1 \
--eval.batch_size=1 \
--eval.use_async_envs=false \
--save_freq=1 \
--policy.push_to_hub=false \
'--rename_map={\"observation.images.image\": \"observation.images.camera1\", \"observation.images.image2\": \"observation.images.camera2\"}'
"

- name: Copy Libero train-smoke artifacts from container
if: always()
run: |
mkdir -p /tmp/libero-train-smoke-artifacts
docker cp libero-train-smoke:/tmp/train-smoke/. /tmp/libero-train-smoke-artifacts/ 2>/dev/null || true
docker rm -f libero-train-smoke || true

- name: Upload Libero train-smoke eval video
if: always()
uses: actions/upload-artifact@v4
with:
name: libero-train-smoke-video
path: /tmp/libero-train-smoke-artifacts/eval/
if-no-files-found: warn

# ── METAWORLD ─────────────────────────────────────────────────────────────
# Isolated image: lerobot[metaworld] only (metaworld==3.0.0, mujoco>=3 chain)
metaworld-integration-test:
name: MetaWorld — build image + 1-episode eval
runs-on:
group: aws-g6-4xlarge-plus
env:
HF_USER_TOKEN: ${{ secrets.LEROBOT_HF_USER }}

steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
lfs: true

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3 # zizmor: ignore[unpinned-uses]
with:
cache-binary: false

- name: Login to Docker Hub
uses: docker/login-action@v3 # zizmor: ignore[unpinned-uses]
with:
username: ${{ secrets.DOCKERHUB_LEROBOT_USERNAME }}
password: ${{ secrets.DOCKERHUB_LEROBOT_PASSWORD }}

- name: Build MetaWorld benchmark image
uses: docker/build-push-action@v6 # zizmor: ignore[unpinned-uses]
with:
context: .
file: docker/Dockerfile.benchmark.metaworld
push: false
load: true
tags: lerobot-benchmark-metaworld:ci

- name: Run MetaWorld smoke eval (1 episode)
run: |
docker run --name metaworld-eval --gpus all \
--shm-size=4g \
-e HF_HOME=/tmp/hf \
-e HF_USER_TOKEN="${HF_USER_TOKEN}" \
-e HF_HUB_DOWNLOAD_TIMEOUT=300 \
lerobot-benchmark-metaworld:ci \
bash -c "
hf auth login --token \"\$HF_USER_TOKEN\" --add-to-git-credential 2>/dev/null || true
lerobot-eval \
--policy.path=pepijn223/smolvla_metaworld \
--env.type=metaworld \
--env.task=metaworld-push-v3 \
--eval.batch_size=1 \
--eval.n_episodes=1 \
--eval.use_async_envs=false \
--policy.device=cuda \
'--rename_map={\"observation.image\": \"observation.images.camera1\"}' \
--policy.empty_cameras=2 \
--output_dir=/tmp/eval-artifacts
python scripts/ci/extract_task_descriptions.py \
--env metaworld --task metaworld-push-v3 \
--output /tmp/eval-artifacts/task_descriptions.json
"

- name: Copy MetaWorld artifacts from container
if: always()
run: |
mkdir -p /tmp/metaworld-artifacts
docker cp metaworld-eval:/tmp/eval-artifacts/. /tmp/metaworld-artifacts/ 2>/dev/null || true
docker rm -f metaworld-eval || true

- name: Parse MetaWorld eval metrics
if: always()
run: |
python3 scripts/ci/parse_eval_metrics.py \
--artifacts-dir /tmp/metaworld-artifacts \
--env metaworld \
--task metaworld-push-v3 \
--policy pepijn223/smolvla_metaworld

- name: Upload MetaWorld rollout video
if: always()
uses: actions/upload-artifact@v4
with:
name: metaworld-rollout-video
path: /tmp/metaworld-artifacts/videos/
if-no-files-found: warn

- name: Upload MetaWorld eval metrics
if: always()
uses: actions/upload-artifact@v4
with:
name: metaworld-metrics
path: /tmp/metaworld-artifacts/metrics.json
if-no-files-found: warn
Loading
Loading