Skip to content

feat(envs): lazy env init + AsyncVectorEnv as default for n_envs > 1#3274

Merged
pkooij merged 44 commits intomainfrom
feat/async-vector-env
Apr 9, 2026
Merged

feat(envs): lazy env init + AsyncVectorEnv as default for n_envs > 1#3274
pkooij merged 44 commits intomainfrom
feat/async-vector-env

Conversation

@pkooij
Copy link
Copy Markdown
Member

@pkooij pkooij commented Apr 3, 2026

Summary

LiberoEnv and MetaWorldEnv eagerly allocated GPU EGL contexts in __init__, making AsyncVectorEnv unusable (child processes inherit stale GPU handles → EGL_BAD_CONTEXT). All environments were also created upfront, causing OOM on multi-suite evaluations.

This PR:

  1. Defers GPU allocation to _ensure_env(), called on first reset()/step() inside worker subprocesses
  2. Adds _LazyAsyncVectorEnv — only one task's workers are alive at a time, preventing OOM
  3. Switches default to AsyncVectorEnv for parallel env stepping
  4. Fixes task descriptions for VLM policies (env.call("task_description") instead of broken add_envs_task)
  5. Auto-tunes batch_size based on available CPU cores (batch_size=0)

Benchmarks

All runs with pepijn223/smolvla_libero, single GPU.

libero_spatial (10 tasks, batch_size=10, n_episodes=10 → 100 rollouts)

Branch Wall time GPU util GPU mem
refactor/benchmark-dispatch 396s 0–8% ~2 GB
feat/async-vector-env 189s 0–99% ~10 GB

→ 2.1× speedup

Full LIBERO (4 suites, 40 tasks, n_episodes=10 → 400 rollouts)

Branch batch_size Wall time GPU mem
refactor/benchmark-dispatch 1 (10 OOMs) 1475s ~22 GB
feat/async-vector-env 10 996s ~10 GB

→ 1.5× faster, half the GPU memory

Related

What changed

  • libero.py: lazy _ensure_env() + _LazyAsyncVectorEnv wrapper
  • metaworld.py: same lazy init pattern
  • configs.py: default use_async_envs=True, auto-downgrade to sync when n_envs=1
  • default.py: batch_size=0 (auto-tune), use_async_envs=True
  • lerobot_eval.py: env.call("task_description") fix, env.close() between tasks
  • utils.py: _get_sub_env_attr / _sub_env_has_attr for async-compatible attribute access

Tests

  • test_libero_lazy_init / test_metaworld_lazy_init
  • test_async_vector_env_libero / test_async_vector_env_metaworld
  • test_add_envs_task_async
  • test_single_env_uses_sync

s1lent4gnt
s1lent4gnt previously approved these changes Apr 8, 2026
Copy link
Copy Markdown
Member

@s1lent4gnt s1lent4gnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Base automatically changed from refactor/benchmark-dispatch to main April 8, 2026 15:49
@pkooij pkooij dismissed s1lent4gnt’s stale review April 8, 2026 15:49

The base branch was changed.

pkooij and others added 15 commits April 8, 2026 18:28
…chmark docs

Add a comprehensive guide for adding new benchmarks to LeRobot, and
refactor the existing LIBERO and Meta-World docs to follow the new
standardized template.

Made-with: Cursor
…asses

Replace hardcoded if/elif chains in factory.py with create_envs() and
get_env_processors() methods on EnvConfig. New benchmarks now only need
to register a config subclass — no factory.py edits required.

Net -23 lines: factory.py shrinks from ~200 to ~70 lines of logic.

Made-with: Cursor
Rewrite for simpler language, better structure, and easier navigation.
Move quick-reference table to the top, fold eval explanation into
architecture section, condense the doc template to a bulleted outline.

Made-with: Cursor
- Thread camera_name_mapping from LiberoEnv config through to gym envs
- Sync features_map with camera_name_mapping in LiberoEnv.__post_init__
- Fix render() to use first available camera instead of hardcoded "image"
- Handle non-dict final_info in rollout by falling back to info["is_success"]
- Add use_peft legacy field to SmolVLAConfig for checkpoint compat
- Add defaults to GR00TN15Config init=False fields for transformers 5.3

Made-with: Cursor
- Revert GR00T N1.5 default_factory/default changes (transformers compat)
- Revert SmolVLA use_peft legacy field
- Apply ruff formatting fixes
- camera_name_mapping stays entirely in env/eval layer (no policy changes)

Made-with: Cursor
Co-authored-by: Khalil Meftah <khalil.meftah@huggingface.co>
Signed-off-by: Pepijn <138571049+pkooij@users.noreply.github.com>
LiberoEnv and MetaworldEnv previously allocated GPU resources (EGL context,
OpenGL framebuffer) in __init__, before AsyncVectorEnv's fork(). Worker
processes inherited stale GPU handles, causing EGL_BAD_CONTEXT crashes on
first render.

Fix: defer OffScreenRenderEnv / MT1 construction to _ensure_env(), called on
first reset() or step() inside the worker subprocess. Each worker creates its
own clean context after fork().

Also fixes lerobot_eval.py:170 (add_envs_task TODO): replace with
env.call("task") which works with both SyncVectorEnv and AsyncVectorEnv.

AsyncVectorEnv is now the default for n_envs > 1; auto-downgraded to
SyncVectorEnv when n_envs=1 (no benefit, less overhead).

Expected speedup: ~15-20x for LIBERO Spatial with batch_size=50.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
eval_policy_all never closed environments after each task completed,
causing AsyncVectorEnv worker processes to accumulate (N_tasks × n_envs).
This led to OOM, BrokenPipeError and EOFError on multi-task benchmarks.

Also fixes:
- AsyncVectorEnv compat in envs/utils.py (use get_attr/call instead of .envs)
- Tuple task handling in tokenizer_processor and lerobot_eval
- _LazyAsyncVectorEnv for deferred worker spawning in LIBERO

Made-with: Cursor
…ning

env.call("task") returns the LIBERO task name with underscores
(e.g. "pick_up_the_black_bowl_...") instead of the natural language
description ("pick up the black bowl ..."). The VLM tokenizes these
completely differently, causing 0.0 reward across all episodes.

Made-with: Cursor
- Replace add_envs_task reference with env.call("task_description")
- Update use_async_envs default to True
- Add note about lazy GPU init for AsyncVectorEnv compatibility

Made-with: Cursor
- batch_size=0 (default) auto-tunes based on CPU cores, capped by
  n_episodes and 64. Removes the need for users to guess the right
  value. The old batch_size > n_episodes error is replaced by silently
  clamping to n_episodes.
- _LazyAsyncVectorEnv accepts pre-computed spaces so only one temp env
  is created per suite (not per task). For libero_spatial (10 tasks)
  this avoids 9 redundant LiberoEnv instantiations during env setup.

Made-with: Cursor
pkooij and others added 7 commits April 8, 2026 18:29
__del__ is unreliable as a cleanup mechanism. close() is already called
explicitly in the eval loop's finally block, so the finalizer is redundant.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ry overlap

Previously, next task's AsyncVectorEnv workers were spawned while the
current task was still running, causing both tasks' GPU contexts to coexist.
Moving the prefetch start into the finally block (after env.close()) ensures
workers for task N+1 only spin up once task N has released GPU memory.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_LazyAsyncVectorEnv lived in libero.py but metaworld had the same OOM
problem: all tasks' AsyncVectorEnv workers were spawned eagerly, wasting
GPU memory for tasks not yet running.

Move the class to envs/utils.py so both environments share it, then apply
the same is_async + lazy wrapping pattern in create_metaworld_envs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Benchmark CI workflow, Dockerfiles, benchmark docs, evaluation smoke-test
doc, and dispatch tests belong in a separate PR. Scope this PR to the
async env init changes only.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…changes

- Restore docs/source/adding_benchmarks.mdx (belongs in this PR)
- Restore tests/envs/test_dispatch.py (belongs in this PR)
- Revert docs/source/env_processor.mdx to main (out of scope for this PR)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e PR)

Step 7 (Dockerfile + benchmark_tests.yml CI job) and its table rows are
out of scope for this PR. The CI infrastructure will be added on top in a
follow-up PR.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaced by env.call("task_description") in lerobot_eval.py. No callers
remain in the codebase.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@pkooij pkooij force-pushed the feat/async-vector-env branch from 2433f10 to 35f18d4 Compare April 8, 2026 17:04
@github-actions github-actions bot added documentation Improvements or fixes to the project’s docs tests Problems with test coverage, failures, or improvements to testing configuration Problems with configuration files or settings evaluation For issues or PRs related to environment evaluation, and benchmarks. labels Apr 8, 2026
@pkooij pkooij force-pushed the feat/async-vector-env branch from 35f18d4 to 566a77b Compare April 8, 2026 17:05
@github-actions github-actions bot added the processor Issue related to processor label Apr 8, 2026
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r task description

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
s1lent4gnt
s1lent4gnt previously approved these changes Apr 8, 2026
Copy link
Copy Markdown
Member

@s1lent4gnt s1lent4gnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHAD, LGTM!

…eadlock

AsyncVectorEnv with default fork context leaks worker processes between
test_policy parametrized cases; subsequent env creation deadlocks because
new forked workers inherit stale pipe FDs from previous test's leaked workers.

- configs.py: pass context="forkserver" to AsyncVectorEnv (matches _LazyAsyncVectorEnv)
- test_policies.py: call close_envs(envs) at end of test_policy to clean up workers

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
s1lent4gnt
s1lent4gnt previously approved these changes Apr 8, 2026
Copy link
Copy Markdown
Member

@s1lent4gnt s1lent4gnt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Tests that call make_env(n_envs=2) without passing use_async_envs were
getting AsyncVectorEnv, whose forked workers can't resolve gym namespaces
registered at runtime. Default to False (sync) so existing tests pass.

lerobot_eval.py explicitly passes cfg.eval.use_async_envs, so the CLI
async behaviour (controlled by EvalConfig.use_async_envs) is unchanged.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@pkooij pkooij merged commit 919184d into main Apr 9, 2026
13 checks passed
@pkooij pkooij deleted the feat/async-vector-env branch April 9, 2026 08:29
pkooij added a commit that referenced this pull request Apr 9, 2026
Resolves conflict in lerobot_eval.py by taking explicit
(AttributeError, NotImplementedError) catches from main (#3274).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

configuration Problems with configuration files or settings documentation Improvements or fixes to the project’s docs evaluation For issues or PRs related to environment evaluation, and benchmarks. processor Issue related to processor tests Problems with test coverage, failures, or improvements to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants