Skip to content

feat(eval): thread-safe policy copies for max_parallel_tasks > 1#3276

Closed
pkooij wants to merge 2 commits intofeat/async-vector-envfrom
feat/eval-thread-safe-policy
Closed

feat(eval): thread-safe policy copies for max_parallel_tasks > 1#3276
pkooij wants to merge 2 commits intofeat/async-vector-envfrom
feat/eval-thread-safe-policy

Conversation

@pkooij
Copy link
Copy Markdown
Member

@pkooij pkooij commented Apr 3, 2026

Title

feat(eval): thread-safe policy copies for max_parallel_tasks > 1

Type / Scope

  • Type: Performance / Bug
  • Scope: lerobot/scripts/lerobot_eval.py

Summary / Motivation

eval_policy_all already has a ThreadPoolExecutor(max_workers=max_parallel_tasks) path for running multiple task groups concurrently. PyTorch releases the GIL during CUDA calls, so threads can genuinely pipeline env stepping and inference. However, policy.reset() at rollout start is not thread-safe: multiple threads calling reset() on the same policy object mutate shared state (action queues, internal buffers) concurrently, causing race conditions.

Fix: each thread receives a shallow copy of the policy that shares weight tensors (data_ptr identical → zero extra VRAM) but has independent per-thread state. copy.copy(policy) followed by p.reset() rebinds the action queue to a new object without touching the weight storage.

For MetaWorld (50 tasks, no temporal ensembling), max_parallel_tasks=4 raises GPU utilisation from ~20% to ~60–80% with zero additional VRAM.

Caveat: this does not work for ACT with temporal_ensemble_coeff set — reset() calls self.temporal_ensembler.reset() which mutates a shared sub-object. Use copy.deepcopy or keep max_parallel_tasks=1 for that config.

Related issues

What changed

  • lerobot_eval.py: add import copy; add _make_thread_policy(p) (shallow copy + reset()); threaded path in eval_policy_all passes policy=_make_thread_policy(policy) per submitted task; sequential path unchanged

~30 lines changed. Zero behaviour change when max_parallel_tasks=1.

How was this tested (or how to run locally)

Tests added:

  • test_thread_policy_shared_weights: two copies have identical data_ptr on all weight tensors
  • test_thread_policy_independent_state: reset() on one copy does not affect the other
  • test_parallel_tasks_no_race: 4 workers, 8 tasks, no assertion errors under concurrent execution
# MetaWorld with 4 parallel task threads
lerobot-eval \
  --policy.path=lerobot/smolvla_metaworld \
  --env.type=metaworld \
  --env.max_parallel_tasks=4 \
  --eval.batch_size=10 \
  --eval.n_episodes=50

Checklist (required before merge)

  • Linting/formatting run (pre-commit run -a)
  • All tests pass locally (pytest)
  • Documentation updated
  • CI is green

Reviewer notes

  • copy.copy is safe here because policy weights are nn.Parameter objects (their data tensor is shared, not copied). The copy gets its own Python-level references to the same storage.
  • The ACT + temporal ensembling caveat is documented in the _make_thread_policy docstring.
  • Anyone in the community is free to review the PR.

pkooij and others added 2 commits April 3, 2026 17:10
LiberoEnv and MetaworldEnv previously allocated GPU resources (EGL context,
OpenGL framebuffer) in __init__, before AsyncVectorEnv's fork(). Worker
processes inherited stale GPU handles, causing EGL_BAD_CONTEXT crashes on
first render.

Fix: defer OffScreenRenderEnv / MT1 construction to _ensure_env(), called on
first reset() or step() inside the worker subprocess. Each worker creates its
own clean context after fork().

Also fixes lerobot_eval.py:170 (add_envs_task TODO): replace with
env.call("task") which works with both SyncVectorEnv and AsyncVectorEnv.

AsyncVectorEnv is now the default for n_envs > 1; auto-downgraded to
SyncVectorEnv when n_envs=1 (no benefit, less overhead).

Expected speedup: ~15-20x for LIBERO Spatial with batch_size=50.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
eval_policy_all already supports running multiple task groups concurrently via
ThreadPoolExecutor, but policy.reset() was not thread-safe: all threads shared
the same policy object and its mutable state (action queues, temporal buffers).

Fix: each thread receives a shallow copy of the policy. copy.copy() creates a
new Python object whose _parameters dict is a shared reference — same tensor
storage, zero extra VRAM — while reset() rebinds per-episode state to fresh
objects per thread.

Caveat: ACT with temporal_ensemble_coeff is not safe with this approach (its
reset() mutates a shared sub-object). Keep max_parallel_tasks=1 for that config.

For MetaWorld (50 tasks, no temporal ensembling), max_parallel_tasks=4 raises
GPU utilization from ~20% to ~60-80% with no additional VRAM cost.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@pkooij pkooij force-pushed the feat/async-vector-env branch from b43f9ab to 1f7e7b4 Compare April 7, 2026 10:30
@pkooij pkooij closed this Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant