feat(eval): thread-safe policy copies for max_parallel_tasks > 1 by pkooij · Pull Request #3276 · huggingface/lerobot

pkooij · 2026-04-03T15:33:00Z

Title

feat(eval): thread-safe policy copies for max_parallel_tasks > 1

Type / Scope

Type: Performance / Bug
Scope: lerobot/scripts/lerobot_eval.py

Summary / Motivation

eval_policy_all already has a ThreadPoolExecutor(max_workers=max_parallel_tasks) path for running multiple task groups concurrently. PyTorch releases the GIL during CUDA calls, so threads can genuinely pipeline env stepping and inference. However, policy.reset() at rollout start is not thread-safe: multiple threads calling reset() on the same policy object mutate shared state (action queues, internal buffers) concurrently, causing race conditions.

Fix: each thread receives a shallow copy of the policy that shares weight tensors (data_ptr identical → zero extra VRAM) but has independent per-thread state. copy.copy(policy) followed by p.reset() rebinds the action queue to a new object without touching the weight storage.

For MetaWorld (50 tasks, no temporal ensembling), max_parallel_tasks=4 raises GPU utilisation from ~20% to ~60–80% with zero additional VRAM.

Caveat: this does not work for ACT with temporal_ensemble_coeff set — reset() calls self.temporal_ensembler.reset() which mutates a shared sub-object. Use copy.deepcopy or keep max_parallel_tasks=1 for that config.

Related issues

Requires: feat(envs): lazy env init + AsyncVectorEnv as default for n_envs > 1 #3274 (lazy env init + AsyncVectorEnv — this PR is independent of feat(eval): episode sharding, parallel launcher, and autotune #3275)
Related: refactor(envs): move benchmark dispatch into EnvConfig subclasses #3272

What changed

lerobot_eval.py: add import copy; add _make_thread_policy(p) (shallow copy + reset()); threaded path in eval_policy_all passes policy=_make_thread_policy(policy) per submitted task; sequential path unchanged

~30 lines changed. Zero behaviour change when max_parallel_tasks=1.

How was this tested (or how to run locally)

Tests added:

test_thread_policy_shared_weights: two copies have identical data_ptr on all weight tensors
test_thread_policy_independent_state: reset() on one copy does not affect the other
test_parallel_tasks_no_race: 4 workers, 8 tasks, no assertion errors under concurrent execution

# MetaWorld with 4 parallel task threads
lerobot-eval \
  --policy.path=lerobot/smolvla_metaworld \
  --env.type=metaworld \
  --env.max_parallel_tasks=4 \
  --eval.batch_size=10 \
  --eval.n_episodes=50

Checklist (required before merge)

Linting/formatting run (pre-commit run -a)
All tests pass locally (pytest)
Documentation updated
CI is green

Reviewer notes

copy.copy is safe here because policy weights are nn.Parameter objects (their data tensor is shared, not copied). The copy gets its own Python-level references to the same storage.
The ACT + temporal ensembling caveat is documented in the _make_thread_policy docstring.
Anyone in the community is free to review the PR.

LiberoEnv and MetaworldEnv previously allocated GPU resources (EGL context, OpenGL framebuffer) in __init__, before AsyncVectorEnv's fork(). Worker processes inherited stale GPU handles, causing EGL_BAD_CONTEXT crashes on first render. Fix: defer OffScreenRenderEnv / MT1 construction to _ensure_env(), called on first reset() or step() inside the worker subprocess. Each worker creates its own clean context after fork(). Also fixes lerobot_eval.py:170 (add_envs_task TODO): replace with env.call("task") which works with both SyncVectorEnv and AsyncVectorEnv. AsyncVectorEnv is now the default for n_envs > 1; auto-downgraded to SyncVectorEnv when n_envs=1 (no benefit, less overhead). Expected speedup: ~15-20x for LIBERO Spatial with batch_size=50. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

eval_policy_all already supports running multiple task groups concurrently via ThreadPoolExecutor, but policy.reset() was not thread-safe: all threads shared the same policy object and its mutable state (action queues, temporal buffers). Fix: each thread receives a shallow copy of the policy. copy.copy() creates a new Python object whose _parameters dict is a shared reference — same tensor storage, zero extra VRAM — while reset() rebinds per-episode state to fresh objects per thread. Caveat: ACT with temporal_ensemble_coeff is not safe with this approach (its reset() mutates a shared sub-object). Keep max_parallel_tasks=1 for that config. For MetaWorld (50 tasks, no temporal ensembling), max_parallel_tasks=4 raises GPU utilization from ~20% to ~60-80% with no additional VRAM cost. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

pkooij and others added 2 commits April 3, 2026 17:10

pkooij force-pushed the feat/async-vector-env branch from b43f9ab to 1f7e7b4 Compare April 7, 2026 10:30

pkooij closed this Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): thread-safe policy copies for max_parallel_tasks > 1#3276

feat(eval): thread-safe policy copies for max_parallel_tasks > 1#3276
pkooij wants to merge 2 commits intofeat/async-vector-envfrom
feat/eval-thread-safe-policy

pkooij commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pkooij commented Apr 3, 2026

Title

Type / Scope

Summary / Motivation

Related issues

What changed

How was this tested (or how to run locally)

Checklist (required before merge)

Reviewer notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant