Skip to content

fix(classes): make RemoteClassWrapper and LRUCache picklable#306

Open
deanq wants to merge 3 commits intomainfrom
fix/AE-2745-remote-class-pickle
Open

fix(classes): make RemoteClassWrapper and LRUCache picklable#306
deanq wants to merge 3 commits intomainfrom
fix/AE-2745-remote-class-pickle

Conversation

@deanq
Copy link
Copy Markdown
Member

@deanq deanq commented Apr 10, 2026

Summary

  • asyncio.Lock in RemoteClassWrapper._init_lock and threading.RLock in LRUCache._lock are not picklable by cloudpickle
  • When ResourceManager._save_resources() calls cloudpickle.dump(), serialization raises, preventing the state file from being updated
  • On next flash deploy, ResourceManager has no record of the previous endpoint and creates a duplicate, orphaning the old one
  • Added __getstate__/__setstate__ to both RemoteClassWrapper and LRUCache to exclude unpicklable fields and recreate them on restore
  • RemoteClassWrapper also resets _initialized to False on unpickle so the instance re-initializes on next use
  • LRUCache.__getstate__ acquires _lock for thread-safe pickling

Test plan

  • test_pickle_roundtrip -- full cloudpickle dumps/loads roundtrip preserving all wrapper state
  • test_pickle_excludes_lock_and_stub -- verifies _init_lock and _stub absent from __getstate__ output
  • test_pickle_resets_initialized_flag -- unpickled instance has _initialized=False and no _stub
  • test_pickle_inside_tuple_like_save_resources -- simulates exact (resources, configs) tuple pattern from _save_resources
  • test_pickle_preserves_cache_key_and_class_data -- verifies cache_key and class metadata survive roundtrip
  • test_pickle_roundtrip (LRUCache) -- roundtrip preserving cache contents and max_size

asyncio.Lock and threading.RLock are not picklable. When
ResourceManager._save_resources() calls cloudpickle.dump(), the
serialization fails silently, causing duplicate endpoints on re-deploy.

- Add __getstate__/__setstate__ to RemoteClassWrapper excluding
  _init_lock and _stub, resetting _initialized on restore
- Add __getstate__/__setstate__ to LRUCache excluding _lock
- Add 5 tests covering pickle roundtrip and state exclusion
@deanq deanq requested a review from Copilot April 10, 2026 19:33
@deanq deanq changed the title fix(AE-2745): make RemoteClassWrapper and LRUCache picklable fix(classes): make RemoteClassWrapper and LRUCache picklable Apr 10, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to make two runtime objects (RemoteClassWrapper and LRUCache) serializable with cloudpickle so ResourceManager can persist state reliably and avoid creating duplicate/orphaned endpoints across deploys.

Changes:

  • Added RemoteClassWrapper.__getstate__/__setstate__ to exclude unpicklable runtime fields (asyncio.Lock, _stub) and force re-initialization after unpickle.
  • Added LRUCache.__getstate__/__setstate__ to exclude and recreate the unpicklable threading.RLock.
  • Added unit tests covering cloudpickle roundtrips and state exclusion/recreation for both classes.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/runpod_flash/execute_class.py Adds pickle hooks to RemoteClassWrapper to drop lock/stub and recreate lock on restore.
src/runpod_flash/core/utils/lru_cache.py Adds pickle hooks to LRUCache to drop/recreate the internal RLock.
tests/unit/test_execute_class.py Adds coverage for RemoteClassWrapper cloudpickle roundtrip and state expectations.
tests/unit/core/utils/test_lru_cache.py Adds coverage for LRUCache cloudpickle roundtrip and lock recreation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@deanq deanq requested review from KAJdev and runpod-Henrik April 10, 2026 19:46
- __setstate__ now calls get_or_cache_class_data() to repopulate the
  serialization cache after unpickle, preventing method_proxy from
  crashing on cache miss in a new process
- LRUCache.__getstate__ uses pop() instead of del for consistency
  with RemoteClassWrapper's defensive style
- Add test_pickle_repopulates_serialization_cache covering the
  cache miss scenario after cloudpickle roundtrip
@deanq
Copy link
Copy Markdown
Member Author

deanq commented Apr 10, 2026

bugbot run

- Revert __setstate__ cache repopulation: adding _SERIALIZED_CLASS_CACHE
  as a direct global reference forced cloudpickle to serialize the
  LRUCache instance as part of class globals, hitting a weakref on 3.11.
  Not needed: cloudpickle preserves the cache in the class globals copy.
- LRUCache.__getstate__ acquires _lock for thread-safe pickling
- Use _class_type.__name__ instead of identity checks in tests since
  cloudpickle rehydrates non-importable classes by value
- Replace cache repopulation test with metadata preservation test
Copy link
Copy Markdown
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #306 — fix(classes): make RemoteClassWrapper and LRUCache picklable

Repo: runpod-inc/flash
Severity: MEDIUM
Key findings: The fix is correct and prevents orphaned endpoints on redeploy, but __setstate__ does not re-warm the in-memory class cache, so a restored wrapper will crash on first method call in any process that didn't originally create it.


1. Core fix — RemoteClassWrapper pickle support

What it fixes: When a user ran flash deploy a second time, ResourceManager._save_resources() would raise during cloudpickle.dump() because asyncio.Lock is not picklable. The state file was left stale, so on the next deploy the manager had no record of the previous endpoint and provisioned a duplicate — leaving the user with a silently orphaned, billing-active endpoint they never deprovisioned.

The __getstate__/__setstate__ pair correctly strips _init_lock and _stub, resets _initialized=False, and recreates the lock on restore. This is the right approach.

Issue (high confidence) — restored wrapper crashes on first method call in a new process:

After unpickling, method_proxy (in __getattr__) reads:

cached_data = _SERIALIZED_CLASS_CACHE.get(self._cache_key)
# cached_data is None in a fresh process
cached_data["constructor_args"]  # TypeError: 'NoneType' is not subscriptable

_SERIALIZED_CLASS_CACHE is a module-level LRUCache that is never pickled. In any scenario where state is loaded from disk and the wrapper's method is then called — which is exactly the scenario this PR is meant to enable — the cache is cold and method_proxy will raise TypeError before the remote call is made. The user's class method call silently fails with an unrelated-looking error rather than a useful message.

The instance does preserve _clean_class_code through the pickle roundtrip. __setstate__ could call get_or_cache_class_data (or at minimum populate _SERIALIZED_CLASS_CACHE directly) to re-warm the cache and prevent the crash.

Question (medium — depends on runtime) — _resource_config preservation:

_resource_config (a ServerlessResource Pydantic model) is pickled as part of the wrapper state. After unpickling, _ensure_initialized calls resource_manager.get_or_deploy_resource(self._resource_config). If the ServerlessResource Pydantic model has validators or computed fields that behave differently when loaded from pickle vs. constructed fresh, drift detection could misfire and trigger a re-deploy instead of reusing the existing endpoint. This is low probability given Pydantic v2 behavior, but no test covers the _ensure_initialized path on a restored wrapper.


2. LRUCache pickle support

The implementation is straightforward and correct. __getstate__ acquires the lock before copying __dict__, preventing a torn read under concurrent access. __setstate__ restores a fresh RLock. Cache contents (OrderedDict) survive the roundtrip intact.

No issues.


3. Test coverage

The unit tests cover the stated scenarios well (roundtrip, state exclusion, reset, tuple pattern, cache key preservation).

Gap — no test for the cold-cache crash path: None of the five TestRemoteClassWrapperPickle tests call a method on the restored wrapper. The test_pickle_inside_tuple_like_save_resources test confirms the pickle pattern matches _save_resources() but stops at verifying _initialized=False. A test that unpickles and then invokes a method proxy would catch the TypeError described above.

Gap — no test verifying create_new_instance behavior after restore: The create_new_instance=not hasattr(self, "_stub") logic is critical for avoiding duplicate remote instances. No test asserts that a freshly-unpickled wrapper sets create_new_instance=True on first use and False on subsequent uses.


Nits

  • _UNPICKLABLE_ATTRS is defined as a class variable inside class RemoteClassWrapper (which itself is defined inside create_remote_class). The frozenset is recreated on every call to create_remote_class. It could be a module-level constant, but the impact is negligible.
  • __setstate__ docstring is missing; other dunder methods in the file have them.
  • The LRUCache test asserts type(restored._lock) is type(threading.RLock()) — this works but is comparing types of instances, not types directly. isinstance(restored._lock, type(threading.RLock())) would be cleaner, though the current form is not wrong.

Verdict

NEEDS WORK — The core orphaned-endpoint bug is correctly fixed, but the __setstate__ implementation leaves restored wrappers in a state that crashes on first method call due to a cold in-memory cache. This defeats the purpose of loading from persisted state and should be addressed before merge.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants