Revert "Add LRU capacity to ValidatingCache, remove sentinel pattern, add storage Update"#4730
Revert "Add LRU capacity to ValidatingCache, remove sentinel pattern, add storage Update"#4730
Conversation
… add sto…" This reverts commit 1a6287b.
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4730 +/- ##
==========================================
+ Coverage 68.68% 68.74% +0.06%
==========================================
Files 518 518
Lines 54826 54826
==========================================
+ Hits 37657 37690 +33
+ Misses 14274 14242 -32
+ Partials 2895 2894 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
Large PR justification has been provided. Thank you!
|
|
||
| ctx, cancel := context.WithTimeout(context.Background(), notifyBackendExpiredTimeout) | ||
| defer cancel() | ||
|
|
||
| // Update only succeeds if the key still exists. A concurrent Delete (same | ||
| // pod or cross-pod) returns (false, nil), and we bail without resurrecting. | ||
| updated, err := sm.storage.Update(ctx, sessionID, metadata) | ||
| if err != nil { | ||
| // Cross-pod guard: re-check that the storage record still exists before | ||
| // upserting. If another pod terminated the session (deleting the key) after | ||
| // NotifyBackendExpired's initial Load, we must not recreate the record. | ||
| if _, err := sm.storage.Load(ctx, sessionID); err != nil { | ||
| if errors.Is(err, transportsession.ErrSessionNotFound) { | ||
| return nil // session was terminated elsewhere; nothing to update | ||
| } | ||
| return err | ||
| } | ||
| if !updated { | ||
| return nil // session was terminated; nothing to update | ||
|
|
||
| if err := sm.storage.Upsert(ctx, sessionID, metadata); err != nil { | ||
| return err | ||
| } | ||
| // The cache self-heals lazily: on the next GetMultiSession, checkSession detects | ||
| // either the absent storage key or stale MetadataKeyBackendIDs and evicts the | ||
| // entry, triggering a fresh RestoreSession. | ||
| sm.sessions.Delete(sessionID) |
There was a problem hiding this comment.
🔴 In updateMetadata, a single context.WithTimeout(context.Background(), notifyBackendExpiredTimeout) context is created at line 614 and shared between both the cross-pod guard storage.Load call and the subsequent storage.Upsert call. If the Load consumes most of the 5-second budget (e.g., slow Redis or network jitter), the Upsert fails with a deadline-exceeded error even though the guard check succeeded and the write was safe to perform. The fix is to create two separate contexts—one for Load and one for Upsert—matching the documented intent of the notifyBackendExpiredTimeout constant, which explicitly states each storage operation is 'capped independently' and the worst-case wall-clock is '2 × 5 s = 10 s'.
Extended reasoning...
What the bug is: updateMetadata (session_manager.go lines 614–630) creates a single context with a 5-second timeout and passes it to two sequential storage operations: a cross-pod guard storage.Load (line 620) and a subsequent storage.Upsert (line 627). These two operations share one deadline, so the 5-second budget is split between them rather than each getting its own independent 5 seconds.
The specific code path: Inside updateMetadata, the code is:
ctx, cancel := context.WithTimeout(context.Background(), notifyBackendExpiredTimeout)
defer cancel()
if _, err := sm.storage.Load(ctx, sessionID); err \!= nil { ... } // consumes from shared ctx
if err := sm.storage.Upsert(ctx, sessionID, metadata); err \!= nil { ... } // residual budget onlyIf the Load call takes, say, 4.9 seconds (slow Redis or network jitter), the Upsert starts with only 0.1 seconds remaining and fails immediately with a context deadline exceeded error.
Why existing code doesn't prevent it: The constant notifyBackendExpiredTimeout has an explicit comment that reads: 'bounds each individual storage operation inside NotifyBackendExpired() — one Load and one Upsert, each capped independently. Each is a single-key Redis operation, so 5 s per call is consistent with terminateTimeout and decorateTimeout. Worst-case wall-clock for the function is 2 × 5 s = 10 s.' The implementation directly contradicts this documented intent. The old code (before the revert) used a single storage.Update atomic operation, so there was only one operation per context and this sharing problem did not exist. The revert introduced two sequential storage operations under one shared context.
What the impact would be: When the Upsert fails due to the exhausted context, updateMetadata returns an error, NotifyBackendExpired logs a warning, the session is NOT evicted from the node-local cache (only evicted on success), and the backend-expiry metadata is silently dropped. The next GetMultiSession call will still use the stale in-memory session holding references to the now-expired backend. This is a subtle correctness issue that only manifests under degraded network conditions (slow Redis, network jitter) where the re-check Load takes close to 5 seconds.
How to fix it: Create two separate contexts, one for the Load and one for the Upsert:
loadCtx, loadCancel := context.WithTimeout(context.Background(), notifyBackendExpiredTimeout)
defer loadCancel()
if _, err := sm.storage.Load(loadCtx, sessionID); err \!= nil { ... }
upsertCtx, upsertCancel := context.WithTimeout(context.Background(), notifyBackendExpiredTimeout)
defer upsertCancel()
if err := sm.storage.Upsert(upsertCtx, sessionID, metadata); err \!= nil { ... }This matches the documented intent and ensures both operations receive their full independent 5-second budget.
Step-by-step proof: (1) Backend workload-A expires; NotifyBackendExpired is called. (2) Initial storage.Load in NotifyBackendExpired at line 540 succeeds in ~0.1s (its own independent context). (3) updateMetadata is called with the updated metadata. (4) The shared context is created with a 5-second deadline at line 614. (5) The cross-pod guard storage.Load at line 620 takes 4.9s due to Redis latency—context still has 0.1s remaining. (6) The storage.Upsert at line 627 starts with 0.1s remaining, Redis round-trip exceeds this, and Upsert returns context deadline exceeded. (7) updateMetadata returns an error; sm.sessions.Delete(sessionID) is never called. (8) NotifyBackendExpired logs a warning and returns. (9) The node-local cache still holds the stale session; subsequent GetMultiSession returns stale backend metadata until the session is next evicted by checkSession detecting drift.
| if !sm.sessions.CompareAndSwap(sessionID, sess, decorated) { | ||
| return fmt.Errorf("DecorateSession: session %q was terminated or concurrently modified during decoration", sessionID) | ||
| } | ||
| // Persist updated metadata to storage. On failure, attempt to rollback | ||
| // the local-map entry so the caller can retry. If Terminate() has since | ||
| // replaced the decorated entry with a sentinel, the rollback CAS returns | ||
| // false and we leave the sentinel in place. | ||
| decorateCtx, decorateCancel := context.WithTimeout(context.Background(), decorateTimeout) | ||
| defer decorateCancel() | ||
| updated, err := sm.storage.Update(decorateCtx, sessionID, decorated.GetMetadata()) | ||
| if err != nil { | ||
| if err := sm.storage.Upsert(decorateCtx, sessionID, decorated.GetMetadata()); err != nil { | ||
| _ = sm.sessions.CompareAndSwap(sessionID, decorated, sess) | ||
| return fmt.Errorf("DecorateSession: failed to store decorated session metadata: %w", err) | ||
| } | ||
| if !updated { | ||
| // Session was deleted (by Terminate or TTL) between Get and Update. | ||
| // Evict the stale cache entry so onEvict closes backend connections. | ||
| sm.sessions.Remove(sessionID) | ||
| return fmt.Errorf("DecorateSession: session %q was deleted during decoration", sessionID) | ||
| } | ||
| sm.sessions.Set(sessionID, decorated) | ||
| return nil |
There was a problem hiding this comment.
🔴 The revert replaces storage.Update (SET XX / conditional write) with unconditional storage.Upsert in DecorateSession, creating a window where a concurrent Terminate() can delete the storage key and then DecorateSession's Upsert silently recreates it. On the same pod the consequence is bounded (the decorator wraps an already-closed session), but cross-pod RestoreSession calls will find the resurrected record in Redis and open fresh backend connections for a session that has been terminated.
Extended reasoning...
The race window
After DecorateSession's CompareAndSwap(sessionID, sess, decorated) succeeds at line 775, the local cache holds decorated as a live vmcpsession.MultiSession. At this point there is a concurrent execution window before line 784's storage.Upsert runs. During that window Terminate() can execute fully: (1) Peek sees decorated (a valid MultiSession, not a sentinel), (2) Store(terminatedSentinel{}) overwrites the cache entry unconditionally, (3) storage.Delete removes the Redis/local key, (4) sessions.Delete removes the sentinel, (5) multiSess.Close() releases backend connections. At this point both the cache and storage are clean — the session is fully terminated.
Why Upsert causes resurrection
DecorateSession then reaches line 784 and calls storage.Upsert(decorateCtx, sessionID, decorated.GetMetadata()). Because Upsert is unconditional (Redis SET with no XX flag), it recreates the storage key even though Terminate just deleted it. The rollback CompareAndSwap(sessionID, decorated, sess) at line 785 only fires on a storage error, not on this success path — so the local cache remains empty but storage now holds a non-terminated metadata record.
Same-pod impact is bounded; cross-pod impact is not
The code comment at lines 757–760 claims "its consequence is bounded: Terminate() already called Close() on the underlying MultiSession before deleting it, so any resurrected decorator wraps an already-closed session and will fail on first use". This is only true for same-pod access via the in-memory RestorableCache. Any other pod performing GetMultiSession → loadSession → factory.RestoreSession will find the resurrected storage record, see it is not marked terminated, and call factory.RestoreSession to open fresh backend connections for a session ID that was explicitly terminated. The closed local session object is never involved; the other pod constructs a brand-new MultiSession from the resurrected metadata.
Concrete proof of the race
- Pod A calls
DecorateSession("sess-1", fn).GetMultiSessionreturnssess;fnruns and returnsdecorated. CAS succeeds — cache:{"sess-1": decorated}. - Pod A's
Terminate("sess-1")runs on a concurrent goroutine:Peekseesdecorated, storesterminatedSentinel, callsstorage.Delete, callssessions.Delete, callsdecorated.Close(). Cache: empty. Storage: key absent. DecorateSessionresumes at line 784:storage.Upsert(ctx, "sess-1", decorated.GetMetadata())— storage key recreated with non-terminated metadata. Cache: empty. Storage:{"sess-1": {metadata without terminated flag}}.- Pod B calls
GetMultiSession("sess-1")→ cache miss →loadSession→storage.Loadreturns the resurrected record → metadata has noterminated=trueflag and hasMetadataKeyTokenHashset →factory.RestoreSessionopens new backend connections.
The fix that was reverted
The previous code (PR #4669) used storage.Update (Redis SET XX) which is an atomic conditional write: it returns (false, nil) when the key does not exist. After Terminate's storage.Delete, Update would find the key absent and return (false, nil), and DecorateSession would bail without recreating the record. The revert replaced Update with Upsert, removing this protection entirely.
| // Guard against a sentinel being stored between load() completing and | ||
| // this Store call (Terminate() running concurrently). LoadOrStore is | ||
| // atomic: if a sentinel got in, we discard the freshly loaded value | ||
| // via onEvict rather than silently overwriting the sentinel. | ||
| if _, loaded := c.m.LoadOrStore(key, v); loaded { | ||
| if c.onEvict != nil { | ||
| c.onEvict(key, v) | ||
| } | ||
| return nil, errSentinelFound | ||
| } |
There was a problem hiding this comment.
🟡 In RestorableCache.Get (cache.go lines 126–130), after load() completes, LoadOrStore returns loaded=true for any pre-existing value, but the code unconditionally calls onEvict(key, v) and returns errSentinelFound without checking whether the pre-existing value is a sentinel or a legitimate V. If a concurrent CreateSession stores a valid MultiSession between storage.Upsert (line 348) and sessions.Store (line 355), a simultaneous GetMultiSession → loadSession → RestoreSession call will find that V via LoadOrStore, close the freshly-restored session's backend connections via onEvict, and return (nil, false) to all singleflight waiters even though the cache holds a valid session. The old ValidatingCache handled this correctly by reading back the winner via ContainsOrAdd+Get and calling onEvict only on the loser; the new code always treats any concurrent store as a sentinel, and the test TestValidatingCache_Singleflight_EvictsLoserWhenLoadRacesWriter that verified the correct behavior was removed in this PR.
Extended reasoning...
The logical flaw: At cache.go lines 126–130, LoadOrStore(key, v) returns (_, loaded=true) whenever anything is already stored under key. The code comment says "if a sentinel got in" but the code cannot distinguish a terminatedSentinel{} from a legitimate vmcpsession.MultiSession. Both cause onEvict(key, v) to fire on the freshly-loaded v (closing its backend connections) and errSentinelFound to be returned, causing all singleflight waiters to receive (zero, false).
The code path that triggers it: In CreateSession (session_manager.go lines 348–355), storage.Upsert writes the session metadata including MetadataKeyTokenHash to Redis, making the session restorable, and then sessions.Store(sessionID, sess) stores the live MultiSession in the node-local cache. These two operations are not atomic. During the window where metadata is in Redis but the V is not yet in the cache: a concurrent GetMultiSession(sessionID) sees a cache miss, enters the singleflight group, calls loadSession, finds MetadataKeyTokenHash in storage, and invokes factory.RestoreSession (which opens HTTP backend connections — typically 100–500 ms). Meanwhile CreateSession completes its sessions.Store. When RestoreSession finally returns and tries LoadOrStore, it finds the V from CreateSession and incorrectly treats it as a sentinel.
Why existing code does not prevent it: The singleflight re-check (lines 110–116) inspects the cache before load() is called and correctly returns the live value or sentinel found there. The bug is in the post-load LoadOrStore guard, which has no type assertion — it cannot tell a just-stored MultiSession from a terminatedSentinel.
Addressing the refutations: Five verifiers argue the trigger is implausible because the window between storage.Upsert and sessions.Store is nanoseconds while RestoreSession takes hundreds of milliseconds. This timing argument is sound: for the race to fire, the concurrent loadSession goroutine must start its storage.Load call after storage.Upsert runs, then RestoreSession must take long enough that sessions.Store completes before LoadOrStore is reached. In a single-pod scenario this requires either a very unlucky scheduler interleaving (cache miss seen between Upsert and Store in the same process) or a cross-pod client known to the pod before the session is registered with the SDK — both extremely unlikely. The refutations are persuasive that the race is near-impossible with real HTTP backends.
What makes it worth noting regardless: The code carries an incorrect invariant comment ("if a sentinel got in") and has silently dropped the correctness guarantee that the previous ValidatingCache implementation provided via ContainsOrAdd+Get. The removed test TestValidatingCache_Singleflight_EvictsLoserWhenLoadRacesWriter explicitly verified that a concurrent valid-V store was handled correctly (winner's value returned, loser evicted). The new TestRestorableCache_Sentinel_StoredDuringLoad only tests the sentinel case. This is a latent correctness regression: if RestoreSession ever becomes faster (e.g., in tests or with connection-pool prewarming) or a new caller path that calls GetMultiSession for an in-progress session emerges, the bug will manifest silently.
Step-by-step proof (theoretical, requires unlucky scheduling):
sm.Generate()stores empty placeholder forsessionID; noMetadataKeyTokenHashyet.sm.CreateSession()callsfactory.MakeSessionWithID()(slow — opens backends).CreateSessioncallssm.storage.Upsert(sessionID, sess.GetMetadata())—MetadataKeyTokenHashnow in Redis.- Race window begins:
sessions.Storehas not yet executed. - Concurrently, another goroutine calls
sm.GetMultiSession(sessionID)— cache miss. loadSessioncallsstorage.Load— seesMetadataKeyTokenHash, session is restorable.loadSessioncallsfactory.RestoreSession— slow (opens HTTP connections).CreateSessionexecutessessions.Store(sessionID, sess)— validVnow in cache.RestoreSessioncompletes;LoadOrStore(key, restoredSess)finds theVfrom step 8.- Code calls
onEvict(key, restoredSess)→ closes the freshly-restored session's backend connections. - Returns
errSentinelFound→ all singleflight waiters receive(nil, false).
Fix: Check whether the pre-existing value is actually a sentinel before calling onEvict and returning errSentinelFound. If loaded=true and the existing value is a V, call onEvict on the freshly-loaded v (as before) but return the winner's V to callers, mirroring the old ContainsOrAdd+Get logic.
Reverts #4669
Large PR Justification
This is a revert or a pr that passed with breaking tests