Route MCP sessions to the originating backend pod using httptrace#4673
Route MCP sessions to the originating backend pod using httptrace#4673
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4673 +/- ##
==========================================
+ Coverage 68.66% 68.69% +0.02%
==========================================
Files 515 516 +1
Lines 53580 53948 +368
==========================================
+ Hits 36793 37058 +265
- Misses 13947 14039 +92
- Partials 2840 2851 +11 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR fixes MCP session misrouting after a proxy runner restart in multi-backend Kubernetes deployments by recording the actual backend pod address used during initialize and adding a transparent backend re-initialization/replay mechanism when a pod becomes unreachable or loses in-memory session state.
Changes:
- Capture the selected backend pod address on
initializeviahttptrace.GotConnand persist it asbackend_urlfor stable follow-up routing after restarts. - Persist the raw
initializerequest body in session metadata and transparently re-initialize + replay requests on backend 404s (non-DELETE) and TCP dial errors. - Add unit tests for storing
init_bodyand for the 404/dial-error transparent re-init + replay behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| pkg/transport/proxy/transparent/transparent_proxy.go | Adds httptrace-based pod pinning, stores init body, and implements transparent re-init + replay with session ID rewriting. |
| pkg/transport/proxy/transparent/backend_routing_test.go | Adds tests covering init-body persistence and transparent re-init behavior on 404 and dial errors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
jerm-dro
left a comment
There was a problem hiding this comment.
suggestion: PR #4574 adds E2E acceptance tests that demonstrate this exact bug (proxy runner restart with backendReplicas > 1 → session not found). Consider merging that branch into this one so the failing tests land alongside the fix — proving the fix end-to-end in CI rather than relying solely on the unit tests here. The acceptance tests are the strongest evidence that the bug is resolved across all three K8s versions.
When a proxy runner pod restarts it recovers sessions from Redis but backend_url stored the ClusterIP, so kube-proxy could send follow-up requests to a different backend pod that never handled initialize — causing JSON-RPC -32001 "session not found" errors on the first request. Use net/http/httptrace.GotConn to capture the actual backend pod IP after kube-proxy DNAT on every initialize request, and store that as backend_url instead of the ClusterIP URL. The existing Rewrite closure already reads backend_url and pins routing to the correct pod; no changes to that path are needed. When the backend pod is later replaced (rescheduled to a new IP or restarted in place and lost in-memory session state), the proxy now re-initializes the backend session transparently rather than returning 404 to the client: - Dial error (pod IP unreachable): re-init triggers on TCP failure - Backend 404 (session lost, same IP): re-init triggers on response In both cases the proxy replays the stored initialize body against the ClusterIP, captures the new pod IP via GotConn, stores the new backend session ID, rewrites outbound Mcp-Session-Id headers, and replays the original client request — the client sees no error. DELETE responses are excluded from the 404 re-init path since the session is intentionally torn down in that case. Closes #4575
|
to avoid problems with rebases, i included the e2e tests you created as part of this pr itself |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
Large PR justification has been provided. Thank you!
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Extract re-initialization logic from tracingTransport into a dedicated backendRecovery type backed by a narrow recoverySessionStore interface (Get + UpsertSession) and a forward func. tracingTransport now owns only the request lifecycle (session guard, initialize detection, httptrace, session creation) and delegates all forwarding and recovery to backendRecovery. This makes reinitializeAndReplay and podBackendURL testable without standing up a full TransparentProxy: backend_recovery_test.go covers all recovery paths (no session, no init body, happy path, forward error, missing new session ID) using a stubSessionStore and inline httptest servers. tracingTransport.forward() and the base field are removed; all network I/O goes through recovery.forward — a single source of truth for the underlying transport. Also integrates the E2E acceptance test from #4574 that exercises backendReplicas=2 + proxy runner restart, verifying that sessions are routed to the correct backend pod after re-initialization.
|
Clicked the merge button to ensure this gets out in the very next release 😃 |
Summary
When a proxy runner pod restarts it recovers sessions from Redis but backend_url stored the ClusterIP, so kube-proxy could send follow-up requests to a different backend pod that never handled initialize — causing JSON-RPC -32001 "session not found" errors on the first request.
Use net/http/httptrace.GotConn to capture the actual backend pod IP after kube-proxy DNAT on every initialize request, and store that as backend_url instead of the ClusterIP URL. The existing Rewrite closure already reads backend_url and pins routing to the correct pod; no changes to that path are needed.
When the backend pod is later replaced (rescheduled to a new IP or restarted in place and lost in-memory session state), the proxy now re-initializes the backend session transparently rather than returning 404 to the client:
In both cases the proxy replays the stored initialize body against the ClusterIP, captures the new pod IP via GotConn, stores the new backend session ID, rewrites outbound Mcp-Session-Id headers, and replays the original client request — the client sees no error.
DELETE responses are excluded from the 404 re-init path since the session is intentionally torn down in that case.
Fixes #4575
Type of change
Test plan
task test)task test-e2e)task lint-fix)Changes
Does this introduce a user-facing change?
Special notes for reviewers
Large PR Justification
This is a complete PR that covers a single bugfix. It includes comprehensive testing. Cannot be split.