Skip to content

Route MCP sessions to the originating backend pod using httptrace#4673

Merged
jerm-dro merged 2 commits intomainfrom
issue-4575-v1
Apr 9, 2026
Merged

Route MCP sessions to the originating backend pod using httptrace#4673
jerm-dro merged 2 commits intomainfrom
issue-4575-v1

Conversation

@yrobla
Copy link
Copy Markdown
Contributor

@yrobla yrobla commented Apr 8, 2026

Summary

When a proxy runner pod restarts it recovers sessions from Redis but backend_url stored the ClusterIP, so kube-proxy could send follow-up requests to a different backend pod that never handled initialize — causing JSON-RPC -32001 "session not found" errors on the first request.

Use net/http/httptrace.GotConn to capture the actual backend pod IP after kube-proxy DNAT on every initialize request, and store that as backend_url instead of the ClusterIP URL. The existing Rewrite closure already reads backend_url and pins routing to the correct pod; no changes to that path are needed.

When the backend pod is later replaced (rescheduled to a new IP or restarted in place and lost in-memory session state), the proxy now re-initializes the backend session transparently rather than returning 404 to the client:

  • Dial error (pod IP unreachable): re-init triggers on TCP failure
  • Backend 404 (session lost, same IP): re-init triggers on response

In both cases the proxy replays the stored initialize body against the ClusterIP, captures the new pod IP via GotConn, stores the new backend session ID, rewrites outbound Mcp-Session-Id headers, and replays the original client request — the client sees no error.

DELETE responses are excluded from the 404 re-init path since the session is intentionally torn down in that case.

Fixes #4575

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (task test)
  • E2E tests (task test-e2e)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Changes

File Change

Does this introduce a user-facing change?

Special notes for reviewers

Large PR Justification

This is a complete PR that covers a single bugfix. It includes comprehensive testing. Cannot be split.

@github-actions github-actions bot added the size/M Medium PR: 300-599 lines changed label Apr 8, 2026
@yrobla yrobla requested a review from Copilot April 8, 2026 14:19
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 89.83051% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.69%. Comparing base (033e051) to head (1e7b6fb).
⚠️ Report is 16 commits behind head on main.

Files with missing lines Patch % Lines
...g/transport/proxy/transparent/transparent_proxy.go 89.83% 6 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4673      +/-   ##
==========================================
+ Coverage   68.66%   68.69%   +0.02%     
==========================================
  Files         515      516       +1     
  Lines       53580    53948     +368     
==========================================
+ Hits        36793    37058     +265     
- Misses      13947    14039      +92     
- Partials     2840     2851      +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes MCP session misrouting after a proxy runner restart in multi-backend Kubernetes deployments by recording the actual backend pod address used during initialize and adding a transparent backend re-initialization/replay mechanism when a pod becomes unreachable or loses in-memory session state.

Changes:

  • Capture the selected backend pod address on initialize via httptrace.GotConn and persist it as backend_url for stable follow-up routing after restarts.
  • Persist the raw initialize request body in session metadata and transparently re-initialize + replay requests on backend 404s (non-DELETE) and TCP dial errors.
  • Add unit tests for storing init_body and for the 404/dial-error transparent re-init + replay behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
pkg/transport/proxy/transparent/transparent_proxy.go Adds httptrace-based pod pinning, stores init body, and implements transparent re-init + replay with session ID rewriting.
pkg/transport/proxy/transparent/backend_routing_test.go Adds tests covering init-body persistence and transparent re-init behavior on 404 and dial errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 8, 2026
Copy link
Copy Markdown
Contributor

@jerm-dro jerm-dro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: PR #4574 adds E2E acceptance tests that demonstrate this exact bug (proxy runner restart with backendReplicas > 1 → session not found). Consider merging that branch into this one so the failing tests land alongside the fix — proving the fix end-to-end in CI rather than relying solely on the unit tests here. The acceptance tests are the strongest evidence that the bug is resolved across all three K8s versions.

jerm-dro
jerm-dro previously approved these changes Apr 8, 2026
Copy link
Copy Markdown
Contributor

@jerm-dro jerm-dro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please confirm the E2E acceptance test from #4574 passes with this fix before closing #4575.

When a proxy runner pod restarts it recovers sessions from Redis but
backend_url stored the ClusterIP, so kube-proxy could send follow-up
requests to a different backend pod that never handled initialize —
causing JSON-RPC -32001 "session not found" errors on the first request.

Use net/http/httptrace.GotConn to capture the actual backend pod IP
after kube-proxy DNAT on every initialize request, and store that as
backend_url instead of the ClusterIP URL. The existing Rewrite closure
already reads backend_url and pins routing to the correct pod; no
changes to that path are needed.

When the backend pod is later replaced (rescheduled to a new IP or
restarted in place and lost in-memory session state), the proxy now
re-initializes the backend session transparently rather than returning
404 to the client:

- Dial error (pod IP unreachable): re-init triggers on TCP failure
- Backend 404 (session lost, same IP): re-init triggers on response

In both cases the proxy replays the stored initialize body against the
ClusterIP, captures the new pod IP via GotConn, stores the new backend
session ID, rewrites outbound Mcp-Session-Id headers, and replays the
original client request — the client sees no error.

DELETE responses are excluded from the 404 re-init path since the
session is intentionally torn down in that case.

Closes #4575
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 9, 2026
@yrobla
Copy link
Copy Markdown
Contributor Author

yrobla commented Apr 9, 2026

to avoid problems with rebases, i included the e2e tests you created as part of this pr itself

@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 9, 2026
@yrobla yrobla requested review from Copilot and jerm-dro April 9, 2026 13:09
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 9, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

@github-actions github-actions bot dismissed their stale review April 9, 2026 13:26

Large PR justification has been provided. Thank you!

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 9, 2026
@yrobla yrobla requested a review from Copilot April 9, 2026 13:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 9, 2026
@yrobla yrobla requested a review from Copilot April 9, 2026 13:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 9, 2026
@yrobla yrobla requested a review from Copilot April 9, 2026 14:36
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 9, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Extract re-initialization logic from tracingTransport into a dedicated
backendRecovery type backed by a narrow recoverySessionStore interface
(Get + UpsertSession) and a forward func. tracingTransport now owns only
the request lifecycle (session guard, initialize detection, httptrace,
session creation) and delegates all forwarding and recovery to
backendRecovery.

This makes reinitializeAndReplay and podBackendURL testable without
standing up a full TransparentProxy: backend_recovery_test.go covers
all recovery paths (no session, no init body, happy path, forward
error, missing new session ID) using a stubSessionStore and inline
httptest servers.

tracingTransport.forward() and the base field are removed; all network
I/O goes through recovery.forward — a single source of truth for the
underlying transport.

Also integrates the E2E acceptance test from #4574 that exercises
backendReplicas=2 + proxy runner restart, verifying that sessions are
routed to the correct backend pod after re-initialization.
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 9, 2026
@jerm-dro jerm-dro merged commit d851c69 into main Apr 9, 2026
40 checks passed
@jerm-dro jerm-dro deleted the issue-4575-v1 branch April 9, 2026 20:39
@jerm-dro
Copy link
Copy Markdown
Contributor

jerm-dro commented Apr 9, 2026

Clicked the merge button to ensure this gets out in the very next release 😃

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Backend routing breaks after proxy runner restart with backendReplicas > 1

4 participants