Route MCP sessions to the originating backend pod using httptrace by yrobla · Pull Request #4673 · stacklok/toolhive

yrobla · 2026-04-08T14:17:38Z

Summary

When a proxy runner pod restarts it recovers sessions from Redis but backend_url stored the ClusterIP, so kube-proxy could send follow-up requests to a different backend pod that never handled initialize — causing JSON-RPC -32001 "session not found" errors on the first request.

Use net/http/httptrace.GotConn to capture the actual backend pod IP after kube-proxy DNAT on every initialize request, and store that as backend_url instead of the ClusterIP URL. The existing Rewrite closure already reads backend_url and pins routing to the correct pod; no changes to that path are needed.

When the backend pod is later replaced (rescheduled to a new IP or restarted in place and lost in-memory session state), the proxy now re-initializes the backend session transparently rather than returning 404 to the client:

Dial error (pod IP unreachable): re-init triggers on TCP failure
Backend 404 (session lost, same IP): re-init triggers on response

In both cases the proxy replays the stored initialize body against the ClusterIP, captures the new pod IP via GotConn, stores the new backend session ID, rewrites outbound Mcp-Session-Id headers, and replays the original client request — the client sees no error.

DELETE responses are excluded from the 404 re-init path since the session is intentionally torn down in that case.

Fixes #4575

Type of change

Test plan

Unit tests (task test)
E2E tests (task test-e2e)
Linting (task lint-fix)
Manual testing (describe below)

Changes

File	Change

Does this introduce a user-facing change?

Special notes for reviewers

Large PR Justification

This is a complete PR that covers a single bugfix. It includes comprehensive testing. Cannot be split.

codecov · 2026-04-08T14:24:28Z

Codecov Report

❌ Patch coverage is 89.83051% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.69%. Comparing base (033e051) to head (1e7b6fb).
⚠️ Report is 16 commits behind head on main.

Files with missing lines	Patch %	Lines
...g/transport/proxy/transparent/transparent_proxy.go	89.83%	6 Missing and 6 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4673      +/-   ##
==========================================
+ Coverage   68.66%   68.69%   +0.02%     
==========================================
  Files         515      516       +1     
  Lines       53580    53948     +368     
==========================================
+ Hits        36793    37058     +265     
- Misses      13947    14039      +92     
- Partials     2840     2851      +11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

This PR fixes MCP session misrouting after a proxy runner restart in multi-backend Kubernetes deployments by recording the actual backend pod address used during initialize and adding a transparent backend re-initialization/replay mechanism when a pod becomes unreachable or loses in-memory session state.

Changes:

Capture the selected backend pod address on initialize via httptrace.GotConn and persist it as backend_url for stable follow-up routing after restarts.
Persist the raw initialize request body in session metadata and transparently re-initialize + replay requests on backend 404s (non-DELETE) and TCP dial errors.
Add unit tests for storing init_body and for the 404/dial-error transparent re-init + replay behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
pkg/transport/proxy/transparent/transparent_proxy.go	Adds httptrace-based pod pinning, stores init body, and implements transparent re-init + replay with session ID rewriting.
pkg/transport/proxy/transparent/backend_routing_test.go	Adds tests covering init-body persistence and transparent re-init behavior on 404 and dial errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/transport/proxy/transparent/transparent_proxy.go

pkg/transport/proxy/transparent/backend_routing_test.go

jerm-dro

suggestion: PR #4574 adds E2E acceptance tests that demonstrate this exact bug (proxy runner restart with backendReplicas > 1 → session not found). Consider merging that branch into this one so the failing tests land alongside the fix — proving the fix end-to-end in CI rather than relying solely on the unit tests here. The acceptance tests are the strongest evidence that the bug is resolved across all three K8s versions.

pkg/transport/proxy/transparent/transparent_proxy.go

jerm-dro

Please confirm the E2E acceptance test from #4574 passes with this fix before closing #4575.

When a proxy runner pod restarts it recovers sessions from Redis but backend_url stored the ClusterIP, so kube-proxy could send follow-up requests to a different backend pod that never handled initialize — causing JSON-RPC -32001 "session not found" errors on the first request. Use net/http/httptrace.GotConn to capture the actual backend pod IP after kube-proxy DNAT on every initialize request, and store that as backend_url instead of the ClusterIP URL. The existing Rewrite closure already reads backend_url and pins routing to the correct pod; no changes to that path are needed. When the backend pod is later replaced (rescheduled to a new IP or restarted in place and lost in-memory session state), the proxy now re-initializes the backend session transparently rather than returning 404 to the client: - Dial error (pod IP unreachable): re-init triggers on TCP failure - Backend 404 (session lost, same IP): re-init triggers on response In both cases the proxy replays the stored initialize body against the ClusterIP, captures the new pod IP via GotConn, stores the new backend session ID, rewrites outbound Mcp-Session-Id headers, and replays the original client request — the client sees no error. DELETE responses are excluded from the 404 re-init path since the session is intentionally torn down in that case. Closes #4575

yrobla · 2026-04-09T12:55:03Z

to avoid problems with rebases, i included the e2e tests you created as part of this pr itself

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/transport/proxy/transparent/transparent_proxy.go

github-actions · 2026-04-09T13:26:12Z

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

Large PR justification has been provided. Thank you!

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/transport/proxy/transparent/backend_recovery_test.go

pkg/transport/proxy/transparent/transparent_proxy.go

test/e2e/thv-operator/virtualmcp/mcpserver_scaling_test.go

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/transport/proxy/transparent/transparent_proxy.go

test/e2e/thv-operator/virtualmcp/mcpserver_scaling_test.go

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pkg/transport/proxy/transparent/transparent_proxy.go

Extract re-initialization logic from tracingTransport into a dedicated backendRecovery type backed by a narrow recoverySessionStore interface (Get + UpsertSession) and a forward func. tracingTransport now owns only the request lifecycle (session guard, initialize detection, httptrace, session creation) and delegates all forwarding and recovery to backendRecovery. This makes reinitializeAndReplay and podBackendURL testable without standing up a full TransparentProxy: backend_recovery_test.go covers all recovery paths (no session, no init body, happy path, forward error, missing new session ID) using a stubSessionStore and inline httptest servers. tracingTransport.forward() and the base field are removed; all network I/O goes through recovery.forward — a single source of truth for the underlying transport. Also integrates the E2E acceptance test from #4574 that exercises backendReplicas=2 + proxy runner restart, verifying that sessions are routed to the correct backend pod after re-initialization.

jerm-dro · 2026-04-09T20:39:42Z

Clicked the merge button to ensure this gets out in the very next release 😃

yrobla requested review from ChrisJBurns, JAORMX, amirejaz, blkt and jhrozek as code owners April 8, 2026 14:17

github-actions bot added the size/M Medium PR: 300-599 lines changed label Apr 8, 2026

yrobla requested a review from Copilot April 8, 2026 14:19

Copilot started reviewing on behalf of yrobla April 8, 2026 14:20 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

yrobla force-pushed the issue-4575-v1 branch from 6b5cb78 to cac1257 Compare April 8, 2026 14:54

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 8, 2026

jerm-dro reviewed Apr 8, 2026

View reviewed changes

pkg/transport/proxy/transparent/transparent_proxy.go Outdated Show resolved Hide resolved

jerm-dro previously approved these changes Apr 8, 2026

View reviewed changes

yrobla force-pushed the issue-4575-v1 branch from cac1257 to 32960b1 Compare April 9, 2026 12:51

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 9, 2026

yrobla dismissed jerm-dro’s stale review via b0cf663 April 9, 2026 13:08

github-actions bot added size/L Large PR: 600-999 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 9, 2026

yrobla requested review from Copilot and jerm-dro April 9, 2026 13:09

yrobla force-pushed the issue-4575-v1 branch from b0cf663 to 8dda48d Compare April 9, 2026 13:13

github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 9, 2026

Copilot AI reviewed Apr 9, 2026

View reviewed changes

pkg/transport/proxy/transparent/transparent_proxy.go Show resolved Hide resolved

pkg/transport/proxy/transparent/transparent_proxy.go Show resolved Hide resolved

pkg/transport/proxy/transparent/transparent_proxy.go Show resolved Hide resolved

yrobla force-pushed the issue-4575-v1 branch from 8dda48d to 1087c52 Compare April 9, 2026 13:23

github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 9, 2026

yrobla requested a review from Copilot April 9, 2026 13:26

Copilot started reviewing on behalf of yrobla April 9, 2026 13:26 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

pkg/transport/proxy/transparent/backend_recovery_test.go Show resolved Hide resolved

pkg/transport/proxy/transparent/transparent_proxy.go Outdated Show resolved Hide resolved

test/e2e/thv-operator/virtualmcp/mcpserver_scaling_test.go Outdated Show resolved Hide resolved

yrobla force-pushed the issue-4575-v1 branch from 1087c52 to ce9d3ca Compare April 9, 2026 13:41

github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 9, 2026

yrobla requested a review from Copilot April 9, 2026 13:42

Copilot started reviewing on behalf of yrobla April 9, 2026 13:43 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

pkg/transport/proxy/transparent/transparent_proxy.go Show resolved Hide resolved

test/e2e/thv-operator/virtualmcp/mcpserver_scaling_test.go Outdated Show resolved Hide resolved

test/e2e/thv-operator/virtualmcp/mcpserver_scaling_test.go Show resolved Hide resolved

yrobla force-pushed the issue-4575-v1 branch from ce9d3ca to 1ef137e Compare April 9, 2026 14:26

github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 9, 2026

yrobla force-pushed the issue-4575-v1 branch from 1ef137e to ec33d36 Compare April 9, 2026 14:36

yrobla requested a review from Copilot April 9, 2026 14:36

Copilot started reviewing on behalf of yrobla April 9, 2026 14:37 View session

github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 9, 2026

Copilot AI reviewed Apr 9, 2026

View reviewed changes

pkg/transport/proxy/transparent/transparent_proxy.go Show resolved Hide resolved

pkg/transport/proxy/transparent/transparent_proxy.go Outdated Show resolved Hide resolved

yrobla force-pushed the issue-4575-v1 branch from ec33d36 to 1e7b6fb Compare April 9, 2026 14:50

github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 9, 2026

jerm-dro approved these changes Apr 9, 2026

View reviewed changes

jerm-dro merged commit d851c69 into main Apr 9, 2026
40 checks passed

jerm-dro deleted the issue-4575-v1 branch April 9, 2026 20:39

Conversation

yrobla commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of change

Test plan

Changes

Does this introduce a user-facing change?

Special notes for reviewers

Large PR Justification

Uh oh!

codecov bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jerm-dro left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jerm-dro left a comment

Choose a reason for hiding this comment

Uh oh!

yrobla commented Apr 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jerm-dro commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yrobla commented Apr 8, 2026 •

edited

Loading

codecov bot commented Apr 8, 2026 •

edited

Loading