Add E2E acceptance test for backend routing after proxy restart by jerm-dro · Pull Request #4574 · stacklok/toolhive

jerm-dro · 2026-04-06T16:35:14Z

Summary

When backendReplicas > 1 and the proxy runner restarts, the recovered session's backend_url points to the ClusterIP service. Kube-proxy may route to a backend pod that never handled the session's initialize, causing HTTP 404 / JSON-RPC -32001 "session not found" errors.
This PR adds an E2E acceptance test that demonstrates the bug: it creates an MCPServer with backendReplicas=2, establishes a session, deletes the proxy runner pod, and sends 5 requests with the same session ID. With 2 backends and random routing, P(all 5 hit correct pod) ≈ 3%, making the failure reliably detectable.
This test is expected to fail until pod-specific headless DNS routing is implemented.

Type of change

New feature

Test plan

E2E tests (task test-e2e) — this PR adds the test; it is expected to fail on CI, proving the bug exists
Unit tests (task test) — no production code changed, existing tests unaffected

Does this introduce a user-facing change?

No

Special notes for reviewers

This test is intentionally expected to fail on CI. It demonstrates the backend routing bug that will be fixed in a follow-up PR implementing pod-specific headless DNS routing. The test context "backendReplicas=2 and proxy runner restarts" will pass once the fix lands.

Generated with Claude Code

MCPServer supports horizontal scaling with Redis session storage, but there was no E2E test verifying that a session established on one pod is accessible from a different pod. This test deploys an MCPServer with replicas=2 and Redis session storage, initializes an MCP session, then sends raw JSON-RPC requests directly to each pod IP using the same Mcp-Session-Id header to prove sessions are shared via Redis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pod IPs are not reachable from the CI runner host in Kind clusters. Replace direct pod IP HTTP calls with kubectl port-forward to each pod, which tunnels through the Kind node's network. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The MCPServer CRD's sessionStorage config was populated by the operator into RunConfig but the proxy runner never read it — sessions always used in-memory LocalStorage, making cross-replica routing non-functional. Add WithSessionStorage transport option and wire ScalingConfig.SessionRedis from RunConfig into the transport layer so both StdioTransport and HTTPTransport (transparent proxy) use Redis-backed session storage when configured. Rewrite the E2E test to use mcp-go clients throughout, including transport.WithSession to create a client on pod B that reuses the session established on pod A. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Pass SessionStorage through types.Config instead of a factory option with interface assertion. The factory now sets the field directly on each transport type during construction. Add clientA to codespell ignore list. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When backendReplicas > 1 and the proxy runner restarts, the recovered session's backend_url points to the ClusterIP service. Kube-proxy may route to a backend pod that never handled the session's initialize, causing HTTP 404 / JSON-RPC -32001 "session not found" errors. This test creates an MCPServer with backendReplicas=2, establishes a session, deletes the proxy runner pod, and sends 5 requests with the same session ID. With 2 backends and random routing, P(all 5 hit the correct pod) ≈ 3%, making the failure reliably detectable. The test is expected to fail until pod-specific headless DNS routing is implemented in the proxy runner. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

codecov · 2026-04-06T16:40:53Z

Codecov Report

❌ Patch coverage is 0% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.84%. Comparing base (2e8e487) to head (c26ef73).
⚠️ Report is 18 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/runner/runner.go	0.00%	19 Missing ⚠️
pkg/transport/factory.go	0.00%	11 Missing ⚠️
pkg/transport/http.go	0.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4574      +/-   ##
==========================================
- Coverage   69.06%   68.84%   -0.23%     
==========================================
  Files         502      505       +3     
  Lines       51997    52408     +411     
==========================================
+ Hits        35913    36078     +165     
- Misses      13300    13542     +242     
- Partials     2784     2788       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The namespace parameter is always defaultNamespace in current tests but is kept for reusability across future test contexts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Extract re-initialization logic from tracingTransport into a dedicated backendRecovery type backed by a narrow recoverySessionStore interface (Get + UpsertSession) and a forward func. tracingTransport now owns only the request lifecycle (session guard, initialize detection, httptrace, session creation) and delegates all forwarding and recovery to backendRecovery. This makes reinitializeAndReplay and podBackendURL testable without standing up a full TransparentProxy: backend_recovery_test.go covers all recovery paths (no session, no init body, happy path, forward error, missing new session ID) using a stubSessionStore and inline httptest servers. tracingTransport.forward() and the base field are removed; all network I/O goes through recovery.forward — a single source of truth for the underlying transport. Also integrates the E2E acceptance test from #4574 that exercises backendReplicas=2 + proxy runner restart, verifying that sessions are routed to the correct backend pod after re-initialization.

) * Route MCP sessions to the originating backend pod using httptrace When a proxy runner pod restarts it recovers sessions from Redis but backend_url stored the ClusterIP, so kube-proxy could send follow-up requests to a different backend pod that never handled initialize — causing JSON-RPC -32001 "session not found" errors on the first request. Use net/http/httptrace.GotConn to capture the actual backend pod IP after kube-proxy DNAT on every initialize request, and store that as backend_url instead of the ClusterIP URL. The existing Rewrite closure already reads backend_url and pins routing to the correct pod; no changes to that path are needed. When the backend pod is later replaced (rescheduled to a new IP or restarted in place and lost in-memory session state), the proxy now re-initializes the backend session transparently rather than returning 404 to the client: - Dial error (pod IP unreachable): re-init triggers on TCP failure - Backend 404 (session lost, same IP): re-init triggers on response In both cases the proxy replays the stored initialize body against the ClusterIP, captures the new pod IP via GotConn, stores the new backend session ID, rewrites outbound Mcp-Session-Id headers, and replays the original client request — the client sees no error. DELETE responses are excluded from the 404 re-init path since the session is intentionally torn down in that case. Closes #4575 * Reimplement tracingTransport logic + add e2e tests Extract re-initialization logic from tracingTransport into a dedicated backendRecovery type backed by a narrow recoverySessionStore interface (Get + UpsertSession) and a forward func. tracingTransport now owns only the request lifecycle (session guard, initialize detection, httptrace, session creation) and delegates all forwarding and recovery to backendRecovery. This makes reinitializeAndReplay and podBackendURL testable without standing up a full TransparentProxy: backend_recovery_test.go covers all recovery paths (no session, no init body, happy path, forward error, missing new session ID) using a stubSessionStore and inline httptest servers. tracingTransport.forward() and the base field are removed; all network I/O goes through recovery.forward — a single source of truth for the underlying transport. Also integrates the E2E acceptance test from #4574 that exercises backendReplicas=2 + proxy runner restart, verifying that sessions are routed to the correct backend pod after re-initialization. --------- Co-authored-by: taskbot <taskbot@users.noreply.github.com>

jerm-dro and others added 5 commits April 3, 2026 08:52

jerm-dro requested review from ChrisJBurns, JAORMX, amirejaz, blkt, jhrozek, lujunsan and yrobla as code owners April 6, 2026 16:35

github-actions bot added the size/M Medium PR: 300-599 lines changed label Apr 6, 2026

Suppress unparam lint for getReadyMCPServerPods namespace

c26ef73

The namespace parameter is always defaultNamespace in current tests but is kept for reusability across future test contexts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 6, 2026

This was referenced Apr 6, 2026

Backend routing breaks after proxy runner restart with backendReplicas > 1 #4575

Closed

Route MCP sessions to the originating backend pod using httptrace #4673

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add E2E acceptance test for backend routing after proxy restart#4574

Add E2E acceptance test for backend routing after proxy restart#4574
jerm-dro wants to merge 6 commits intomainfrom
e2e-backend-routing-acceptance-test

jerm-dro commented Apr 6, 2026

Uh oh!

codecov bot commented Apr 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jerm-dro commented Apr 6, 2026

Summary

Type of change

Test plan

Does this introduce a user-facing change?

Special notes for reviewers

Uh oh!

codecov bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Apr 6, 2026 •

edited

Loading