Skip to content

Add E2E acceptance test for backend routing after proxy restart#4574

Open
jerm-dro wants to merge 6 commits intomainfrom
e2e-backend-routing-acceptance-test
Open

Add E2E acceptance test for backend routing after proxy restart#4574
jerm-dro wants to merge 6 commits intomainfrom
e2e-backend-routing-acceptance-test

Conversation

@jerm-dro
Copy link
Copy Markdown
Contributor

@jerm-dro jerm-dro commented Apr 6, 2026

Summary

  • When backendReplicas > 1 and the proxy runner restarts, the recovered session's backend_url points to the ClusterIP service. Kube-proxy may route to a backend pod that never handled the session's initialize, causing HTTP 404 / JSON-RPC -32001 "session not found" errors.
  • This PR adds an E2E acceptance test that demonstrates the bug: it creates an MCPServer with backendReplicas=2, establishes a session, deletes the proxy runner pod, and sends 5 requests with the same session ID. With 2 backends and random routing, P(all 5 hit correct pod) ≈ 3%, making the failure reliably detectable.
  • This test is expected to fail until pod-specific headless DNS routing is implemented.

Type of change

  • New feature

Test plan

  • E2E tests (task test-e2e) — this PR adds the test; it is expected to fail on CI, proving the bug exists
  • Unit tests (task test) — no production code changed, existing tests unaffected

Does this introduce a user-facing change?

No

Special notes for reviewers

This test is intentionally expected to fail on CI. It demonstrates the backend routing bug that will be fixed in a follow-up PR implementing pod-specific headless DNS routing. The test context "backendReplicas=2 and proxy runner restarts" will pass once the fix lands.

Generated with Claude Code

jerm-dro and others added 5 commits April 3, 2026 08:52
MCPServer supports horizontal scaling with Redis session storage, but
there was no E2E test verifying that a session established on one pod
is accessible from a different pod. This test deploys an MCPServer with
replicas=2 and Redis session storage, initializes an MCP session, then
sends raw JSON-RPC requests directly to each pod IP using the same
Mcp-Session-Id header to prove sessions are shared via Redis.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pod IPs are not reachable from the CI runner host in Kind clusters.
Replace direct pod IP HTTP calls with kubectl port-forward to each
pod, which tunnels through the Kind node's network.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The MCPServer CRD's sessionStorage config was populated by the operator
into RunConfig but the proxy runner never read it — sessions always used
in-memory LocalStorage, making cross-replica routing non-functional.

Add WithSessionStorage transport option and wire ScalingConfig.SessionRedis
from RunConfig into the transport layer so both StdioTransport and
HTTPTransport (transparent proxy) use Redis-backed session storage when
configured.

Rewrite the E2E test to use mcp-go clients throughout, including
transport.WithSession to create a client on pod B that reuses the
session established on pod A.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pass SessionStorage through types.Config instead of a factory option
with interface assertion. The factory now sets the field directly on
each transport type during construction.

Add clientA to codespell ignore list.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When backendReplicas > 1 and the proxy runner restarts, the recovered
session's backend_url points to the ClusterIP service. Kube-proxy may
route to a backend pod that never handled the session's initialize,
causing HTTP 404 / JSON-RPC -32001 "session not found" errors.

This test creates an MCPServer with backendReplicas=2, establishes a
session, deletes the proxy runner pod, and sends 5 requests with the
same session ID. With 2 backends and random routing, P(all 5 hit the
correct pod) ≈ 3%, making the failure reliably detectable.

The test is expected to fail until pod-specific headless DNS routing
is implemented in the proxy runner.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the size/M Medium PR: 300-599 lines changed label Apr 6, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 6, 2026

Codecov Report

❌ Patch coverage is 0% with 32 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.84%. Comparing base (2e8e487) to head (c26ef73).
⚠️ Report is 18 commits behind head on main.

Files with missing lines Patch % Lines
pkg/runner/runner.go 0.00% 19 Missing ⚠️
pkg/transport/factory.go 0.00% 11 Missing ⚠️
pkg/transport/http.go 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4574      +/-   ##
==========================================
- Coverage   69.06%   68.84%   -0.23%     
==========================================
  Files         502      505       +3     
  Lines       51997    52408     +411     
==========================================
+ Hits        35913    36078     +165     
- Misses      13300    13542     +242     
- Partials     2784     2788       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The namespace parameter is always defaultNamespace in current tests
but is kept for reusability across future test contexts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Apr 6, 2026
yrobla pushed a commit that referenced this pull request Apr 9, 2026
Extract re-initialization logic from tracingTransport into a dedicated
backendRecovery type backed by a narrow recoverySessionStore interface
(Get + UpsertSession) and a forward func. tracingTransport now owns only
the request lifecycle (session guard, initialize detection, httptrace,
session creation) and delegates all forwarding and recovery to
backendRecovery.

This makes reinitializeAndReplay and podBackendURL testable without
standing up a full TransparentProxy: backend_recovery_test.go covers
all recovery paths (no session, no init body, happy path, forward
error, missing new session ID) using a stubSessionStore and inline
httptest servers.

tracingTransport.forward() and the base field are removed; all network
I/O goes through recovery.forward — a single source of truth for the
underlying transport.

Also integrates the E2E acceptance test from #4574 that exercises
backendReplicas=2 + proxy runner restart, verifying that sessions are
routed to the correct backend pod after re-initialization.
yrobla pushed a commit that referenced this pull request Apr 9, 2026
Extract re-initialization logic from tracingTransport into a dedicated
backendRecovery type backed by a narrow recoverySessionStore interface
(Get + UpsertSession) and a forward func. tracingTransport now owns only
the request lifecycle (session guard, initialize detection, httptrace,
session creation) and delegates all forwarding and recovery to
backendRecovery.

This makes reinitializeAndReplay and podBackendURL testable without
standing up a full TransparentProxy: backend_recovery_test.go covers
all recovery paths (no session, no init body, happy path, forward
error, missing new session ID) using a stubSessionStore and inline
httptest servers.

tracingTransport.forward() and the base field are removed; all network
I/O goes through recovery.forward — a single source of truth for the
underlying transport.

Also integrates the E2E acceptance test from #4574 that exercises
backendReplicas=2 + proxy runner restart, verifying that sessions are
routed to the correct backend pod after re-initialization.
yrobla pushed a commit that referenced this pull request Apr 9, 2026
Extract re-initialization logic from tracingTransport into a dedicated
backendRecovery type backed by a narrow recoverySessionStore interface
(Get + UpsertSession) and a forward func. tracingTransport now owns only
the request lifecycle (session guard, initialize detection, httptrace,
session creation) and delegates all forwarding and recovery to
backendRecovery.

This makes reinitializeAndReplay and podBackendURL testable without
standing up a full TransparentProxy: backend_recovery_test.go covers
all recovery paths (no session, no init body, happy path, forward
error, missing new session ID) using a stubSessionStore and inline
httptest servers.

tracingTransport.forward() and the base field are removed; all network
I/O goes through recovery.forward — a single source of truth for the
underlying transport.

Also integrates the E2E acceptance test from #4574 that exercises
backendReplicas=2 + proxy runner restart, verifying that sessions are
routed to the correct backend pod after re-initialization.
yrobla pushed a commit that referenced this pull request Apr 9, 2026
Extract re-initialization logic from tracingTransport into a dedicated
backendRecovery type backed by a narrow recoverySessionStore interface
(Get + UpsertSession) and a forward func. tracingTransport now owns only
the request lifecycle (session guard, initialize detection, httptrace,
session creation) and delegates all forwarding and recovery to
backendRecovery.

This makes reinitializeAndReplay and podBackendURL testable without
standing up a full TransparentProxy: backend_recovery_test.go covers
all recovery paths (no session, no init body, happy path, forward
error, missing new session ID) using a stubSessionStore and inline
httptest servers.

tracingTransport.forward() and the base field are removed; all network
I/O goes through recovery.forward — a single source of truth for the
underlying transport.

Also integrates the E2E acceptance test from #4574 that exercises
backendReplicas=2 + proxy runner restart, verifying that sessions are
routed to the correct backend pod after re-initialization.
yrobla pushed a commit that referenced this pull request Apr 9, 2026
Extract re-initialization logic from tracingTransport into a dedicated
backendRecovery type backed by a narrow recoverySessionStore interface
(Get + UpsertSession) and a forward func. tracingTransport now owns only
the request lifecycle (session guard, initialize detection, httptrace,
session creation) and delegates all forwarding and recovery to
backendRecovery.

This makes reinitializeAndReplay and podBackendURL testable without
standing up a full TransparentProxy: backend_recovery_test.go covers
all recovery paths (no session, no init body, happy path, forward
error, missing new session ID) using a stubSessionStore and inline
httptest servers.

tracingTransport.forward() and the base field are removed; all network
I/O goes through recovery.forward — a single source of truth for the
underlying transport.

Also integrates the E2E acceptance test from #4574 that exercises
backendReplicas=2 + proxy runner restart, verifying that sessions are
routed to the correct backend pod after re-initialization.
yrobla pushed a commit that referenced this pull request Apr 9, 2026
Extract re-initialization logic from tracingTransport into a dedicated
backendRecovery type backed by a narrow recoverySessionStore interface
(Get + UpsertSession) and a forward func. tracingTransport now owns only
the request lifecycle (session guard, initialize detection, httptrace,
session creation) and delegates all forwarding and recovery to
backendRecovery.

This makes reinitializeAndReplay and podBackendURL testable without
standing up a full TransparentProxy: backend_recovery_test.go covers
all recovery paths (no session, no init body, happy path, forward
error, missing new session ID) using a stubSessionStore and inline
httptest servers.

tracingTransport.forward() and the base field are removed; all network
I/O goes through recovery.forward — a single source of truth for the
underlying transport.

Also integrates the E2E acceptance test from #4574 that exercises
backendReplicas=2 + proxy runner restart, verifying that sessions are
routed to the correct backend pod after re-initialization.
yrobla pushed a commit that referenced this pull request Apr 9, 2026
Extract re-initialization logic from tracingTransport into a dedicated
backendRecovery type backed by a narrow recoverySessionStore interface
(Get + UpsertSession) and a forward func. tracingTransport now owns only
the request lifecycle (session guard, initialize detection, httptrace,
session creation) and delegates all forwarding and recovery to
backendRecovery.

This makes reinitializeAndReplay and podBackendURL testable without
standing up a full TransparentProxy: backend_recovery_test.go covers
all recovery paths (no session, no init body, happy path, forward
error, missing new session ID) using a stubSessionStore and inline
httptest servers.

tracingTransport.forward() and the base field are removed; all network
I/O goes through recovery.forward — a single source of truth for the
underlying transport.

Also integrates the E2E acceptance test from #4574 that exercises
backendReplicas=2 + proxy runner restart, verifying that sessions are
routed to the correct backend pod after re-initialization.
jerm-dro pushed a commit that referenced this pull request Apr 9, 2026
)

* Route MCP sessions to the originating backend pod using httptrace

When a proxy runner pod restarts it recovers sessions from Redis but
backend_url stored the ClusterIP, so kube-proxy could send follow-up
requests to a different backend pod that never handled initialize —
causing JSON-RPC -32001 "session not found" errors on the first request.

Use net/http/httptrace.GotConn to capture the actual backend pod IP
after kube-proxy DNAT on every initialize request, and store that as
backend_url instead of the ClusterIP URL. The existing Rewrite closure
already reads backend_url and pins routing to the correct pod; no
changes to that path are needed.

When the backend pod is later replaced (rescheduled to a new IP or
restarted in place and lost in-memory session state), the proxy now
re-initializes the backend session transparently rather than returning
404 to the client:

- Dial error (pod IP unreachable): re-init triggers on TCP failure
- Backend 404 (session lost, same IP): re-init triggers on response

In both cases the proxy replays the stored initialize body against the
ClusterIP, captures the new pod IP via GotConn, stores the new backend
session ID, rewrites outbound Mcp-Session-Id headers, and replays the
original client request — the client sees no error.

DELETE responses are excluded from the 404 re-init path since the
session is intentionally torn down in that case.

Closes #4575

* Reimplement tracingTransport logic + add e2e tests

Extract re-initialization logic from tracingTransport into a dedicated
backendRecovery type backed by a narrow recoverySessionStore interface
(Get + UpsertSession) and a forward func. tracingTransport now owns only
the request lifecycle (session guard, initialize detection, httptrace,
session creation) and delegates all forwarding and recovery to
backendRecovery.

This makes reinitializeAndReplay and podBackendURL testable without
standing up a full TransparentProxy: backend_recovery_test.go covers
all recovery paths (no session, no init body, happy path, forward
error, missing new session ID) using a stubSessionStore and inline
httptest servers.

tracingTransport.forward() and the base field are removed; all network
I/O goes through recovery.forward — a single source of truth for the
underlying transport.

Also integrates the E2E acceptance test from #4574 that exercises
backendReplicas=2 + proxy runner restart, verifying that sessions are
routed to the correct backend pod after re-initialization.

---------

Co-authored-by: taskbot <taskbot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant