Add per-user rate limit types and limiter support#4692
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4692 +/- ##
==========================================
- Coverage 68.66% 68.63% -0.03%
==========================================
Files 509 516 +7
Lines 52987 54192 +1205
==========================================
+ Hits 36384 37196 +812
- Misses 13782 14129 +347
- Partials 2821 2867 +46 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add PerUser field to RateLimitConfig and ToolRateLimitConfig so administrators can configure per-user token bucket rate limits on MCPServer. Make ToolRateLimitConfig.Shared optional since a tool entry may now have only a perUser limit. CEL admission validation enforces that perUser rate limiting requires authentication (oidcConfig, oidcConfigRef, or externalAuthConfigRef) at both server-level and per-tool level. The existing "at least one scope" rule is updated to include perUser alongside shared and tools. Add RateLimitConfigValid condition type and reason constants for use in the operator reconciler (wired in a following commit). Part of #4550 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Validate that per-user rate limiting has authentication enabled at reconciliation time (defense-in-depth alongside CEL admission). Set RateLimitConfigValid condition with appropriate reason: - RateLimitConfigValid when configuration is valid - PerUserRequiresAuth when perUser is set without auth - RateLimitNotApplicable when rate limiting is not configured Part of #4550 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extend the limiter to create per-user token buckets keyed by userID.
Per-user buckets are stored as deferred specs (bucketSpec) at
construction time and materialized into TokenBucket structs at Allow()
time since the userID is request-scoped. bucket.New() only allocates
a struct (no I/O), so per-request creation is cheap.
All applicable buckets (shared server, shared per-tool, per-user
server, per-user per-tool) are checked atomically via ConsumeAll.
The Lua script's two-phase check-then-consume ensures a per-user
rejection does not drain the shared bucket.
Redis keys follow the RFC format:
- Server per-user: thv:rl:{ns:name}:user:{userID}
- Tool per-user: thv:rl:{ns:name}:user:{userID}:tool:{toolName}
Part of #4550
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use pointer for optional perUserSpec (clearer than bool+value) - Use distinct key prefix "user-tool:" for per-tool per-user buckets to prevent key collisions when userID contains delimiter characters - Extract shared validateBucketCRD helper to deduplicate validation between newBucket and newBucketSpec Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1067400 to
21d2210
Compare
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
JAORMX
left a comment
There was a problem hiding this comment.
OK so... I went through this pretty carefully. The algorithm is solid, the split into PR 1/2 is clean, and the atomic guarantee test (TestLimiter_PerUserRejectionDoesNotDrainShared) is exactly the kind of test I want to see. The CEL + controller defense-in-depth is the right pattern. Good stuff.
That said, there are a few things to fix before this is ready.
Must fix
1. t.Context() instead of context.Background() in limiter tests
pkg/ratelimit/limiter_test.go - all 7 new tests use context.Background(). Our convention is t.Context(). Since you're already touching this file, please fix the existing tests too (they have the same issue from before this PR).
2. Swagger description for RateLimitBucket is wrong
Check docs/server/docs.go, swagger.json, and swagger.yaml. The RateLimitBucket type description now reads:
"PerUser defines a token bucket applied independently to each authenticated user for this specific tool..."
But RateLimitBucket is a shared type used by both Shared and PerUser fields. The swagger generator is picking up whichever field's doc comment appears last. Fix: add a proper doc comment on the RateLimitBucket type itself so the generator picks that up instead of a field-level comment.
3. RFC key format deviation needs a comment
The RFC says per-user per-tool keys should be thv:rl:{ns}:{server}:user:{userId}:tool:{toolName}, but the implementation uses user-tool:{toolName}:{userID}. The PR notes explain why (prevents collision when userID contains :tool:), and I think that's the right call. But... we should document this. A code comment in Allow() explaining the deviation, and we should update the RFC to match what we actually ship. Otherwise someone reads the RFC and builds the wrong mental model.
Should fix
4. Condition semantics when rate limiting is not configured
When rateLimiting is nil, you set RateLimitConfigValid to False with reason NotApplicable. The problem is that RateLimitConfigValid = False reads as "the config is invalid"... which isn't what's happening. Rate limiting just isn't configured, that's all.
Two options that feel better:
- Skip setting the condition entirely when the feature is off
- Set it to
Truewith reasonNotApplicable(trivially valid, nothing to validate)
The second approach follows what we do with ImageValidated / ImageValidationSkipped.
5. validateRateLimitConfig doesn't halt reconciliation
When per-user is configured without auth, the condition gets set to False but reconciliation keeps going. Compare with handleOIDCConfig / handleExternalAuthConfig which return errors to stop things. Now, CEL should block this at admission time anyway, and the runtime if userID != "" guard means per-user buckets are silently skipped (graceful degradation)... so the blast radius is low. But it would be good to document this intentional behavior in a comment. Something like "defense-in-depth only; CEL is the primary gate."
6. Redis memory sizing
Each unique userID creates Redis keys that expire after 2x refillPeriod. That's fine for normal usage. But with OIDC providers that issue pairwise or per-session sub claims, you could get a lot of keys in the TTL window. Worth adding a note in the CRD docs or a comment about the memory formula: max_keys ~= unique_users_per_TTL_window * (1 + num_tools_with_per_user_limits). Not a blocker but good to have.
Nits
- Condition name: issue #4550 says
RateLimitingConfigValid, PR hasRateLimitConfigValid. Intentional? I actually prefer the shorter name (matches the type), but let's update the issue so we don't confuse anyone later. //nolint:lllonToolRateLimitConfig: the CEL rulehas(self.shared) || has(self.perUser)is short. The nolint looks copy-pasted fromRateLimitConfigwhere it's actually needed.- Test object names have spaces:
TestRateLimitConfigValidationbuilds names likerl-perUser with auth. K8s names can't have spaces. The fake client doesn't care but the names are unrealistic. Use slugs likerl-peruser-with-auth. bucketSpeccomment: the existing comment explains the deferred creation well. Would be nice to add a line saying state lives in Redis, which is why creating a newTokenBucketper request is safe. Saves the next reader from wondering about it.
What's good
The algorithm implementation is correct. The bucketSpec pattern for deferred per-user bucket creation is the right design given that userID is request-scoped and bucket.New() is just a struct allocation. The ConsumeAll Lua script handles the new buckets naturally since they're just additional TokenBucket instances in the same atomic check. The ToolRateLimitConfig.Shared change from required to optional is backwards-compatible (existing resources have shared set and pass the new CEL rule). Security-wise, the user: vs user-tool: prefix ordering prevents cross-type key collision, the perUserTools map acts as an allowlist for toolName, and the if userID != "" guard is a good fail-safe. Clean work overall.
Must fix: - Replace context.Background() with t.Context() in all limiter tests (new and pre-existing) - Fix RateLimitBucket swagger description by trimming field-level comments so the shared type gets a neutral description - Add comment in Allow() documenting RFC key format deviation and why "user-tool:" prefix prevents cross-type key collisions Should fix: - Change condition to ConditionTrue with NotApplicable when rate limiting is not configured (matches ImageValidated/Skipped pattern) - Add defense-in-depth comment explaining reconciliation continues intentionally (CEL is the primary gate) - Add Redis memory sizing note on PerUser CRD field Nits: - Fix test object names to use K8s-valid slugs (no spaces) - Keep nolint:lll on ToolRateLimitConfig (kubebuilder marker is 146 chars, exceeds the 130 limit) - Improve bucketSpec comment noting state lives in Redis Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thanks for the thorough review @JAORMX! All feedback addressed in 415bd1c. Must fix
Should fix
Nits
|
Per-user per-operation Redis keys now use distinct prefixes
(user-tool:, user-prompt:, user-resource:) instead of nesting
under user:{userId}:tool:... to prevent key collisions when a
userId contains delimiter characters like ":tool:".
The operation name precedes the userId so that the variable-length
userId is always the terminal key component.
Matches the implementation shipped in stacklok/toolhive#4692.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Per-user per-operation Redis keys now use distinct prefixes
(user-tool:, user-prompt:, user-resource:) instead of nesting
under user:{userId}:tool:... to prevent key collisions when a
userId contains delimiter characters like ":tool:".
The operation name precedes the userId so that the variable-length
userId is always the terminal key component.
Matches the implementation shipped in stacklok/toolhive#4692.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
Cluster admins need per-user rate limiting so no single user can monopolize an MCPServer's tools. RFC THV-0057 defines per-user token buckets keyed by authenticated user identity.
perUserfield toRateLimitConfigandToolRateLimitConfigCRD types so admins can configure per-user token bucket limits at both server and per-tool levelperUserwhen authentication is not enabled (oidcConfig, oidcConfigRef, or externalAuthConfigRef required)RateLimitingConfigValidstatus condition set during reconciliation as defense-in-depth alongside CELAllow()time using raw userID in Redis keys (thv:rl:{ns:name}:user:{userID})ToolRateLimitConfig.Sharedoptional since a tool entry may now have only aperUserlimitThis is PR 1 of 2 for #4550. The middleware wiring (extracting identity from context and passing userID to the limiter) will follow in PR 2 after #4652 merges.
Part of #4550
Type of change
Test plan
task test)task lint-fix)Changes
cmd/thv-operator/api/v1alpha1/mcpserver_types.goPerUsertoRateLimitConfigandToolRateLimitConfig, condition constants, CEL rulescmd/thv-operator/api/v1alpha1/zz_generated.deepcopy.gocmd/thv-operator/controllers/mcpserver_controller.govalidateRateLimitConfigandsetRateLimitConfigConditioncmd/thv-operator/controllers/mcpserver_replicas_test.gopkg/ratelimit/limiter.gobucketSpec, per-user bucket creation inNewLimiter/Allowpkg/ratelimit/limiter_test.godeploy/charts/operator-crds/...Does this introduce a user-facing change?
Yes. Admins can now configure
rateLimiting.perUserandrateLimiting.tools[].perUseron MCPServer to set per-user token bucket rate limits. The limits are enforced once the middleware wiring lands in PR 2.Special notes for reviewers
TokenBucketstructs are created per-request inAllow()—bucket.New()only allocates a struct (no I/O), so this is cheap. The alternative (caching per-user buckets) would require eviction logic for unbounded user sets.TestLimiter_PerUserRejectionDoesNotDrainSharedtest verifies the atomic guarantee: when a per-user bucket rejects, the shared server bucket is NOT decremented. This is critical for preventing noisy users from affecting global capacity.{ns:name}closes before the userID appears, so no characters can cause key injection or slot routing issues.ToolRateLimitConfig.Sharedchanged from required to optional — backwards compatible since existing resources withsharedset still pass the new CEL rule (has(self.shared) || has(self.perUser)).Generated with Claude Code