Skip to content

Add per-user rate limit types and limiter support#4692

Merged
jerm-dro merged 8 commits intomainfrom
jerm-dro/per-user-rate-limits
Apr 9, 2026
Merged

Add per-user rate limit types and limiter support#4692
jerm-dro merged 8 commits intomainfrom
jerm-dro/per-user-rate-limits

Conversation

@jerm-dro
Copy link
Copy Markdown
Contributor

@jerm-dro jerm-dro commented Apr 8, 2026

Summary

Cluster admins need per-user rate limiting so no single user can monopolize an MCPServer's tools. RFC THV-0057 defines per-user token buckets keyed by authenticated user identity.

  • Add perUser field to RateLimitConfig and ToolRateLimitConfig CRD types so admins can configure per-user token bucket limits at both server and per-tool level
  • Add CEL admission validation rejecting perUser when authentication is not enabled (oidcConfig, oidcConfigRef, or externalAuthConfigRef required)
  • Add RateLimitingConfigValid status condition set during reconciliation as defense-in-depth alongside CEL
  • Extend the limiter to create per-user buckets dynamically at Allow() time using raw userID in Redis keys (thv:rl:{ns:name}:user:{userID})
  • Make ToolRateLimitConfig.Shared optional since a tool entry may now have only a perUser limit

This is PR 1 of 2 for #4550. The middleware wiring (extracting identity from context and passing userID to the limiter) will follow in PR 2 after #4652 merges.

Part of #4550

Type of change

  • New feature

Test plan

  • Unit tests (task test)
  • Linting (task lint-fix)

Changes

File Change
cmd/thv-operator/api/v1alpha1/mcpserver_types.go Add PerUser to RateLimitConfig and ToolRateLimitConfig, condition constants, CEL rules
cmd/thv-operator/api/v1alpha1/zz_generated.deepcopy.go Regenerated deepcopy for new fields
cmd/thv-operator/controllers/mcpserver_controller.go Add validateRateLimitConfig and setRateLimitConfigCondition
cmd/thv-operator/controllers/mcpserver_replicas_test.go Table-driven tests for rate limit config validation (5 cases)
pkg/ratelimit/limiter.go Add bucketSpec, per-user bucket creation in NewLimiter/Allow
pkg/ratelimit/limiter_test.go 7 new tests covering per-user scenarios, atomic guarantee, validation
deploy/charts/operator-crds/... Regenerated CRD YAML

Does this introduce a user-facing change?

Yes. Admins can now configure rateLimiting.perUser and rateLimiting.tools[].perUser on MCPServer to set per-user token bucket rate limits. The limits are enforced once the middleware wiring lands in PR 2.

Special notes for reviewers

  • Per-user TokenBucket structs are created per-request in Allow()bucket.New() only allocates a struct (no I/O), so this is cheap. The alternative (caching per-user buckets) would require eviction logic for unbounded user sets.
  • The TestLimiter_PerUserRejectionDoesNotDrainShared test verifies the atomic guarantee: when a per-user bucket rejects, the shared server bucket is NOT decremented. This is critical for preventing noisy users from affecting global capacity.
  • Raw userID is used in Redis keys. Redis keys are binary-safe and the hash tag {ns:name} closes before the userID appears, so no characters can cause key injection or slot routing issues.
  • ToolRateLimitConfig.Shared changed from required to optional — backwards compatible since existing resources with shared set still pass the new CEL rule (has(self.shared) || has(self.perUser)).

Generated with Claude Code

@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 8, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 93.97590% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.63%. Comparing base (3c5da31) to head (415bd1c).
⚠️ Report is 24 commits behind head on main.

Files with missing lines Patch % Lines
...d/thv-operator/controllers/mcpserver_controller.go 86.11% 3 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4692      +/-   ##
==========================================
- Coverage   68.66%   68.63%   -0.03%     
==========================================
  Files         509      516       +7     
  Lines       52987    54192    +1205     
==========================================
+ Hits        36384    37196     +812     
- Misses      13782    14129     +347     
- Partials     2821     2867      +46     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jerm-dro and others added 6 commits April 8, 2026 16:36
Add PerUser field to RateLimitConfig and ToolRateLimitConfig so
administrators can configure per-user token bucket rate limits on
MCPServer. Make ToolRateLimitConfig.Shared optional since a tool
entry may now have only a perUser limit.

CEL admission validation enforces that perUser rate limiting
requires authentication (oidcConfig, oidcConfigRef, or
externalAuthConfigRef) at both server-level and per-tool level.
The existing "at least one scope" rule is updated to include
perUser alongside shared and tools.

Add RateLimitConfigValid condition type and reason constants for
use in the operator reconciler (wired in a following commit).

Part of #4550

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Validate that per-user rate limiting has authentication enabled at
reconciliation time (defense-in-depth alongside CEL admission).
Set RateLimitConfigValid condition with appropriate reason:
- RateLimitConfigValid when configuration is valid
- PerUserRequiresAuth when perUser is set without auth
- RateLimitNotApplicable when rate limiting is not configured

Part of #4550

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extend the limiter to create per-user token buckets keyed by userID.
Per-user buckets are stored as deferred specs (bucketSpec) at
construction time and materialized into TokenBucket structs at Allow()
time since the userID is request-scoped. bucket.New() only allocates
a struct (no I/O), so per-request creation is cheap.

All applicable buckets (shared server, shared per-tool, per-user
server, per-user per-tool) are checked atomically via ConsumeAll.
The Lua script's two-phase check-then-consume ensures a per-user
rejection does not drain the shared bucket.

Redis keys follow the RFC format:
- Server per-user: thv:rl:{ns:name}:user:{userID}
- Tool per-user: thv:rl:{ns:name}:user:{userID}:tool:{toolName}

Part of #4550

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use pointer for optional perUserSpec (clearer than bool+value)
- Use distinct key prefix "user-tool:" for per-tool per-user buckets
  to prevent key collisions when userID contains delimiter characters
- Extract shared validateBucketCRD helper to deduplicate validation
  between newBucket and newBucketSpec

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jerm-dro jerm-dro force-pushed the jerm-dro/per-user-rate-limits branch from 1067400 to 21d2210 Compare April 8, 2026 23:37
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 8, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 8, 2026
@jerm-dro jerm-dro marked this pull request as ready for review April 8, 2026 23:51
Copy link
Copy Markdown
Collaborator

@JAORMX JAORMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK so... I went through this pretty carefully. The algorithm is solid, the split into PR 1/2 is clean, and the atomic guarantee test (TestLimiter_PerUserRejectionDoesNotDrainShared) is exactly the kind of test I want to see. The CEL + controller defense-in-depth is the right pattern. Good stuff.

That said, there are a few things to fix before this is ready.

Must fix

1. t.Context() instead of context.Background() in limiter tests

pkg/ratelimit/limiter_test.go - all 7 new tests use context.Background(). Our convention is t.Context(). Since you're already touching this file, please fix the existing tests too (they have the same issue from before this PR).

2. Swagger description for RateLimitBucket is wrong

Check docs/server/docs.go, swagger.json, and swagger.yaml. The RateLimitBucket type description now reads:

"PerUser defines a token bucket applied independently to each authenticated user for this specific tool..."

But RateLimitBucket is a shared type used by both Shared and PerUser fields. The swagger generator is picking up whichever field's doc comment appears last. Fix: add a proper doc comment on the RateLimitBucket type itself so the generator picks that up instead of a field-level comment.

3. RFC key format deviation needs a comment

The RFC says per-user per-tool keys should be thv:rl:{ns}:{server}:user:{userId}:tool:{toolName}, but the implementation uses user-tool:{toolName}:{userID}. The PR notes explain why (prevents collision when userID contains :tool:), and I think that's the right call. But... we should document this. A code comment in Allow() explaining the deviation, and we should update the RFC to match what we actually ship. Otherwise someone reads the RFC and builds the wrong mental model.

Should fix

4. Condition semantics when rate limiting is not configured

When rateLimiting is nil, you set RateLimitConfigValid to False with reason NotApplicable. The problem is that RateLimitConfigValid = False reads as "the config is invalid"... which isn't what's happening. Rate limiting just isn't configured, that's all.

Two options that feel better:

  • Skip setting the condition entirely when the feature is off
  • Set it to True with reason NotApplicable (trivially valid, nothing to validate)

The second approach follows what we do with ImageValidated / ImageValidationSkipped.

5. validateRateLimitConfig doesn't halt reconciliation

When per-user is configured without auth, the condition gets set to False but reconciliation keeps going. Compare with handleOIDCConfig / handleExternalAuthConfig which return errors to stop things. Now, CEL should block this at admission time anyway, and the runtime if userID != "" guard means per-user buckets are silently skipped (graceful degradation)... so the blast radius is low. But it would be good to document this intentional behavior in a comment. Something like "defense-in-depth only; CEL is the primary gate."

6. Redis memory sizing

Each unique userID creates Redis keys that expire after 2x refillPeriod. That's fine for normal usage. But with OIDC providers that issue pairwise or per-session sub claims, you could get a lot of keys in the TTL window. Worth adding a note in the CRD docs or a comment about the memory formula: max_keys ~= unique_users_per_TTL_window * (1 + num_tools_with_per_user_limits). Not a blocker but good to have.

Nits

  • Condition name: issue #4550 says RateLimitingConfigValid, PR has RateLimitConfigValid. Intentional? I actually prefer the shorter name (matches the type), but let's update the issue so we don't confuse anyone later.
  • //nolint:lll on ToolRateLimitConfig: the CEL rule has(self.shared) || has(self.perUser) is short. The nolint looks copy-pasted from RateLimitConfig where it's actually needed.
  • Test object names have spaces: TestRateLimitConfigValidation builds names like rl-perUser with auth. K8s names can't have spaces. The fake client doesn't care but the names are unrealistic. Use slugs like rl-peruser-with-auth.
  • bucketSpec comment: the existing comment explains the deferred creation well. Would be nice to add a line saying state lives in Redis, which is why creating a new TokenBucket per request is safe. Saves the next reader from wondering about it.

What's good

The algorithm implementation is correct. The bucketSpec pattern for deferred per-user bucket creation is the right design given that userID is request-scoped and bucket.New() is just a struct allocation. The ConsumeAll Lua script handles the new buckets naturally since they're just additional TokenBucket instances in the same atomic check. The ToolRateLimitConfig.Shared change from required to optional is backwards-compatible (existing resources have shared set and pass the new CEL rule). Security-wise, the user: vs user-tool: prefix ordering prevents cross-type key collision, the perUserTools map acts as an allowlist for toolName, and the if userID != "" guard is a good fail-safe. Clean work overall.

Must fix:
- Replace context.Background() with t.Context() in all limiter tests
  (new and pre-existing)
- Fix RateLimitBucket swagger description by trimming field-level
  comments so the shared type gets a neutral description
- Add comment in Allow() documenting RFC key format deviation and why
  "user-tool:" prefix prevents cross-type key collisions

Should fix:
- Change condition to ConditionTrue with NotApplicable when rate
  limiting is not configured (matches ImageValidated/Skipped pattern)
- Add defense-in-depth comment explaining reconciliation continues
  intentionally (CEL is the primary gate)
- Add Redis memory sizing note on PerUser CRD field

Nits:
- Fix test object names to use K8s-valid slugs (no spaces)
- Keep nolint:lll on ToolRateLimitConfig (kubebuilder marker is 146
  chars, exceeds the 130 limit)
- Improve bucketSpec comment noting state lives in Redis

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jerm-dro
Copy link
Copy Markdown
Contributor Author

jerm-dro commented Apr 9, 2026

Thanks for the thorough review @JAORMX! All feedback addressed in 415bd1c.

Must fix

  1. t.Context() in limiter tests — Fixed all 14 tests (new + pre-existing). Removed the "context" import entirely.

  2. Swagger description for RateLimitBucket — Swag picks up the last field-level comment for transitively-discovered types (via parseDependencyLevel). The @Description annotation doesn't override this for transitive types. Fixed by trimming field-level comments to be type-neutral (e.g., "PerUser token bucket configuration for this tool") so the swagger description reads correctly for the shared type.

  3. RFC key format deviation comment — Added a block comment in Allow() explaining the user: vs user-tool: prefix split and why it deviates from the RFC. Will update the RFC separately to match what ships.

Should fix

  1. Condition semantics — Changed nil case to ConditionTrue with NotApplicable, matching the ImageValidated/ImageValidationSkipped pattern. Updated test expectation.

  2. Defense-in-depth comment — Expanded the validateRateLimitConfig doc comment to explain reconciliation continues intentionally (CEL is primary gate, per-user buckets silently skip when userID is empty).

  3. Redis memory note — Added formula comment on the PerUser CRD field: unique_users_per_TTL_window * (1 + num_tools_with_per_user_limits) keys.

Nits

  • Condition name: Intentional — RateLimitConfigValid matches the Go type name. Will update the issue text to align.
  • //nolint:lll: Kept — the kubebuilder marker is 146 chars (limit is 130). The rule text is short but the +kubebuilder:validation:XValidation:rule= prefix + message= push it over.
  • Test names: Fixed to K8s-valid slugs (peruser-with-auth, per-tool-peruser-without-auth, etc.).
  • bucketSpec comment: Added note that state lives in Redis, so per-request TokenBucket creation is safe.

@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 9, 2026
jerm-dro added a commit to stacklok/toolhive-rfcs that referenced this pull request Apr 9, 2026
Per-user per-operation Redis keys now use distinct prefixes
(user-tool:, user-prompt:, user-resource:) instead of nesting
under user:{userId}:tool:... to prevent key collisions when a
userId contains delimiter characters like ":tool:".

The operation name precedes the userId so that the variable-length
userId is always the terminal key component.

Matches the implementation shipped in stacklok/toolhive#4692.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jerm-dro jerm-dro merged commit 65a78f4 into main Apr 9, 2026
40 checks passed
@jerm-dro jerm-dro deleted the jerm-dro/per-user-rate-limits branch April 9, 2026 20:38
jerm-dro added a commit to stacklok/toolhive-rfcs that referenced this pull request Apr 10, 2026
Per-user per-operation Redis keys now use distinct prefixes
(user-tool:, user-prompt:, user-resource:) instead of nesting
under user:{userId}:tool:... to prevent key collisions when a
userId contains delimiter characters like ":tool:".

The operation name precedes the userId so that the variable-length
userId is always the terminal key component.

Matches the implementation shipped in stacklok/toolhive#4692.

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Large PR: 600-999 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants