Add per-user rate limit types and limiter support by jerm-dro · Pull Request #4692 · stacklok/toolhive

jerm-dro · 2026-04-08T23:18:58Z

Summary

Cluster admins need per-user rate limiting so no single user can monopolize an MCPServer's tools. RFC THV-0057 defines per-user token buckets keyed by authenticated user identity.

Add perUser field to RateLimitConfig and ToolRateLimitConfig CRD types so admins can configure per-user token bucket limits at both server and per-tool level
Add CEL admission validation rejecting perUser when authentication is not enabled (oidcConfig, oidcConfigRef, or externalAuthConfigRef required)
Add RateLimitingConfigValid status condition set during reconciliation as defense-in-depth alongside CEL
Extend the limiter to create per-user buckets dynamically at Allow() time using raw userID in Redis keys (thv:rl:{ns:name}:user:{userID})
Make ToolRateLimitConfig.Shared optional since a tool entry may now have only a perUser limit

This is PR 1 of 2 for #4550. The middleware wiring (extracting identity from context and passing userID to the limiter) will follow in PR 2 after #4652 merges.

Part of #4550

Type of change

New feature

Test plan

Unit tests (task test)
Linting (task lint-fix)

Changes

File	Change
`cmd/thv-operator/api/v1alpha1/mcpserver_types.go`	Add `PerUser` to `RateLimitConfig` and `ToolRateLimitConfig`, condition constants, CEL rules
`cmd/thv-operator/api/v1alpha1/zz_generated.deepcopy.go`	Regenerated deepcopy for new fields
`cmd/thv-operator/controllers/mcpserver_controller.go`	Add `validateRateLimitConfig` and `setRateLimitConfigCondition`
`cmd/thv-operator/controllers/mcpserver_replicas_test.go`	Table-driven tests for rate limit config validation (5 cases)
`pkg/ratelimit/limiter.go`	Add `bucketSpec`, per-user bucket creation in `NewLimiter`/`Allow`
`pkg/ratelimit/limiter_test.go`	7 new tests covering per-user scenarios, atomic guarantee, validation
`deploy/charts/operator-crds/...`	Regenerated CRD YAML

Does this introduce a user-facing change?

Yes. Admins can now configure rateLimiting.perUser and rateLimiting.tools[].perUser on MCPServer to set per-user token bucket rate limits. The limits are enforced once the middleware wiring lands in PR 2.

Special notes for reviewers

Per-user TokenBucket structs are created per-request in Allow() — bucket.New() only allocates a struct (no I/O), so this is cheap. The alternative (caching per-user buckets) would require eviction logic for unbounded user sets.
The TestLimiter_PerUserRejectionDoesNotDrainShared test verifies the atomic guarantee: when a per-user bucket rejects, the shared server bucket is NOT decremented. This is critical for preventing noisy users from affecting global capacity.
Raw userID is used in Redis keys. Redis keys are binary-safe and the hash tag {ns:name} closes before the userID appears, so no characters can cause key injection or slot routing issues.
ToolRateLimitConfig.Shared changed from required to optional — backwards compatible since existing resources with shared set still pass the new CEL rule (has(self.shared) || has(self.perUser)).

Generated with Claude Code

codecov · 2026-04-08T23:31:54Z

Codecov Report

❌ Patch coverage is 93.97590% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.63%. Comparing base (3c5da31) to head (415bd1c).
⚠️ Report is 24 commits behind head on main.

Files with missing lines	Patch %	Lines
...d/thv-operator/controllers/mcpserver_controller.go	86.11%	3 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4692      +/-   ##
==========================================
- Coverage   68.66%   68.63%   -0.03%     
==========================================
  Files         509      516       +7     
  Lines       52987    54192    +1205     
==========================================
+ Hits        36384    37196     +812     
- Misses      13782    14129     +347     
- Partials     2821     2867      +46

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add PerUser field to RateLimitConfig and ToolRateLimitConfig so administrators can configure per-user token bucket rate limits on MCPServer. Make ToolRateLimitConfig.Shared optional since a tool entry may now have only a perUser limit. CEL admission validation enforces that perUser rate limiting requires authentication (oidcConfig, oidcConfigRef, or externalAuthConfigRef) at both server-level and per-tool level. The existing "at least one scope" rule is updated to include perUser alongside shared and tools. Add RateLimitConfigValid condition type and reason constants for use in the operator reconciler (wired in a following commit). Part of #4550 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Validate that per-user rate limiting has authentication enabled at reconciliation time (defense-in-depth alongside CEL admission). Set RateLimitConfigValid condition with appropriate reason: - RateLimitConfigValid when configuration is valid - PerUserRequiresAuth when perUser is set without auth - RateLimitNotApplicable when rate limiting is not configured Part of #4550 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Extend the limiter to create per-user token buckets keyed by userID. Per-user buckets are stored as deferred specs (bucketSpec) at construction time and materialized into TokenBucket structs at Allow() time since the userID is request-scoped. bucket.New() only allocates a struct (no I/O), so per-request creation is cheap. All applicable buckets (shared server, shared per-tool, per-user server, per-user per-tool) are checked atomically via ConsumeAll. The Lua script's two-phase check-then-consume ensures a per-user rejection does not drain the shared bucket. Redis keys follow the RFC format: - Server per-user: thv:rl:{ns:name}:user:{userID} - Tool per-user: thv:rl:{ns:name}:user:{userID}:tool:{toolName} Part of #4550 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Use pointer for optional perUserSpec (clearer than bool+value) - Use distinct key prefix "user-tool:" for per-tool per-user buckets to prevent key collisions when userID contains delimiter characters - Extract shared validateBucketCRD helper to deduplicate validation between newBucket and newBucketSpec Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

JAORMX

OK so... I went through this pretty carefully. The algorithm is solid, the split into PR 1/2 is clean, and the atomic guarantee test (TestLimiter_PerUserRejectionDoesNotDrainShared) is exactly the kind of test I want to see. The CEL + controller defense-in-depth is the right pattern. Good stuff.

That said, there are a few things to fix before this is ready.

Must fix

1. `t.Context()` instead of `context.Background()` in limiter tests

pkg/ratelimit/limiter_test.go - all 7 new tests use context.Background(). Our convention is t.Context(). Since you're already touching this file, please fix the existing tests too (they have the same issue from before this PR).

2. Swagger description for `RateLimitBucket` is wrong

Check docs/server/docs.go, swagger.json, and swagger.yaml. The RateLimitBucket type description now reads:

"PerUser defines a token bucket applied independently to each authenticated user for this specific tool..."

But RateLimitBucket is a shared type used by both Shared and PerUser fields. The swagger generator is picking up whichever field's doc comment appears last. Fix: add a proper doc comment on the RateLimitBucket type itself so the generator picks that up instead of a field-level comment.

3. RFC key format deviation needs a comment

The RFC says per-user per-tool keys should be thv:rl:{ns}:{server}:user:{userId}:tool:{toolName}, but the implementation uses user-tool:{toolName}:{userID}. The PR notes explain why (prevents collision when userID contains :tool:), and I think that's the right call. But... we should document this. A code comment in Allow() explaining the deviation, and we should update the RFC to match what we actually ship. Otherwise someone reads the RFC and builds the wrong mental model.

Should fix

4. Condition semantics when rate limiting is not configured

When rateLimiting is nil, you set RateLimitConfigValid to False with reason NotApplicable. The problem is that RateLimitConfigValid = False reads as "the config is invalid"... which isn't what's happening. Rate limiting just isn't configured, that's all.

Two options that feel better:

Skip setting the condition entirely when the feature is off
Set it to True with reason NotApplicable (trivially valid, nothing to validate)

The second approach follows what we do with ImageValidated / ImageValidationSkipped.

5. `validateRateLimitConfig` doesn't halt reconciliation

When per-user is configured without auth, the condition gets set to False but reconciliation keeps going. Compare with handleOIDCConfig / handleExternalAuthConfig which return errors to stop things. Now, CEL should block this at admission time anyway, and the runtime if userID != "" guard means per-user buckets are silently skipped (graceful degradation)... so the blast radius is low. But it would be good to document this intentional behavior in a comment. Something like "defense-in-depth only; CEL is the primary gate."

6. Redis memory sizing

Each unique userID creates Redis keys that expire after 2x refillPeriod. That's fine for normal usage. But with OIDC providers that issue pairwise or per-session sub claims, you could get a lot of keys in the TTL window. Worth adding a note in the CRD docs or a comment about the memory formula: max_keys ~= unique_users_per_TTL_window * (1 + num_tools_with_per_user_limits). Not a blocker but good to have.

Nits

Condition name: issue #4550 says RateLimitingConfigValid, PR has RateLimitConfigValid. Intentional? I actually prefer the shorter name (matches the type), but let's update the issue so we don't confuse anyone later.
//nolint:lll on ToolRateLimitConfig: the CEL rule has(self.shared) || has(self.perUser) is short. The nolint looks copy-pasted from RateLimitConfig where it's actually needed.
Test object names have spaces: TestRateLimitConfigValidation builds names like rl-perUser with auth. K8s names can't have spaces. The fake client doesn't care but the names are unrealistic. Use slugs like rl-peruser-with-auth.
bucketSpec comment: the existing comment explains the deferred creation well. Would be nice to add a line saying state lives in Redis, which is why creating a new TokenBucket per request is safe. Saves the next reader from wondering about it.

What's good

The algorithm implementation is correct. The bucketSpec pattern for deferred per-user bucket creation is the right design given that userID is request-scoped and bucket.New() is just a struct allocation. The ConsumeAll Lua script handles the new buckets naturally since they're just additional TokenBucket instances in the same atomic check. The ToolRateLimitConfig.Shared change from required to optional is backwards-compatible (existing resources have shared set and pass the new CEL rule). Security-wise, the user: vs user-tool: prefix ordering prevents cross-type key collision, the perUserTools map acts as an allowlist for toolName, and the if userID != "" guard is a good fail-safe. Clean work overall.

Must fix: - Replace context.Background() with t.Context() in all limiter tests (new and pre-existing) - Fix RateLimitBucket swagger description by trimming field-level comments so the shared type gets a neutral description - Add comment in Allow() documenting RFC key format deviation and why "user-tool:" prefix prevents cross-type key collisions Should fix: - Change condition to ConditionTrue with NotApplicable when rate limiting is not configured (matches ImageValidated/Skipped pattern) - Add defense-in-depth comment explaining reconciliation continues intentionally (CEL is the primary gate) - Add Redis memory sizing note on PerUser CRD field Nits: - Fix test object names to use K8s-valid slugs (no spaces) - Keep nolint:lll on ToolRateLimitConfig (kubebuilder marker is 146 chars, exceeds the 130 limit) - Improve bucketSpec comment noting state lives in Redis Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jerm-dro · 2026-04-09T16:32:35Z

Thanks for the thorough review @JAORMX! All feedback addressed in 415bd1c.

Must fix

t.Context() in limiter tests — Fixed all 14 tests (new + pre-existing). Removed the "context" import entirely.
Swagger description for RateLimitBucket — Swag picks up the last field-level comment for transitively-discovered types (via parseDependencyLevel). The @Description annotation doesn't override this for transitive types. Fixed by trimming field-level comments to be type-neutral (e.g., "PerUser token bucket configuration for this tool") so the swagger description reads correctly for the shared type.
RFC key format deviation comment — Added a block comment in Allow() explaining the user: vs user-tool: prefix split and why it deviates from the RFC. Will update the RFC separately to match what ships.

Should fix

Condition semantics — Changed nil case to ConditionTrue with NotApplicable, matching the ImageValidated/ImageValidationSkipped pattern. Updated test expectation.
Defense-in-depth comment — Expanded the validateRateLimitConfig doc comment to explain reconciliation continues intentionally (CEL is primary gate, per-user buckets silently skip when userID is empty).
Redis memory note — Added formula comment on the PerUser CRD field: unique_users_per_TTL_window * (1 + num_tools_with_per_user_limits) keys.

Nits

Condition name: Intentional — RateLimitConfigValid matches the Go type name. Will update the issue text to align.
//nolint:lll: Kept — the kubebuilder marker is 146 chars (limit is 130). The rule text is short but the +kubebuilder:validation:XValidation:rule= prefix + message= push it over.
Test names: Fixed to K8s-valid slugs (peruser-with-auth, per-tool-peruser-without-auth, etc.).
bucketSpec comment: Added note that state lives in Redis, so per-request TokenBucket creation is safe.

Per-user per-operation Redis keys now use distinct prefixes (user-tool:, user-prompt:, user-resource:) instead of nesting under user:{userId}:tool:... to prevent key collisions when a userId contains delimiter characters like ":tool:". The operation name precedes the userId so that the variable-length userId is always the terminal key component. Matches the implementation shipped in stacklok/toolhive#4692. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Per-user per-operation Redis keys now use distinct prefixes (user-tool:, user-prompt:, user-resource:) instead of nesting under user:{userId}:tool:... to prevent key collisions when a userId contains delimiter characters like ":tool:". The operation name precedes the userId so that the variable-length userId is always the terminal key component. Matches the implementation shipped in stacklok/toolhive#4692. Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 8, 2026

jerm-dro and others added 6 commits April 8, 2026 16:36

Fix struct field alignment from linter

46f1552

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Regenerate swagger docs for perUser rate limit fields

21d2210

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jerm-dro force-pushed the jerm-dro/per-user-rate-limits branch from 1067400 to 21d2210 Compare April 8, 2026 23:37

github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 8, 2026

Regenerate CRD API reference docs for perUser fields

49c183f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 8, 2026

jerm-dro marked this pull request as ready for review April 8, 2026 23:51

jerm-dro requested review from ChrisJBurns, JAORMX, amirejaz, jhrozek and yrobla as code owners April 8, 2026 23:51

JAORMX requested changes Apr 9, 2026

View reviewed changes

github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 9, 2026

JAORMX approved these changes Apr 9, 2026

View reviewed changes

jerm-dro mentioned this pull request Apr 9, 2026

Update THV-0057 key format to prevent cross-type collisions stacklok/toolhive-rfcs#69

Merged

jerm-dro merged commit 65a78f4 into main Apr 9, 2026
40 checks passed

jerm-dro deleted the jerm-dro/per-user-rate-limits branch April 9, 2026 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-user rate limit types and limiter support#4692

Add per-user rate limit types and limiter support#4692
jerm-dro merged 8 commits intomainfrom
jerm-dro/per-user-rate-limits

jerm-dro commented Apr 8, 2026

Uh oh!

codecov bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

JAORMX left a comment

Uh oh!

jerm-dro commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jerm-dro commented Apr 8, 2026

Summary

Type of change

Test plan

Changes

Does this introduce a user-facing change?

Special notes for reviewers

Uh oh!

codecov bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

JAORMX left a comment

Choose a reason for hiding this comment

Must fix

1. t.Context() instead of context.Background() in limiter tests

2. Swagger description for RateLimitBucket is wrong

3. RFC key format deviation needs a comment

Should fix

4. Condition semantics when rate limiting is not configured

5. validateRateLimitConfig doesn't halt reconciliation

6. Redis memory sizing

Nits

What's good

Uh oh!

jerm-dro commented Apr 9, 2026

Must fix

Should fix

Nits

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Apr 8, 2026 •

edited

Loading

1. `t.Context()` instead of `context.Background()` in limiter tests

2. Swagger description for `RateLimitBucket` is wrong

5. `validateRateLimitConfig` doesn't halt reconciliation