Rate Limiting
Keeptrusts supports multi-tier rate limiting to prevent abuse, control costs, and enforce fair-use policies across your deployment. Rate limits are evaluated in order from most-specific to least-specific: per-key limits are checked first, then per-user, then per-team, and finally global limits. A request that exceeds any tier is rejected with HTTP 429 before it reaches the upstream provider.
Use this page when
- You need the exact command, config, API, or integration details for Rate Limiting.
- You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
- If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Rate Limit Tiers
Keeptrusts evaluates four rate limit tiers on every request.
per_key — Per API key
Applied based on the API key used to authenticate the request. Use per_key limits to give each integration or service account its own RPM/TPM budget, preventing a single badly-behaving client from consuming all available capacity.
per_user — Per user identity
Applied based on the value in the configured user identity header (default: X-User-Id). Use per_user limits for applications that have authenticated end users and want per-user fairness controls.
per_team — Per team identity
Applied based on the value in the configured team identity header (default: X-Team-Id). Use per_team limits for multi-tenant SaaS applications where each tenant has a separate quota.
global — Across all traffic
Applied as a hard ceiling across all traffic regardless of key, user, or team. Use global limits to enforce your aggregate upstream provider capacity or your upstream API key's rate limit.
All four tiers in a single config
rate_limits:
per_key:
rpm: 60 # requests per minute per API key
tpm: 100000 # tokens per minute per API key
max_parallel_requests: 10
per_user:
rpm: 20
tpm: 40000
max_parallel_requests: 5
per_team:
rpm: 200
tpm: 500000
max_parallel_requests: 50
global:
rpm: 1000
tpm: 2000000
max_parallel_requests: 200
Field Reference
ScopeRateLimits fields
| Field | Type | Description |
|---|---|---|
rpm | integer | Maximum requests per minute for this scope. null = no limit. |
tpm | integer | Maximum tokens per minute (prompt + completion combined) for this scope. null = no limit. |
max_parallel_requests | integer | Maximum simultaneous in-flight requests for this scope. null = no limit. |
All three fields are optional. Omitting a field means no limit is applied for that dimension.
IP Rate Limiting
IP-based rate limiting throttles requests from the same source IP address. This is the first-line defence against unauthenticated or pre-auth abuse, scraping, and denial-of-service attempts.
rate_limits:
ip:
enabled: true
rpm: 30
burst: 10 # allowed burst above rpm, consumed in < 1 second
whitelist:
- "10.0.0.0/8" # internal networks never rate-limited
- "172.16.0.0/12"
- "192.168.0.0/16"
header: "X-Forwarded-For" # header used to extract the real client IP
IpRateLimitConfig fields
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable IP-based rate limiting. |
rpm | integer | — | Maximum requests per minute from a single IP. |
burst | integer | 0 | Number of requests above rpm allowed in a burst window. |
whitelist | list of CIDR strings | [] | IP ranges that bypass IP rate limiting entirely. |
header | string | "X-Forwarded-For" | HTTP header from which the client IP is extracted. Set to "X-Real-IP" for nginx reverse proxy deployments. |
10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) and any load-balancer health check IPs. Failing to do so can trigger false-positive rate limits for internal service-to-service traffic.User Rate Limiting
User rate limiting applies independent quotas per authenticated user identity. The gateway extracts the user ID from a configurable request header.
rate_limits:
per_user:
rpm: 20
tpm: 50000
max_parallel_requests: 5
user_rate_limit:
header: "X-User-Id" # header that carries the user identifier
strategy: fixed_window # "fixed_window" or "sliding_window"
window_seconds: 60
UserRateLimitConfig fields
| Field | Type | Default | Description |
|---|---|---|---|
header | string | "X-User-Id" | Request header containing the user identifier. |
strategy | string | "fixed_window" | Window algorithm: fixed_window resets the counter at each window boundary; sliding_window tracks a rolling count. |
window_seconds | integer | 60 | Duration of the rate limit window in seconds. |
fallback_to_ip | bool | false | If true, fall back to IP-based limiting when the user header is absent. |
Global Rate Limiting
Global limits act as a hard ceiling across all combined traffic. When the global limit is reached, all new requests are rejected with HTTP 429 regardless of their per-key or per-user quotas.
rate_limits:
global:
rpm: 5000
tpm: 10000000
max_parallel_requests: 500
global_rate_limit:
strategy: sliding_window
window_seconds: 60
reject_action: return_429 # "return_429" or "queue" (if queue is configured)
GlobalRateLimitConfig fields
| Field | Type | Default | Description |
|---|---|---|---|
strategy | string | "fixed_window" | fixed_window or sliding_window. |
window_seconds | integer | 60 | Duration of the global rate limit window. |
reject_action | string | "return_429" | Action when the limit is reached: return_429 immediately rejects; queue holds the request if a request queue is configured. |
queue_timeout_ms | integer | 5000 | Maximum milliseconds a queued request will wait before being rejected. Only applies when reject_action: queue. |
Token Rate Limiting
Token rate limiting tracks the number of tokens consumed rather than (or in addition to) the number of requests. Useful when you have a metered upstream API key with a token-per-minute budget.
Keeptrusts counts tokens in two ways:
- Pre-request estimate: For non-streaming requests, the prompt tokens are counted before forwarding using a built-in tokenizer.
- Post-response reconciliation: After the upstream returns, the actual prompt + completion token counts from the response are used to reconcile the bucket.
rate_limits:
per_key:
tpm: 500000 # 500K tokens per minute per API key
global:
tpm: 2000000 # 2M tokens per minute across all traffic
token_rate_limit:
count_prompt_tokens: true
count_completion_tokens: true
pre_request_estimate: true # enforce a pre-request token estimate to reject before upstream call
tokenizer: "cl100k_base" # tiktoken tokenizer name; used for pre-request estimates
TokenRateLimitConfig fields
| Field | Type | Default | Description |
|---|---|---|---|
count_prompt_tokens | bool | true | Include prompt tokens in the TPM counter. |
count_completion_tokens | bool | true | Include completion tokens in the TPM counter. |
pre_request_estimate | bool | false | Estimate prompt tokens before forwarding and reject if the estimate would exceed the TPM limit. |
tokenizer | string | "cl100k_base" | Tiktoken tokenizer used for pre-request estimation. Use "o200k_base" for GPT-4o and o-series models. |
Distributed Rate Limiting (Redis)
When running multiple Keeptrusts gateway instances behind a load balancer, each instance maintains its own in-memory rate limit counters by default. To enforce consistent limits across the entire pool, configure Redis-backed distributed rate limiting.
distributed_rate_limit:
enabled: true
redis_url_env: REDIS_URL # environment variable containing the Redis connection URL
key_prefix: "kt:rl:" # key prefix for all rate limit entries in Redis
window_ms: 60000 # rate limit window in milliseconds
sync_interval_ms: 100 # how often local counts are flushed to Redis
local_fallback: true # if Redis is unreachable, fall back to local counting
DistributedRateLimitConfig fields
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable distributed (Redis-backed) rate limiting. |
redis_url_env | string | — | Name of the environment variable holding the Redis URL (e.g., redis://user:pass@host:6379/0). Credentials are never stored in the config file. |
key_prefix | string | "kt:rl:" | All Redis keys written by the rate limiter are prefixed with this string. Use per-environment prefixes (e.g., "kt:rl:prod:") when sharing a Redis cluster across environments. |
window_ms | integer | 60000 | Rate limit window in milliseconds. Must match the window_seconds values in rate_limits. |
sync_interval_ms | integer | 100 | Milliseconds between local counter flushes to Redis. Lower values reduce inconsistency windows at the cost of higher Redis write throughput. |
local_fallback | bool | true | When true, the gateway falls back to in-memory counting if Redis becomes unreachable. When false, rate limit enforcement is suspended on Redis failure (all requests are allowed through). |
Full multi-instance example
rate_limits:
per_key:
rpm: 100
tpm: 200000
global:
rpm: 2000
tpm: 5000000
distributed_rate_limit:
enabled: true
redis_url_env: REDIS_URL
key_prefix: "kt:rl:prod:"
window_ms: 60000
sync_interval_ms: 50
local_fallback: true
Set the environment variable before starting each gateway instance:
export REDIS_URL="redis://:yourpassword@redis.internal:6379/0"
kt gateway run --policy-config policy-config.yaml
Consumer Group Rate Limits
Consumer groups let you define named groups of API keys that share a pool-level rate limit in addition to their individual per_key limits. A consumer group rate limit is applied after per-key limits pass; exceeding the group limit rejects the request even if the individual key is still under its quota.
consumer_groups:
- name: mobile-clients
description: "All mobile app API keys"
rate_limits:
rpm: 500
tpm: 1000000
max_parallel_requests: 100
keys:
- key_id: mobile-ios-prod
- key_id: mobile-android-prod
- key_id: mobile-web-prod
- name: internal-services
description: "Backend service-to-service calls"
rate_limits:
rpm: 2000
tpm: 8000000
max_parallel_requests: 400
keys:
- key_id: search-service
- key_id: recommendation-engine
- key_id: content-moderation
Evaluation order for a request from mobile-ios-prod:
- IP rate limit
per_keyrate limit formobile-ios-prodmobile-clientsconsumer group rate limitper_userrate limit (if user header present)per_teamrate limit (if team header present)globalrate limit
Inheriting and overriding rate limits
A ConsumerGroupRule can inherit a base rate limit profile and override individual fields:
rate_limit_profiles:
- name: standard
rpm: 60
tpm: 100000
consumer_groups:
- name: premium-tier
inherits: standard
rate_limits:
rpm: 300 # override rpm only; tpm inherits 100000 from "standard"
keys:
- key_id: premium-api-key-1
- key_id: premium-api-key-2
Response Headers
When rate limiting applies, Keeptrusts returns standard rate-limit headers on every response to help clients implement backoff:
| Header | Description |
|---|---|
X-RateLimit-Limit | The rate limit ceiling for the current scope (requests or tokens depending on which limit was applied). |
X-RateLimit-Remaining | Number of requests or tokens remaining in the current window. |
X-RateLimit-Reset | Unix timestamp (seconds) when the current window resets. |
Retry-After | Seconds until the client should retry. Present only on HTTP 429 responses. |
Example headers on a normal response
HTTP/1.1 200 OK
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 43
X-RateLimit-Reset: 1711577460
Example headers on a rate-limited response
HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1711577460
Retry-After: 12
Content-Type: application/json
{
"error": {
"code": "rate_limit_exceeded",
"message": "Request rate limit exceeded for this API key. Retry after 12 seconds.",
"type": "rate_limit_error",
"param": null
}
}
Observability & Monitoring
Every rate limit decision emits a structured event in the Keeptrusts event stream. Filter these events in the console to identify clients approaching their limits before they start receiving 429 responses.
Event types
| Event | Description |
|---|---|
rate_limit.checked | A rate limit check passed. Fields: scope, key_id, user_id, team_id, remaining_rpm, remaining_tpm. |
rate_limit.exceeded | A request was rejected due to a rate limit. Fields: scope, limit_type (rpm/tpm/parallel), limit_value, current_value. |
rate_limit.circuit_open | The global limit was reached and the circuit opened (requests are rejected without checking upstream). |
Detecting approaching limits
Use the Keeptrusts console's Rate Limit dashboard view to observe:
- Which API keys are consistently above 80% of their
rpmortpmbudget - Which consumer groups are approaching their pooled ceiling
- Time-of-day patterns that suggest you should shift to
sliding_windowinstead offixed_window
Exporting rate limit metrics
Rate limit state is exported as OTLP metrics when OTLP export is configured:
observability:
otlp:
enabled: true
endpoint_env: OTEL_EXPORTER_OTLP_ENDPOINT
export_rate_limit_metrics: true
Exported metric names follow the kt.rate_limit.* namespace:
| Metric | Type | Description |
|---|---|---|
kt.rate_limit.requests_allowed_total | Counter | Total requests that passed all rate limit checks. |
kt.rate_limit.requests_rejected_total | Counter | Total requests rejected by any rate limit tier. |
kt.rate_limit.remaining_rpm | Gauge | Remaining requests per minute for the most constrained active scope. |
kt.rate_limit.remaining_tpm | Gauge | Remaining tokens per minute for the most constrained active scope. |
Best Practices
-
Start conservative and increase limits based on observed traffic. It is easier to raise limits for trusted clients than to explain unexpected rejections. Set
per_keylimits low initially and monitorrate_limit_exceededevents in the Keeptrusts console to identify keys that regularly hit their ceiling. -
Always configure
globallimits that match your upstream provider quota. If your upstream API key allows 1M TPM, setglobal.tpm: 950000(5% margin) so Keeptrusts rejects excess traffic before it reaches the provider's own rate limiter, avoiding upstream 429s entirely. -
Use
local_fallback: truefor Redis-backed deployments. A Redis outage should not disable rate limiting entirely. Local fallback maintains per-instance limits, which is a reasonable safety net during short Redis unavailability windows. -
Use consumer groups instead of per-key limits for large fleets. Managing 50 individual
per_keyentries is error-prone. Use aconsumer_groupsentry with a pool limit and add individual keys to the group — group membership changes without touching the rate limit values. -
Prefer
sliding_windowfor user-facing rate limits. Fixed windows allow clients to "double-burst" by sending maximum requests just before and just after a window boundary. Sliding windows are more consistent for API consumers. -
Log
Retry-Afterheader values in your client. Applications that call Keeptrusts should respect theRetry-Afterheader rather than implementing a fixed backoff interval. A client that backs off correctly will self-heal during transient spikes without manual intervention.
Identity Modes and Per-User Rate Limiting
Per-user rate limits (per_user: tier) require the gateway to resolve the caller's identity for each request. Keeptrusts supports three identity modes with different trust levels. The active mode is recorded in the identity_proof_method field on every persisted event and trace span.
HeaderSoft (unverified headers)
When no gateway key authentication is configured, the gateway derives the caller's user ID from the X-User-ID request header. This is HeaderSoft mode.
rate_limits:
per_user:
rpm: 60
Trade-offs: Header-soft identity is fast and simple, but any caller that can set the X-User-ID header can impersonate any user. Per-user rate limits enforced in header-soft mode can be bypassed by rotating user IDs. Use header-soft only in controlled internal environments where requests come from trusted infrastructure.
Events produced in header-soft mode carry identity_proof_method: header_soft. The identity_context object is absent because the header value is not verified.
GatewayKey (key-bound identity)
When the gateway authenticates requests via bearer gateway keys, each key can be bound to a specific user_id and team_id at creation time. This is GatewayKey mode.
# Gateway key bound to a user — configure in the Keeptrusts console or API
# POST /v1/gateway-keys { "user_id": "alice@example.com", "max_budget": 10.0 }
Trade-offs: Virtual-key identity is authenticated — the gateway validates the bearer token and resolves the bound user_id/team_id from the key record. Per-user rate limits in gateway-key mode are reliable: a caller cannot change their effective user ID without rotating to a different key with a different binding. This is the recommended mode for production per-user rate limiting.
Events in gateway-key mode carry identity_proof_method: gateway_key and an identity_context object with the resolved user_id, key_id, and team_id.
SignedAssertion (JWKS-verified identity)
When the caller presents a JWKS-signed JWT Bearer token in the Authorization header alongside or instead of a gateway key, the gateway verifies the token against a configured JWKS endpoint. This is SignedAssertion mode.
pack:
name: rate-limiting-providers-13
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider:
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Trade-offs: Signed assertions provide the strongest identity guarantee — the caller's identity is cryptographically bound to the token and cannot be forged without the signing key. Use this mode when callers are external services with their own identity provider. The JWT sub claim becomes the effective user_id, and optional kt_key_id/kt_team_id claims are also extracted.
Events in signed-assertion mode carry identity_proof_method: signed_assertion.
Choosing the right mode for per-user rate limiting
| Mode | Identity source | Forgeable? | Recommended for |
|---|---|---|---|
header_soft | X-User-ID header | Yes | Trusted internal dev environments |
gateway_key | Authenticated key record | No | Production APIs with gateway key auth |
signed_assertion | JWKS-verified JWT | No | External callers with their own IdP |
For any production deployment where per-user limits must be enforced accurately and auditably, prefer GatewayKey or SignedAssertion mode. Header-soft mode should be treated as a HeaderSoft compatibility layer for legacy or internal flows where spoofing the user ID is not a concern.
For AI systems
- Canonical terms: Keeptrusts Rate Limiting, per-key limit, per-user limit, per-team limit, global limit, IP rate limit, distributed rate limiting, consumer group rate limit, identity modes.
- Config keys:
rate_limits.per_key(rpm,tpm,max_parallel_requests),rate_limits.per_user,rate_limits.per_team,rate_limits.global,rate_limits.ip(enabled,rpm,burst,whitelist,header),distributed_rate_limit(enabled,redis_url_env,key_prefix,sync_interval_ms,local_fallback),token_rate_limit(pre_request_estimate,tokenizer). - Identity modes:
header_soft(unverifiedX-User-ID),gateway_key(key-bound identity),signed_assertion(JWKS-verified JWT). - Response headers:
X-RateLimit-Limit,X-RateLimit-Remaining,X-RateLimit-Reset,Retry-After. - Event types:
rate_limit.checked,rate_limit.exceeded,rate_limit.circuit_open. - OTLP metrics:
kt.rate_limit.requests_allowed_total,kt.rate_limit.requests_rejected_total,kt.rate_limit.remaining_rpm,kt.rate_limit.remaining_tpm. - Best next pages: Consumer Groups, CORS & IP Allowlist, Provider Routing.
For engineers
- Prerequisites: for distributed limiting, build with
--features distributedand setREDIS_URLenvironment variable. - Evaluation order: IP → per_key → consumer_group → per_user → per_team → global.
- Set
global.tpmto 95% of your upstream provider’s TPM quota to absorb excess traffic before it hits provider-side 429s. - Use
sliding_windowfor user-facing limits (prevents double-burst at window boundaries). - Always whitelist internal CIDRs (
10.0.0.0/8,172.16.0.0/12) in IP rate limits to avoid false-positive blocks on internal traffic. - Monitor: filter Events by
rate_limit.exceededto identify keys approaching their ceiling before they start receiving 429s. - For production per-user rate limiting, use
gateway_keyorsigned_assertionidentity mode —header_softis trivially spoofable.
For leaders
- Cost control: rate limits prevent runaway agents or misbehaving integrations from exhausting the entire provider budget in minutes.
- Fair use: per-team and per-user limits ensure no single team monopolizes shared LLM capacity.
- Compliance: per-user rate limits with auditable identity modes (
gateway_key,signed_assertion) satisfy SOC 2 access control requirements. - Operational safety:
local_fallback: trueensures rate limiting continues even during Redis outages, preventing unbounded traffic during infrastructure incidents.
Next steps
- Consumer Groups — per-group rate limit overrides for team-level budgets
- CORS & IP Allowlist — IP-level access control alongside rate limiting
- Provider Routing —
usage_basedrouting respects per-provider token budgets - Custom Routes — apply different rate limit profiles per API path