Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Rate Limiting

Keeptrusts supports multi-tier rate limiting to prevent abuse, control costs, and enforce fair-use policies across your deployment. Rate limits are evaluated in order from most-specific to least-specific: per-key limits are checked first, then per-user, then per-team, and finally global limits. A request that exceeds any tier is rejected with HTTP 429 before it reaches the upstream provider.

Use this page when

  • You need the exact command, config, API, or integration details for Rate Limiting.
  • You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
  • If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Rate Limit Tiers

Keeptrusts evaluates four rate limit tiers on every request.

per_key — Per API key

Applied based on the API key used to authenticate the request. Use per_key limits to give each integration or service account its own RPM/TPM budget, preventing a single badly-behaving client from consuming all available capacity.

per_user — Per user identity

Applied based on the value in the configured user identity header (default: X-User-Id). Use per_user limits for applications that have authenticated end users and want per-user fairness controls.

per_team — Per team identity

Applied based on the value in the configured team identity header (default: X-Team-Id). Use per_team limits for multi-tenant SaaS applications where each tenant has a separate quota.

global — Across all traffic

Applied as a hard ceiling across all traffic regardless of key, user, or team. Use global limits to enforce your aggregate upstream provider capacity or your upstream API key's rate limit.

All four tiers in a single config

rate_limits:
per_key:
rpm: 60 # requests per minute per API key
tpm: 100000 # tokens per minute per API key
max_parallel_requests: 10

per_user:
rpm: 20
tpm: 40000
max_parallel_requests: 5

per_team:
rpm: 200
tpm: 500000
max_parallel_requests: 50

global:
rpm: 1000
tpm: 2000000
max_parallel_requests: 200

Field Reference

ScopeRateLimits fields

FieldTypeDescription
rpmintegerMaximum requests per minute for this scope. null = no limit.
tpmintegerMaximum tokens per minute (prompt + completion combined) for this scope. null = no limit.
max_parallel_requestsintegerMaximum simultaneous in-flight requests for this scope. null = no limit.

All three fields are optional. Omitting a field means no limit is applied for that dimension.


IP Rate Limiting

IP-based rate limiting throttles requests from the same source IP address. This is the first-line defence against unauthenticated or pre-auth abuse, scraping, and denial-of-service attempts.

rate_limits:
ip:
enabled: true
rpm: 30
burst: 10 # allowed burst above rpm, consumed in < 1 second
whitelist:
- "10.0.0.0/8" # internal networks never rate-limited
- "172.16.0.0/12"
- "192.168.0.0/16"
header: "X-Forwarded-For" # header used to extract the real client IP

IpRateLimitConfig fields

FieldTypeDefaultDescription
enabledboolfalseEnable IP-based rate limiting.
rpmintegerMaximum requests per minute from a single IP.
burstinteger0Number of requests above rpm allowed in a burst window.
whitelistlist of CIDR strings[]IP ranges that bypass IP rate limiting entirely.
headerstring"X-Forwarded-For"HTTP header from which the client IP is extracted. Set to "X-Real-IP" for nginx reverse proxy deployments.
Always whitelist your internal network ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) and any load-balancer health check IPs. Failing to do so can trigger false-positive rate limits for internal service-to-service traffic.

User Rate Limiting

User rate limiting applies independent quotas per authenticated user identity. The gateway extracts the user ID from a configurable request header.

rate_limits:
per_user:
rpm: 20
tpm: 50000
max_parallel_requests: 5

user_rate_limit:
header: "X-User-Id" # header that carries the user identifier
strategy: fixed_window # "fixed_window" or "sliding_window"
window_seconds: 60

UserRateLimitConfig fields

FieldTypeDefaultDescription
headerstring"X-User-Id"Request header containing the user identifier.
strategystring"fixed_window"Window algorithm: fixed_window resets the counter at each window boundary; sliding_window tracks a rolling count.
window_secondsinteger60Duration of the rate limit window in seconds.
fallback_to_ipboolfalseIf true, fall back to IP-based limiting when the user header is absent.

Global Rate Limiting

Global limits act as a hard ceiling across all combined traffic. When the global limit is reached, all new requests are rejected with HTTP 429 regardless of their per-key or per-user quotas.

rate_limits:
global:
rpm: 5000
tpm: 10000000
max_parallel_requests: 500

global_rate_limit:
strategy: sliding_window
window_seconds: 60
reject_action: return_429 # "return_429" or "queue" (if queue is configured)

GlobalRateLimitConfig fields

FieldTypeDefaultDescription
strategystring"fixed_window"fixed_window or sliding_window.
window_secondsinteger60Duration of the global rate limit window.
reject_actionstring"return_429"Action when the limit is reached: return_429 immediately rejects; queue holds the request if a request queue is configured.
queue_timeout_msinteger5000Maximum milliseconds a queued request will wait before being rejected. Only applies when reject_action: queue.

Token Rate Limiting

Token rate limiting tracks the number of tokens consumed rather than (or in addition to) the number of requests. Useful when you have a metered upstream API key with a token-per-minute budget.

Keeptrusts counts tokens in two ways:

  • Pre-request estimate: For non-streaming requests, the prompt tokens are counted before forwarding using a built-in tokenizer.
  • Post-response reconciliation: After the upstream returns, the actual prompt + completion token counts from the response are used to reconcile the bucket.
rate_limits:
per_key:
tpm: 500000 # 500K tokens per minute per API key

global:
tpm: 2000000 # 2M tokens per minute across all traffic

token_rate_limit:
count_prompt_tokens: true
count_completion_tokens: true
pre_request_estimate: true # enforce a pre-request token estimate to reject before upstream call
tokenizer: "cl100k_base" # tiktoken tokenizer name; used for pre-request estimates

TokenRateLimitConfig fields

FieldTypeDefaultDescription
count_prompt_tokensbooltrueInclude prompt tokens in the TPM counter.
count_completion_tokensbooltrueInclude completion tokens in the TPM counter.
pre_request_estimateboolfalseEstimate prompt tokens before forwarding and reject if the estimate would exceed the TPM limit.
tokenizerstring"cl100k_base"Tiktoken tokenizer used for pre-request estimation. Use "o200k_base" for GPT-4o and o-series models.

Distributed Rate Limiting (Redis)

When running multiple Keeptrusts gateway instances behind a load balancer, each instance maintains its own in-memory rate limit counters by default. To enforce consistent limits across the entire pool, configure Redis-backed distributed rate limiting.

distributed_rate_limit:
enabled: true
redis_url_env: REDIS_URL # environment variable containing the Redis connection URL
key_prefix: "kt:rl:" # key prefix for all rate limit entries in Redis
window_ms: 60000 # rate limit window in milliseconds
sync_interval_ms: 100 # how often local counts are flushed to Redis
local_fallback: true # if Redis is unreachable, fall back to local counting

DistributedRateLimitConfig fields

FieldTypeDefaultDescription
enabledboolfalseEnable distributed (Redis-backed) rate limiting.
redis_url_envstringName of the environment variable holding the Redis URL (e.g., redis://user:pass@host:6379/0). Credentials are never stored in the config file.
key_prefixstring"kt:rl:"All Redis keys written by the rate limiter are prefixed with this string. Use per-environment prefixes (e.g., "kt:rl:prod:") when sharing a Redis cluster across environments.
window_msinteger60000Rate limit window in milliseconds. Must match the window_seconds values in rate_limits.
sync_interval_msinteger100Milliseconds between local counter flushes to Redis. Lower values reduce inconsistency windows at the cost of higher Redis write throughput.
local_fallbackbooltrueWhen true, the gateway falls back to in-memory counting if Redis becomes unreachable. When false, rate limit enforcement is suspended on Redis failure (all requests are allowed through).

Full multi-instance example

rate_limits:
per_key:
rpm: 100
tpm: 200000
global:
rpm: 2000
tpm: 5000000

distributed_rate_limit:
enabled: true
redis_url_env: REDIS_URL
key_prefix: "kt:rl:prod:"
window_ms: 60000
sync_interval_ms: 50
local_fallback: true

Set the environment variable before starting each gateway instance:

export REDIS_URL="redis://:yourpassword@redis.internal:6379/0"
kt gateway run --policy-config policy-config.yaml

Consumer Group Rate Limits

Consumer groups let you define named groups of API keys that share a pool-level rate limit in addition to their individual per_key limits. A consumer group rate limit is applied after per-key limits pass; exceeding the group limit rejects the request even if the individual key is still under its quota.

consumer_groups:
- name: mobile-clients
description: "All mobile app API keys"
rate_limits:
rpm: 500
tpm: 1000000
max_parallel_requests: 100
keys:
- key_id: mobile-ios-prod
- key_id: mobile-android-prod
- key_id: mobile-web-prod

- name: internal-services
description: "Backend service-to-service calls"
rate_limits:
rpm: 2000
tpm: 8000000
max_parallel_requests: 400
keys:
- key_id: search-service
- key_id: recommendation-engine
- key_id: content-moderation

Evaluation order for a request from mobile-ios-prod:

  1. IP rate limit
  2. per_key rate limit for mobile-ios-prod
  3. mobile-clients consumer group rate limit
  4. per_user rate limit (if user header present)
  5. per_team rate limit (if team header present)
  6. global rate limit

Inheriting and overriding rate limits

A ConsumerGroupRule can inherit a base rate limit profile and override individual fields:

rate_limit_profiles:
- name: standard
rpm: 60
tpm: 100000

consumer_groups:
- name: premium-tier
inherits: standard
rate_limits:
rpm: 300 # override rpm only; tpm inherits 100000 from "standard"
keys:
- key_id: premium-api-key-1
- key_id: premium-api-key-2

Response Headers

When rate limiting applies, Keeptrusts returns standard rate-limit headers on every response to help clients implement backoff:

HeaderDescription
X-RateLimit-LimitThe rate limit ceiling for the current scope (requests or tokens depending on which limit was applied).
X-RateLimit-RemainingNumber of requests or tokens remaining in the current window.
X-RateLimit-ResetUnix timestamp (seconds) when the current window resets.
Retry-AfterSeconds until the client should retry. Present only on HTTP 429 responses.

Example headers on a normal response

HTTP/1.1 200 OK
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 43
X-RateLimit-Reset: 1711577460

Example headers on a rate-limited response

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1711577460
Retry-After: 12
Content-Type: application/json

{
"error": {
"code": "rate_limit_exceeded",
"message": "Request rate limit exceeded for this API key. Retry after 12 seconds.",
"type": "rate_limit_error",
"param": null
}
}

Observability & Monitoring

Every rate limit decision emits a structured event in the Keeptrusts event stream. Filter these events in the console to identify clients approaching their limits before they start receiving 429 responses.

Event types

EventDescription
rate_limit.checkedA rate limit check passed. Fields: scope, key_id, user_id, team_id, remaining_rpm, remaining_tpm.
rate_limit.exceededA request was rejected due to a rate limit. Fields: scope, limit_type (rpm/tpm/parallel), limit_value, current_value.
rate_limit.circuit_openThe global limit was reached and the circuit opened (requests are rejected without checking upstream).

Detecting approaching limits

Use the Keeptrusts console's Rate Limit dashboard view to observe:

  • Which API keys are consistently above 80% of their rpm or tpm budget
  • Which consumer groups are approaching their pooled ceiling
  • Time-of-day patterns that suggest you should shift to sliding_window instead of fixed_window

Exporting rate limit metrics

Rate limit state is exported as OTLP metrics when OTLP export is configured:

observability:
otlp:
enabled: true
endpoint_env: OTEL_EXPORTER_OTLP_ENDPOINT
export_rate_limit_metrics: true

Exported metric names follow the kt.rate_limit.* namespace:

MetricTypeDescription
kt.rate_limit.requests_allowed_totalCounterTotal requests that passed all rate limit checks.
kt.rate_limit.requests_rejected_totalCounterTotal requests rejected by any rate limit tier.
kt.rate_limit.remaining_rpmGaugeRemaining requests per minute for the most constrained active scope.
kt.rate_limit.remaining_tpmGaugeRemaining tokens per minute for the most constrained active scope.

Best Practices

  1. Start conservative and increase limits based on observed traffic. It is easier to raise limits for trusted clients than to explain unexpected rejections. Set per_key limits low initially and monitor rate_limit_exceeded events in the Keeptrusts console to identify keys that regularly hit their ceiling.

  2. Always configure global limits that match your upstream provider quota. If your upstream API key allows 1M TPM, set global.tpm: 950000 (5% margin) so Keeptrusts rejects excess traffic before it reaches the provider's own rate limiter, avoiding upstream 429s entirely.

  3. Use local_fallback: true for Redis-backed deployments. A Redis outage should not disable rate limiting entirely. Local fallback maintains per-instance limits, which is a reasonable safety net during short Redis unavailability windows.

  4. Use consumer groups instead of per-key limits for large fleets. Managing 50 individual per_key entries is error-prone. Use a consumer_groups entry with a pool limit and add individual keys to the group — group membership changes without touching the rate limit values.

  5. Prefer sliding_window for user-facing rate limits. Fixed windows allow clients to "double-burst" by sending maximum requests just before and just after a window boundary. Sliding windows are more consistent for API consumers.

  6. Log Retry-After header values in your client. Applications that call Keeptrusts should respect the Retry-After header rather than implementing a fixed backoff interval. A client that backs off correctly will self-heal during transient spikes without manual intervention.


Identity Modes and Per-User Rate Limiting

Per-user rate limits (per_user: tier) require the gateway to resolve the caller's identity for each request. Keeptrusts supports three identity modes with different trust levels. The active mode is recorded in the identity_proof_method field on every persisted event and trace span.

HeaderSoft (unverified headers)

When no gateway key authentication is configured, the gateway derives the caller's user ID from the X-User-ID request header. This is HeaderSoft mode.

rate_limits:
per_user:
rpm: 60

Trade-offs: Header-soft identity is fast and simple, but any caller that can set the X-User-ID header can impersonate any user. Per-user rate limits enforced in header-soft mode can be bypassed by rotating user IDs. Use header-soft only in controlled internal environments where requests come from trusted infrastructure.

Events produced in header-soft mode carry identity_proof_method: header_soft. The identity_context object is absent because the header value is not verified.

GatewayKey (key-bound identity)

When the gateway authenticates requests via bearer gateway keys, each key can be bound to a specific user_id and team_id at creation time. This is GatewayKey mode.

# Gateway key bound to a user — configure in the Keeptrusts console or API
# POST /v1/gateway-keys { "user_id": "alice@example.com", "max_budget": 10.0 }

Trade-offs: Virtual-key identity is authenticated — the gateway validates the bearer token and resolves the bound user_id/team_id from the key record. Per-user rate limits in gateway-key mode are reliable: a caller cannot change their effective user ID without rotating to a different key with a different binding. This is the recommended mode for production per-user rate limiting.

Events in gateway-key mode carry identity_proof_method: gateway_key and an identity_context object with the resolved user_id, key_id, and team_id.

SignedAssertion (JWKS-verified identity)

When the caller presents a JWKS-signed JWT Bearer token in the Authorization header alongside or instead of a gateway key, the gateway verifies the token against a configured JWKS endpoint. This is SignedAssertion mode.

pack:
name: rate-limiting-providers-13
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider:
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Trade-offs: Signed assertions provide the strongest identity guarantee — the caller's identity is cryptographically bound to the token and cannot be forged without the signing key. Use this mode when callers are external services with their own identity provider. The JWT sub claim becomes the effective user_id, and optional kt_key_id/kt_team_id claims are also extracted.

Events in signed-assertion mode carry identity_proof_method: signed_assertion.

Choosing the right mode for per-user rate limiting

ModeIdentity sourceForgeable?Recommended for
header_softX-User-ID headerYesTrusted internal dev environments
gateway_keyAuthenticated key recordNoProduction APIs with gateway key auth
signed_assertionJWKS-verified JWTNoExternal callers with their own IdP

For any production deployment where per-user limits must be enforced accurately and auditably, prefer GatewayKey or SignedAssertion mode. Header-soft mode should be treated as a HeaderSoft compatibility layer for legacy or internal flows where spoofing the user ID is not a concern.

For AI systems

  • Canonical terms: Keeptrusts Rate Limiting, per-key limit, per-user limit, per-team limit, global limit, IP rate limit, distributed rate limiting, consumer group rate limit, identity modes.
  • Config keys: rate_limits.per_key (rpm, tpm, max_parallel_requests), rate_limits.per_user, rate_limits.per_team, rate_limits.global, rate_limits.ip (enabled, rpm, burst, whitelist, header), distributed_rate_limit (enabled, redis_url_env, key_prefix, sync_interval_ms, local_fallback), token_rate_limit (pre_request_estimate, tokenizer).
  • Identity modes: header_soft (unverified X-User-ID), gateway_key (key-bound identity), signed_assertion (JWKS-verified JWT).
  • Response headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After.
  • Event types: rate_limit.checked, rate_limit.exceeded, rate_limit.circuit_open.
  • OTLP metrics: kt.rate_limit.requests_allowed_total, kt.rate_limit.requests_rejected_total, kt.rate_limit.remaining_rpm, kt.rate_limit.remaining_tpm.
  • Best next pages: Consumer Groups, CORS & IP Allowlist, Provider Routing.

For engineers

  • Prerequisites: for distributed limiting, build with --features distributed and set REDIS_URL environment variable.
  • Evaluation order: IP → per_key → consumer_group → per_user → per_team → global.
  • Set global.tpm to 95% of your upstream provider’s TPM quota to absorb excess traffic before it hits provider-side 429s.
  • Use sliding_window for user-facing limits (prevents double-burst at window boundaries).
  • Always whitelist internal CIDRs (10.0.0.0/8, 172.16.0.0/12) in IP rate limits to avoid false-positive blocks on internal traffic.
  • Monitor: filter Events by rate_limit.exceeded to identify keys approaching their ceiling before they start receiving 429s.
  • For production per-user rate limiting, use gateway_key or signed_assertion identity mode — header_soft is trivially spoofable.

For leaders

  • Cost control: rate limits prevent runaway agents or misbehaving integrations from exhausting the entire provider budget in minutes.
  • Fair use: per-team and per-user limits ensure no single team monopolizes shared LLM capacity.
  • Compliance: per-user rate limits with auditable identity modes (gateway_key, signed_assertion) satisfy SOC 2 access control requirements.
  • Operational safety: local_fallback: true ensures rate limiting continues even during Redis outages, preventing unbounded traffic during infrastructure incidents.

Next steps