Rate Limiting

Keeptrusts supports multi-tier rate limiting to prevent abuse, control costs, and enforce fair-use policies across your deployment. Rate limits are evaluated in order from most-specific to least-specific: per-key limits are checked first, then per-user, then per-team, and finally global limits. A request that exceeds any tier is rejected with HTTP 429 before it reaches the upstream provider.

Use this page when

You need the exact command, config, API, or integration details for Rate Limiting.
You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Rate Limit Tiers

Keeptrusts evaluates four rate limit tiers on every request.

`per_key` — Per API key

Applied based on the API key used to authenticate the request. Use per_key limits to give each integration or service account its own RPM/TPM budget, preventing a single badly-behaving client from consuming all available capacity.

`per_user` — Per user identity

Applied based on the value in the configured user identity header (default: X-User-Id). Use per_user limits for applications that have authenticated end users and want per-user fairness controls.

`per_team` — Per team identity

Applied based on the value in the configured team identity header (default: X-Team-Id). Use per_team limits for multi-tenant SaaS applications where each tenant has a separate quota.

`global` — Across all traffic

Applied as a hard ceiling across all traffic regardless of key, user, or team. Use global limits to enforce your aggregate upstream provider capacity or your upstream API key's rate limit.

All four tiers in a single config

rate_limits:
  per_key:
    rpm: 60                  # requests per minute per API key
    tpm: 100000              # tokens per minute per API key
    max_parallel_requests: 10

  per_user:
    rpm: 20
    tpm: 40000
    max_parallel_requests: 5

  per_team:
    rpm: 200
    tpm: 500000
    max_parallel_requests: 50

  global:
    rpm: 1000
    tpm: 2000000
    max_parallel_requests: 200

Field Reference

`ScopeRateLimits` fields

Field	Type	Description
`rpm`	integer	Maximum requests per minute for this scope. `null` = no limit.
`tpm`	integer	Maximum tokens per minute (prompt + completion combined) for this scope. `null` = no limit.
`max_parallel_requests`	integer	Maximum simultaneous in-flight requests for this scope. `null` = no limit.

All three fields are optional. Omitting a field means no limit is applied for that dimension.

IP Rate Limiting

IP-based rate limiting throttles requests from the same source IP address. This is the first-line defence against unauthenticated or pre-auth abuse, scraping, and denial-of-service attempts.

rate_limits:
  ip:
    enabled: true
    rpm: 30
    burst: 10                # allowed burst above rpm, consumed in < 1 second
    whitelist:
      - "10.0.0.0/8"        # internal networks never rate-limited
      - "172.16.0.0/12"
      - "192.168.0.0/16"
    header: "X-Forwarded-For"   # header used to extract the real client IP

`IpRateLimitConfig` fields

Field	Type	Default	Description
`enabled`	bool	`false`	Enable IP-based rate limiting.
`rpm`	integer	—	Maximum requests per minute from a single IP.
`burst`	integer	`0`	Number of requests above `rpm` allowed in a burst window.
`whitelist`	list of CIDR strings	`[]`	IP ranges that bypass IP rate limiting entirely.
`header`	string	`"X-Forwarded-For"`	HTTP header from which the client IP is extracted. Set to `"X-Real-IP"` for nginx reverse proxy deployments.

Always whitelist your internal network ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16) and any load-balancer health check IPs. Failing to do so can trigger false-positive rate limits for internal service-to-service traffic.

User Rate Limiting

User rate limiting applies independent quotas per authenticated user identity. The gateway extracts the user ID from a configurable request header.

rate_limits:
  per_user:
    rpm: 20
    tpm: 50000
    max_parallel_requests: 5

user_rate_limit:
  header: "X-User-Id"       # header that carries the user identifier
  strategy: fixed_window    # "fixed_window" or "sliding_window"
  window_seconds: 60

`UserRateLimitConfig` fields

Field	Type	Default	Description
`header`	string	`"X-User-Id"`	Request header containing the user identifier.
`strategy`	string	`"fixed_window"`	Window algorithm: `fixed_window` resets the counter at each window boundary; `sliding_window` tracks a rolling count.
`window_seconds`	integer	`60`	Duration of the rate limit window in seconds.
`fallback_to_ip`	bool	`false`	If `true`, fall back to IP-based limiting when the user header is absent.

Global Rate Limiting

Global limits act as a hard ceiling across all combined traffic. When the global limit is reached, all new requests are rejected with HTTP 429 regardless of their per-key or per-user quotas.

rate_limits:
  global:
    rpm: 5000
    tpm: 10000000
    max_parallel_requests: 500

global_rate_limit:
  strategy: sliding_window
  window_seconds: 60
  reject_action: return_429   # "return_429" or "queue" (if queue is configured)

`GlobalRateLimitConfig` fields

Field	Type	Default	Description
`strategy`	string	`"fixed_window"`	`fixed_window` or `sliding_window`.
`window_seconds`	integer	`60`	Duration of the global rate limit window.
`reject_action`	string	`"return_429"`	Action when the limit is reached: `return_429` immediately rejects; `queue` holds the request if a request queue is configured.
`queue_timeout_ms`	integer	`5000`	Maximum milliseconds a queued request will wait before being rejected. Only applies when `reject_action: queue`.

Token Rate Limiting

Token rate limiting tracks the number of tokens consumed rather than (or in addition to) the number of requests. Useful when you have a metered upstream API key with a token-per-minute budget.

Keeptrusts counts tokens in two ways:

Pre-request estimate: For non-streaming requests, the prompt tokens are counted before forwarding using a built-in tokenizer.
Post-response reconciliation: After the upstream returns, the actual prompt + completion token counts from the response are used to reconcile the bucket.

rate_limits:
  per_key:
    tpm: 500000       # 500K tokens per minute per API key

  global:
    tpm: 2000000      # 2M tokens per minute across all traffic

token_rate_limit:
  count_prompt_tokens: true
  count_completion_tokens: true
  pre_request_estimate: true    # enforce a pre-request token estimate to reject before upstream call
  tokenizer: "cl100k_base"      # tiktoken tokenizer name; used for pre-request estimates

`TokenRateLimitConfig` fields

Field	Type	Default	Description
`count_prompt_tokens`	bool	`true`	Include prompt tokens in the TPM counter.
`count_completion_tokens`	bool	`true`	Include completion tokens in the TPM counter.
`pre_request_estimate`	bool	`false`	Estimate prompt tokens before forwarding and reject if the estimate would exceed the TPM limit.
`tokenizer`	string	`"cl100k_base"`	Tiktoken tokenizer used for pre-request estimation. Use `"o200k_base"` for GPT-4o and o-series models.

Distributed Rate Limiting (Redis)

When running multiple Keeptrusts gateway instances behind a load balancer, each instance maintains its own in-memory rate limit counters by default. To enforce consistent limits across the entire pool, configure Redis-backed distributed rate limiting.

distributed_rate_limit:
  enabled: true
  redis_url_env: REDIS_URL       # environment variable containing the Redis connection URL
  key_prefix: "kt:rl:"           # key prefix for all rate limit entries in Redis
  window_ms: 60000               # rate limit window in milliseconds
  sync_interval_ms: 100          # how often local counts are flushed to Redis
  local_fallback: true           # if Redis is unreachable, fall back to local counting

`DistributedRateLimitConfig` fields

Field	Type	Default	Description
`enabled`	bool	`false`	Enable distributed (Redis-backed) rate limiting.
`redis_url_env`	string	—	Name of the environment variable holding the Redis URL (e.g., `redis://user:pass@host:6379/0`). Credentials are never stored in the config file.
`key_prefix`	string	`"kt:rl:"`	All Redis keys written by the rate limiter are prefixed with this string. Use per-environment prefixes (e.g., `"kt:rl:prod:"`) when sharing a Redis cluster across environments.
`window_ms`	integer	`60000`	Rate limit window in milliseconds. Must match the `window_seconds` values in `rate_limits`.
`sync_interval_ms`	integer	`100`	Milliseconds between local counter flushes to Redis. Lower values reduce inconsistency windows at the cost of higher Redis write throughput.
`local_fallback`	bool	`true`	When `true`, the gateway falls back to in-memory counting if Redis becomes unreachable. When `false`, rate limit enforcement is suspended on Redis failure (all requests are allowed through).

Full multi-instance example

rate_limits:
  per_key:
    rpm: 100
    tpm: 200000
  global:
    rpm: 2000
    tpm: 5000000

distributed_rate_limit:
  enabled: true
  redis_url_env: REDIS_URL
  key_prefix: "kt:rl:prod:"
  window_ms: 60000
  sync_interval_ms: 50
  local_fallback: true

Set the environment variable before starting each gateway instance:

export REDIS_URL="redis://:yourpassword@redis.internal:6379/0"
kt gateway run --policy-config policy-config.yaml

Consumer Group Rate Limits

Consumer groups let you define named groups of API keys that share a pool-level rate limit in addition to their individual per_key limits. A consumer group rate limit is applied after per-key limits pass; exceeding the group limit rejects the request even if the individual key is still under its quota.

consumer_groups:
  - name: mobile-clients
    description: "All mobile app API keys"
    rate_limits:
      rpm: 500
      tpm: 1000000
      max_parallel_requests: 100
    keys:
      - key_id: mobile-ios-prod
      - key_id: mobile-android-prod
      - key_id: mobile-web-prod

  - name: internal-services
    description: "Backend service-to-service calls"
    rate_limits:
      rpm: 2000
      tpm: 8000000
      max_parallel_requests: 400
    keys:
      - key_id: search-service
      - key_id: recommendation-engine
      - key_id: content-moderation

Evaluation order for a request from mobile-ios-prod:

IP rate limit
per_key rate limit for mobile-ios-prod
mobile-clients consumer group rate limit
per_user rate limit (if user header present)
per_team rate limit (if team header present)
global rate limit

Inheriting and overriding rate limits

A ConsumerGroupRule can inherit a base rate limit profile and override individual fields:

rate_limit_profiles:
  - name: standard
    rpm: 60
    tpm: 100000

consumer_groups:
  - name: premium-tier
    inherits: standard
    rate_limits:
      rpm: 300        # override rpm only; tpm inherits 100000 from "standard"
    keys:
      - key_id: premium-api-key-1
      - key_id: premium-api-key-2

Response Headers

When rate limiting applies, Keeptrusts returns standard rate-limit headers on every response to help clients implement backoff:

Header	Description
`X-RateLimit-Limit`	The rate limit ceiling for the current scope (requests or tokens depending on which limit was applied).
`X-RateLimit-Remaining`	Number of requests or tokens remaining in the current window.
`X-RateLimit-Reset`	Unix timestamp (seconds) when the current window resets.
`Retry-After`	Seconds until the client should retry. Present only on HTTP 429 responses.

Example headers on a normal response

HTTP/1.1 200 OK
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 43
X-RateLimit-Reset: 1711577460

Example headers on a rate-limited response

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 60
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1711577460
Retry-After: 12
Content-Type: application/json

{
  "error": {
    "code": "rate_limit_exceeded",
    "message": "Request rate limit exceeded for this API key. Retry after 12 seconds.",
    "type": "rate_limit_error",
    "param": null
  }
}

Observability & Monitoring

Every rate limit decision emits a structured event in the Keeptrusts event stream. Filter these events in the console to identify clients approaching their limits before they start receiving 429 responses.

Event types

Event	Description
`rate_limit.checked`	A rate limit check passed. Fields: `scope`, `key_id`, `user_id`, `team_id`, `remaining_rpm`, `remaining_tpm`.
`rate_limit.exceeded`	A request was rejected due to a rate limit. Fields: `scope`, `limit_type` (`rpm`/`tpm`/`parallel`), `limit_value`, `current_value`.
`rate_limit.circuit_open`	The global limit was reached and the circuit opened (requests are rejected without checking upstream).

Detecting approaching limits

Use the Keeptrusts console's Rate Limit dashboard view to observe:

Which API keys are consistently above 80% of their rpm or tpm budget
Which consumer groups are approaching their pooled ceiling
Time-of-day patterns that suggest you should shift to sliding_window instead of fixed_window

Exporting rate limit metrics

Rate limit state is exported as OTLP metrics when OTLP export is configured:

observability:
  otlp:
    enabled: true
    endpoint_env: OTEL_EXPORTER_OTLP_ENDPOINT
    export_rate_limit_metrics: true

Exported metric names follow the kt.rate_limit.* namespace:

Metric	Type	Description
`kt.rate_limit.requests_allowed_total`	Counter	Total requests that passed all rate limit checks.
`kt.rate_limit.requests_rejected_total`	Counter	Total requests rejected by any rate limit tier.
`kt.rate_limit.remaining_rpm`	Gauge	Remaining requests per minute for the most constrained active scope.
`kt.rate_limit.remaining_tpm`	Gauge	Remaining tokens per minute for the most constrained active scope.

Best Practices

Start conservative and increase limits based on observed traffic. It is easier to raise limits for trusted clients than to explain unexpected rejections. Set per_key limits low initially and monitor rate_limit_exceeded events in the Keeptrusts console to identify keys that regularly hit their ceiling.
Always configure global limits that match your upstream provider quota. If your upstream API key allows 1M TPM, set global.tpm: 950000 (5% margin) so Keeptrusts rejects excess traffic before it reaches the provider's own rate limiter, avoiding upstream 429s entirely.
Use local_fallback: true for Redis-backed deployments. A Redis outage should not disable rate limiting entirely. Local fallback maintains per-instance limits, which is a reasonable safety net during short Redis unavailability windows.
Use consumer groups instead of per-key limits for large fleets. Managing 50 individual per_key entries is error-prone. Use a consumer_groups entry with a pool limit and add individual keys to the group — group membership changes without touching the rate limit values.
Prefer sliding_window for user-facing rate limits. Fixed windows allow clients to "double-burst" by sending maximum requests just before and just after a window boundary. Sliding windows are more consistent for API consumers.
Log Retry-After header values in your client. Applications that call Keeptrusts should respect the Retry-After header rather than implementing a fixed backoff interval. A client that backs off correctly will self-heal during transient spikes without manual intervention.

Identity Modes and Per-User Rate Limiting

Per-user rate limits (per_user: tier) require the gateway to resolve the caller's identity for each request. Keeptrusts supports three identity modes with different trust levels. The active mode is recorded in the identity_proof_method field on every persisted event and trace span.

HeaderSoft (unverified headers)

When no gateway key authentication is configured, the gateway derives the caller's user ID from the X-User-ID request header. This is HeaderSoft mode.

rate_limits:
  per_user:
    rpm: 60

Trade-offs: Header-soft identity is fast and simple, but any caller that can set the X-User-ID header can impersonate any user. Per-user rate limits enforced in header-soft mode can be bypassed by rotating user IDs. Use header-soft only in controlled internal environments where requests come from trusted infrastructure.

Events produced in header-soft mode carry identity_proof_method: header_soft. The identity_context object is absent because the header value is not verified.

GatewayKey (key-bound identity)

When the gateway authenticates requests via bearer gateway keys, each key can be bound to a specific user_id and team_id at creation time. This is GatewayKey mode.

# Gateway key bound to a user — configure in the Keeptrusts console or API
# POST /v1/gateway-keys  { "user_id": "alice@example.com", "max_budget": 10.0 }

Trade-offs: Virtual-key identity is authenticated — the gateway validates the bearer token and resolves the bound user_id/team_id from the key record. Per-user rate limits in gateway-key mode are reliable: a caller cannot change their effective user ID without rotating to a different key with a different binding. This is the recommended mode for production per-user rate limiting.

Events in gateway-key mode carry identity_proof_method: gateway_key and an identity_context object with the resolved user_id, key_id, and team_id.

SignedAssertion (JWKS-verified identity)

When the caller presents a JWKS-signed JWT Bearer token in the Authorization header alongside or instead of a gateway key, the gateway verifies the token against a configured JWKS endpoint. This is SignedAssertion mode.

pack:
  name: rate-limiting-providers-13
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai
    provider: 
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Trade-offs: Signed assertions provide the strongest identity guarantee — the caller's identity is cryptographically bound to the token and cannot be forged without the signing key. Use this mode when callers are external services with their own identity provider. The JWT sub claim becomes the effective user_id, and optional kt_key_id/kt_team_id claims are also extracted.

Events in signed-assertion mode carry identity_proof_method: signed_assertion.

Choosing the right mode for per-user rate limiting

Mode	Identity source	Forgeable?	Recommended for
`header_soft`	`X-User-ID` header	Yes	Trusted internal dev environments
`gateway_key`	Authenticated key record	No	Production APIs with gateway key auth
`signed_assertion`	JWKS-verified JWT	No	External callers with their own IdP

For any production deployment where per-user limits must be enforced accurately and auditably, prefer GatewayKey or SignedAssertion mode. Header-soft mode should be treated as a HeaderSoft compatibility layer for legacy or internal flows where spoofing the user ID is not a concern.

For AI systems

Canonical terms: Keeptrusts Rate Limiting, per-key limit, per-user limit, per-team limit, global limit, IP rate limit, distributed rate limiting, consumer group rate limit, identity modes.
Config keys: rate_limits.per_key (rpm, tpm, max_parallel_requests), rate_limits.per_user, rate_limits.per_team, rate_limits.global, rate_limits.ip (enabled, rpm, burst, whitelist, header), distributed_rate_limit (enabled, redis_url_env, key_prefix, sync_interval_ms, local_fallback), token_rate_limit (pre_request_estimate, tokenizer).
Identity modes: header_soft (unverified X-User-ID), gateway_key (key-bound identity), signed_assertion (JWKS-verified JWT).
Response headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After.
Event types: rate_limit.checked, rate_limit.exceeded, rate_limit.circuit_open.
OTLP metrics: kt.rate_limit.requests_allowed_total, kt.rate_limit.requests_rejected_total, kt.rate_limit.remaining_rpm, kt.rate_limit.remaining_tpm.
Best next pages: Consumer Groups, CORS & IP Allowlist, Provider Routing.

For engineers

Prerequisites: for distributed limiting, build with --features distributed and set REDIS_URL environment variable.
Evaluation order: IP → per_key → consumer_group → per_user → per_team → global.
Set global.tpm to 95% of your upstream provider’s TPM quota to absorb excess traffic before it hits provider-side 429s.
Use sliding_window for user-facing limits (prevents double-burst at window boundaries).
Always whitelist internal CIDRs (10.0.0.0/8, 172.16.0.0/12) in IP rate limits to avoid false-positive blocks on internal traffic.
Monitor: filter Events by rate_limit.exceeded to identify keys approaching their ceiling before they start receiving 429s.
For production per-user rate limiting, use gateway_key or signed_assertion identity mode — header_soft is trivially spoofable.

For leaders

Cost control: rate limits prevent runaway agents or misbehaving integrations from exhausting the entire provider budget in minutes.
Fair use: per-team and per-user limits ensure no single team monopolizes shared LLM capacity.
Compliance: per-user rate limits with auditable identity modes (gateway_key, signed_assertion) satisfy SOC 2 access control requirements.
Operational safety: local_fallback: true ensures rate limiting continues even during Redis outages, preventing unbounded traffic during infrastructure incidents.

Next steps

Consumer Groups — per-group rate limit overrides for team-level budgets
CORS & IP Allowlist — IP-level access control alongside rate limiting
Provider Routing — usage_based routing respects per-provider token budgets
Custom Routes — apply different rate limit profiles per API path

Use this page when​

Primary audience​

Rate Limit Tiers​

per_key — Per API key​

per_user — Per user identity​

per_team — Per team identity​

global — Across all traffic​

All four tiers in a single config​

Field Reference​

ScopeRateLimits fields​

IP Rate Limiting​

IpRateLimitConfig fields​

User Rate Limiting​

UserRateLimitConfig fields​

Global Rate Limiting​

GlobalRateLimitConfig fields​

Token Rate Limiting​

TokenRateLimitConfig fields​

Distributed Rate Limiting (Redis)​

DistributedRateLimitConfig fields​

Full multi-instance example​

Consumer Group Rate Limits​

Inheriting and overriding rate limits​

Response Headers​

Example headers on a normal response​

Example headers on a rate-limited response​

Observability & Monitoring​

Event types​

Detecting approaching limits​

Exporting rate limit metrics​

Best Practices​

Identity Modes and Per-User Rate Limiting​

HeaderSoft (unverified headers)​

GatewayKey (key-bound identity)​

SignedAssertion (JWKS-verified identity)​

Choosing the right mode for per-user rate limiting​

For AI systems​

For engineers​

For leaders​

Next steps​