Rate Limiting AI Usage Per User, Team, and Organization
Keeptrusts rate-limits AI usage by applying independent quotas at multiple scopes. In practice, that means you can enforce per-user fairness, per-team isolation, and an organization-wide ceiling with the global tier. If you also need hard cost control, you pair those limits with wallets and budgets so traffic is constrained by both capacity and money.
Use this page when
- You need to stop one user, one team, or one integration from consuming all available LLM capacity.
- You want a clean explanation of how user, team, and organization-wide throttling fit together.
- You are designing rate limits that work with spend controls instead of fighting them.
Primary audience
- Primary: Technical Engineers
- Secondary: Technical Leaders, AI Agents
The problem
Provider-side rate limits are not enough. If your upstream key allows a certain TPM or RPM budget, the provider only tells you that you are over the line after the request has already reached its edge. That is late, noisy, and usually disconnected from your own identity model. It also gives you no clean way to decide whether the problem came from one user, one team, or everyone combined.
Inside an organization, those are different control problems. A per-user limit is about fairness and abuse prevention. A per-team limit is about tenant isolation and predictable service. An organization-wide limit is about protecting the total capacity you are willing to allocate to a gateway or upstream provider account.
The other challenge is identity. A user-scoped limit only means anything if the gateway can trust who the user is. The docs are explicit about this: header-only identity is easy to spoof, while API-token-bound identity or signed assertions provide a stronger basis for production enforcement.
Finally, rate limiting is not the same as cost control. RPM and TPM ceilings constrain traffic, but they do not replace wallets. If you only set quotas and ignore budgets, you can still burn money efficiently inside those allowed limits.
The solution
Keeptrusts uses a tiered model. The advanced rate-limiting docs describe the evaluation order as IP, per key, consumer group, per user, per team, and finally global. For the title question here, the important three are per_user, per_team, and global.
per_user protects against a single caller monopolizing resources. per_team protects a tenant or department boundary. global is the organization-wide hard ceiling across all traffic handled by the gateway. In other words, the “organization” layer in practical operations is the global tier.
That is only part of the picture. Consumer groups are useful when you want pooled limits for sets of API keys, and provider-level rate limits are useful when you want to cap consumption against a specific upstream target. Wallets and budgets then cover the spend side: reserves and settlements enforce money, while rate limits enforce capacity.
Implementation
For a first rollout, keep the example explicit: configure user, team, and global limits, choose a window strategy, and wire distributed coordination if you run multiple gateway instances.
rate_limits:
per_user:
rpm: 20
tpm: 40000
max_parallel_requests: 5
per_team:
rpm: 200
tpm: 500000
max_parallel_requests: 50
global:
rpm: 1000
tpm: 2000000
max_parallel_requests: 200
user_rate_limit:
header: "X-User-Id"
strategy: sliding_window
window_seconds: 60
global_rate_limit:
strategy: sliding_window
window_seconds: 60
reject_action: return_429
distributed_rate_limit:
enabled: true
redis_url_env: REDIS_URL
key_prefix: "kt:rl:prod:"
window_ms: 60000
sync_interval_ms: 50
local_fallback: true
providers:
targets:
- id: openai-primary
provider: openai
model: gpt-5.4-mini-mini
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
retention_days: 90
That config answers three operational questions directly.
How much can one user do? per_user defines that. How much can one team do in aggregate? per_team defines that. What is the absolute top line for the whole gateway? global defines that.
Run it with a distributed backend if you have more than one gateway instance:
export REDIS_URL="redis://:yourpassword@redis.internal:6379/0"
kt policy lint --file policy-config.yaml
kt gateway run --policy-config policy-config.yaml --listen 0.0.0.0:41002
Then send a request with user and team headers so the gateway can bucket the usage correctly:
curl -s -D- http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-User-Id: alice@example.com" \
-H "X-Team-Id: engineering" \
-d '{
"model": "gpt-5.4-mini-mini",
"messages": [
{"role": "user", "content": "Summarize the deployment checklist in five bullets."}
]
}' | grep -E "^(HTTP|X-RateLimit|Retry-After)"
In production, use the stronger identity modes described in the advanced rate-limiting docs. Header-only identity is fine for trusted internal development environments, but auditable per-user throttling should use API-token-bound identity or signed assertions. Otherwise users can simply rotate or spoof header values and defeat the point of the limit.
If you also want to align throttling with spend enforcement, connect the rate-limit rollout to Spend & Wallets. Rate limits keep traffic inside capacity. Wallets keep it inside budget. Those are complementary controls, not substitutes.
Results and impact
The first result is fairer capacity distribution. A single user can no longer saturate your upstream allowance just because they have a fast loop or a badly behaved client. A single team can no longer crowd out every other tenant on a shared gateway.
The second result is cleaner failure modes. Instead of random provider-side 429s, the gateway becomes the place where limits are enforced and explained. Clients can see Retry-After and rate-limit headers and back off in a predictable way.
The third result is easier financial control when paired with wallets. A team can stay under its team-level RPM limit but still run out of budget, or it can stay under budget but spike request volume. That is normal. Capacity and cost are different axes, and Keeptrusts gives you controls for both.
Key takeaways
- Per-user limits protect fairness, per-team limits protect shared tenancy, and the global tier is the organization-wide ceiling.
- Strong identity matters: production per-user limits should use authenticated identity, not just free-form headers.
- Distributed rate limiting is necessary when several gateway instances share the same quota pool.
- Rate limits and wallets solve different problems and work best together.
- Enforcing limits at the gateway is cleaner than relying on provider-side throttling alone.