Rate Limiting AI Usage Per User, Team, and Organization

Keeptrusts rate-limits AI usage by applying independent quotas at multiple scopes. In practice, that means you can enforce per-user fairness, per-team isolation, and an organization-wide ceiling with the global tier. If you also need hard cost control, you pair those limits with wallets and budgets so traffic is constrained by both capacity and money.

Use this page when

You need to stop one user, one team, or one integration from consuming all available LLM capacity.
You want a clean explanation of how user, team, and organization-wide throttling fit together.
You are designing rate limits that work with spend controls instead of fighting them.

Primary audience

Primary: Technical Engineers
Secondary: Technical Leaders, AI Agents

The problem

Provider-side rate limits are not enough. If your upstream key allows a certain TPM or RPM budget, the provider only tells you that you are over the line after the request has already reached its edge. That is late, noisy, and usually disconnected from your own identity model. It also gives you no clean way to decide whether the problem came from one user, one team, or everyone combined.

Inside an organization, those are different control problems. A per-user limit is about fairness and abuse prevention. A per-team limit is about tenant isolation and predictable service. An organization-wide limit is about protecting the total capacity you are willing to allocate to a gateway or upstream provider account.

The other challenge is identity. A user-scoped limit only means anything if the gateway can trust who the user is. The docs are explicit about this: header-only identity is easy to spoof, while API-token-bound identity or signed assertions provide a stronger basis for production enforcement.

Finally, rate limiting is not the same as cost control. RPM and TPM ceilings constrain traffic, but they do not replace wallets. If you only set quotas and ignore budgets, you can still burn money efficiently inside those allowed limits.

The solution

Keeptrusts uses a tiered model. The advanced rate-limiting docs describe the evaluation order as IP, per key, consumer group, per user, per team, and finally global. For the title question here, the important three are per_user, per_team, and global.

per_user protects against a single caller monopolizing resources. per_team protects a tenant or department boundary. global is the organization-wide hard ceiling across all traffic handled by the gateway. In other words, the “organization” layer in practical operations is the global tier.

That is only part of the picture. Consumer groups are useful when you want pooled limits for sets of API keys, and provider-level rate limits are useful when you want to cap consumption against a specific upstream target. Wallets and budgets then cover the spend side: reserves and settlements enforce money, while rate limits enforce capacity.

Implementation

For a first rollout, keep the example explicit: configure user, team, and global limits, choose a window strategy, and wire distributed coordination if you run multiple gateway instances.

rate_limits:
  per_user:
    rpm: 20
    tpm: 40000
    max_parallel_requests: 5

  per_team:
    rpm: 200
    tpm: 500000
    max_parallel_requests: 50

  global:
    rpm: 1000
    tpm: 2000000
    max_parallel_requests: 200

user_rate_limit:
  header: "X-User-Id"
  strategy: sliding_window
  window_seconds: 60

global_rate_limit:
  strategy: sliding_window
  window_seconds: 60
  reject_action: return_429

distributed_rate_limit:
  enabled: true
  redis_url_env: REDIS_URL
  key_prefix: "kt:rl:prod:"
  window_ms: 60000
  sync_interval_ms: 50
  local_fallback: true

providers:
  targets:
    - id: openai-primary
      provider: openai
      model: gpt-5.4-mini-mini
      secret_key_ref:
        env: OPENAI_API_KEY

policies:
  chain:
    - audit-logger

policy:
  audit-logger:
    retention_days: 90

That config answers three operational questions directly.

How much can one user do? per_user defines that. How much can one team do in aggregate? per_team defines that. What is the absolute top line for the whole gateway? global defines that.

Run it with a distributed backend if you have more than one gateway instance:

export REDIS_URL="redis://:yourpassword@redis.internal:6379/0"
kt policy lint --file policy-config.yaml
kt gateway run --policy-config policy-config.yaml --listen 0.0.0.0:41002

Then send a request with user and team headers so the gateway can bucket the usage correctly:

curl -s -D- http://localhost:41002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-User-Id: alice@example.com" \
  -H "X-Team-Id: engineering" \
  -d '{
    "model": "gpt-5.4-mini-mini",
    "messages": [
      {"role": "user", "content": "Summarize the deployment checklist in five bullets."}
    ]
  }' | grep -E "^(HTTP|X-RateLimit|Retry-After)"

In production, use the stronger identity modes described in the advanced rate-limiting docs. Header-only identity is fine for trusted internal development environments, but auditable per-user throttling should use API-token-bound identity or signed assertions. Otherwise users can simply rotate or spoof header values and defeat the point of the limit.

If you also want to align throttling with spend enforcement, connect the rate-limit rollout to Spend & Wallets. Rate limits keep traffic inside capacity. Wallets keep it inside budget. Those are complementary controls, not substitutes.

Results and impact

The first result is fairer capacity distribution. A single user can no longer saturate your upstream allowance just because they have a fast loop or a badly behaved client. A single team can no longer crowd out every other tenant on a shared gateway.

The second result is cleaner failure modes. Instead of random provider-side 429s, the gateway becomes the place where limits are enforced and explained. Clients can see Retry-After and rate-limit headers and back off in a predictable way.

The third result is easier financial control when paired with wallets. A team can stay under its team-level RPM limit but still run out of budget, or it can stay under budget but spike request volume. That is normal. Capacity and cost are different axes, and Keeptrusts gives you controls for both.

Key takeaways

Per-user limits protect fairness, per-team limits protect shared tenancy, and the global tier is the organization-wide ceiling.
Strong identity matters: production per-user limits should use authenticated identity, not just free-form headers.
Distributed rate limiting is necessary when several gateway instances share the same quota pool.
Rate limits and wallets solve different problems and work best together.
Enforcing limits at the gateway is cleaner than relying on provider-side throttling alone.

Rate Limiting AI Usage Per User, Team, and Organization

Use this page when​

Primary audience​

The problem​

The solution​

Implementation​

Results and impact​

Key takeaways​

Next steps​