Prevent Runaway AI Costs with Smart Rate Limiting

A single developer running a buggy loop, an agent without session limits, or a sudden spike in customer traffic can burn through your entire AI budget in hours. Keeptrusts enforces rate limits, per-user quotas, and wallet reserves at the gateway — stopping runaway costs before they happen.

Use this page when

You need to prevent runaway AI costs from buggy loops, uncontrolled agents, or traffic spikes.
You are configuring per-user, per-team, or global rate limits at the gateway.
You want wallet reserves and cost tickets to control spending without hard-blocking all requests.

Primary audience

Primary: Technical Leaders
Secondary: Technical Engineers, AI Agents

What you'll achieve

Per-user and per-team rate limits that prevent any single actor from monopolizing resources
Burst control that handles traffic spikes without blocking legitimate requests
Wallet reserves that pause requests when budgets are exhausted
Cost tickets that queue over-budget requests for admin approval
Real-time spend alerting so you see cost anomalies before they become emergencies

Rate limiting: control request volume

Per-user rate limits

Prevent any individual user from sending too many requests:

policies:
  chain:
    - agent-firewall
    - audit-logger

policy:
  agent-firewall:
    rate_limit:
      per_user:
        requests_per_minute: 60
        requests_per_hour: 500
        requests_per_day: 5000
      on_exceed: block
      retry_after_seconds: 30

When a user exceeds their rate limit, the gateway returns a 429 Too Many Requests response with a Retry-After header.

Per-team rate limits

Set aggregate limits for an entire team:

policy:
  agent-firewall:
    rate_limit:
      per_team:
        requests_per_minute: 300
        requests_per_hour: 5000
      per_user:
        requests_per_minute: 60
        requests_per_hour: 500
      on_exceed: block
pack:
  name: rate-limiting-cost-control-example-2
  version: 1.0.0
  enabled: true
policies:
  chain:
  - agent-firewall

Team limits prevent a single team's traffic from crowding out other teams, even if individual user limits aren't reached.

Global rate limits

Set a ceiling for your entire organization:

policy:
  agent-firewall:
    rate_limit:
      global:
        requests_per_minute: 1000
        requests_per_hour: 20000
      on_exceed: queue
pack:
  name: rate-limiting-cost-control-example-3
  version: 1.0.0
  enabled: true
policies:
  chain:
  - agent-firewall

With on_exceed: queue, requests that hit the global limit are queued and retried rather than rejected — smoothing traffic spikes without errors.

Burst control: handle spikes gracefully

Token bucket rate limiting allows short bursts while maintaining long-term limits:

policy:
  agent-firewall:
    rate_limit:
      per_user:
        requests_per_minute: 60
        burst_size: 20
      on_exceed: block
pack:
  name: rate-limiting-cost-control-example-4
  version: 1.0.0
  enabled: true
policies:
  chain:
  - agent-firewall

Parameter	Behavior
`requests_per_minute`	Sustained rate — tokens replenish at this rate
`burst_size`	Maximum requests allowed in a single burst
`on_exceed: block`	Reject requests beyond the burst capacity
`on_exceed: queue`	Queue excess requests for retry

A user with requests_per_minute: 60 and burst_size: 20 can send 20 requests immediately, then 1 per second after that. This handles natural usage patterns (batch job startup, page load with multiple requests) without triggering rate limit errors.

Wallet reserves: enforce hard budgets

Rate limits control request volume. Wallets control cost. Together, they prevent both volume-based and cost-based runaway spending.

How wallet reserves work

A request arrives at the gateway
The gateway estimates the request cost based on model pricing and token count
The estimated cost is reserved against the user's effective wallet (user → team → org cascade)
The request is forwarded to the provider
On response, the reservation is settled to the actual cost
The wallet balance is updated with the actual charge

# Set a team wallet to $500
curl -X POST https://api.keeptrusts.com/v1/wallets/allocate \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "team_id": "engineering-team-id",
    "amount": 500.00,
    "currency": "USD"
  }'

When the wallet runs dry

When no wallet in the cascade has sufficient balance:

A cost ticket is created with the estimated cost
The request is queued — not rejected
An admin is notified of the pending cost ticket
The admin can approve (replenish the wallet) or deny (reject the request)
No charges are incurred until the ticket is resolved

# Check team wallet balance
curl -X GET https://api.keeptrusts.com/v1/wallets/balance \
  -H "Authorization: Bearer $API_TOKEN" \
  -G -d "team_id=engineering-team-id"

Combining rate limits and wallets

The most effective cost control combines both:

pack:
  name: cost-controlled-gateway
  version: '1.0'
policies:
  chain:
  - rbac
  - agent-firewall
  - audit-logger
policy:
  rbac:
    require_auth: true
    deny_if_missing:
    - role
    - team
  agent-firewall:
    rate_limit:
      per_user:
        requests_per_minute: 60
        requests_per_hour: 500
        burst_size: 15
      per_team:
        requests_per_minute: 300
        requests_per_hour: 5000
      on_exceed: block
    max_actions_per_session: 50
    max_cost_per_session_usd: 10.0
  audit-logger:
    retention_days: 90
providers:
  targets:
  - id: openai-gpt4o
    provider: openai
    model: gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY

This configuration enforces:

Volume limits — per-user and per-team request caps with burst allowance
Session limits — max 50 actions and $10 per agent session
Budget limits — wallet reserves with cost tickets on exhaustion
Identity requirement — only authenticated team members can send requests

Monitoring rate limit and cost events

Track rate limit and cost events in the console:

# List rate-limited requests
kt events list \
  --filter "action:rate_limited" \
  --from "2025-04-15" \
  --limit 50

# List cost ticket events
kt events list \
  --filter "cost_ticket" \
  --from "2025-04-15" \
  --limit 20

Event type	What it tells you
Rate-limited (429)	Which users or teams are hitting limits most often
Cost ticket created	Which teams are exhausting budgets
Cost ticket resolved	How long it takes to replenish budgets
Wallet balance alerts	When balances fall below warning thresholds

Quick wins

Set per-user rate limits — prevent any single user from monopolizing gateway resources
Allocate a wallet to your highest-spending team — cap exposure immediately
Add max_cost_per_session_usd to agent configs — prevent runaway agent loops
Review rate-limited events — identify users or applications that need higher limits or optimization

For AI systems

Canonical terms: rate limiting, per-user quota, per-team quota, burst control, wallet reserve, cost ticket, agent-firewall rate_limit.
Config keys: policy.agent-firewall.rate_limit.per_user, policy.agent-firewall.rate_limit.per_team, policy.agent-firewall.rate_limit.global, on_exceed (block/queue).
Wallet API: POST /v1/wallets/allocate, GET /v1/wallets/balance.
Best next pages: Reduce AI Spend, Govern AI Agents, Wallets, Team-Based Governance.

For engineers

Prerequisites: gateway running with agent-firewall in the policy chain.
Set rate_limit.per_user.requests_per_minute and per_team limits in the agent-firewall config.
Choose on_exceed: block (returns 429) or on_exceed: queue (smooths spikes without errors).
Configure wallet allocations via POST /v1/wallets/allocate with team_id and amount.
Validate: exceed a rate limit and confirm the gateway returns 429 with a Retry-After header.

For leaders

A single runaway agent or buggy loop can burn through an entire monthly AI budget in hours without rate limits.
Per-team limits prevent any one department from monopolizing shared AI resources.
Wallet reserves enforce hard budget caps — when exhausted, requests are paused or queued for admin approval.
Cost tickets provide a controlled escalation path instead of hard-blocking production traffic.

Next steps

Reduce AI Spend by 40% — combine rate limiting with provider routing and caching
Wallets — full wallet cascade, top-up, and cost ticket documentation
Team-Based Governance — per-team budget allocation and controls
Cost and Spend — real-time spend tracking dashboards
Govern AI Agents — agent-specific session and cost limits

Use this page when​

Primary audience​

What you'll achieve​

Rate limiting: control request volume​

Per-user rate limits​

Per-team rate limits​

Global rate limits​

Burst control: handle spikes gracefully​

Wallet reserves: enforce hard budgets​

How wallet reserves work​

When the wallet runs dry​

Combining rate limits and wallets​

Monitoring rate limit and cost events​

Quick wins​

For AI systems​

For engineers​

For leaders​

Next steps​