Prevent Runaway AI Costs with Smart Rate Limiting
A single developer running a buggy loop, an agent without session limits, or a sudden spike in customer traffic can burn through your entire AI budget in hours. Keeptrusts enforces rate limits, per-user quotas, and wallet reserves at the gateway — stopping runaway costs before they happen.
Use this page when
- You need to prevent runaway AI costs from buggy loops, uncontrolled agents, or traffic spikes.
- You are configuring per-user, per-team, or global rate limits at the gateway.
- You want wallet reserves and cost tickets to control spending without hard-blocking all requests.
Primary audience
- Primary: Technical Leaders
- Secondary: Technical Engineers, AI Agents
What you'll achieve
- Per-user and per-team rate limits that prevent any single actor from monopolizing resources
- Burst control that handles traffic spikes without blocking legitimate requests
- Wallet reserves that pause requests when budgets are exhausted
- Cost tickets that queue over-budget requests for admin approval
- Real-time spend alerting so you see cost anomalies before they become emergencies
Rate limiting: control request volume
Per-user rate limits
Prevent any individual user from sending too many requests:
policies:
chain:
- agent-firewall
- audit-logger
policy:
agent-firewall:
rate_limit:
per_user:
requests_per_minute: 60
requests_per_hour: 500
requests_per_day: 5000
on_exceed: block
retry_after_seconds: 30
When a user exceeds their rate limit, the gateway returns a 429 Too Many Requests response with a Retry-After header.
Per-team rate limits
Set aggregate limits for an entire team:
policy:
agent-firewall:
rate_limit:
per_team:
requests_per_minute: 300
requests_per_hour: 5000
per_user:
requests_per_minute: 60
requests_per_hour: 500
on_exceed: block
pack:
name: rate-limiting-cost-control-example-2
version: 1.0.0
enabled: true
policies:
chain:
- agent-firewall
Team limits prevent a single team's traffic from crowding out other teams, even if individual user limits aren't reached.
Global rate limits
Set a ceiling for your entire organization:
policy:
agent-firewall:
rate_limit:
global:
requests_per_minute: 1000
requests_per_hour: 20000
on_exceed: queue
pack:
name: rate-limiting-cost-control-example-3
version: 1.0.0
enabled: true
policies:
chain:
- agent-firewall
With on_exceed: queue, requests that hit the global limit are queued and retried rather than rejected — smoothing traffic spikes without errors.
Burst control: handle spikes gracefully
Token bucket rate limiting allows short bursts while maintaining long-term limits:
policy:
agent-firewall:
rate_limit:
per_user:
requests_per_minute: 60
burst_size: 20
on_exceed: block
pack:
name: rate-limiting-cost-control-example-4
version: 1.0.0
enabled: true
policies:
chain:
- agent-firewall
| Parameter | Behavior |
|---|---|
requests_per_minute | Sustained rate — tokens replenish at this rate |
burst_size | Maximum requests allowed in a single burst |
on_exceed: block | Reject requests beyond the burst capacity |
on_exceed: queue | Queue excess requests for retry |
A user with requests_per_minute: 60 and burst_size: 20 can send 20 requests immediately, then 1 per second after that. This handles natural usage patterns (batch job startup, page load with multiple requests) without triggering rate limit errors.
Wallet reserves: enforce hard budgets
Rate limits control request volume. Wallets control cost. Together, they prevent both volume-based and cost-based runaway spending.
How wallet reserves work
- A request arrives at the gateway
- The gateway estimates the request cost based on model pricing and token count
- The estimated cost is reserved against the user's effective wallet (user → team → org cascade)
- The request is forwarded to the provider
- On response, the reservation is settled to the actual cost
- The wallet balance is updated with the actual charge
# Set a team wallet to $500
curl -X POST https://api.keeptrusts.com/v1/wallets/allocate \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"team_id": "engineering-team-id",
"amount": 500.00,
"currency": "USD"
}'
When the wallet runs dry
When no wallet in the cascade has sufficient balance:
- A cost ticket is created with the estimated cost
- The request is queued — not rejected
- An admin is notified of the pending cost ticket
- The admin can approve (replenish the wallet) or deny (reject the request)
- No charges are incurred until the ticket is resolved
# Check team wallet balance
curl -X GET https://api.keeptrusts.com/v1/wallets/balance \
-H "Authorization: Bearer $API_TOKEN" \
-G -d "team_id=engineering-team-id"
Combining rate limits and wallets
The most effective cost control combines both:
pack:
name: cost-controlled-gateway
version: '1.0'
policies:
chain:
- rbac
- agent-firewall
- audit-logger
policy:
rbac:
require_auth: true
deny_if_missing:
- role
- team
agent-firewall:
rate_limit:
per_user:
requests_per_minute: 60
requests_per_hour: 500
burst_size: 15
per_team:
requests_per_minute: 300
requests_per_hour: 5000
on_exceed: block
max_actions_per_session: 50
max_cost_per_session_usd: 10.0
audit-logger:
retention_days: 90
providers:
targets:
- id: openai-gpt4o
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
This configuration enforces:
- Volume limits — per-user and per-team request caps with burst allowance
- Session limits — max 50 actions and $10 per agent session
- Budget limits — wallet reserves with cost tickets on exhaustion
- Identity requirement — only authenticated team members can send requests
Monitoring rate limit and cost events
Track rate limit and cost events in the console:
# List rate-limited requests
kt events list \
--filter "action:rate_limited" \
--from "2025-04-15" \
--limit 50
# List cost ticket events
kt events list \
--filter "cost_ticket" \
--from "2025-04-15" \
--limit 20
| Event type | What it tells you |
|---|---|
| Rate-limited (429) | Which users or teams are hitting limits most often |
| Cost ticket created | Which teams are exhausting budgets |
| Cost ticket resolved | How long it takes to replenish budgets |
| Wallet balance alerts | When balances fall below warning thresholds |
Quick wins
- Set per-user rate limits — prevent any single user from monopolizing gateway resources
- Allocate a wallet to your highest-spending team — cap exposure immediately
- Add
max_cost_per_session_usdto agent configs — prevent runaway agent loops - Review rate-limited events — identify users or applications that need higher limits or optimization
For AI systems
- Canonical terms: rate limiting, per-user quota, per-team quota, burst control, wallet reserve, cost ticket, agent-firewall rate_limit.
- Config keys:
policy.agent-firewall.rate_limit.per_user,policy.agent-firewall.rate_limit.per_team,policy.agent-firewall.rate_limit.global,on_exceed(block/queue). - Wallet API:
POST /v1/wallets/allocate,GET /v1/wallets/balance. - Best next pages: Reduce AI Spend, Govern AI Agents, Wallets, Team-Based Governance.
For engineers
- Prerequisites: gateway running with
agent-firewallin the policy chain. - Set
rate_limit.per_user.requests_per_minuteandper_teamlimits in the agent-firewall config. - Choose
on_exceed: block(returns 429) oron_exceed: queue(smooths spikes without errors). - Configure wallet allocations via
POST /v1/wallets/allocatewithteam_idandamount. - Validate: exceed a rate limit and confirm the gateway returns 429 with a
Retry-Afterheader.
For leaders
- A single runaway agent or buggy loop can burn through an entire monthly AI budget in hours without rate limits.
- Per-team limits prevent any one department from monopolizing shared AI resources.
- Wallet reserves enforce hard budget caps — when exhausted, requests are paused or queued for admin approval.
- Cost tickets provide a controlled escalation path instead of hard-blocking production traffic.
Next steps
- Reduce AI Spend by 40% — combine rate limiting with provider routing and caching
- Wallets — full wallet cascade, top-up, and cost ticket documentation
- Team-Based Governance — per-team budget allocation and controls
- Cost and Spend — real-time spend tracking dashboards
- Govern AI Agents — agent-specific session and cost limits