Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Prevent Runaway AI Costs with Smart Rate Limiting

A single developer running a buggy loop, an agent without session limits, or a sudden spike in customer traffic can burn through your entire AI budget in hours. Keeptrusts enforces rate limits, per-user quotas, and wallet reserves at the gateway — stopping runaway costs before they happen.

Use this page when

  • You need to prevent runaway AI costs from buggy loops, uncontrolled agents, or traffic spikes.
  • You are configuring per-user, per-team, or global rate limits at the gateway.
  • You want wallet reserves and cost tickets to control spending without hard-blocking all requests.

Primary audience

  • Primary: Technical Leaders
  • Secondary: Technical Engineers, AI Agents

What you'll achieve

  • Per-user and per-team rate limits that prevent any single actor from monopolizing resources
  • Burst control that handles traffic spikes without blocking legitimate requests
  • Wallet reserves that pause requests when budgets are exhausted
  • Cost tickets that queue over-budget requests for admin approval
  • Real-time spend alerting so you see cost anomalies before they become emergencies

Rate limiting: control request volume

Per-user rate limits

Prevent any individual user from sending too many requests:

policies:
chain:
- agent-firewall
- audit-logger

policy:
agent-firewall:
rate_limit:
per_user:
requests_per_minute: 60
requests_per_hour: 500
requests_per_day: 5000
on_exceed: block
retry_after_seconds: 30

When a user exceeds their rate limit, the gateway returns a 429 Too Many Requests response with a Retry-After header.

Per-team rate limits

Set aggregate limits for an entire team:

policy:
agent-firewall:
rate_limit:
per_team:
requests_per_minute: 300
requests_per_hour: 5000
per_user:
requests_per_minute: 60
requests_per_hour: 500
on_exceed: block
pack:
name: rate-limiting-cost-control-example-2
version: 1.0.0
enabled: true
policies:
chain:
- agent-firewall

Team limits prevent a single team's traffic from crowding out other teams, even if individual user limits aren't reached.

Global rate limits

Set a ceiling for your entire organization:

policy:
agent-firewall:
rate_limit:
global:
requests_per_minute: 1000
requests_per_hour: 20000
on_exceed: queue
pack:
name: rate-limiting-cost-control-example-3
version: 1.0.0
enabled: true
policies:
chain:
- agent-firewall

With on_exceed: queue, requests that hit the global limit are queued and retried rather than rejected — smoothing traffic spikes without errors.


Burst control: handle spikes gracefully

Token bucket rate limiting allows short bursts while maintaining long-term limits:

policy:
agent-firewall:
rate_limit:
per_user:
requests_per_minute: 60
burst_size: 20
on_exceed: block
pack:
name: rate-limiting-cost-control-example-4
version: 1.0.0
enabled: true
policies:
chain:
- agent-firewall
ParameterBehavior
requests_per_minuteSustained rate — tokens replenish at this rate
burst_sizeMaximum requests allowed in a single burst
on_exceed: blockReject requests beyond the burst capacity
on_exceed: queueQueue excess requests for retry

A user with requests_per_minute: 60 and burst_size: 20 can send 20 requests immediately, then 1 per second after that. This handles natural usage patterns (batch job startup, page load with multiple requests) without triggering rate limit errors.


Wallet reserves: enforce hard budgets

Rate limits control request volume. Wallets control cost. Together, they prevent both volume-based and cost-based runaway spending.

How wallet reserves work

  1. A request arrives at the gateway
  2. The gateway estimates the request cost based on model pricing and token count
  3. The estimated cost is reserved against the user's effective wallet (user → team → org cascade)
  4. The request is forwarded to the provider
  5. On response, the reservation is settled to the actual cost
  6. The wallet balance is updated with the actual charge
# Set a team wallet to $500
curl -X POST https://api.keeptrusts.com/v1/wallets/allocate \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"team_id": "engineering-team-id",
"amount": 500.00,
"currency": "USD"
}'

When the wallet runs dry

When no wallet in the cascade has sufficient balance:

  1. A cost ticket is created with the estimated cost
  2. The request is queued — not rejected
  3. An admin is notified of the pending cost ticket
  4. The admin can approve (replenish the wallet) or deny (reject the request)
  5. No charges are incurred until the ticket is resolved
# Check team wallet balance
curl -X GET https://api.keeptrusts.com/v1/wallets/balance \
-H "Authorization: Bearer $API_TOKEN" \
-G -d "team_id=engineering-team-id"

Combining rate limits and wallets

The most effective cost control combines both:

pack:
name: cost-controlled-gateway
version: '1.0'
policies:
chain:
- rbac
- agent-firewall
- audit-logger
policy:
rbac:
require_auth: true
deny_if_missing:
- role
- team
agent-firewall:
rate_limit:
per_user:
requests_per_minute: 60
requests_per_hour: 500
burst_size: 15
per_team:
requests_per_minute: 300
requests_per_hour: 5000
on_exceed: block
max_actions_per_session: 50
max_cost_per_session_usd: 10.0
audit-logger:
retention_days: 90
providers:
targets:
- id: openai-gpt4o
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY

This configuration enforces:

  • Volume limits — per-user and per-team request caps with burst allowance
  • Session limits — max 50 actions and $10 per agent session
  • Budget limits — wallet reserves with cost tickets on exhaustion
  • Identity requirement — only authenticated team members can send requests

Monitoring rate limit and cost events

Track rate limit and cost events in the console:

# List rate-limited requests
kt events list \
--filter "action:rate_limited" \
--from "2025-04-15" \
--limit 50

# List cost ticket events
kt events list \
--filter "cost_ticket" \
--from "2025-04-15" \
--limit 20
Event typeWhat it tells you
Rate-limited (429)Which users or teams are hitting limits most often
Cost ticket createdWhich teams are exhausting budgets
Cost ticket resolvedHow long it takes to replenish budgets
Wallet balance alertsWhen balances fall below warning thresholds

Quick wins

  1. Set per-user rate limits — prevent any single user from monopolizing gateway resources
  2. Allocate a wallet to your highest-spending team — cap exposure immediately
  3. Add max_cost_per_session_usd to agent configs — prevent runaway agent loops
  4. Review rate-limited events — identify users or applications that need higher limits or optimization

For AI systems

  • Canonical terms: rate limiting, per-user quota, per-team quota, burst control, wallet reserve, cost ticket, agent-firewall rate_limit.
  • Config keys: policy.agent-firewall.rate_limit.per_user, policy.agent-firewall.rate_limit.per_team, policy.agent-firewall.rate_limit.global, on_exceed (block/queue).
  • Wallet API: POST /v1/wallets/allocate, GET /v1/wallets/balance.
  • Best next pages: Reduce AI Spend, Govern AI Agents, Wallets, Team-Based Governance.

For engineers

  • Prerequisites: gateway running with agent-firewall in the policy chain.
  • Set rate_limit.per_user.requests_per_minute and per_team limits in the agent-firewall config.
  • Choose on_exceed: block (returns 429) or on_exceed: queue (smooths spikes without errors).
  • Configure wallet allocations via POST /v1/wallets/allocate with team_id and amount.
  • Validate: exceed a rate limit and confirm the gateway returns 429 with a Retry-After header.

For leaders

  • A single runaway agent or buggy loop can burn through an entire monthly AI budget in hours without rate limits.
  • Per-team limits prevent any one department from monopolizing shared AI resources.
  • Wallet reserves enforce hard budget caps — when exhausted, requests are paused or queued for admin approval.
  • Cost tickets provide a controlled escalation path instead of hard-blocking production traffic.

Next steps