Token Budget Enforcement: Preventing Cost Overruns Per Request

Keeptrusts prevents cost overruns per request by estimating the request cost before it leaves the gateway, reserving that amount against the effective wallet, and stopping the request if the budget is not available. In practical terms, token budget enforcement is not an after-the-fact dashboard; it is a runtime control that combines model pricing, team budgets, and gateway decisions before the provider bill is created.

Use this page when

You need to stop one request, one team, or one agent loop from burning through budget before anyone notices.
You want a practical explanation of how Keeptrusts wallets, model pricing, and budget alerts work together.
You are rolling out spend controls and need a concrete starting config instead of a finance-only reporting view.

Primary audience

Primary: Technical Engineers
Secondary: Technical Leaders, FinOps owners

The problem

Most AI cost problems start before the invoice but are only discovered after it. A team ships a feature against a shared provider key, prompt sizes drift upward, a new model is selected for quality reasons, or an agent gets stuck in a retry loop. By the time finance sees the bill, the expensive requests have already happened.

Per-request cost control is hard when the application does not know three things at decision time: which model price applies, which team budget should absorb the charge, and whether enough budget remains. Application code usually knows only the prompt and the provider endpoint. It does not know the governance rules around that request.

That becomes worse in shared environments. One gateway may serve engineering, support, and marketing at the same time. Without a wallet or consumer-group boundary, all traffic looks like one blended stream. The expensive request from one team consumes the same upstream balance as everyone else.

There is also a sequencing problem. If you try to do spend control after the provider call, you can report accurately, but you cannot prevent the charge. Runtime enforcement has to happen before the request is forwarded. That is the difference between visibility and control.

The solution

Keeptrusts splits the problem into four parts.

First, the platform needs model pricing. The gateway cannot reserve spend unless the control plane knows how to price a request for the selected provider and model.

Second, traffic needs an owner. In the documented budget workflow, consumer groups map incoming traffic to a team wallet through wallet_team_id. That is what turns an abstract request into an accountable request.

Third, the gateway needs permission to enforce. With cost_tracking.wallet_enforcement: true, Keeptrusts reserves the estimated cost before sending the request upstream, then settles to the actual cost once the provider responds. If balance is insufficient, the request is rejected before the charge occurs.

Fourth, you need operating thresholds, not just a hard stop. Budget alerts let you notify at 50%, 80%, or 95% so teams have time to act before the blocking threshold hits.

This matters because rate limits and budgets solve different problems. Rate limits stop volume spikes. Wallets stop overspend. In mature deployments you use both. A request can be small enough to pass a rate limit but still too expensive for the remaining wallet balance. Keeptrusts handles that with separate controls instead of forcing one system to do both jobs badly.

Implementation

Start by loading model pricing and allocating a team wallet. Then bind gateway traffic to that wallet through a consumer group.

export KEEPTRUSTS_API_URL="http://localhost:41002"
export KEEPTRUSTS_API_TOKEN="your-admin-token"

curl -s -X POST "$KEEPTRUSTS_API_URL/v1/model-pricing" \
  -H "Authorization: Bearer $KEEPTRUSTS_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.4-mini-mini",
    "provider": "openai",
    "input_cost_per_1k_tokens": 0.00015,
    "output_cost_per_1k_tokens": 0.0006
  }'

curl -s -X POST "$KEEPTRUSTS_API_URL/v1/wallets/allocate" \
  -H "Authorization: Bearer $KEEPTRUSTS_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "team_id": "team_engineering",
    "amount": 500.00,
    "currency": "USD",
    "description": "Monthly LLM budget"
  }'

Then wire the gateway to use that budget.

providers:
  targets:
    - id: openai
      provider: openai
      secret_key_ref:
        env: OPENAI_API_KEY

consumer_groups:
  - name: engineering
    api_key: kt_cg_engineering_abc123
    wallet_team_id: team_engineering

cost_tracking:
  enabled: true
  wallet_enforcement: true
  budget_alerts:
    - threshold_percent: 50
      action: notify
    - threshold_percent: 80
      action: notify
    - threshold_percent: 95
      action: notify
    - threshold_percent: 100
      action: block

api:
  url: http://localhost:41002
  token_env: KEEPTRUSTS_API_TOKEN

Validate and run it the normal way:

kt policy lint --file policy-config.yaml
kt gateway run --policy-config policy-config.yaml --listen 0.0.0.0:41002
kt spend --team engineering

That is the minimum viable control loop. The gateway now knows what the request costs, which wallet should pay for it, and whether the team still has budget available.

Two details matter in production.

The first is identity. If every caller uses the same upstream key without consumer-group separation, you only get a blended budget. If you want spend accountability per team, or different limits for different apps, you need the consumer-group boundary described in Consumer Group Isolation.

The second is rollout order. Do not start with a single hard block at 100% and call the work done. Start with pricing and alerts, confirm the numbers make sense in Spend & Wallets, then enable wallet enforcement for the teams that need hard caps first. Teams usually accept enforcement faster when they already trust the measurements.

If you also have runaway request volume, pair this with Rate Limiting or the broader Runaway Cost Control guidance. Budget control without throughput control still leaves room for noisy behavior. Throughput control without wallet enforcement still leaves room for expensive requests.

Results and impact

The immediate result is that overspend becomes a prevented event instead of a reported event. A request that would exceed the team budget does not quietly reach the provider and show up later as invoice shock.

The second result is accountable ownership. Because consumer groups and wallets tie requests to teams, you can answer basic operational questions quickly: who spent the budget, which model consumed it, and whether the pattern was normal traffic or a fault.

The third result is cleaner budgeting conversations. Teams stop arguing about blended platform invoices and start working with their own measured spend, thresholds, and alerts. That turns cost control into an engineering control, not just a finance exception process.

There is also a quality benefit. When engineers know that expensive models and large prompts are budgeted explicitly, they are more likely to test smaller models, caching, or context compression before defaulting to the most expensive option.

Key takeaways

Per-request budget enforcement requires a pre-dispatch decision, not a monthly report.
Keeptrusts reserves estimated cost before provider dispatch and settles to actual cost after the response.
wallet_team_id is what connects a request stream to an accountable team budget.
Budget alerts help teams adapt before hard blocking at 100%.
Rate limits and wallet enforcement work best together because they control different failure modes.

Token Budget Enforcement: Preventing Cost Overruns Per Request

Use this page when​

Primary audience​

The problem​

The solution​

Implementation​

Results and impact​

Key takeaways​

Next steps​