Reduce AI Spend by 40% with Gateway Controls
Most organizations overspend on AI because every team picks its own provider, nobody tracks per-request costs, and there are no guardrails to prevent runaway usage. Keeptrusts fixes all three problems at the gateway layer.
Use this page when
- You want to cut AI costs through provider routing, response caching, model downgrading, and wallet enforcement.
- You need to understand the five cost-reduction levers and their typical savings percentages.
- You are setting up cost-optimized routing across multiple providers with a quality floor.
Primary audience
- Primary: Technical Leaders
- Secondary: Technical Engineers, AI Agents
What you'll achieve
- 40% average cost reduction through intelligent provider routing and caching
- Per-team budget enforcement with wallet controls that prevent overspend before it happens
- Real-time spend visibility across every team, user, and model
- Automatic model selection that routes to the cheapest capable model for each request type
The five levers of cost reduction
1. Provider routing — stop overpaying for the same capability
Different providers charge different rates for equivalent models. Keeptrusts lets you route traffic to the cheapest option automatically.
pack:
name: reduce-ai-spend-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: azure-gpt4o
provider: azure-openai
model: gpt-4o
base_url: https://my-resource.openai.azure.com
secret_key_ref:
env: AZURE_OPENAI_KEY
- id: openai-gpt4o
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: anthropic-sonnet
provider: anthropic
model: claude-sonnet-4-20250514
secret_key_ref:
env: ANTHROPIC_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
With this config, Keeptrusts automatically routes to Azure (cheapest) when available, failing over to others only when needed.
2. Response caching — stop paying twice for the same answer
Enable caching to serve repeated or similar requests from a local cache instead of making a new upstream call.
caching:
enabled: true
ttl_seconds: 3600
max_entries: 10000
match_strategy: exact
Typical savings: 15–25% reduction in upstream API calls for workloads with repeated queries (customer support, FAQ bots, code completion).
3. Model selection — use the right model for the job
Not every request needs GPT-4o. Route simple classification and extraction tasks to cheaper models.
providers:
model_groups:
- name: fast-cheap
models:
- provider: openai
model: gpt-4o-mini
- provider: anthropic
model: claude-haiku
routing: lowest_latency
- name: high-quality
models:
- provider: openai
model: gpt-4o
- provider: anthropic
model: claude-sonnet-4-20250514
routing: cost_optimized
Applications can target fast-cheap for simple tasks and high-quality for complex reasoning — cutting costs without sacrificing quality where it matters.
4. Wallet limits — enforce budgets before they're exceeded
Keeptrusts wallets enforce hard spend limits at every scope: organization, team, and individual user.
# Allocate $500 to the engineering team
curl -X POST https://api.keeptrusts.com/v1/wallets/allocate \
-H "Authorization: Bearer $API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"team_id": "eng-team-id",
"amount": 500.00,
"currency": "USD"
}'
When a wallet runs dry:
- The gateway creates a cost ticket instead of forwarding the request
- The request is queued until an admin replenishes the wallet or approves the ticket
- No surprise bills — ever
See Wallets for the full cascade logic (user → team → org).
5. Spend dashboards — see where the money goes
The console Spend page gives you real-time visibility into:
- Cost per team — identify which teams are the biggest spenders
- Cost per model — see which models are driving costs
- Cost per request — drill into individual expensive requests
- Trend analysis — spot cost spikes before they become budget problems
Quick wins
- Add pricing blocks to your provider targets — this alone gives you per-request cost tracking
- Enable caching for any workload with repeated queries — immediate 15–25% savings
- Set wallet limits on your top-spending team — prevent the next cost surprise
- Switch simple workloads to
gpt-4o-minior equivalent — 80% cheaper for classification and extraction
Measuring your savings
Track these metrics in the Spend page before and after implementing controls:
| Metric | Where to find it | Target improvement |
|---|---|---|
| Total monthly spend | Spend → Overview | 30–40% reduction |
| Cache hit rate | Gateway metrics | > 20% for repetitive workloads |
| Cost per 1K tokens (blended) | Spend → Model breakdown | Lower blended rate |
| Budget utilization per team | Wallets → Team view | < 90% of allocation |
Example: full cost-optimized config
pack:
name: cost-optimized-gateway
version: '1.0'
provider_routing:
strategy: cost_optimized
cost_optimization:
prefer_lowest_cost: true
quality_floor: 0.7
caching:
enabled: true
ttl_seconds: 3600
max_entries: 10000
providers:
targets:
- id: azure-gpt4o-mini
provider: azure-openai
model: gpt-4o-mini
base_url: https://my-resource.openai.azure.com
secret_key_ref:
env: AZURE_OPENAI_KEY
- id: openai-gpt4o
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
retention_days: 90
For AI systems
- Canonical terms: cost_optimized routing, response caching, model selection, wallet limits, spend dashboard.
- Config keys:
provider_routing.strategy: cost_optimized,caching.enabled,caching.ttl_seconds,providers.targets[].pricing, wallet API. - Five levers: provider routing (15–25%), model downgrading (20–40%), caching (15–25%), wallet limits, spend visibility.
- Best next pages: ROI Calculator, Rate Limiting, Wallets, Provider Routing.
For engineers
- Prerequisites: gateway running with at least two provider targets with
pricingdefined. - Set
provider_routing.strategy: cost_optimizedwithquality_floor: 0.7to route to the cheapest capable provider. - Enable
caching.enabled: truewithttl_seconds: 3600for repetitive workloads (FAQ bots, code completion). - Configure wallet allocations per team via the Wallets API to enforce hard budget caps.
- Validate: compare the Spend page totals before and after enabling routing/caching over a 1-week period.
For leaders
- Organizations typically achieve 30–50% cost reduction by combining routing, caching, and model selection.
- At $50K/month spend, that’s $180K–$300K/year in savings — payback period is typically under one month.
- Wallet limits eliminate surprise invoices: no team or user can exceed their allocation without admin approval.
- Real-time spend visibility enables proactive budget forecasting instead of reactive invoice shock.
Next steps
- Calculate Your AI Governance ROI — build a business case with hard numbers
- Centralize AI Observability — get visibility before optimizing further
- Cost and Spend — deep dive into spend tracking features
- Wallets — learn the full wallet cascade and allocation model
- Provider Routing — explore all 11 routing strategies