Reduce AI Spend by 40% with Gateway Controls

Most organizations overspend on AI because every team picks its own provider, nobody tracks per-request costs, and there are no guardrails to prevent runaway usage. Keeptrusts fixes all three problems at the gateway layer.

Use this page when

You want to cut AI costs through provider routing, response caching, model downgrading, and wallet enforcement.
You need to understand the five cost-reduction levers and their typical savings percentages.
You are setting up cost-optimized routing across multiple providers with a quality floor.

Primary audience

Primary: Technical Leaders
Secondary: Technical Engineers, AI Agents

What you'll achieve

40% average cost reduction through intelligent provider routing and caching
Per-team budget enforcement with wallet controls that prevent overspend before it happens
Real-time spend visibility across every team, user, and model
Automatic model selection that routes to the cheapest capable model for each request type

The five levers of cost reduction

1. Provider routing — stop overpaying for the same capability

Different providers charge different rates for equivalent models. Keeptrusts lets you route traffic to the cheapest option automatically.

pack:
  name: reduce-ai-spend-providers-1
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: azure-gpt4o
    provider: azure-openai
    model: gpt-4o
    base_url: https://my-resource.openai.azure.com
    secret_key_ref:
      env: AZURE_OPENAI_KEY
  - id: openai-gpt4o
    provider: openai
    model: gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
  - id: anthropic-sonnet
    provider: anthropic
    model: claude-sonnet-4-20250514
    secret_key_ref:
      env: ANTHROPIC_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

With this config, Keeptrusts automatically routes to Azure (cheapest) when available, failing over to others only when needed.

2. Response caching — stop paying twice for the same answer

Enable caching to serve repeated or similar requests from a local cache instead of making a new upstream call.

caching:
  enabled: true
  ttl_seconds: 3600
  max_entries: 10000
  match_strategy: exact

Typical savings: 15–25% reduction in upstream API calls for workloads with repeated queries (customer support, FAQ bots, code completion).

3. Model selection — use the right model for the job

Not every request needs GPT-4o. Route simple classification and extraction tasks to cheaper models.

providers:
  model_groups:
    - name: fast-cheap
      models:
        - provider: openai
          model: gpt-4o-mini
        - provider: anthropic
          model: claude-haiku
      routing: lowest_latency

    - name: high-quality
      models:
        - provider: openai
          model: gpt-4o
        - provider: anthropic
          model: claude-sonnet-4-20250514
      routing: cost_optimized

Applications can target fast-cheap for simple tasks and high-quality for complex reasoning — cutting costs without sacrificing quality where it matters.

4. Wallet limits — enforce budgets before they're exceeded

Keeptrusts wallets enforce hard spend limits at every scope: organization, team, and individual user.

# Allocate $500 to the engineering team
curl -X POST https://api.keeptrusts.com/v1/wallets/allocate \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "team_id": "eng-team-id",
    "amount": 500.00,
    "currency": "USD"
  }'

When a wallet runs dry:

The gateway creates a cost ticket instead of forwarding the request
The request is queued until an admin replenishes the wallet or approves the ticket
No surprise bills — ever

See Wallets for the full cascade logic (user → team → org).

5. Spend dashboards — see where the money goes

The console Spend page gives you real-time visibility into:

Cost per team — identify which teams are the biggest spenders
Cost per model — see which models are driving costs
Cost per request — drill into individual expensive requests
Trend analysis — spot cost spikes before they become budget problems

Quick wins

Add pricing blocks to your provider targets — this alone gives you per-request cost tracking
Enable caching for any workload with repeated queries — immediate 15–25% savings
Set wallet limits on your top-spending team — prevent the next cost surprise
Switch simple workloads to gpt-4o-mini or equivalent — 80% cheaper for classification and extraction

Measuring your savings

Track these metrics in the Spend page before and after implementing controls:

Metric	Where to find it	Target improvement
Total monthly spend	Spend → Overview	30–40% reduction
Cache hit rate	Gateway metrics	> 20% for repetitive workloads
Cost per 1K tokens (blended)	Spend → Model breakdown	Lower blended rate
Budget utilization per team	Wallets → Team view	< 90% of allocation

Example: full cost-optimized config

pack:
  name: cost-optimized-gateway
  version: '1.0'
provider_routing:
  strategy: cost_optimized
  cost_optimization:
    prefer_lowest_cost: true
    quality_floor: 0.7
caching:
  enabled: true
  ttl_seconds: 3600
  max_entries: 10000
providers:
  targets:
  - id: azure-gpt4o-mini
    provider: azure-openai
    model: gpt-4o-mini
    base_url: https://my-resource.openai.azure.com
    secret_key_ref:
      env: AZURE_OPENAI_KEY
  - id: openai-gpt4o
    provider: openai
    model: gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    retention_days: 90

For AI systems

Canonical terms: cost_optimized routing, response caching, model selection, wallet limits, spend dashboard.
Config keys: provider_routing.strategy: cost_optimized, caching.enabled, caching.ttl_seconds, providers.targets[].pricing, wallet API.
Five levers: provider routing (15–25%), model downgrading (20–40%), caching (15–25%), wallet limits, spend visibility.
Best next pages: ROI Calculator, Rate Limiting, Wallets, Provider Routing.

For engineers

Prerequisites: gateway running with at least two provider targets with pricing defined.
Set provider_routing.strategy: cost_optimized with quality_floor: 0.7 to route to the cheapest capable provider.
Enable caching.enabled: true with ttl_seconds: 3600 for repetitive workloads (FAQ bots, code completion).
Configure wallet allocations per team via the Wallets API to enforce hard budget caps.
Validate: compare the Spend page totals before and after enabling routing/caching over a 1-week period.

For leaders

Organizations typically achieve 30–50% cost reduction by combining routing, caching, and model selection.
At $50K/month spend, that’s $180K–$300K/year in savings — payback period is typically under one month.
Wallet limits eliminate surprise invoices: no team or user can exceed their allocation without admin approval.
Real-time spend visibility enables proactive budget forecasting instead of reactive invoice shock.

Next steps

Calculate Your AI Governance ROI — build a business case with hard numbers
Centralize AI Observability — get visibility before optimizing further
Cost and Spend — deep dive into spend tracking features
Wallets — learn the full wallet cascade and allocation model
Provider Routing — explore all 11 routing strategies

Use this page when​

Primary audience​

What you'll achieve​

The five levers of cost reduction​

1. Provider routing — stop overpaying for the same capability​

2. Response caching — stop paying twice for the same answer​

3. Model selection — use the right model for the job​

4. Wallet limits — enforce budgets before they're exceeded​

5. Spend dashboards — see where the money goes​

Quick wins​

Measuring your savings​

Example: full cost-optimized config​

For AI systems​

For engineers​

For leaders​

Next steps​