CIO Guide: Cutting AI Infrastructure Costs by 40%

AI infrastructure costs are the fastest-growing line item in most enterprise IT budgets. Without centralized controls, teams over-provision expensive models for simple tasks, duplicate API keys across projects, and lack visibility into who is spending what.

Use this page when

You are setting up wallet-based budget allocation and per-team spend caps
You need to understand the reserve/settle flow for gateway-level cost enforcement
You want to implement model routing optimization (expensive models for complex tasks, cheap models for simple tasks)
You are building chargeback reports from console usage and wallet reporting for finance/procurement

Keeptrusts gives you console-level cost controls, wallet-based budget allocation, intelligent model routing, and API-driven analytics to cut AI spend by 40% or more — with full audit trail.

Primary audience

Primary: Technical Leaders
Secondary: Technical Engineers, AI Agents

Usage & Wallets Deep-Dive

The console Usage and Wallets pages give you a complete picture for AI spend management. They provide:

Real-time spend tracking per team, project, and individual user
Budget allocation with hard and soft caps
Trend analysis with daily, weekly, and monthly views
Anomaly detection when spend deviates from historical patterns
Export capabilities for finance and procurement workflows

Screenshot reference: Console Usage showing team-level spend breakdown, budget utilization bars, and monthly trend chart.

Navigating Usage and Wallets

Surface	What It Shows	Action
Usage	Total spend, top consumers, and budget health	Identify over-spenders
Wallets	Per-team allocation and current balances	Adjust allocations
Events API	Spend per model across all teams	Identify optimization targets
Cost and Spend	Historical spend with forecasting inputs	Plan budget cycles
Alerts	Budget threshold notifications	Configure warning levels

Wallet Allocation Strategy

Wallets are the budget enforcement primitive. Each wallet maps to a team or project and has a configurable balance that decrements with every LLM request.

Allocation Hierarchy

Organization Wallet ($50,000/month)
├── Engineering ($25,000)
│   ├── Search Team ($10,000)
│   ├── Platform Team ($8,000)
│   └── ML Team ($7,000)
├── Product ($15,000)
│   ├── Customer Support ($8,000)
│   └── Analytics ($7,000)
└── Operations ($10,000)
    ├── DevOps ($5,000)
    └── Security ($5,000)

# Allocate budget to the search team
curl -X POST https://api.keeptrusts.com/v1/wallets/allocate \
  -H "Authorization: Bearer $API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "team_id": "search",
    "amount": 10000,
    "currency": "USD",
    "period": "monthly"
  }'

Reserve/Settle Flow

When the gateway processes an LLM request:

Reserve — Estimated cost is held against the team's wallet before forwarding to the provider
Forward — Request goes to the upstream LLM provider
Settle — Actual cost (based on token usage) replaces the reservation
Block — If insufficient balance, the request is held until funds are available or an admin approves

This prevents overspend even under concurrent high-volume usage.

Per-Team Spend Caps

Configure hard caps to prevent runaway costs and soft caps for early warnings.

# Policy config with spend controls
spend_controls:
  - team: search
    hard_cap: 10000
    soft_cap: 8000
    period: monthly
    on_soft_cap: notify
    on_hard_cap: block
  - team: customer-support
    hard_cap: 8000
    soft_cap: 6000
    period: monthly
    on_soft_cap: notify
    on_hard_cap: escalate

Console checkpoint: Usage and wallet reporting show which teams are approaching their soft cap and which have been blocked by hard cap enforcement.

Model Routing for Cost Optimization

The biggest cost lever is routing requests to the most cost-effective model that meets quality requirements. Most teams default to the most expensive model out of habit, not necessity.

Cost Comparison Matrix

Model	Input (per 1M tokens)	Output (per 1M tokens)	Best For
GPT-4o	$2.50	$10.00	Complex reasoning
GPT-4o mini	$0.15	$0.60	Simple classification
Claude Sonnet	$3.00	$15.00	Long-form content
Claude Haiku	$0.25	$1.25	Summarization

Smart Routing Configuration

# Route simple tasks to cheaper models automatically
model_groups:
  - name: cost-optimized
    routing: cost-priority
    models:
      - provider: openai
        model: gpt-4o-mini
        max_tokens: 1000
        priority: 1  # Try cheapest first
      - provider: openai
        model: gpt-4o
        priority: 2  # Fallback for complex requests

Typical savings: Organizations report 30–50% cost reduction by routing classification, summarization, and extraction tasks to smaller models.

Caching ROI

Enable response caching for deterministic queries to avoid paying for identical requests.

cache:
  enabled: true
  ttl: 3600  # 1 hour
  scope: team  # Cache is team-scoped for isolation

Impact metrics:

Metric	Without Caching	With Caching	Savings
Daily requests (search team)	50,000	50,000	—
Unique requests	50,000	32,000	—
Cached responses	0	18,000	36%
Daily cost	$125	$80	$45/day
Monthly cost	$3,750	$2,400	$1,350/month

API Spend Analytics Endpoints

Build automated cost reports and alerting using the control-plane API:

# Get current wallet balance for a team
curl https://api.keeptrusts.com/v1/wallets/balance?team_id=search \
  -H "Authorization: Bearer $API_TOKEN"

# Query spend events with aggregation
curl "https://api.keeptrusts.com/v1/events?team_id=search&since=30d&aggregate=daily" \
  -H "Authorization: Bearer $API_TOKEN"

Finance Integration Example

# Monthly chargeback report via CLI
kt events list \
  --since 30d \
  --format csv \
  --fields team,provider,model,cost,timestamp \
  > monthly-chargeback-$(date +%Y%m).csv

Budget Forecasting with /v1/wallets

Use wallet balance trends to forecast future spend and proactively adjust allocations:

# Get wallet transaction history for forecasting
curl "https://api.keeptrusts.com/v1/wallets/transactions?team_id=search&since=90d" \
  -H "Authorization: Bearer $API_TOKEN"

Build forecasting into your BI pipeline:

Month	Actual Spend	Forecast	Variance
Jan	$22,400	—	—
Feb	$24,100	$23,500	+2.6%
Mar	$21,800	$25,200	-13.5%
Apr (forecast)	—	$22,800	—

ROI Summary

Optimization	Savings	Implementation Effort
Model routing (expensive → optimized)	30–50%	Policy config change
Response caching	20–35%	Policy config change
Budget caps (eliminate overspend)	10–15%	Wallet allocation
Shadow AI elimination	Variable	Gateway + network policy
Combined	40–60%	2–4 weeks

Next steps

Open the console Usage and Wallets pages and review current team-level spend
Identify the top 3 teams by AI spend and audit their model usage
Implement model routing for simple tasks (classification, extraction)
Set soft caps at 80% of current spend to establish baselines
Schedule monthly cost review using automated CSV exports

For AI systems

Canonical terms: console usage, wallets, POST /v1/wallets/allocate, reserve/settle flow, spend_controls, hard_cap, soft_cap, model routing, model_allowlist, budget allocation hierarchy, cost per token, team spend breakdown
Key console pages: Usage, Wallets, Cost and Spend
Best next pages: CTO: Multi-Provider Strategy, CIO: AI Governance Framework, CTO: API Economy

For engineers

Allocate budget: POST /v1/wallets/allocate with team_id, amount, currency: USD, period: monthly
Reserve/settle: gateway reserves estimated cost before upstream call, settles to actual cost on response; insufficient balance = request held
Spend caps: set hard_cap (blocks at limit) and soft_cap (alerts at threshold) per team in policy-config.yaml
Model routing: use model_allowlist: [gpt-4o-mini] for cost-sensitive workloads to prevent teams from using expensive models
Console checkpoint: Usage and wallet reporting show per-team allocation vs actual spend and identify optimization targets

For leaders

Wallet-based budgets enforce cost controls in real time at the request level — not after the invoice arrives
The reserve/settle pattern prevents overspend under concurrent high-volume usage, even with multiple services sharing a budget
Model routing optimization (directing simple tasks to gpt-4o-mini instead of gpt-4o) typically reduces costs by 30–50% with minimal quality impact
Usage analytics and wallet reporting enable show-back and charge-back models that align AI spend with business unit accountability

Use this page when​

Primary audience​

Usage & Wallets Deep-Dive​

Navigating Usage and Wallets​

Wallet Allocation Strategy​

Allocation Hierarchy​

Reserve/Settle Flow​

Per-Team Spend Caps​

Model Routing for Cost Optimization​

Cost Comparison Matrix​

Smart Routing Configuration​

Caching ROI​

API Spend Analytics Endpoints​

Finance Integration Example​

Budget Forecasting with /v1/wallets​

ROI Summary​

Next steps​

For AI systems​

For engineers​

For leaders​