The Hidden Cost of Ungoverned AI: Why Teams Waste 40% of LLM Spend
Ungoverned AI rarely looks expensive at the moment a team ships it. One assistant works, another proves value, a third team copies the pattern, and the bill still seems manageable because usage is distributed across products, sandboxes, and provider accounts. The problem shows up later: expensive models become the default for routine work, duplicate prompts are re-billed every day, and budget owners discover overspend only after the request has already left the gateway. Keeptrusts changes that pattern by putting wallets, model pricing, provider routing, caching, and reporting directly in the execution path.
Use this page when
- You are trying to explain why AI costs keep climbing faster than business value.
- You need a product-level view of how Keeptrusts turns spend control into a runtime discipline.
- You want a practical way to connect wallets, routing, caching, and reporting into one cost-management story.
Primary audience
- Primary: Technical Leaders
- Secondary: FinOps owners, Technical Engineers
The problem
Teams do not usually waste 40 percent of LLM spend because one provider invoice is wrong. They waste it because cost governance is missing from the operating model.
The first leak is model sprawl. Without a routing layer, teams pin every workflow to the model that felt safest during the first prototype. That premium default then gets reused for summarization, extraction, triage, and FAQ traffic that could run on a cheaper capable model. The expensive part is not the provider itself. It is the absence of a system that makes the cheaper option the default.
The second leak is duplicate work. Support assistants, internal policy bots, and enablement workflows answer the same or nearly the same questions over and over. If every repeated request is sent upstream again, the organization is paying for work it already bought. Local application caches help a little, but they rarely cover every app, every team, or every gateway node. The result is hidden duplication across the estate.
The third leak is inaccurate accounting. If model pricing is missing or stale, estimates are wrong before the request even starts. That means wallets reserve the wrong amount, dashboards blur the real cost by provider, and finance cannot trust trend lines enough to make planning decisions. Bad pricing metadata turns every downstream control into a guess.
The fourth leak is delayed enforcement. Many organizations have monthly budget reviews, but the gateway still forwards requests after a team has effectively spent its allowance. By the time someone notices, the money is already committed. Alerts alone do not solve that problem if they are not paired with a hard control that can actually stop a request.
The fifth leak is weak attribution. Shared provider accounts create a single large invoice with no obvious owner. Engineering blames support. Support blames experimentation. Finance sees the total, but not which team, agent, or workflow created it. If you cannot attribute spend cleanly, you cannot correct it cleanly.
The solution
Keeptrusts closes those leaks by moving cost control into the governed request path.
Wallets create a hard spending boundary. The gateway estimates the cost before dispatch, reserves that amount against the effective wallet scope, and settles to the actual cost after the provider responds. If no eligible wallet has enough balance, the request is held and a cost ticket is created instead of allowing overspend to continue.
Provider routing reduces premium-model waste. You can declare multiple targets and let the gateway pick among them using documented routing strategies. That does not mean choosing the cheapest model blindly. It means creating a governed path where simple workloads land on lower-cost targets and premium capacity is preserved for premium tasks.
Caching removes duplicate upstream calls. When Keeptrusts finds an exact or semantic cache hit, the provider is not called, wallet reserve and settle are skipped, and dashboards can record the avoided cost separately. That turns repeated work into measurable savings instead of a vague performance claim.
Model-pricing records make the rest of the controls trustworthy. Accurate provider-specific input and output rates are what make reserve and settle reliable, make routing comparisons meaningful, and make exports usable for finance review.
Finally, dashboards and exports make the savings operational. Leaders can see which teams are burning budget, which models dominate spend, and which optimization levers are producing real reduction.
Implementation
The practical rollout is to combine team wallet mapping, cost tracking, response caching, and a non-trivial routing policy in one configuration.
pack:
name: governed-cost-control
version: 1.0.0
enabled: true
cache:
enabled: true
mode: exact
ttl_seconds: 3600
max_entries: 20000
namespace: support-and-ops
providers:
routing:
strategy: usage_based
targets:
- id: openai-mini
provider: openai:chat:gpt-5.4-mini-mini
secret_key_ref:
env: OPENAI_API_KEY
- id: openai-premium
provider: openai:chat:gpt-5.4-mini
secret_key_ref:
env: OPENAI_API_KEY
consumer_groups:
- name: support
api_key: kt_cg_support_prod
wallet_team_id: team_support
- name: operations
api_key: kt_cg_operations_prod
wallet_team_id: team_operations
cost_tracking:
enabled: true
wallet_enforcement: true
budget_alerts:
- threshold_percent: 80
action: notify
- threshold_percent: 95
action: notify
- threshold_percent: 100
action: block
This configuration does four useful things at once. It ties traffic to team wallets through wallet_team_id, so attribution is clear. It enables reserve-and-settle enforcement, so budgets matter before a request leaves the gateway. It adds cache for repetitive workloads, so exact repeats are not re-billed. And it gives routing a chance to place traffic on a lower-cost target rather than a premium default.
The operational follow-through is just as important as the YAML. Seed model pricing so the gateway can estimate and reconcile costs accurately. Review wallet balances regularly. Export monthly data for finance instead of waiting for a provider invoice to become the first source of truth.
Results and impact
Consider an organization spending $60,000 per month across support, engineering, and internal operations.
If 25 percent of that traffic is repetitive and exact caching prevents even half of those upstream calls, the organization avoids thousands of dollars in provider charges while preserving the same wallet capacity for uncached work. If another 30 percent of requests are simple enough for a cheaper capable target, routing trims a second layer of waste that had previously been hidden inside one premium default.
Reserve and settle then removes the last category of surprise. Instead of discovering overrun during a monthly review, the gateway holds the request at the moment balance is insufficient. That changes the budget conversation from reactive blame to controlled decision-making.
The net effect is not just a smaller invoice. It is better operating behavior. Teams stop treating AI cost as somebody else's problem because they can see their wallet, their usage, and their tradeoffs clearly. Finance gains clean attribution. Leadership gains confidence that AI adoption can expand without runaway spend.
Key takeaways
- The biggest source of LLM overspend is usually missing runtime control, not just expensive token rates.
- Wallet enforcement makes cost governance operational because requests are checked before dispatch.
- Provider routing and caching attack different failure modes: one improves model choice, the other removes duplicate work.
- Accurate model pricing is the backbone for trustworthy estimates, clean settlements, and credible reporting.
- Team-level attribution is what turns optimization from theory into action.