Skip to main content

Cost-Aware Routing: Automatically Choose the Cheapest Capable Model

The most expensive model in your stack is usually not too expensive for every task. It is too expensive for the number of routine tasks that end up using it by default. Cost-aware routing fixes that by making model choice a governed gateway concern instead of an application hardcode. In Keeptrusts, you can declare multiple provider targets, group them behind stable logical names, and use routing strategy plus cost filters to keep ordinary work off premium lanes while preserving fallback and resilience.

Use this page when

  • You want to reduce model spend without forcing application teams to rewrite their integrations.
  • You need a practical explanation of how routing and cost controls work together in Keeptrusts.
  • You are comparing premium, standard, and economy model lanes for different workloads.

Primary audience

  • Primary: Technical Engineers
  • Secondary: Technical Leaders, FinOps owners

The problem

Most AI cost waste starts with a hardcoded default model.

Teams prototype on a premium model because it is safe and familiar. The prototype succeeds. The same model is kept for extraction, summarization, classification, and FAQ traffic because changing it later requires application work or creates perceived quality risk. Over time, the expensive lane becomes the universal lane.

Even teams that add a second provider often do so only for resilience, not for cost optimization. They have failover, but no governed way to prefer cheaper capable options when the task does not need premium reasoning.

The result is an awkward operating model: engineers know some requests are overpowered, finance knows the blended cost is too high, and nobody wants to change application code across multiple products just to rebalance the model fleet.

The solution

Keeptrusts puts that decision in the routing layer.

Provider routing lets you declare multiple targets and choose how the gateway selects among them. Model groups give applications stable logical names instead of vendor-specific strings. Cost-aware filters such as max_price let you exclude targets that exceed an acceptable cost ceiling for a lane.

This is how teams get to the cheapest capable model in practice. They define the acceptable capability lane for a workflow, then let Keeptrusts optimize within that lane using routing strategy, provider health, and cost constraints. The gateway is not guessing blindly about quality. It is enforcing a designed operating range.

That distinction matters. Cost-aware routing is not about choosing the cheapest model in the catalog. It is about choosing the cheapest model that still belongs in the approved lane for the task.

Implementation

One practical pattern is to define cost-sensitive routing with a hard price ceiling and a fallback chain that still protects availability.

pack:
name: cost-optimized-routing
version: 1.0.0
enabled: true

providers:
routing:
strategy: lowest_latency
max_price: 2.50
window_seconds: 180
min_sample_count: 8
fallback:
enabled: true
targets:
- id: groq-llama
provider: groq:chat:llama-3.3-70b-versatile
secret_key_ref:
env: GROQ_API_KEY
pricing:
input_price_per_million: 0.59
output_price_per_million: 0.79
- id: openai-mini
provider: openai:chat:gpt-5.4-mini-mini
secret_key_ref:
env: OPENAI_API_KEY
pricing:
input_price_per_million: 0.15
output_price_per_million: 0.60
- id: openai-premium
provider: openai:chat:gpt-5.4-mini
secret_key_ref:
env: OPENAI_API_KEY
pricing:
input_price_per_million: 2.50
output_price_per_million: 10.00

In this pattern, the gateway prefers healthy low-latency providers but excludes anything above the configured price ceiling for the lane. If you need a premium route for more complex work, give that traffic a different model group or gateway profile instead of letting every request inherit the expensive path.

The operational win is that application code can keep calling the gateway while platform owners refine the provider mix behind the scenes. Routing strategy changes, price ceilings, and target order are all controlled centrally.

Results and impact

Cost-aware routing usually creates savings in three stages.

First, it stops the easiest waste: simple tasks landing on premium defaults. That alone can materially reduce the blended cost per request. Second, it improves resilience because cheaper lanes can still be multi-provider and fall back safely instead of depending on one bargain target. Third, it shortens optimization cycles because platform teams can change routing behavior without waiting for application releases.

The financial benefit is usually clearest in steady high-volume workloads such as classification, support assistance, and internal knowledge retrieval. These are the places where premium reasoning is often overused simply because nobody encoded a cheaper governed path.

There is also a governance benefit. Once teams see that the gateway can reduce cost without reducing control, they are more willing to standardize on central routing. That improves observability and makes later wallet and chargeback work easier.

Key takeaways

  • Cost-aware routing is about cheapest capable, not cheapest possible.
  • Model groups and routing strategy let platform teams optimize centrally without app rewrites.
  • Cost filters such as max_price help keep a routing lane inside approved spend boundaries.
  • Resilience and cost optimization should be designed together, not treated as separate concerns.
  • Routing is one of the fastest ways to remove premium-default waste across multiple teams.

Next steps