Provider Routing
Keeptrusts supports 11 routing strategies for distributing LLM requests across providers. These strategies let you optimize for latency, cost, resilience, or domain-specific accuracy — without changing a line of application code. All routing is configured in your policy config YAML under the provider_routing key.
Use this page when
- You need the exact command, config, API, or integration details for Provider Routing.
- You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
- If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Routing Strategies
ordered
Try each provider in the configured sequence. Move to the next provider only when the current one fails (error, timeout, or rate-limit). The simplest and most predictable strategy.
Best for: Production workloads that have a clear preferred provider and need deterministic fallback behavior.
lowest_latency
Route to the provider that has demonstrated the lowest average response latency over a rolling measurement window. Keeptrusts tracks a per-provider latency histogram and refreshes rankings after min_sample_count successful responses have been recorded within window_seconds.
Best for: Latency-sensitive user-facing applications. Especially useful for chatbots and copilots where p95 tail latency directly impacts UX.
highest_throughput
Route to the provider that is processing the most tokens per second across its recent request history. Useful when you have token-volume SLOs rather than per-request latency SLOs.
Best for: Batch inference pipelines where aggregate throughput matters more than individual response time.
round_robin
Evenly rotate request assignment across all healthy providers in the list. Each provider receives approximately the same number of requests over time, regardless of capacity or current load.
Best for: Spreading load uniformly to avoid hitting any single provider's rate limits.
weighted_round_robin
Proportional rotation where each provider receives a fraction of requests equal to its weight divided by the sum of all weights. A provider with weight: 3 receives three times the traffic of a provider with weight: 1.
Best for: Hybrid deployments (cloud + self-hosted), or when one provider has a higher rate limit tier than others.
least_connections
Route to the provider with the fewest in-flight requests at the moment the request arrives. Keeptrusts tracks an atomic in-flight counter per provider and decrements it when the upstream response is fully received.
Best for: Streaming workloads where requests have highly variable durations and queueing behind a slow request matters.
random
Select a random provider from the healthy list on each request. No state is maintained between requests.
Best for: Rapid load spreading in development or CI environments where even distribution over any small window is not required.
simple_shuffle
Shuffle the provider list once at config load time, then use that shuffled ordered list (like ordered but with randomized initial order). The order persists until the gateway restarts or the config is reloaded.
Best for: Distributing initial load across a fixed pool while preserving ordered-fallback behaviour within a session window.
least_busy
A composite metric that combines queue depth (count of waiting requests) and current in-flight connections into a single score. The provider with the lowest combined score receives the next request.
Best for: Large provider pools with heterogeneous capacity where some providers may be temporarily overloaded.
semantic
Route requests based on the semantic similarity of the incoming message to configured example prompts. Each provider target can declare example topics or prompt snippets. The router embeds the incoming message using semantic_embedding_provider and picks the target whose examples have the highest cosine similarity above semantic_similarity_threshold. When no target clears the threshold, the gateway falls back to ordered.
Best for: Multi-model deployments where different models are specialized for different domains (e.g., code generation vs. medical Q&A vs. legal document review).
usage_based
Route by historical token usage. Tracks cumulative token consumption per provider over a rolling window and directs new requests to the provider with the most available headroom relative to its configured limits.
Best for: Organizations managing strict per-provider token budgets or billing tiers.
Configuration Reference
The provider_routing object is placed at the top level of your gateway config, or inside a named pack definition.
pack:
name: production-api
version: "1.0.0"
provider_routing:
strategy: ordered # one of the 11 strategies listed above
fallback_enabled: true # attempt next provider on failure
# Measurement window for latency / throughput strategies
window_seconds: 300 # rolling 5-minute window
min_sample_count: 10 # minimum samples before a provider is ranked
# For bandit-style exploration in latency/throughput modes
exploration_ratio: 0.05 # 5% of requests probe non-optimal providers
# Ordered list for `ordered` and `simple_shuffle` strategies
order:
- id: primary-openai
- id: fallback-azure
# Provider filtering
allow_fallbacks: true
only: [] # if non-empty, restrict to these provider IDs
ignore: [] # exclude these provider IDs from routing
# Smart routing filters (per-request SLO constraints)
max_price: null # max USD per 1M tokens (null = no limit)
preferred_max_latency: null # milliseconds, soft preference
preferred_min_throughput: null # tokens/second, soft preference
require_region: [] # e.g. ["us-east-1", "eu-west-1"]
require_quantizations: [] # e.g. ["fp16", "bf16"]
# Semantic routing
semantic_embedding_provider: null # provider ID used to embed the prompt
semantic_similarity_threshold: 0.75
# Pre-call provider health check before routing
enable_pre_call_checks: false
Example 1: Ordered fallback
pack:
name: provider-routing-providers-2
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-primary
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: azure-backup
provider: azure:chat:gpt-4o
base_url: https://myorg.openai.azure.com/openai/deployments/gpt-4o
secret_key_ref:
env: AZURE_OPENAI_API_KEY
- id: anthropic-last-resort
provider: anthropic:chat:claude-3-5-sonnet-20241022
secret_key_ref:
env: ANTHROPIC_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Example 2: Weighted round-robin
pack:
name: provider-routing-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-main
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: azure-secondary
provider: azure:chat:gpt-4o-mini
secret_key_ref:
env: AZURE_OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Example 3: Semantic routing
pack:
name: provider-routing-providers-4
version: 1.0.0
enabled: true
providers:
targets:
- id: embedding-router
provider: openai:embedding:text-embedding-3-small
secret_key_ref:
env: OPENAI_API_KEY
- id: code-specialist
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: medical-specialist
provider: anthropic:chat:claude-3-5-sonnet-20241022
secret_key_ref:
env: ANTHROPIC_API_KEY
- id: general-fallback
provider: openai:chat:gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
When a coding question arrives, the gateway embeds it and finds that code-specialist has the highest similarity. Medical queries route to medical-specialist. Anything below the 0.72 threshold falls back to general-fallback via the ordered fallback chain.
Example 4: Cost-optimized routing
pack:
name: provider-routing-providers-5
version: 1.0.0
enabled: true
providers:
targets:
- id: groq-llama
provider: groq:chat:llama-3.3-70b-versatile
secret_key_ref:
env: GROQ_API_KEY
- id: cerebras-llama
provider: cerebras:chat:llama3.1-70b
secret_key_ref:
env: CEREBRAS_API_KEY
- id: openai-gpt4o-mini
provider: openai:chat:gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
- id: openai-gpt4o
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
In this example, openai-gpt4o is excluded from routing because its completion token price exceeds max_price: 2.50. The gateway measures actual latency across the remaining three providers and routes to the fastest one.
Smart Routing Filters
Smart routing filters are declared on provider_routing and act as per-request eligibility constraints. A provider that does not satisfy a filter is excluded from the routing decision for that request; if all providers are excluded, the gateway returns a 503 with a clear no_eligible_provider error code.
max_price
Maximum cost in USD per 1 million tokens (prompt + completion combined). Keeptrusts evaluates this against the pricing block on each provider target. Providers without a pricing block are always eligible.
provider_routing:
max_price: 5.00 # USD per 1M tokens
preferred_max_latency
Soft upper bound on mean response latency in milliseconds. Providers currently measuring above this threshold are deprioritized but not excluded when no alternative is available.
provider_routing:
preferred_max_latency: 600 # ms
preferred_min_throughput
Soft lower bound on tokens per second. Providers currently measuring below this threshold are deprioritized but remain eligible as fallbacks.
provider_routing:
preferred_min_throughput: 200 # tokens/second
require_region
Only route to providers whose declared region field matches one of the listed values. Useful for data residency requirements.
pack:
name: provider-routing-providers-9
version: 1.0.0
enabled: true
providers:
targets:
- id: azure-eu
provider: azure:chat:gpt-4o
secret_key_ref:
env: AZURE_EU_KEY
- id: openai-us
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
require_quantizations
Only route to providers whose quantization field matches one of the listed values. Common values: fp32, fp16, bf16, int8, int4, awq, gptq.
provider_routing:
require_quantizations:
- fp16
- bf16
Semantic Routing
Semantic routing enables content-aware model selection without hardcoded if/else dispatch logic in your application. Configure domain-specific example prompts on each target; the gateway embeds the incoming message and picks the best-matching target.
Full domain-routing example
pack:
name: domain-router
version: "1.0.0"
provider_routing:
strategy: semantic
semantic_embedding_provider: embedding-provider
semantic_similarity_threshold: 0.70
fallback_enabled: true
policies:
- name: pii-redaction
rules:
- type: redact
pattern: "\b[A-Z]{2}\d{6}\b" # passport numbers
replacement: "[PASSPORT]"
providers:
targets:
- id: embedding-provider
provider: "openai:embedding:text-embedding-3-small"
secret_key_ref:
env: OPENAI_API_KEY
- id: code-model
provider: "openai:chat:gpt-4o"
secret_key_ref:
env: OPENAI_API_KEY
semantic_examples:
- "Implement a binary search tree in Go"
- "Refactor this React component to use hooks"
- "Write a Terraform module for an S3 bucket"
- "Fix the memory leak in this C++ destructor"
- "Generate unit tests for this Python service"
- id: medical-model
provider: "anthropic:chat:claude-3-5-sonnet-20241022"
secret_key_ref:
env: ANTHROPIC_API_KEY
semantic_examples:
- "What is the first-line treatment for hypertension?"
- "Summarize this discharge summary for the patient"
- "Explain the pharmacokinetics of warfarin"
- "Review this clinical note for ICD-10 coding accuracy"
- "What are signs of acute kidney injury?"
- id: legal-model
provider: "anthropic:chat:claude-3-5-sonnet-20241022"
secret_key_ref:
env: ANTHROPIC_API_KEY
semantic_examples:
- "Review this NDA for unusual indemnification clauses"
- "What is the statute of limitations for breach of contract?"
- "Summarize the key obligations in this SaaS agreement"
- "Compare EU GDPR and California CCPA data subject rights"
- id: general-model
provider: "openai:chat:gpt-4o-mini"
secret_key_ref:
env: OPENAI_API_KEY
How it works:
- The incoming message is sent to
embedding-providerto produce a vector. - Keeptrusts computes cosine similarity between the message vector and the pre-embedded
semantic_examplesfor each non-embedding target. - The target with the highest similarity above
0.70is selected. - If no target clears the threshold, the request falls through to
general-model(the last target in the ordered list, used as the default fallback).
Cost-Optimized Routing
Combine strategy: lowest_latency (or round_robin) with max_price and per-provider pricing blocks to ensure expensive models are only used when cheaper alternatives are unavailable.
pack:
name: cost-aware-chatbot
version: 1.0.0
provider_routing:
strategy: lowest_latency
max_price: 3.0
preferred_max_latency: 1000
window_seconds: 180
min_sample_count: 8
fallback_enabled: true
providers:
targets:
- id: groq-fast
provider: groq:chat:llama-3.3-70b-versatile
secret_key_ref:
env: GROQ_API_KEY
- id: together-mixtral
provider: togetherai:chat:mistralai/Mixtral-8x7B-Instruct-v0.1
secret_key_ref:
env: TOGETHERAI_API_KEY
- id: openai-mini
provider: openai:chat:gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
- id: anthropic-sonnet
provider: anthropic:chat:claude-3-5-sonnet-20241022
secret_key_ref:
env: ANTHROPIC_API_KEY
The anthropic-sonnet target is dynamically excluded because its completion token price of $15.00 per 1M tokens exceeds the max_price: 3.00 ceiling. The gateway measures latency across the remaining three providers and picks the fastest one.
Provider and Model Pinning
Override the routing strategy on a per-request basis using HTTP headers:
X-Keeptrusts-Provider: <target-id>— Route the request to a specific provider target.X-Keeptrusts-Model: <model-id>— Override the model within the selected provider target.
# Pin to a specific provider
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-Keeptrusts-Provider: openai-mini" \
-d '{"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hello"}]}'
# Pin both provider and model
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-Keeptrusts-Provider: openai-mini" \
-H "X-Keeptrusts-Model: gpt-4o" \
-d '{"messages": [{"role": "user", "content": "Hello"}]}'
When a pinning header is present the routing strategy is bypassed entirely and the request goes directly to the named target. Both headers are validated against the active config — unknown or unauthorized values return 400 Bad Request. Pinning does not bypass policy evaluation, rate limits, or budget enforcement.
Detailed Provider Pricing
Provider targets support granular pricing fields for accurate spend tracking:
pack:
name: provider-routing-providers-13
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-prod
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
| Field | Description |
|---|---|
input_price_per_million | Cost per million input tokens |
cached_input_price_per_million | Cost per million cached input tokens |
output_price_per_million | Cost per million output tokens |
input_multiplier | Multiplier applied to input token count (default 1.0) |
cached_input_multiplier | Multiplier applied to cached input token count (default 1.0) |
output_multiplier | Multiplier applied to output token count (default 1.0) |
Legacy prompt_token_cost_per_million and completion_token_cost_per_million fields continue to work and map to input_price_per_million and output_price_per_million respectively.
Best Practices
-
Start with
orderedand add measured routing (latency, throughput) only when you have real traffic data. Premature optimization with an empty latency window causes random selection. -
Set
min_sample_count≥ 10 before trusting latency rankings. The default of 10 samples prevents cold-start routing errors on new deployments. -
Always configure
fallback_enabled: truein production. A routing strategy that narrows the eligible provider set (e.g.,require_region) should always have a fallback to avoid hard failures. -
Use
max_pricewith explicitpricingblocks on every target. A provider without apricingblock is always treated as eligible — leaving out pricing data defeats cost filtering. -
Pre-embed semantic examples at startup. Semantic routing embeds the
semantic_examplesstrings once when the config is loaded. Keep examples concise (1–2 sentences) and representative of actual prompt patterns. -
Monitor routing decisions via the Keeptrusts event stream. Every request emits a
routing.decisionevent field indicating which strategy was applied and which provider was selected. Use this to validate routing behaviour before going to production.
For AI systems
- Canonical terms: Keeptrusts Provider Routing, routing strategy, semantic routing, cost-based routing, provider pinning.
- Strategies (11 total):
ordered,lowest_latency,highest_throughput,round_robin,weighted_round_robin,least_connections,random,simple_shuffle,least_busy,semantic,usage_based. - Config keys:
provider_routing.strategy,provider_routing.fallback_enabled,provider_routing.window_seconds,provider_routing.min_sample_count,provider_routing.exploration_ratio,provider_routing.max_price,provider_routing.preferred_max_latency,provider_routing.require_region,provider_routing.require_quantizations,provider_routing.semantic_embedding_provider,provider_routing.semantic_similarity_threshold. - Per-target pricing:
pricing.input_price_per_million,pricing.cached_input_price_per_million,pricing.output_price_per_million. - Pinning headers:
X-Keeptrusts-Provider,X-Keeptrusts-Model. - Best next pages: Model Groups, Provider Fallback, Circuit Breakers & Retry.
For engineers
- Prerequisites: at least two provider targets for routing to be meaningful; set
min_sample_count ≥ 10before trusting latency/throughput rankings. - Start with
orderedstrategy and add latency/throughput-based routing only after collecting real traffic data. - Always set
fallback_enabled: truein production to prevent hard failures when routing narrows the eligible set. - Semantic routing: keep
semantic_examplesconcise (1–2 sentences); examples are embedded once at config load. - Cost filtering: add
pricingblocks on every target — targets without pricing are always eligible and defeatmax_pricefiltering. - Pin requests for debugging:
X-Keeptrusts-Provider: <target-id>bypasses routing but not policy/rate-limits. - Monitor: every request emits a
routing.decisionevent field showing which strategy and provider were selected.
For leaders
- Cost optimization:
max_priceandusage_basedrouting enforce per-provider spend ceilings without manual intervention. - Data residency:
require_regionensures requests never leave approved geographic zones for compliance (GDPR, data sovereignty). - Quality routing:
semanticstrategy automatically routes domain-specific queries to specialized models (code, medical, legal) without application code changes. - Vendor diversity: weighted or latency-based routing across multiple providers reduces concentration risk and improves negotiating leverage.
Next steps
- Model Groups — define named model pools with per-group routing overrides
- Provider Fallback — what happens when routing selects a failing provider
- Circuit Breakers & Retry — per-provider resilience layers
- Consumer Groups — per-group upstream overrides that interact with routing