Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Provider Routing

Keeptrusts supports 11 routing strategies for distributing LLM requests across providers. These strategies let you optimize for latency, cost, resilience, or domain-specific accuracy — without changing a line of application code. All routing is configured in your policy config YAML under the provider_routing key.

Use this page when

  • You need the exact command, config, API, or integration details for Provider Routing.
  • You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
  • If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Routing Strategies

ordered

Try each provider in the configured sequence. Move to the next provider only when the current one fails (error, timeout, or rate-limit). The simplest and most predictable strategy.

Best for: Production workloads that have a clear preferred provider and need deterministic fallback behavior.

lowest_latency

Route to the provider that has demonstrated the lowest average response latency over a rolling measurement window. Keeptrusts tracks a per-provider latency histogram and refreshes rankings after min_sample_count successful responses have been recorded within window_seconds.

Best for: Latency-sensitive user-facing applications. Especially useful for chatbots and copilots where p95 tail latency directly impacts UX.

highest_throughput

Route to the provider that is processing the most tokens per second across its recent request history. Useful when you have token-volume SLOs rather than per-request latency SLOs.

Best for: Batch inference pipelines where aggregate throughput matters more than individual response time.

round_robin

Evenly rotate request assignment across all healthy providers in the list. Each provider receives approximately the same number of requests over time, regardless of capacity or current load.

Best for: Spreading load uniformly to avoid hitting any single provider's rate limits.

weighted_round_robin

Proportional rotation where each provider receives a fraction of requests equal to its weight divided by the sum of all weights. A provider with weight: 3 receives three times the traffic of a provider with weight: 1.

Best for: Hybrid deployments (cloud + self-hosted), or when one provider has a higher rate limit tier than others.

least_connections

Route to the provider with the fewest in-flight requests at the moment the request arrives. Keeptrusts tracks an atomic in-flight counter per provider and decrements it when the upstream response is fully received.

Best for: Streaming workloads where requests have highly variable durations and queueing behind a slow request matters.

random

Select a random provider from the healthy list on each request. No state is maintained between requests.

Best for: Rapid load spreading in development or CI environments where even distribution over any small window is not required.

simple_shuffle

Shuffle the provider list once at config load time, then use that shuffled ordered list (like ordered but with randomized initial order). The order persists until the gateway restarts or the config is reloaded.

Best for: Distributing initial load across a fixed pool while preserving ordered-fallback behaviour within a session window.

least_busy

A composite metric that combines queue depth (count of waiting requests) and current in-flight connections into a single score. The provider with the lowest combined score receives the next request.

Best for: Large provider pools with heterogeneous capacity where some providers may be temporarily overloaded.

semantic

Route requests based on the semantic similarity of the incoming message to configured example prompts. Each provider target can declare example topics or prompt snippets. The router embeds the incoming message using semantic_embedding_provider and picks the target whose examples have the highest cosine similarity above semantic_similarity_threshold. When no target clears the threshold, the gateway falls back to ordered.

Best for: Multi-model deployments where different models are specialized for different domains (e.g., code generation vs. medical Q&A vs. legal document review).

usage_based

Route by historical token usage. Tracks cumulative token consumption per provider over a rolling window and directs new requests to the provider with the most available headroom relative to its configured limits.

Best for: Organizations managing strict per-provider token budgets or billing tiers.


Configuration Reference

The provider_routing object is placed at the top level of your gateway config, or inside a named pack definition.

pack:
name: production-api
version: "1.0.0"

provider_routing:
strategy: ordered # one of the 11 strategies listed above
fallback_enabled: true # attempt next provider on failure

# Measurement window for latency / throughput strategies
window_seconds: 300 # rolling 5-minute window
min_sample_count: 10 # minimum samples before a provider is ranked

# For bandit-style exploration in latency/throughput modes
exploration_ratio: 0.05 # 5% of requests probe non-optimal providers

# Ordered list for `ordered` and `simple_shuffle` strategies
order:
- id: primary-openai
- id: fallback-azure

# Provider filtering
allow_fallbacks: true
only: [] # if non-empty, restrict to these provider IDs
ignore: [] # exclude these provider IDs from routing

# Smart routing filters (per-request SLO constraints)
max_price: null # max USD per 1M tokens (null = no limit)
preferred_max_latency: null # milliseconds, soft preference
preferred_min_throughput: null # tokens/second, soft preference
require_region: [] # e.g. ["us-east-1", "eu-west-1"]
require_quantizations: [] # e.g. ["fp16", "bf16"]

# Semantic routing
semantic_embedding_provider: null # provider ID used to embed the prompt
semantic_similarity_threshold: 0.75

# Pre-call provider health check before routing
enable_pre_call_checks: false

Example 1: Ordered fallback

pack:
name: provider-routing-providers-2
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-primary
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: azure-backup
provider: azure:chat:gpt-4o
base_url: https://myorg.openai.azure.com/openai/deployments/gpt-4o
secret_key_ref:
env: AZURE_OPENAI_API_KEY
- id: anthropic-last-resort
provider: anthropic:chat:claude-3-5-sonnet-20241022
secret_key_ref:
env: ANTHROPIC_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Example 2: Weighted round-robin

pack:
name: provider-routing-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-main
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: azure-secondary
provider: azure:chat:gpt-4o-mini
secret_key_ref:
env: AZURE_OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Example 3: Semantic routing

pack:
name: provider-routing-providers-4
version: 1.0.0
enabled: true
providers:
targets:
- id: embedding-router
provider: openai:embedding:text-embedding-3-small
secret_key_ref:
env: OPENAI_API_KEY
- id: code-specialist
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: medical-specialist
provider: anthropic:chat:claude-3-5-sonnet-20241022
secret_key_ref:
env: ANTHROPIC_API_KEY
- id: general-fallback
provider: openai:chat:gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

When a coding question arrives, the gateway embeds it and finds that code-specialist has the highest similarity. Medical queries route to medical-specialist. Anything below the 0.72 threshold falls back to general-fallback via the ordered fallback chain.

Example 4: Cost-optimized routing

pack:
name: provider-routing-providers-5
version: 1.0.0
enabled: true
providers:
targets:
- id: groq-llama
provider: groq:chat:llama-3.3-70b-versatile
secret_key_ref:
env: GROQ_API_KEY
- id: cerebras-llama
provider: cerebras:chat:llama3.1-70b
secret_key_ref:
env: CEREBRAS_API_KEY
- id: openai-gpt4o-mini
provider: openai:chat:gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
- id: openai-gpt4o
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

In this example, openai-gpt4o is excluded from routing because its completion token price exceeds max_price: 2.50. The gateway measures actual latency across the remaining three providers and routes to the fastest one.


Smart Routing Filters

Smart routing filters are declared on provider_routing and act as per-request eligibility constraints. A provider that does not satisfy a filter is excluded from the routing decision for that request; if all providers are excluded, the gateway returns a 503 with a clear no_eligible_provider error code.

max_price

Maximum cost in USD per 1 million tokens (prompt + completion combined). Keeptrusts evaluates this against the pricing block on each provider target. Providers without a pricing block are always eligible.

provider_routing:
max_price: 5.00 # USD per 1M tokens

preferred_max_latency

Soft upper bound on mean response latency in milliseconds. Providers currently measuring above this threshold are deprioritized but not excluded when no alternative is available.

provider_routing:
preferred_max_latency: 600 # ms

preferred_min_throughput

Soft lower bound on tokens per second. Providers currently measuring below this threshold are deprioritized but remain eligible as fallbacks.

provider_routing:
preferred_min_throughput: 200 # tokens/second

require_region

Only route to providers whose declared region field matches one of the listed values. Useful for data residency requirements.

pack:
name: provider-routing-providers-9
version: 1.0.0
enabled: true
providers:
targets:
- id: azure-eu
provider: azure:chat:gpt-4o
secret_key_ref:
env: AZURE_EU_KEY
- id: openai-us
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

require_quantizations

Only route to providers whose quantization field matches one of the listed values. Common values: fp32, fp16, bf16, int8, int4, awq, gptq.

provider_routing:
require_quantizations:
- fp16
- bf16

Semantic Routing

Semantic routing enables content-aware model selection without hardcoded if/else dispatch logic in your application. Configure domain-specific example prompts on each target; the gateway embeds the incoming message and picks the best-matching target.

Full domain-routing example

pack:
name: domain-router
version: "1.0.0"

provider_routing:
strategy: semantic
semantic_embedding_provider: embedding-provider
semantic_similarity_threshold: 0.70
fallback_enabled: true

policies:
- name: pii-redaction
rules:
- type: redact
pattern: "\b[A-Z]{2}\d{6}\b" # passport numbers
replacement: "[PASSPORT]"

providers:
targets:
- id: embedding-provider
provider: "openai:embedding:text-embedding-3-small"
secret_key_ref:
env: OPENAI_API_KEY

- id: code-model
provider: "openai:chat:gpt-4o"
secret_key_ref:
env: OPENAI_API_KEY
semantic_examples:
- "Implement a binary search tree in Go"
- "Refactor this React component to use hooks"
- "Write a Terraform module for an S3 bucket"
- "Fix the memory leak in this C++ destructor"
- "Generate unit tests for this Python service"

- id: medical-model
provider: "anthropic:chat:claude-3-5-sonnet-20241022"
secret_key_ref:
env: ANTHROPIC_API_KEY
semantic_examples:
- "What is the first-line treatment for hypertension?"
- "Summarize this discharge summary for the patient"
- "Explain the pharmacokinetics of warfarin"
- "Review this clinical note for ICD-10 coding accuracy"
- "What are signs of acute kidney injury?"

- id: legal-model
provider: "anthropic:chat:claude-3-5-sonnet-20241022"
secret_key_ref:
env: ANTHROPIC_API_KEY
semantic_examples:
- "Review this NDA for unusual indemnification clauses"
- "What is the statute of limitations for breach of contract?"
- "Summarize the key obligations in this SaaS agreement"
- "Compare EU GDPR and California CCPA data subject rights"

- id: general-model
provider: "openai:chat:gpt-4o-mini"
secret_key_ref:
env: OPENAI_API_KEY

How it works:

  1. The incoming message is sent to embedding-provider to produce a vector.
  2. Keeptrusts computes cosine similarity between the message vector and the pre-embedded semantic_examples for each non-embedding target.
  3. The target with the highest similarity above 0.70 is selected.
  4. If no target clears the threshold, the request falls through to general-model (the last target in the ordered list, used as the default fallback).

Cost-Optimized Routing

Combine strategy: lowest_latency (or round_robin) with max_price and per-provider pricing blocks to ensure expensive models are only used when cheaper alternatives are unavailable.

pack:
name: cost-aware-chatbot
version: 1.0.0
provider_routing:
strategy: lowest_latency
max_price: 3.0
preferred_max_latency: 1000
window_seconds: 180
min_sample_count: 8
fallback_enabled: true
providers:
targets:
- id: groq-fast
provider: groq:chat:llama-3.3-70b-versatile
secret_key_ref:
env: GROQ_API_KEY
- id: together-mixtral
provider: togetherai:chat:mistralai/Mixtral-8x7B-Instruct-v0.1
secret_key_ref:
env: TOGETHERAI_API_KEY
- id: openai-mini
provider: openai:chat:gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
- id: anthropic-sonnet
provider: anthropic:chat:claude-3-5-sonnet-20241022
secret_key_ref:
env: ANTHROPIC_API_KEY

The anthropic-sonnet target is dynamically excluded because its completion token price of $15.00 per 1M tokens exceeds the max_price: 3.00 ceiling. The gateway measures latency across the remaining three providers and picks the fastest one.


Provider and Model Pinning

Override the routing strategy on a per-request basis using HTTP headers:

  • X-Keeptrusts-Provider: <target-id> — Route the request to a specific provider target.
  • X-Keeptrusts-Model: <model-id> — Override the model within the selected provider target.
# Pin to a specific provider
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-Keeptrusts-Provider: openai-mini" \
-d '{"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hello"}]}'

# Pin both provider and model
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-Keeptrusts-Provider: openai-mini" \
-H "X-Keeptrusts-Model: gpt-4o" \
-d '{"messages": [{"role": "user", "content": "Hello"}]}'

When a pinning header is present the routing strategy is bypassed entirely and the request goes directly to the named target. Both headers are validated against the active config — unknown or unauthorized values return 400 Bad Request. Pinning does not bypass policy evaluation, rate limits, or budget enforcement.


Detailed Provider Pricing

Provider targets support granular pricing fields for accurate spend tracking:

pack:
name: provider-routing-providers-13
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-prod
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
FieldDescription
input_price_per_millionCost per million input tokens
cached_input_price_per_millionCost per million cached input tokens
output_price_per_millionCost per million output tokens
input_multiplierMultiplier applied to input token count (default 1.0)
cached_input_multiplierMultiplier applied to cached input token count (default 1.0)
output_multiplierMultiplier applied to output token count (default 1.0)

Legacy prompt_token_cost_per_million and completion_token_cost_per_million fields continue to work and map to input_price_per_million and output_price_per_million respectively.


Best Practices

  1. Start with ordered and add measured routing (latency, throughput) only when you have real traffic data. Premature optimization with an empty latency window causes random selection.

  2. Set min_sample_count ≥ 10 before trusting latency rankings. The default of 10 samples prevents cold-start routing errors on new deployments.

  3. Always configure fallback_enabled: true in production. A routing strategy that narrows the eligible provider set (e.g., require_region) should always have a fallback to avoid hard failures.

  4. Use max_price with explicit pricing blocks on every target. A provider without a pricing block is always treated as eligible — leaving out pricing data defeats cost filtering.

  5. Pre-embed semantic examples at startup. Semantic routing embeds the semantic_examples strings once when the config is loaded. Keep examples concise (1–2 sentences) and representative of actual prompt patterns.

  6. Monitor routing decisions via the Keeptrusts event stream. Every request emits a routing.decision event field indicating which strategy was applied and which provider was selected. Use this to validate routing behaviour before going to production.

For AI systems

  • Canonical terms: Keeptrusts Provider Routing, routing strategy, semantic routing, cost-based routing, provider pinning.
  • Strategies (11 total): ordered, lowest_latency, highest_throughput, round_robin, weighted_round_robin, least_connections, random, simple_shuffle, least_busy, semantic, usage_based.
  • Config keys: provider_routing.strategy, provider_routing.fallback_enabled, provider_routing.window_seconds, provider_routing.min_sample_count, provider_routing.exploration_ratio, provider_routing.max_price, provider_routing.preferred_max_latency, provider_routing.require_region, provider_routing.require_quantizations, provider_routing.semantic_embedding_provider, provider_routing.semantic_similarity_threshold.
  • Per-target pricing: pricing.input_price_per_million, pricing.cached_input_price_per_million, pricing.output_price_per_million.
  • Pinning headers: X-Keeptrusts-Provider, X-Keeptrusts-Model.
  • Best next pages: Model Groups, Provider Fallback, Circuit Breakers & Retry.

For engineers

  • Prerequisites: at least two provider targets for routing to be meaningful; set min_sample_count ≥ 10 before trusting latency/throughput rankings.
  • Start with ordered strategy and add latency/throughput-based routing only after collecting real traffic data.
  • Always set fallback_enabled: true in production to prevent hard failures when routing narrows the eligible set.
  • Semantic routing: keep semantic_examples concise (1–2 sentences); examples are embedded once at config load.
  • Cost filtering: add pricing blocks on every target — targets without pricing are always eligible and defeat max_price filtering.
  • Pin requests for debugging: X-Keeptrusts-Provider: <target-id> bypasses routing but not policy/rate-limits.
  • Monitor: every request emits a routing.decision event field showing which strategy and provider were selected.

For leaders

  • Cost optimization: max_price and usage_based routing enforce per-provider spend ceilings without manual intervention.
  • Data residency: require_region ensures requests never leave approved geographic zones for compliance (GDPR, data sovereignty).
  • Quality routing: semantic strategy automatically routes domain-specific queries to specialized models (code, medical, legal) without application code changes.
  • Vendor diversity: weighted or latency-based routing across multiple providers reduces concentration risk and improves negotiating leverage.

Next steps