Provider Routing

Keeptrusts supports 11 routing strategies for distributing LLM requests across providers. These strategies let you optimize for latency, cost, resilience, or domain-specific accuracy — without changing a line of application code. All routing is configured in your policy config YAML under the provider_routing key.

Use this page when

You need the exact command, config, API, or integration details for Provider Routing.
You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Routing Strategies

`ordered`

Try each provider in the configured sequence. Move to the next provider only when the current one fails (error, timeout, or rate-limit). The simplest and most predictable strategy.

Best for: Production workloads that have a clear preferred provider and need deterministic fallback behavior.

`lowest_latency`

Route to the provider that has demonstrated the lowest average response latency over a rolling measurement window. Keeptrusts tracks a per-provider latency histogram and refreshes rankings after min_sample_count successful responses have been recorded within window_seconds.

Best for: Latency-sensitive user-facing applications. Especially useful for chatbots and copilots where p95 tail latency directly impacts UX.

`highest_throughput`

Route to the provider that is processing the most tokens per second across its recent request history. Useful when you have token-volume SLOs rather than per-request latency SLOs.

Best for: Batch inference pipelines where aggregate throughput matters more than individual response time.

`round_robin`

Evenly rotate request assignment across all healthy providers in the list. Each provider receives approximately the same number of requests over time, regardless of capacity or current load.

Best for: Spreading load uniformly to avoid hitting any single provider's rate limits.

`weighted_round_robin`

Proportional rotation where each provider receives a fraction of requests equal to its weight divided by the sum of all weights. A provider with weight: 3 receives three times the traffic of a provider with weight: 1.

Best for: Hybrid deployments (cloud + self-hosted), or when one provider has a higher rate limit tier than others.

`least_connections`

Route to the provider with the fewest in-flight requests at the moment the request arrives. Keeptrusts tracks an atomic in-flight counter per provider and decrements it when the upstream response is fully received.

Best for: Streaming workloads where requests have highly variable durations and queueing behind a slow request matters.

`random`

Select a random provider from the healthy list on each request. No state is maintained between requests.

Best for: Rapid load spreading in development or CI environments where even distribution over any small window is not required.

`simple_shuffle`

Shuffle the provider list once at config load time, then use that shuffled ordered list (like ordered but with randomized initial order). The order persists until the gateway restarts or the config is reloaded.

Best for: Distributing initial load across a fixed pool while preserving ordered-fallback behaviour within a session window.

`least_busy`

A composite metric that combines queue depth (count of waiting requests) and current in-flight connections into a single score. The provider with the lowest combined score receives the next request.

Best for: Large provider pools with heterogeneous capacity where some providers may be temporarily overloaded.

`semantic`

Route requests based on the semantic similarity of the incoming message to configured example prompts. Each provider target can declare example topics or prompt snippets. The router embeds the incoming message using semantic_embedding_provider and picks the target whose examples have the highest cosine similarity above semantic_similarity_threshold. When no target clears the threshold, the gateway falls back to ordered.

Best for: Multi-model deployments where different models are specialized for different domains (e.g., code generation vs. medical Q&A vs. legal document review).

`usage_based`

Route by historical token usage. Tracks cumulative token consumption per provider over a rolling window and directs new requests to the provider with the most available headroom relative to its configured limits.

Best for: Organizations managing strict per-provider token budgets or billing tiers.

Configuration Reference

The provider_routing object is placed at the top level of your gateway config, or inside a named pack definition.

pack:
  name: production-api
  version: "1.0.0"

provider_routing:
  strategy: ordered           # one of the 11 strategies listed above
  fallback_enabled: true      # attempt next provider on failure

  # Measurement window for latency / throughput strategies
  window_seconds: 300         # rolling 5-minute window
  min_sample_count: 10        # minimum samples before a provider is ranked

  # For bandit-style exploration in latency/throughput modes
  exploration_ratio: 0.05     # 5% of requests probe non-optimal providers

  # Ordered list for `ordered` and `simple_shuffle` strategies
  order:
    - id: primary-openai
    - id: fallback-azure

  # Provider filtering
  allow_fallbacks: true
  only: []                    # if non-empty, restrict to these provider IDs
  ignore: []                  # exclude these provider IDs from routing

  # Smart routing filters (per-request SLO constraints)
  max_price: null             # max USD per 1M tokens (null = no limit)
  preferred_max_latency: null # milliseconds, soft preference
  preferred_min_throughput: null  # tokens/second, soft preference
  require_region: []          # e.g. ["us-east-1", "eu-west-1"]
  require_quantizations: []   # e.g. ["fp16", "bf16"]

  # Semantic routing
  semantic_embedding_provider: null  # provider ID used to embed the prompt
  semantic_similarity_threshold: 0.75

  # Pre-call provider health check before routing
  enable_pre_call_checks: false

Example 1: Ordered fallback

pack:
  name: provider-routing-providers-2
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai-primary
    provider: openai:chat:gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
  - id: azure-backup
    provider: azure:chat:gpt-4o
    base_url: https://myorg.openai.azure.com/openai/deployments/gpt-4o
    secret_key_ref:
      env: AZURE_OPENAI_API_KEY
  - id: anthropic-last-resort
    provider: anthropic:chat:claude-3-5-sonnet-20241022
    secret_key_ref:
      env: ANTHROPIC_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Example 2: Weighted round-robin

pack:
  name: provider-routing-providers-3
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai-main
    provider: openai:chat:gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
  - id: azure-secondary
    provider: azure:chat:gpt-4o-mini
    secret_key_ref:
      env: AZURE_OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Example 3: Semantic routing

pack:
  name: provider-routing-providers-4
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: embedding-router
    provider: openai:embedding:text-embedding-3-small
    secret_key_ref:
      env: OPENAI_API_KEY
  - id: code-specialist
    provider: openai:chat:gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
  - id: medical-specialist
    provider: anthropic:chat:claude-3-5-sonnet-20241022
    secret_key_ref:
      env: ANTHROPIC_API_KEY
  - id: general-fallback
    provider: openai:chat:gpt-4o-mini
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

When a coding question arrives, the gateway embeds it and finds that code-specialist has the highest similarity. Medical queries route to medical-specialist. Anything below the 0.72 threshold falls back to general-fallback via the ordered fallback chain.

Example 4: Cost-optimized routing

pack:
  name: provider-routing-providers-5
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: groq-llama
    provider: groq:chat:llama-3.3-70b-versatile
    secret_key_ref:
      env: GROQ_API_KEY
  - id: cerebras-llama
    provider: cerebras:chat:llama3.1-70b
    secret_key_ref:
      env: CEREBRAS_API_KEY
  - id: openai-gpt4o-mini
    provider: openai:chat:gpt-4o-mini
    secret_key_ref:
      env: OPENAI_API_KEY
  - id: openai-gpt4o
    provider: openai:chat:gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

In this example, openai-gpt4o is excluded from routing because its completion token price exceeds max_price: 2.50. The gateway measures actual latency across the remaining three providers and routes to the fastest one.

Smart Routing Filters

Smart routing filters are declared on provider_routing and act as per-request eligibility constraints. A provider that does not satisfy a filter is excluded from the routing decision for that request; if all providers are excluded, the gateway returns a 503 with a clear no_eligible_provider error code.

`max_price`

Maximum cost in USD per 1 million tokens (prompt + completion combined). Keeptrusts evaluates this against the pricing block on each provider target. Providers without a pricing block are always eligible.

provider_routing:
  max_price: 5.00   # USD per 1M tokens

`preferred_max_latency`

Soft upper bound on mean response latency in milliseconds. Providers currently measuring above this threshold are deprioritized but not excluded when no alternative is available.

provider_routing:
  preferred_max_latency: 600   # ms

`preferred_min_throughput`

Soft lower bound on tokens per second. Providers currently measuring below this threshold are deprioritized but remain eligible as fallbacks.

provider_routing:
  preferred_min_throughput: 200   # tokens/second

`require_region`

Only route to providers whose declared region field matches one of the listed values. Useful for data residency requirements.

pack:
  name: provider-routing-providers-9
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: azure-eu
    provider: azure:chat:gpt-4o
    secret_key_ref:
      env: AZURE_EU_KEY
  - id: openai-us
    provider: openai:chat:gpt-4o
    secret_key_ref:
      env: OPENAI_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

`require_quantizations`

Only route to providers whose quantization field matches one of the listed values. Common values: fp32, fp16, bf16, int8, int4, awq, gptq.

provider_routing:
  require_quantizations:
    - fp16
    - bf16

Semantic Routing

Semantic routing enables content-aware model selection without hardcoded if/else dispatch logic in your application. Configure domain-specific example prompts on each target; the gateway embeds the incoming message and picks the best-matching target.

Full domain-routing example

pack:
  name: domain-router
  version: "1.0.0"

provider_routing:
  strategy: semantic
  semantic_embedding_provider: embedding-provider
  semantic_similarity_threshold: 0.70
  fallback_enabled: true

policies:
  - name: pii-redaction
    rules:
      - type: redact
        pattern: "\b[A-Z]{2}\d{6}\b"   # passport numbers
        replacement: "[PASSPORT]"

providers:
  targets:
    - id: embedding-provider
      provider: "openai:embedding:text-embedding-3-small"
      secret_key_ref:
        env: OPENAI_API_KEY

    - id: code-model
      provider: "openai:chat:gpt-4o"
      secret_key_ref:
        env: OPENAI_API_KEY
      semantic_examples:
        - "Implement a binary search tree in Go"
        - "Refactor this React component to use hooks"
        - "Write a Terraform module for an S3 bucket"
        - "Fix the memory leak in this C++ destructor"
        - "Generate unit tests for this Python service"

    - id: medical-model
      provider: "anthropic:chat:claude-3-5-sonnet-20241022"
      secret_key_ref:
        env: ANTHROPIC_API_KEY
      semantic_examples:
        - "What is the first-line treatment for hypertension?"
        - "Summarize this discharge summary for the patient"
        - "Explain the pharmacokinetics of warfarin"
        - "Review this clinical note for ICD-10 coding accuracy"
        - "What are signs of acute kidney injury?"

    - id: legal-model
      provider: "anthropic:chat:claude-3-5-sonnet-20241022"
      secret_key_ref:
        env: ANTHROPIC_API_KEY
      semantic_examples:
        - "Review this NDA for unusual indemnification clauses"
        - "What is the statute of limitations for breach of contract?"
        - "Summarize the key obligations in this SaaS agreement"
        - "Compare EU GDPR and California CCPA data subject rights"

    - id: general-model
      provider: "openai:chat:gpt-4o-mini"
      secret_key_ref:
        env: OPENAI_API_KEY

How it works:

The incoming message is sent to embedding-provider to produce a vector.
Keeptrusts computes cosine similarity between the message vector and the pre-embedded semantic_examples for each non-embedding target.
The target with the highest similarity above 0.70 is selected.
If no target clears the threshold, the request falls through to general-model (the last target in the ordered list, used as the default fallback).

Cost-Optimized Routing

Combine strategy: lowest_latency (or round_robin) with max_price and per-provider pricing blocks to ensure expensive models are only used when cheaper alternatives are unavailable.

pack:
  name: cost-aware-chatbot
  version: 1.0.0
provider_routing:
  strategy: lowest_latency
  max_price: 3.0
  preferred_max_latency: 1000
  window_seconds: 180
  min_sample_count: 8
  fallback_enabled: true
providers:
  targets:
  - id: groq-fast
    provider: groq:chat:llama-3.3-70b-versatile
    secret_key_ref:
      env: GROQ_API_KEY
  - id: together-mixtral
    provider: togetherai:chat:mistralai/Mixtral-8x7B-Instruct-v0.1
    secret_key_ref:
      env: TOGETHERAI_API_KEY
  - id: openai-mini
    provider: openai:chat:gpt-4o-mini
    secret_key_ref:
      env: OPENAI_API_KEY
  - id: anthropic-sonnet
    provider: anthropic:chat:claude-3-5-sonnet-20241022
    secret_key_ref:
      env: ANTHROPIC_API_KEY

The anthropic-sonnet target is dynamically excluded because its completion token price of $15.00 per 1M tokens exceeds the max_price: 3.00 ceiling. The gateway measures latency across the remaining three providers and picks the fastest one.

Provider and Model Pinning

Override the routing strategy on a per-request basis using HTTP headers:

X-Keeptrusts-Provider: <target-id> — Route the request to a specific provider target.
X-Keeptrusts-Model: <model-id> — Override the model within the selected provider target.

# Pin to a specific provider
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-Keeptrusts-Provider: openai-mini" \
  -d '{"model": "gpt-4o-mini", "messages": [{"role": "user", "content": "Hello"}]}'

# Pin both provider and model
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-Keeptrusts-Provider: openai-mini" \
  -H "X-Keeptrusts-Model: gpt-4o" \
  -d '{"messages": [{"role": "user", "content": "Hello"}]}'

When a pinning header is present the routing strategy is bypassed entirely and the request goes directly to the named target. Both headers are validated against the active config — unknown or unauthorized values return 400 Bad Request. Pinning does not bypass policy evaluation, rate limits, or budget enforcement.

Detailed Provider Pricing

Provider targets support granular pricing fields for accurate spend tracking:

pack:
  name: provider-routing-providers-13
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai-prod
    provider: openai:chat:gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Field	Description
`input_price_per_million`	Cost per million input tokens
`cached_input_price_per_million`	Cost per million cached input tokens
`output_price_per_million`	Cost per million output tokens
`input_multiplier`	Multiplier applied to input token count (default 1.0)
`cached_input_multiplier`	Multiplier applied to cached input token count (default 1.0)
`output_multiplier`	Multiplier applied to output token count (default 1.0)

Legacy prompt_token_cost_per_million and completion_token_cost_per_million fields continue to work and map to input_price_per_million and output_price_per_million respectively.

Best Practices

Start with ordered and add measured routing (latency, throughput) only when you have real traffic data. Premature optimization with an empty latency window causes random selection.
Set min_sample_count ≥ 10 before trusting latency rankings. The default of 10 samples prevents cold-start routing errors on new deployments.
Always configure fallback_enabled: true in production. A routing strategy that narrows the eligible provider set (e.g., require_region) should always have a fallback to avoid hard failures.
Use max_price with explicit pricing blocks on every target. A provider without a pricing block is always treated as eligible — leaving out pricing data defeats cost filtering.
Pre-embed semantic examples at startup. Semantic routing embeds the semantic_examples strings once when the config is loaded. Keep examples concise (1–2 sentences) and representative of actual prompt patterns.
Monitor routing decisions via the Keeptrusts event stream. Every request emits a routing.decision event field indicating which strategy was applied and which provider was selected. Use this to validate routing behaviour before going to production.

For AI systems

Canonical terms: Keeptrusts Provider Routing, routing strategy, semantic routing, cost-based routing, provider pinning.
Strategies (11 total): ordered, lowest_latency, highest_throughput, round_robin, weighted_round_robin, least_connections, random, simple_shuffle, least_busy, semantic, usage_based.
Config keys: provider_routing.strategy, provider_routing.fallback_enabled, provider_routing.window_seconds, provider_routing.min_sample_count, provider_routing.exploration_ratio, provider_routing.max_price, provider_routing.preferred_max_latency, provider_routing.require_region, provider_routing.require_quantizations, provider_routing.semantic_embedding_provider, provider_routing.semantic_similarity_threshold.
Per-target pricing: pricing.input_price_per_million, pricing.cached_input_price_per_million, pricing.output_price_per_million.
Pinning headers: X-Keeptrusts-Provider, X-Keeptrusts-Model.
Best next pages: Model Groups, Provider Fallback, Circuit Breakers & Retry.

For engineers

Prerequisites: at least two provider targets for routing to be meaningful; set min_sample_count ≥ 10 before trusting latency/throughput rankings.
Start with ordered strategy and add latency/throughput-based routing only after collecting real traffic data.
Always set fallback_enabled: true in production to prevent hard failures when routing narrows the eligible set.
Semantic routing: keep semantic_examples concise (1–2 sentences); examples are embedded once at config load.
Cost filtering: add pricing blocks on every target — targets without pricing are always eligible and defeat max_price filtering.
Pin requests for debugging: X-Keeptrusts-Provider: <target-id> bypasses routing but not policy/rate-limits.
Monitor: every request emits a routing.decision event field showing which strategy and provider were selected.

For leaders

Cost optimization: max_price and usage_based routing enforce per-provider spend ceilings without manual intervention.
Data residency: require_region ensures requests never leave approved geographic zones for compliance (GDPR, data sovereignty).
Quality routing: semantic strategy automatically routes domain-specific queries to specialized models (code, medical, legal) without application code changes.
Vendor diversity: weighted or latency-based routing across multiple providers reduces concentration risk and improves negotiating leverage.

Next steps

Model Groups — define named model pools with per-group routing overrides
Provider Fallback — what happens when routing selects a failing provider
Circuit Breakers & Retry — per-provider resilience layers
Consumer Groups — per-group upstream overrides that interact with routing

Use this page when​

Primary audience​

Routing Strategies​

ordered​

lowest_latency​

highest_throughput​

round_robin​

weighted_round_robin​

least_connections​

random​

simple_shuffle​

least_busy​

semantic​

usage_based​

Configuration Reference​

Example 1: Ordered fallback​

Example 2: Weighted round-robin​

Example 3: Semantic routing​

Example 4: Cost-optimized routing​

Smart Routing Filters​

max_price​

preferred_max_latency​

preferred_min_throughput​

require_region​

require_quantizations​

Semantic Routing​

Full domain-routing example​

Cost-Optimized Routing​

Provider and Model Pinning​

Detailed Provider Pricing​

Best Practices​

For AI systems​

For engineers​

For leaders​

Next steps​