Skip to main content

Configuring Provider Failover and Multi-Model Routing

Configuring provider failover and multi-model routing in Keeptrusts means declaring more than one providers.targets entry, choosing a routing strategy, enabling fallback, grouping targets behind stable model names, and filtering those targets with data-routing-policy when not every provider is allowed to process the same traffic.

Use this page when

  • You need a practical way to set up provider failover without changing application code every time a model or vendor changes.
  • You want to route one logical model name to several upstream targets with clear fallback behavior.
  • You need to combine resilience with retention, training-opt-out, or regulated-routing constraints.

Primary audience

  • Primary: Technical Engineers and Platform Operators
  • Secondary: Technical Leaders and Governance reviewers

The problem

Most teams start with one upstream model and a hardcoded provider name in the client. That is fast to ship, but it creates three operational problems.

The first problem is outage risk. If the only configured provider starts timing out or rate-limiting, the application has nowhere else to go.

The second problem is coupling. If your client sends model: gpt-5.4-mini directly to one provider contract, changing vendors or adding a cheaper fallback often turns into an application release rather than a gateway config change.

The third problem is policy mismatch. Some providers may be acceptable for low-risk content, while others should only receive sanitized or zero-retention traffic. If you treat all providers as interchangeable, you eventually route the wrong class of request to a target that does not meet the data-handling rule for that route.

Keeptrusts solves those problems at the gateway layer. The relevant pieces are already documented in the declarative config: providers.targets, providers.routing, providers.fallback, providers.circuit_breaker, providers.model_groups, and data-routing-policy.

What Keeptrusts actually controls

providers.targets is the physical inventory. Each target identifies one upstream model and the credentials needed to reach it.

providers.routing.strategy decides how a request chooses from those targets. For most first rollouts, ordered is the right starting point because it is deterministic. You know which target is tried first and which target comes next.

providers.fallback decides which failures should move a request to the next target. This is where rate limits, timeouts, server errors, and similar failures become a routing concern instead of an application incident.

providers.circuit_breaker prevents repeated traffic to a degraded target. Retries help with brief blips. Circuit breakers help with sustained failure.

providers.model_groups is the abstraction layer. Your application can ask for production-chat while the gateway decides whether that means OpenAI first, Anthropic second, or a cheaper fallback pool.

data-routing-policy is the policy boundary that stops routing from becoming too permissive. If the request requires zero-data-retention or no-training guarantees, the gateway can remove non-compliant targets before normal selection begins.

That last point is important: failover should increase resilience, not quietly bypass your governance rules.

Example: one logical model, three concrete targets

The example below uses a stable group name, deterministic fallback, and provider metadata that the data-routing-policy can enforce.

pack:
name: provider-failover-routing
version: 1.0.0
enabled: true

providers:
routing:
strategy: ordered
fallback:
enabled: true
triggers:
- rate_limit
- server_error
- timeout
max_fallback_attempts: 2
circuit_breaker:
enabled: true
consecutive_failure_threshold: 5
cooldown_seconds: 30
half_open_successes: 1
targets:
- id: openai-primary
provider: openai
model: gpt-5.4-mini
secret_key_ref:
env: OPENAI_API_KEY
data_policy:
zero_data_retention: true
training_opt_out: true
retention_days: 0
allow_internet_egress: false
local_only_processing: true
- id: anthropic-backup
provider: anthropic
model: claude-sonnet-4-20250514
secret_key_ref:
env: ANTHROPIC_API_KEY
data_policy:
zero_data_retention: true
training_opt_out: true
retention_days: 0
allow_internet_egress: false
local_only_processing: true
- id: openai-economy
provider: openai
model: gpt-5.4-mini-mini
secret_key_ref:
env: OPENAI_API_KEY
data_policy:
zero_data_retention: true
training_opt_out: true
retention_days: 0
allow_internet_egress: false
local_only_processing: true
model_groups:
- name: production-chat
aliases: [gpt-5.4-mini]
targets: [openai-primary, anthropic-backup]
fallback_group: economy-chat
- name: economy-chat
targets: [openai-economy]

policies:
chain:
- data-routing-policy
- audit-logger

policy:
data-routing-policy:
require_zero_data_retention: true
require_no_training: true
max_retention_days: 0
allow_internet_egress: false
local_only_processing: true
on_no_compliant_provider: block
audit-logger:
retention_days: 365

This config does a few useful things at once.

The client can request production-chat instead of binding itself to a single vendor. With routing.strategy: ordered, the gateway prefers openai-primary and then moves to anthropic-backup when fallback triggers fire. If the primary group is exhausted or unavailable, fallback_group: economy-chat gives you a lower-cost recovery lane instead of a hard failure.

Just as important, data-routing-policy runs before routing completes. If one target loses the metadata needed for zero-retention or no-training flows, that target is excluded instead of silently becoming the backup.

How to roll this out without surprising yourself

Start with ordered routing even if you eventually want lowest_latency. Deterministic routing is easier to reason about during the first rollout because every failure path is obvious from the config order.

Once the ordered chain is stable, you can decide whether a latency-driven strategy makes sense for user-facing traffic. That is usually the point where the multi-provider resilience guidance becomes more useful than simple primary-backup thinking.

Model groups are the part that saves the most maintenance over time. They let you change the provider fleet behind one logical name. That matters when a team standardizes on production-chat, economy-chat, or another internal alias and does not want client releases every time the upstream mix changes.

Fallback also needs clear boundaries. It only helps after provider selection begins. It does not override policy decisions. If prompt injection detection blocks the request before an upstream call, there is nothing to fail over. The same is true when data-routing-policy removes every target and on_no_compliant_provider is set to block.

That is the right behavior. Resilience should never bypass enforcement.

Validation checklist

Before treating the route as production-ready, validate the config and then exercise a few concrete failure paths.

kt policy lint --file policy-config.yaml

kt gateway run \
--listen 0.0.0.0:41002 \
--policy-config policy-config.yaml \
--fail-mode block

Then test these cases deliberately:

  1. Normal request to confirm the primary target serves traffic.
  2. Simulated timeout or rate-limit path to confirm fallback reaches the next target.
  3. Provider outage long enough to open the circuit breaker.
  4. Data-policy mismatch to confirm non-compliant targets are excluded instead of used as backups.
  5. Client request using the group alias rather than a vendor-specific model name.

If the route is high risk, add output controls after routing stabilizes. Human Oversight is the cleanest review stop when a resilient route still needs a human decision before release.

What teams usually get wrong

The most common mistake is treating provider failover as a pure infrastructure topic. In Keeptrusts it is partly infrastructure, but it is also policy design. A backup target is only valid if it satisfies the same governance requirements as the primary path.

The second common mistake is exposing raw provider names to applications. That makes every future routing improvement harder than it needs to be.

The third mistake is turning on fallback but not observing it. Failover that is never tested is just an assumption written in YAML.

Key takeaways

  • Use providers.targets for concrete upstreams and providers.model_groups for stable application-facing model names.
  • Start with providers.routing.strategy: ordered so fallback behavior is predictable.
  • Add providers.fallback and providers.circuit_breaker together; retries alone do not protect you from persistent degradation.
  • Put data-routing-policy in the chain when not every provider is allowed to handle the same class of data.
  • Treat provider failover as a governance-aware routing decision, not just an uptime feature.

Next steps