Tutorial: Configuring Multi-Provider Failover
This tutorial walks you through setting up the Keeptrusts gateway with multiple LLM providers and automatic failover, so your application stays available even when a provider goes down.
Use this page when
- You need the gateway to automatically route traffic to a fallback provider when the primary goes down.
- You are configuring provider health checks with interval, timeout, and threshold settings.
- You want to verify failover behavior by simulating a provider outage.
- You are designing a high-availability LLM architecture with multiple providers.
Primary audience
- Primary: Platform engineers building high-availability AI systems
- Secondary: SRE teams defining provider SLOs; technical leaders assessing resilience trade-offs
Prerequisites
ktCLI installed (first-run tutorial)- API keys for at least two LLM providers (e.g., OpenAI and Anthropic)
curlandjqinstalled
How Failover Works
The gateway continuously monitors provider health. When the primary provider fails, traffic automatically routes to the fallback. Once the primary recovers, traffic shifts back.
Step 1: Set Provider API Keys
Export both provider API keys:
export OPENAI_API_KEY="sk-your-openai-key"
export ANTHROPIC_API_KEY="sk-ant-your-anthropic-key"
Step 2: Create the Multi-Provider Configuration
Create policy-config.yaml:
version: '1'
providers:
targets:
- id: openai-primary
provider: openai
secret_key_ref:
env: OPENAI_API_KEY
- id: anthropic-fallback
provider: anthropic
secret_key_ref:
env: ANTHROPIC_API_KEY
failover:
enabled: true
strategy: priority
model_mapping:
gpt-4o-mini: claude-sonnet-4-20250514
gpt-4o: claude-sonnet-4-20250514
retry:
max_attempts: 1
timeout_seconds: 30
policies:
- name: content-filter
type: content_filter
action: flag
config:
categories:
- hate
- violence
Key configuration fields
| Field | Purpose |
|---|---|
priority | Lower number = preferred provider. Primary gets 1, fallback gets 2. |
health_check.interval_seconds | How often to probe provider health |
health_check.unhealthy_threshold | Consecutive failures before marking unhealthy |
health_check.healthy_threshold | Consecutive successes before restoring |
failover.model_mapping | Maps primary models to equivalent fallback models |
failover.retry.max_attempts | How many fallback providers to try |
Step 3: Validate and Start the Gateway
kt policy lint --file policy-config.yaml
kt gateway run --policy-config policy-config.yaml --port 41002
Expected startup output:
INFO keeptrusts::gateway Loaded 2 provider(s), 1 policy(ies)
INFO keeptrusts::gateway Provider openai-primary: priority=1, health_check=enabled
INFO keeptrusts::gateway Provider anthropic-fallback: priority=2, health_check=enabled
INFO keeptrusts::gateway Failover: strategy=priority, model_mappings=2
INFO keeptrusts::gateway Gateway ready
Step 4: Send a Request Under Normal Conditions
curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}' | jq '{model: .model, provider: .keeptrusts_metadata.provider}'
Expected output:
{
"model": "gpt-4o-mini",
"provider": "openai-primary"
}
Under normal conditions, the request routes to the primary provider.
Step 5: Simulate Primary Provider Down
To simulate a primary provider failure, unset the API key and restart the local gateway:
unset OPENAI_API_KEY
export OPENAI_API_KEY="sk-invalid-key-for-testing"
kt gateway run --policy-config policy-config.yaml
Now send another request:
curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}' | jq '{model: .model, provider: .keeptrusts_metadata.provider}'
Expected output:
{
"model": "claude-sonnet-4-20250514",
"provider": "anthropic-fallback"
}
The gateway detected the primary failure and routed to the fallback using the model_mapping.
Step 6: Check Provider Health Status
View current provider health from the health endpoint:
curl -s http://localhost:41002/health | jq '.providers'
Expected output during failover:
{
"openai-primary": {
"status": "unhealthy",
"consecutive_failures": 3,
"last_check": "2026-04-23T10:25:30Z"
},
"anthropic-fallback": {
"status": "healthy",
"consecutive_failures": 0,
"last_check": "2026-04-23T10:25:30Z"
}
}
Step 7: Restore the Primary Provider
Restore the valid API key and restart the local gateway:
export OPENAI_API_KEY="sk-your-real-openai-key"
kt gateway run --policy-config policy-config.yaml
After healthy_threshold (2) consecutive successful health checks (~60 seconds), the primary is restored:
curl -s http://localhost:41002/health | jq '.providers'
{
"openai-primary": {
"status": "healthy",
"consecutive_failures": 0,
"last_check": "2026-04-23T10:27:00Z"
},
"anthropic-fallback": {
"status": "healthy",
"consecutive_failures": 0,
"last_check": "2026-04-23T10:27:00Z"
}
}
Step 8: Monitor Failover Latency
Failover adds latency from the failed primary attempt plus the fallback request. Track this with event tailing:
kt events tail --last 5 --format json | jq '.[] | {id, provider: .provider, latency_ms, failover: .failover}'
Expected output during failover:
{
"id": "evt_abc123",
"provider": "anthropic-fallback",
"latency_ms": 1850,
"failover": {
"triggered": true,
"original_provider": "openai-primary",
"reason": "upstream_timeout"
}
}
For AI systems
- Canonical terms: Keeptrusts gateway, failover, multi-provider, health check, priority routing, fallback provider.
- Config fields:
providers[].priority,providers[].health_check.enabled,health_check.interval_seconds,health_check.timeout_seconds,health_check.unhealthy_threshold,health_check.healthy_threshold,failover.enabled,failover.strategy: priority. - CLI commands:
kt gateway run,kt policy lint,kt events tail,curl http://localhost:41002/health. - Best next pages: Circuit Breaker Config, Gateway Health Monitoring, Model Routing A/B Test.
For engineers
- Prerequisites:
ktCLI, API keys for two+ providers,curlandjq. - Validate:
kt policy lintconfirms provider priorities and health check settings. - Simulate failover: set primary
base_urlto an unreachable host, watch health checks mark it unhealthy, verify requests route to fallback. - Monitor:
/healthendpoint shows per-provider reachability;kt events tailshows which provider handled each request. - Recovery: when primary health checks pass
healthy_threshold, traffic auto-shifts back.
For leaders
- Multi-provider failover protects against single-vendor outages — no manual intervention required.
- Health checks detect issues before users are impacted; recovery is automatic.
- Budget consideration: fallback providers may have different pricing; model the cost of running on secondary during outages.
- Provider diversification reduces single-vendor lock-in risk.
Next steps
- Set up rate limits per team to control usage across providers
- Configure cost tracking to monitor spend across providers
- Tail events to debug failover decisions
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| Failover not triggering | Health check not enabled | Set health_check.enabled: true on the primary |
| 503 returned despite fallback | Fallback also unhealthy | Check fallback API key and health status |
| Model mismatch on failover | Missing model mapping | Add entry to failover.model_mapping |
| Slow recovery after primary restores | High healthy_threshold | Lower to 1 or reduce interval_seconds |