Skip to main content
Browse docs

Tutorial: Configuring Multi-Provider Failover

This tutorial walks you through setting up the Keeptrusts gateway with multiple LLM providers and automatic failover, so your application stays available even when a provider goes down.

Use this page when

  • You need the gateway to automatically route traffic to a fallback provider when the primary goes down.
  • You are configuring provider health checks with interval, timeout, and threshold settings.
  • You want to verify failover behavior by simulating a provider outage.
  • You are designing a high-availability LLM architecture with multiple providers.

Primary audience

  • Primary: Platform engineers building high-availability AI systems
  • Secondary: SRE teams defining provider SLOs; technical leaders assessing resilience trade-offs

Prerequisites

  • kt CLI installed (first-run tutorial)
  • API keys for at least two LLM providers (e.g., OpenAI and Anthropic)
  • curl and jq installed

How Failover Works

The gateway continuously monitors provider health. When the primary provider fails, traffic automatically routes to the fallback. Once the primary recovers, traffic shifts back.

Step 1: Set Provider API Keys

Export both provider API keys:

export OPENAI_API_KEY="sk-your-openai-key"
export ANTHROPIC_API_KEY="sk-ant-your-anthropic-key"

Step 2: Create the Multi-Provider Configuration

Create policy-config.yaml:

version: '1'
providers:
targets:
- id: openai-primary
provider: openai
secret_key_ref:
env: OPENAI_API_KEY
- id: anthropic-fallback
provider: anthropic
secret_key_ref:
env: ANTHROPIC_API_KEY
failover:
enabled: true
strategy: priority
model_mapping:
gpt-4o-mini: claude-sonnet-4-20250514
gpt-4o: claude-sonnet-4-20250514
retry:
max_attempts: 1
timeout_seconds: 30
policies:
- name: content-filter
type: content_filter
action: flag
config:
categories:
- hate
- violence

Key configuration fields

FieldPurpose
priorityLower number = preferred provider. Primary gets 1, fallback gets 2.
health_check.interval_secondsHow often to probe provider health
health_check.unhealthy_thresholdConsecutive failures before marking unhealthy
health_check.healthy_thresholdConsecutive successes before restoring
failover.model_mappingMaps primary models to equivalent fallback models
failover.retry.max_attemptsHow many fallback providers to try

Step 3: Validate and Start the Gateway

kt policy lint --file policy-config.yaml
kt gateway run --policy-config policy-config.yaml --port 41002

Expected startup output:

INFO keeptrusts::gateway Loaded 2 provider(s), 1 policy(ies)
INFO keeptrusts::gateway Provider openai-primary: priority=1, health_check=enabled
INFO keeptrusts::gateway Provider anthropic-fallback: priority=2, health_check=enabled
INFO keeptrusts::gateway Failover: strategy=priority, model_mappings=2
INFO keeptrusts::gateway Gateway ready

Step 4: Send a Request Under Normal Conditions

curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}' | jq '{model: .model, provider: .keeptrusts_metadata.provider}'

Expected output:

{
"model": "gpt-4o-mini",
"provider": "openai-primary"
}

Under normal conditions, the request routes to the primary provider.

Step 5: Simulate Primary Provider Down

To simulate a primary provider failure, unset the API key and restart the local gateway:

unset OPENAI_API_KEY
export OPENAI_API_KEY="sk-invalid-key-for-testing"
kt gateway run --policy-config policy-config.yaml

Now send another request:

curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}' | jq '{model: .model, provider: .keeptrusts_metadata.provider}'

Expected output:

{
"model": "claude-sonnet-4-20250514",
"provider": "anthropic-fallback"
}

The gateway detected the primary failure and routed to the fallback using the model_mapping.

Step 6: Check Provider Health Status

View current provider health from the health endpoint:

curl -s http://localhost:41002/health | jq '.providers'

Expected output during failover:

{
"openai-primary": {
"status": "unhealthy",
"consecutive_failures": 3,
"last_check": "2026-04-23T10:25:30Z"
},
"anthropic-fallback": {
"status": "healthy",
"consecutive_failures": 0,
"last_check": "2026-04-23T10:25:30Z"
}
}

Step 7: Restore the Primary Provider

Restore the valid API key and restart the local gateway:

export OPENAI_API_KEY="sk-your-real-openai-key"
kt gateway run --policy-config policy-config.yaml

After healthy_threshold (2) consecutive successful health checks (~60 seconds), the primary is restored:

curl -s http://localhost:41002/health | jq '.providers'
{
"openai-primary": {
"status": "healthy",
"consecutive_failures": 0,
"last_check": "2026-04-23T10:27:00Z"
},
"anthropic-fallback": {
"status": "healthy",
"consecutive_failures": 0,
"last_check": "2026-04-23T10:27:00Z"
}
}

Step 8: Monitor Failover Latency

Failover adds latency from the failed primary attempt plus the fallback request. Track this with event tailing:

kt events tail --last 5 --format json | jq '.[] | {id, provider: .provider, latency_ms, failover: .failover}'

Expected output during failover:

{
"id": "evt_abc123",
"provider": "anthropic-fallback",
"latency_ms": 1850,
"failover": {
"triggered": true,
"original_provider": "openai-primary",
"reason": "upstream_timeout"
}
}

For AI systems

  • Canonical terms: Keeptrusts gateway, failover, multi-provider, health check, priority routing, fallback provider.
  • Config fields: providers[].priority, providers[].health_check.enabled, health_check.interval_seconds, health_check.timeout_seconds, health_check.unhealthy_threshold, health_check.healthy_threshold, failover.enabled, failover.strategy: priority.
  • CLI commands: kt gateway run, kt policy lint, kt events tail, curl http://localhost:41002/health.
  • Best next pages: Circuit Breaker Config, Gateway Health Monitoring, Model Routing A/B Test.

For engineers

  • Prerequisites: kt CLI, API keys for two+ providers, curl and jq.
  • Validate: kt policy lint confirms provider priorities and health check settings.
  • Simulate failover: set primary base_url to an unreachable host, watch health checks mark it unhealthy, verify requests route to fallback.
  • Monitor: /health endpoint shows per-provider reachability; kt events tail shows which provider handled each request.
  • Recovery: when primary health checks pass healthy_threshold, traffic auto-shifts back.

For leaders

  • Multi-provider failover protects against single-vendor outages — no manual intervention required.
  • Health checks detect issues before users are impacted; recovery is automatic.
  • Budget consideration: fallback providers may have different pricing; model the cost of running on secondary during outages.
  • Provider diversification reduces single-vendor lock-in risk.

Next steps

Troubleshooting

SymptomCauseFix
Failover not triggeringHealth check not enabledSet health_check.enabled: true on the primary
503 returned despite fallbackFallback also unhealthyCheck fallback API key and health status
Model mismatch on failoverMissing model mappingAdd entry to failover.model_mapping
Slow recovery after primary restoresHigh healthy_thresholdLower to 1 or reduce interval_seconds