Tutorial: Configuring Multi-Provider Failover

This tutorial walks you through setting up the Keeptrusts gateway with multiple LLM providers and automatic failover, so your application stays available even when a provider goes down.

Use this page when

You need the gateway to automatically route traffic to a fallback provider when the primary goes down.
You are configuring provider health checks with interval, timeout, and threshold settings.
You want to verify failover behavior by simulating a provider outage.
You are designing a high-availability LLM architecture with multiple providers.

Primary audience

Primary: Platform engineers building high-availability AI systems
Secondary: SRE teams defining provider SLOs; technical leaders assessing resilience trade-offs

Prerequisites

kt CLI installed (first-run tutorial)
API keys for at least two LLM providers (e.g., OpenAI and Anthropic)
curl and jq installed

How Failover Works

The gateway continuously monitors provider health. When the primary provider fails, traffic automatically routes to the fallback. Once the primary recovers, traffic shifts back.

Step 1: Set Provider API Keys

Export both provider API keys:

export OPENAI_API_KEY="sk-your-openai-key"
export ANTHROPIC_API_KEY="sk-ant-your-anthropic-key"

Step 2: Create the Multi-Provider Configuration

Create policy-config.yaml:

version: '1'
providers:
  targets:
  - id: openai-primary
    provider: openai
    secret_key_ref:
      env: OPENAI_API_KEY
  - id: anthropic-fallback
    provider: anthropic
    secret_key_ref:
      env: ANTHROPIC_API_KEY
failover:
  enabled: true
  strategy: priority
  model_mapping:
    gpt-4o-mini: claude-sonnet-4-20250514
    gpt-4o: claude-sonnet-4-20250514
  retry:
    max_attempts: 1
    timeout_seconds: 30
policies:
- name: content-filter
  type: content_filter
  action: flag
  config:
    categories:
    - hate
    - violence

Key configuration fields

Field	Purpose
`priority`	Lower number = preferred provider. Primary gets `1`, fallback gets `2`.
`health_check.interval_seconds`	How often to probe provider health
`health_check.unhealthy_threshold`	Consecutive failures before marking unhealthy
`health_check.healthy_threshold`	Consecutive successes before restoring
`failover.model_mapping`	Maps primary models to equivalent fallback models
`failover.retry.max_attempts`	How many fallback providers to try

Step 3: Validate and Start the Gateway

kt policy lint --file policy-config.yaml
kt gateway run --policy-config policy-config.yaml --port 41002

Expected startup output:

INFO  keeptrusts::gateway Loaded 2 provider(s), 1 policy(ies)
INFO  keeptrusts::gateway Provider openai-primary: priority=1, health_check=enabled
INFO  keeptrusts::gateway Provider anthropic-fallback: priority=2, health_check=enabled
INFO  keeptrusts::gateway Failover: strategy=priority, model_mappings=2
INFO  keeptrusts::gateway Gateway ready

Step 4: Send a Request Under Normal Conditions

curl -s http://localhost:41002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }' | jq '{model: .model, provider: .keeptrusts_metadata.provider}'

Expected output:

{
  "model": "gpt-4o-mini",
  "provider": "openai-primary"
}

Under normal conditions, the request routes to the primary provider.

Step 5: Simulate Primary Provider Down

To simulate a primary provider failure, unset the API key and restart the local gateway:

unset OPENAI_API_KEY
export OPENAI_API_KEY="sk-invalid-key-for-testing"
kt gateway run --policy-config policy-config.yaml

Now send another request:

curl -s http://localhost:41002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }' | jq '{model: .model, provider: .keeptrusts_metadata.provider}'

Expected output:

{
  "model": "claude-sonnet-4-20250514",
  "provider": "anthropic-fallback"
}

The gateway detected the primary failure and routed to the fallback using the model_mapping.

Step 6: Check Provider Health Status

View current provider health from the health endpoint:

curl -s http://localhost:41002/health | jq '.providers'

Expected output during failover:

{
  "openai-primary": {
    "status": "unhealthy",
    "consecutive_failures": 3,
    "last_check": "2026-04-23T10:25:30Z"
  },
  "anthropic-fallback": {
    "status": "healthy",
    "consecutive_failures": 0,
    "last_check": "2026-04-23T10:25:30Z"
  }
}

Step 7: Restore the Primary Provider

Restore the valid API key and restart the local gateway:

export OPENAI_API_KEY="sk-your-real-openai-key"
kt gateway run --policy-config policy-config.yaml

After healthy_threshold (2) consecutive successful health checks (~60 seconds), the primary is restored:

curl -s http://localhost:41002/health | jq '.providers'

{
  "openai-primary": {
    "status": "healthy",
    "consecutive_failures": 0,
    "last_check": "2026-04-23T10:27:00Z"
  },
  "anthropic-fallback": {
    "status": "healthy",
    "consecutive_failures": 0,
    "last_check": "2026-04-23T10:27:00Z"
  }
}

Step 8: Monitor Failover Latency

Failover adds latency from the failed primary attempt plus the fallback request. Track this with event tailing:

kt events tail --last 5 --format json | jq '.[] | {id, provider: .provider, latency_ms, failover: .failover}'

Expected output during failover:

{
  "id": "evt_abc123",
  "provider": "anthropic-fallback",
  "latency_ms": 1850,
  "failover": {
    "triggered": true,
    "original_provider": "openai-primary",
    "reason": "upstream_timeout"
  }
}

For AI systems

Canonical terms: Keeptrusts gateway, failover, multi-provider, health check, priority routing, fallback provider.
Config fields: providers[].priority, providers[].health_check.enabled, health_check.interval_seconds, health_check.timeout_seconds, health_check.unhealthy_threshold, health_check.healthy_threshold, failover.enabled, failover.strategy: priority.
CLI commands: kt gateway run, kt policy lint, kt events tail, curl http://localhost:41002/health.
Best next pages: Circuit Breaker Config, Gateway Health Monitoring, Model Routing A/B Test.

For engineers

Prerequisites: kt CLI, API keys for two+ providers, curl and jq.
Validate: kt policy lint confirms provider priorities and health check settings.
Simulate failover: set primary base_url to an unreachable host, watch health checks mark it unhealthy, verify requests route to fallback.
Monitor: /health endpoint shows per-provider reachability; kt events tail shows which provider handled each request.
Recovery: when primary health checks pass healthy_threshold, traffic auto-shifts back.

For leaders

Multi-provider failover protects against single-vendor outages — no manual intervention required.
Health checks detect issues before users are impacted; recovery is automatic.
Budget consideration: fallback providers may have different pricing; model the cost of running on secondary during outages.
Provider diversification reduces single-vendor lock-in risk.

Next steps

Set up rate limits per team to control usage across providers
Configure cost tracking to monitor spend across providers
Tail events to debug failover decisions

Troubleshooting

Symptom	Cause	Fix
Failover not triggering	Health check not enabled	Set `health_check.enabled: true` on the primary
503 returned despite fallback	Fallback also unhealthy	Check fallback API key and health status
Model mismatch on failover	Missing model mapping	Add entry to `failover.model_mapping`
Slow recovery after primary restores	High `healthy_threshold`	Lower to `1` or reduce `interval_seconds`

Use this page when​

Primary audience​

Prerequisites​

How Failover Works​

Step 1: Set Provider API Keys​

Step 2: Create the Multi-Provider Configuration​

Key configuration fields​

Step 3: Validate and Start the Gateway​

Step 4: Send a Request Under Normal Conditions​

Step 5: Simulate Primary Provider Down​

Step 6: Check Provider Health Status​

Step 7: Restore the Primary Provider​

Step 8: Monitor Failover Latency​

For AI systems​

For engineers​

For leaders​

Next steps​

Troubleshooting​