Tutorial: Configuring Circuit Breakers for Provider Outages
This tutorial walks you through configuring circuit breakers in the Keeptrusts gateway to automatically detect provider failures, stop sending traffic to unhealthy providers, and gradually recover when the provider comes back online.
Use this page when
- You need the gateway to stop routing to an unhealthy LLM provider automatically.
- You are configuring failure thresholds, recovery timeouts, and half-open probe logic.
- You want automatic failover to a secondary provider when the primary circuit opens.
- You are hardening a production gateway against intermittent provider outages.
Primary audience
- Primary: Platform engineers building resilient AI gateway deployments
- Secondary: SRE teams defining provider SLOs; technical leaders assessing uptime guarantees for LLM-powered features
Prerequisites
ktCLI installed (first-run tutorial)- An OpenAI-compatible API key exported as
OPENAI_API_KEY - A secondary provider key (e.g.,
ANTHROPIC_API_KEY) for failover testing curlandjqinstalled
How Circuit Breakers Work
A circuit breaker tracks provider health through three states:
[Closed] ──failures exceed threshold──▶ [Open]
│
recovery_timeout
│
▼
[Half-Open]
│ │
success ▼ ▼ failure
[Closed] [Open]
| State | Behavior |
|---|---|
| Closed | Normal operation — requests flow to the provider |
| Open | Provider marked unhealthy — requests fail fast or route to fallback |
| Half-Open | A limited number of test requests are sent to check recovery |
Step 1: Create the Circuit Breaker Configuration
Create policy-config.yaml with circuit breaker settings:
version: '1'
providers:
targets:
- id: openai
provider: openai
secret_key_ref:
env: OPENAI_API_KEY
- id: anthropic
provider: anthropic
secret_key_ref:
env: ANTHROPIC_API_KEY
fallback:
provider: anthropic
model: claude-sonnet-4-20250514
policies:
- name: content-filter
type: content_filter
action: flag
Configuration breakdown
| Field | Purpose |
|---|---|
failure_threshold | Number of consecutive failures before the circuit opens |
recovery_timeout_seconds | Seconds to wait in open state before transitioning to half-open |
half_open_max_requests | Number of test requests allowed in half-open state |
failure_types | Which error types count toward the failure threshold |
on_open | Action when circuit opens — fallback (route to backup) or reject (fail fast) |
Step 2: Validate and Start the Gateway
kt policy lint --file policy-config.yaml
kt gateway run --policy-config policy-config.yaml --port 41002
Expected output:
INFO keeptrusts::gateway Loaded 2 provider(s), 1 policy(ies)
INFO keeptrusts::gateway Circuit breaker: openai threshold=5, recovery=30s, half_open=3
INFO keeptrusts::gateway Fallback provider: anthropic (claude-sonnet-4-20250514)
INFO keeptrusts::gateway Gateway ready
Step 3: Verify Normal Operation (Closed State)
Send a request while the provider is healthy:
curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "Say hello"}]
}' | jq '{model: .model, provider: .provider}'
Expected output:
{
"model": "gpt-4o-mini",
"provider": "openai"
}
Check the circuit state:
kt gateway status --format json | jq '.providers[] | {name, circuit_state}'
{
"name": "openai",
"circuit_state": "closed"
}
Step 4: Simulate Provider Failure
To test the circuit breaker, temporarily point the provider at a non-existent endpoint. Create policy-config-broken.yaml:
pack:
name: circuit-breaker-config-providers-2
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider: openai
base_url: http://localhost:19999
secret_key_ref:
env: OPENAI_API_KEY
- id: anthropic
provider: anthropic
secret_key_ref:
env: ANTHROPIC_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Start the gateway with the broken config:
kt gateway run --policy-config policy-config-broken.yaml --port 41002
Send requests to trigger failures:
for i in $(seq 1 5); do
RESULT=$(curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"ping"}]}')
PROVIDER=$(echo "$RESULT" | jq -r '.provider // .error.code')
echo "Request $i: $PROVIDER"
done
Expected output:
Request 1: connection_error
Request 2: connection_error
Request 3: connection_error
Request 4: anthropic ← circuit opened, routed to fallback
Request 5: anthropic
Step 5: Monitor State Transitions
Watch the circuit breaker state in real time:
kt gateway status --watch --interval 5 --format json | jq '.providers[] | {name, circuit_state, failure_count, last_failure}'
You will see transitions:
{"name": "openai", "circuit_state": "open", "failure_count": 3, "last_failure": "2026-04-23T10:45:12Z"}
After recovery_timeout_seconds (15s in the test config):
{"name": "openai", "circuit_state": "half_open", "failure_count": 3, "last_failure": "2026-04-23T10:45:12Z"}
Step 6: Observe Half-Open Recovery
Once the circuit enters half-open state, the gateway sends a limited number of test requests to the primary provider. If you restore the provider (stop the broken config, start with the real one), the circuit closes:
# After restoring the real provider and waiting for half-open:
curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"ping"}]}' \
| jq '{provider: .provider}'
{
"provider": "openai"
}
Check the circuit is now closed:
kt gateway status --format json | jq '.providers[] | select(.name == "openai") | .circuit_state'
"closed"
Step 7: Use Reject Mode Instead of Fallback
If you prefer fast failure over fallback routing, set on_open: reject:
pack:
name: circuit-breaker-config-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider: openai
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
When the circuit opens, requests return 503 Service Unavailable:
{
"error": {
"message": "Circuit breaker open for provider: openai",
"type": "circuit_breaker_open",
"code": "provider_unavailable",
"retry_after_seconds": 30
}
}
For AI systems
- Canonical terms: Keeptrusts gateway, circuit breaker, half-open state, failure threshold, recovery timeout, failover.
- Config fields:
providers[].circuit_breaker.failure_threshold,recovery_timeout_seconds,half_open_max_requests,failure_types,on_open,fallback.provider. - CLI commands:
kt gateway run,kt policy lint,kt events tail --status blocked. - Best next pages: Multi-Provider Failover, Gateway Health Monitoring, Traffic Mirroring.
For engineers
- Prerequisites:
ktCLI, API keys for primary + fallback providers,curlandjq. - Validate:
kt policy lint --file policy-config.yamlconfirms circuit breaker fields parse correctly. - Simulate outage: stop the primary provider or point at an invalid base URL, then watch
kt events tailfor fallback routing. - Tune thresholds: start with
failure_threshold: 5andrecovery_timeout_seconds: 30; lower the threshold for faster detection, raise for transient-error tolerance.
For leaders
- Circuit breakers prevent cascading failures when an LLM provider has an outage, protecting end-user experience.
- Automatic failover means no human intervention is needed during short provider incidents.
- Cost implication: fallback providers may have different per-token pricing — budget for brief spikes on the secondary provider.
- Recovery timeout controls how aggressively traffic returns to the primary; shorter timeouts reduce fallback duration but risk flapping.
Next steps
- Multi-Provider Failover — advanced priority-based routing with health checks
- Gateway Health Monitoring — Prometheus metrics and
kt doctoralongside circuit breaker state - Rate Limiting — prevent overloading recovering providers after circuit recovery