Skip to main content
Browse docs

Tutorial: Configuring Circuit Breakers for Provider Outages

This tutorial walks you through configuring circuit breakers in the Keeptrusts gateway to automatically detect provider failures, stop sending traffic to unhealthy providers, and gradually recover when the provider comes back online.

Use this page when

  • You need the gateway to stop routing to an unhealthy LLM provider automatically.
  • You are configuring failure thresholds, recovery timeouts, and half-open probe logic.
  • You want automatic failover to a secondary provider when the primary circuit opens.
  • You are hardening a production gateway against intermittent provider outages.

Primary audience

  • Primary: Platform engineers building resilient AI gateway deployments
  • Secondary: SRE teams defining provider SLOs; technical leaders assessing uptime guarantees for LLM-powered features

Prerequisites

  • kt CLI installed (first-run tutorial)
  • An OpenAI-compatible API key exported as OPENAI_API_KEY
  • A secondary provider key (e.g., ANTHROPIC_API_KEY) for failover testing
  • curl and jq installed

How Circuit Breakers Work

A circuit breaker tracks provider health through three states:

[Closed] ──failures exceed threshold──▶ [Open]

recovery_timeout


[Half-Open]
│ │
success ▼ ▼ failure
[Closed] [Open]
StateBehavior
ClosedNormal operation — requests flow to the provider
OpenProvider marked unhealthy — requests fail fast or route to fallback
Half-OpenA limited number of test requests are sent to check recovery

Step 1: Create the Circuit Breaker Configuration

Create policy-config.yaml with circuit breaker settings:

version: '1'
providers:
targets:
- id: openai
provider: openai
secret_key_ref:
env: OPENAI_API_KEY
- id: anthropic
provider: anthropic
secret_key_ref:
env: ANTHROPIC_API_KEY
fallback:
provider: anthropic
model: claude-sonnet-4-20250514
policies:
- name: content-filter
type: content_filter
action: flag

Configuration breakdown

FieldPurpose
failure_thresholdNumber of consecutive failures before the circuit opens
recovery_timeout_secondsSeconds to wait in open state before transitioning to half-open
half_open_max_requestsNumber of test requests allowed in half-open state
failure_typesWhich error types count toward the failure threshold
on_openAction when circuit opens — fallback (route to backup) or reject (fail fast)

Step 2: Validate and Start the Gateway

kt policy lint --file policy-config.yaml
kt gateway run --policy-config policy-config.yaml --port 41002

Expected output:

INFO keeptrusts::gateway Loaded 2 provider(s), 1 policy(ies)
INFO keeptrusts::gateway Circuit breaker: openai threshold=5, recovery=30s, half_open=3
INFO keeptrusts::gateway Fallback provider: anthropic (claude-sonnet-4-20250514)
INFO keeptrusts::gateway Gateway ready

Step 3: Verify Normal Operation (Closed State)

Send a request while the provider is healthy:

curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [{"role": "user", "content": "Say hello"}]
}' | jq '{model: .model, provider: .provider}'

Expected output:

{
"model": "gpt-4o-mini",
"provider": "openai"
}

Check the circuit state:

kt gateway status --format json | jq '.providers[] | {name, circuit_state}'
{
"name": "openai",
"circuit_state": "closed"
}

Step 4: Simulate Provider Failure

To test the circuit breaker, temporarily point the provider at a non-existent endpoint. Create policy-config-broken.yaml:

pack:
name: circuit-breaker-config-providers-2
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider: openai
base_url: http://localhost:19999
secret_key_ref:
env: OPENAI_API_KEY
- id: anthropic
provider: anthropic
secret_key_ref:
env: ANTHROPIC_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Start the gateway with the broken config:

kt gateway run --policy-config policy-config-broken.yaml --port 41002

Send requests to trigger failures:

for i in $(seq 1 5); do
RESULT=$(curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"ping"}]}')
PROVIDER=$(echo "$RESULT" | jq -r '.provider // .error.code')
echo "Request $i: $PROVIDER"
done

Expected output:

Request 1: connection_error
Request 2: connection_error
Request 3: connection_error
Request 4: anthropic ← circuit opened, routed to fallback
Request 5: anthropic

Step 5: Monitor State Transitions

Watch the circuit breaker state in real time:

kt gateway status --watch --interval 5 --format json | jq '.providers[] | {name, circuit_state, failure_count, last_failure}'

You will see transitions:

{"name": "openai", "circuit_state": "open", "failure_count": 3, "last_failure": "2026-04-23T10:45:12Z"}

After recovery_timeout_seconds (15s in the test config):

{"name": "openai", "circuit_state": "half_open", "failure_count": 3, "last_failure": "2026-04-23T10:45:12Z"}

Step 6: Observe Half-Open Recovery

Once the circuit enters half-open state, the gateway sends a limited number of test requests to the primary provider. If you restore the provider (stop the broken config, start with the real one), the circuit closes:

# After restoring the real provider and waiting for half-open:
curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"ping"}]}' \
| jq '{provider: .provider}'
{
"provider": "openai"
}

Check the circuit is now closed:

kt gateway status --format json | jq '.providers[] | select(.name == "openai") | .circuit_state'
"closed"

Step 7: Use Reject Mode Instead of Fallback

If you prefer fast failure over fallback routing, set on_open: reject:

pack:
name: circuit-breaker-config-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider: openai
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

When the circuit opens, requests return 503 Service Unavailable:

{
"error": {
"message": "Circuit breaker open for provider: openai",
"type": "circuit_breaker_open",
"code": "provider_unavailable",
"retry_after_seconds": 30
}
}

For AI systems

  • Canonical terms: Keeptrusts gateway, circuit breaker, half-open state, failure threshold, recovery timeout, failover.
  • Config fields: providers[].circuit_breaker.failure_threshold, recovery_timeout_seconds, half_open_max_requests, failure_types, on_open, fallback.provider.
  • CLI commands: kt gateway run, kt policy lint, kt events tail --status blocked.
  • Best next pages: Multi-Provider Failover, Gateway Health Monitoring, Traffic Mirroring.

For engineers

  • Prerequisites: kt CLI, API keys for primary + fallback providers, curl and jq.
  • Validate: kt policy lint --file policy-config.yaml confirms circuit breaker fields parse correctly.
  • Simulate outage: stop the primary provider or point at an invalid base URL, then watch kt events tail for fallback routing.
  • Tune thresholds: start with failure_threshold: 5 and recovery_timeout_seconds: 30; lower the threshold for faster detection, raise for transient-error tolerance.

For leaders

  • Circuit breakers prevent cascading failures when an LLM provider has an outage, protecting end-user experience.
  • Automatic failover means no human intervention is needed during short provider incidents.
  • Cost implication: fallback providers may have different per-token pricing — budget for brief spikes on the secondary provider.
  • Recovery timeout controls how aggressively traffic returns to the primary; shorter timeouts reduce fallback duration but risk flapping.

Next steps