Resilience Engineering for AI Services
LLM providers experience rate limits, outages, and latency spikes. This guide covers resilience patterns you can implement through the Keeptrusts gateway and at the application layer to keep your AI-powered services available.
Use this page when
- You are configuring multi-provider failover chains in the gateway
- You need retry strategies with exponential backoff and jitter for LLM requests
- You are implementing circuit breakers to isolate provider failures
- You want graceful degradation patterns when all providers are unavailable
Primary audience
- Primary: Technical Engineers
- Secondary: AI Agents, Technical Leaders
Provider Failover
Multi-Provider Configuration
Configure multiple providers with failover priority:
pack:
name: resilience-patterns-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-primary
provider:
base_url: https://api.openai.com/v1
secret_key_ref:
env: OPENAI_API_KEY
- id: azure-openai-fallback
provider:
base_url: https://myorg.openai.azure.com/openai/deployments
secret_key_ref:
env: AZURE_OPENAI_KEY
- id: anthropic-fallback
provider:
base_url: https://api.anthropic.com/v1
secret_key_ref:
env: ANTHROPIC_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Failover Flow
Model Mapping for Failover
When failing over between providers, map equivalent models:
model_mapping:
gpt-4o:
openai-primary: gpt-4o
azure-openai-fallback: gpt-4o
anthropic-fallback: claude-sonnet-4-20250514
gpt-4o-mini:
openai-primary: gpt-4o-mini
azure-openai-fallback: gpt-4o-mini
anthropic-fallback: claude-haiku-4-20250414
Retry Strategies
Exponential Backoff with Jitter
Never retry with fixed intervals — use exponential backoff with jitter to prevent thundering herd:
gateway:
retry:
max_attempts: 3
initial_delay: 500ms
max_delay: 10s
backoff_multiplier: 2.0
jitter: true
retryable_codes: [429, 500, 502, 503]
Retry Timing Visualization
Application-Level Retry
For fine-grained control, implement retries in your application:
import httpx
import random
import asyncio
async def call_with_retry(
messages: list,
max_attempts: int = 3,
base_delay: float = 0.5,
):
for attempt in range(max_attempts):
try:
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:41002/v1/chat/completions",
json={"model": "gpt-4o", "messages": messages},
timeout=120.0,
)
response.raise_for_status()
return response.json()
except httpx.HTTPStatusError as e:
if e.response.status_code not in (429, 500, 502, 503):
raise # Non-retryable error
if attempt == max_attempts - 1:
raise # Final attempt failed
delay = base_delay * (2 ** attempt)
jitter = random.uniform(0, delay * 0.5)
await asyncio.sleep(delay + jitter)
Circuit Breaker Patterns
Three-State Circuit Breaker
Per-Provider Circuit Breakers
Each provider gets an independent circuit breaker:
pack:
name: resilience-patterns-providers-4
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider:
base_url: https://api.openai.com/v1
secret_key_ref:
env: OPENAI_API_KEY
- id: anthropic
provider:
base_url: https://api.anthropic.com/v1
secret_key_ref:
env: ANTHROPIC_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Bulkheading
Isolate provider connections so a misbehaving provider cannot exhaust resources for others:
pack:
name: resilience-patterns-providers-5
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider:
base_url: https://api.openai.com/v1
secret_key_ref:
env: OPENAI_API_KEY
- id: anthropic
provider:
base_url: https://api.anthropic.com/v1
secret_key_ref:
env: ANTHROPIC_API_KEY
- id: local-llm
provider:
base_url: http://localhost:8080/v1
secret_key_ref:
env: LOCAL_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Graceful Degradation
When the Gateway Is Down
Design your application to handle gateway unavailability:
async function getCompletion(messages: Message[]): Promise<string> {
try {
// Primary path: through the governed gateway
const response = await fetch('http://kt-gateway:41002/v1/chat/completions', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ model: 'gpt-4o', messages }),
signal: AbortSignal.timeout(5000),
});
return (await response.json()).choices[0].message.content;
} catch {
// Degraded path: return a safe fallback
console.error('[AI] Gateway unreachable, returning fallback');
return 'I am temporarily unable to process your request. Please try again shortly.';
}
}
Degradation Tiers
| Tier | Condition | Behavior |
|---|---|---|
| Full | Gateway + provider healthy | Normal AI responses with full governance |
| Reduced | Primary provider down | Failover to secondary provider, governance intact |
| Minimal | Gateway overloaded | Shed non-critical requests, prioritize critical paths |
| Offline | Gateway unreachable | Static fallback responses, queue requests for replay |
Request Priority and Shedding
gateway:
load_shedding:
# Start shedding when concurrent requests exceed this
max_concurrent: 500
# Priority header for request classification
priority_header: X-Request-Priority
# Shed low-priority first
shed_order: [low, medium, high, critical]
# Critical request — last to be shed
curl -H "X-Request-Priority: critical" \
http://kt-gateway:41002/v1/chat/completions \
-d '{"model":"gpt-4o","messages":[...]}'
Health Monitoring
Gateway Health Check
# Check gateway health
kt health
# Detailed status including provider circuit breaker states
kt health --verbose
Provider Health Dashboard
Monitor provider health via the console dashboard:
# Tail real-time events to see failures
kt events tail --filter "status=error"
# Check event counts by provider
kt events stats --group-by provider --last 1h
Resilience Testing
Chaos Engineering with the Gateway
Test resilience by simulating failures:
# Simulate provider timeout
kt gateway run --test-mode \
--inject-fault openai:timeout:5s
# Simulate rate limiting
kt gateway run --test-mode \
--inject-fault openai:rate-limit:80%
# Simulate intermittent errors
kt gateway run --test-mode \
--inject-fault anthropic:error:503:30%
Next steps
- Performance Engineering the AI Gateway — optimize throughput
- Observability for AI-Governed Systems — monitor resilience metrics
- Capacity Planning for AI Workloads — size for failure scenarios
For AI systems
- Canonical terms: provider failover,
providers[].priority,model_mapping,gateway.retry, exponential backoff with jitter, circuit breaker, bulkhead,retryable_codes: [429, 500, 502, 503], graceful degradation - Key configuration:
providers[].priority(1 = primary, 2 = fallback),gateway.retry.max_attempts: 3,gateway.retry.initial_delay: 500ms,gateway.retry.backoff_multiplier: 2.0 - Best next pages: Performance Engineering, Capacity Planning, Architecture Patterns
For engineers
- Configure failover: assign
priority: 1(primary),priority: 2(secondary),priority: 3(tertiary) across providers - Model mapping: map
gpt-4oto equivalent models across providers (e.g.,claude-sonnet-4-20250514for Anthropic) - Retry only idempotent failures:
[429, 500, 502, 503]— never retry409policy blocks or4xxclient errors - Circuit breaker: open after N consecutive failures, half-open after cooldown, close on successful probe
- Bulkhead: isolate traffic per consumer group so one team’s failure doesn’t cascade to others
For leaders
- Multi-provider failover eliminates single-vendor dependency — provider outages become transparent to applications
- Retry and circuit breaker patterns reduce user-visible errors without manual intervention during provider degradation
- Cost implication: failover traffic may route to more expensive providers — monitor cost trends during and after incidents