Circuit Breakers & Retry
Keeptrusts includes built-in circuit breakers and retry policies that protect your application against upstream LLM provider failures. Together they form a two-layer resilience system: retries absorb transient errors at the individual request level, while circuit breakers protect the system from spending time and tokens against a provider that is persistently degraded.
Use this page when
- You need the exact command, config, API, or integration details for Circuit Breakers & Retry.
- You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
- If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Circuit Breaker
A circuit breaker wraps each provider target and tracks its recent failure history. When failures exceed a threshold, the circuit "opens" and the gateway immediately routes to the next available provider without waiting for a timeout. After a cooldown period the circuit enters the "half-open" state and probes the provider with a limited number of real requests; if they succeed, the circuit closes again.
States
┌─────────────┐
success │ │ consecutive failures ≥ threshold
┌────────────│ CLOSED │──────────────────────────────────────┐
│ │ │ ▼
│ └─────────────┘ ┌──────────────────┐
│ │ OPEN │
│ ┌──────────────────┐ │ (reject fast) │
└──────────│ HALF-OPEN │◄─────────────────────────┤ │
all probes │ (limited probes)│ cooldown_seconds └──────────────────┘
succeed └──────────────────┘
│
│ any probe fails
└──────────────────────────────► OPEN
| State | Behaviour |
|---|---|
| Closed | Normal operation. Failures are counted. |
| Open | All requests to this provider are immediately rejected without making an upstream call. |
| Half-Open | A limited number of probe requests are forwarded. Success → Closed; failure → Open again. |
Configuration fields
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable circuit breaker for this target or globally. |
consecutive_failure_threshold | integer | 5 | Number of consecutive failures before the circuit opens. |
cooldown_seconds | integer | 60 | Seconds to wait in the Open state before entering Half-Open. |
half_open_successes | integer | 2 | Number of consecutive successes required in Half-Open to close the circuit. |
Per-target configuration
pack:
name: circuit-breaker-retry-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-primary
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: azure-backup
provider: azure:chat:gpt-4o
secret_key_ref:
env: AZURE_OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Global circuit breaker defaults
Set global defaults that apply to all targets that do not declare their own circuit_breaker block:
pack:
name: circuit-breaker-retry-providers-2
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-primary
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: groq-fast
provider: groq:chat:llama-3.3-70b-versatile
secret_key_ref:
env: GROQ_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Retry Policy
The retry policy controls how many times the gateway attempts a request before declaring failure, which error conditions trigger retries, and how long to wait between attempts.
Configuration fields
| Field | Type | Default | Description |
|---|---|---|---|
max_retries | integer | 2 | Total retry attempts across all triggers. |
per_trigger | map | {} | Override max_retries for specific error types. |
backoff.strategy | string | exponential | Backoff timing: fixed, linear, or exponential. |
backoff.base_ms | integer | 200 | Starting delay in milliseconds. |
backoff.delay_ms | integer | 500 | Increment for linear, or base for fixed. |
backoff.max_ms | integer | 10000 | Maximum delay cap regardless of strategy. |
jitter | bool | true | Add ±20% random jitter to backoff delays to avoid thundering herds. |
Error triggers
| Trigger | Condition |
|---|---|
rate_limit | Provider returns HTTP 429. |
timeout | No response within the configured request timeout. |
service_unavailable | Provider returns HTTP 503 or HTTP 502. |
context_window_exceeded | Provider returns a context-length error (HTTP 400 with a context-window error code). |
server_error | Any 5xx response not matched by a more specific trigger. |
empty_response | Provider returns HTTP 200 but with zero content in the completion. |
Full retry configuration example
pack:
name: circuit-breaker-retry-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-primary
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Backoff strategies
- Fixed
- Linear
- Exponential
Every retry waits the same delay_ms regardless of attempt number.
retry:
max_retries: 3
backoff:
strategy: fixed
delay_ms: 1000 # always wait 1 second
Delays: 1000ms → 1000ms → 1000ms
Each retry adds another delay_ms to the previous delay, starting from base_ms.
retry:
max_retries: 4
backoff:
strategy: linear
base_ms: 200
delay_ms: 300
Delays: 200ms → 500ms → 800ms → 1100ms
Delay doubles on each retry, starting from base_ms, capped at max_ms.
retry:
max_retries: 5
backoff:
strategy: exponential
base_ms: 250
max_ms: 8000
jitter: true
Delays: ~250ms → ~500ms → ~1000ms → ~2000ms → ~4000ms (with jitter applied)
Combining Circuit Breaker + Retry with Fallbacks
The full resilience system layers retry, circuit breaker, and group fallback into a single decision pipeline:
Request
│
▼
Retry attempt 1 → upstream call
│ fails (timeout)
▼
Retry attempt 2 → upstream call
│ fails (5xx)
▼
Retry attempt 3 → upstream call
│ fails (5xx)
│ consecutive_failure_threshold reached → circuit opens
▼
Route to fallback provider (circuit breaker short-circuits this target)
│
▼
Response to client
Complete example
pack:
name: resilient-chat
version: 1.0.0
provider_routing:
strategy: ordered
fallback_enabled: true
circuit_breaker_defaults:
enabled: true
consecutive_failure_threshold: 4
cooldown_seconds: 60
half_open_successes: 2
model_groups:
- name: primary-chat
fallback_group: backup-chat
targets:
- id: openai-primary
weight: 1
- name: backup-chat
targets:
- id: anthropic-backup
weight: 1
providers:
targets:
- id: openai-primary
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: anthropic-backup
provider: anthropic:chat:claude-3-5-sonnet-20241022
secret_key_ref:
env: ANTHROPIC_API_KEY
What happens when OpenAI degrades:
- Request arrives and is forwarded to
openai-primary. - The upstream returns 503. The retry policy forwards to
openai-primaryup to 3 more times (forservice_unavailable), each time with exponential backoff. - After 4 consecutive failures, the circuit breaker opens. Subsequent requests to
openai-primaryare immediately rejected without any upstream calls. - The fallback group
backup-chatis activated. Requests are now forwarded toanthropic-backup. - After 60 seconds, the
openai-primarycircuit enters Half-Open. Two consecutive probe requests succeed, and the circuit closes. Traffic shifts back toopenai-primary.
Zero Completion Insurance
Zero Completion Insurance (ZCI) is an additional retry layer that activates when a provider returns a technically successful HTTP 200 response but with no usable completion content. This happens when providers stream an empty choices[0].message.content, return a stop reason of length with zero tokens, or produce a low-quality output that fails a configured assertion.
Configuration fields
| Field | Type | Description |
|---|---|---|
enabled | bool | Enable ZCI for this target or globally. |
conditions | list | One or more conditions that trigger ZCI. |
action | string | What to do when a condition fires: retry_same, retry_fallback, return_error. |
retry_with_fallback | bool | If true, retry on the next available provider rather than the same one. |
max_zci_retries | integer | Maximum ZCI-specific retry attempts (default: 2). |
Conditions
| Condition | Description |
|---|---|
empty_response | The response body contains no completion tokens. |
low_quality_score | A configured quality scorer rates the response below threshold. |
failed_assertion | A post-processing policy assertion is not satisfied. |
stop_reason_length | The model stopped generating due to token limit (truncated output). |
ZCI configuration example
pack:
name: circuit-breaker-retry-providers-8
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-primary
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: anthropic-backup
provider: anthropic:chat:claude-3-5-sonnet-20241022
secret_key_ref:
env: ANTHROPIC_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
When openai-primary returns an empty response, ZCI fires with action: retry_fallback, and the gateway immediately retries the request on anthropic-backup rather than returning the empty response to the client.
Observability
Every circuit breaker state change and every retry attempt emits a structured event in the Keeptrusts event stream, giving full visibility into resilience behaviour in production.
Circuit breaker events
| Event | Fields | Description |
|---|---|---|
circuit_breaker.opened | target_id, failure_count, threshold | Circuit transitioned from Closed → Open. |
circuit_breaker.half_opened | target_id, cooldown_elapsed_ms | Cooldown expired; circuit entered Half-Open probe mode. |
circuit_breaker.closed | target_id, probe_successes | All probes succeeded; circuit closed and normal routing resumed. |
circuit_breaker.rejected | target_id | A request was rejected because the circuit is currently Open. |
Retry events
| Event | Fields | Description |
|---|---|---|
retry.attempt | target_id, attempt_number, trigger, backoff_ms | A retry was scheduled. |
retry.exhausted | target_id, total_attempts, last_trigger | All retry attempts were consumed; request will fail or be routed to fallback. |
zci.triggered | target_id, condition, action | Zero Completion Insurance activated on a successful-status but empty/low-quality response. |
Example: alert on circuit opening
Use the Keeptrusts console event rule engine to create an alert when any circuit opens in production:
alert_rules:
- name: circuit-breaker-opened
event_type: circuit_breaker.opened
severity: high
channels:
- pagerduty
- slack-ops
message: "Circuit breaker opened for provider {{ target_id }} after {{ failure_count }} failures."
Best Practices
-
Set
context_window_exceededretries to0. Retrying a context-length error on the same provider always fails — the model cannot process a prompt that exceeds its window. Either truncate the prompt or route to a provider with a larger context window. -
Keep
consecutive_failure_thresholdlow for user-facing paths (3–5) and higher for batch paths (8–10). Low thresholds protect real-time UX from slow provider degradation; higher thresholds tolerate normal variance in batch workloads. -
Always set
max_mson exponential backoff. Without a cap, exponential backoff can produce delays of tens of seconds on attempt 6+, turning a transient error into an apparent hang. -
Enable
jitter: truein multi-instance deployments. Without jitter, all gateway instances back off to the same retry schedule and retry simultaneously, creating thundering herd traffic spikes against a recovering upstream. -
Use per-trigger
rate_limitretries generously. Rate limit responses (HTTP 429) are expected under normal conditions. Settingper_trigger.rate_limit: 5with exponential backoff gracefully absorbs token bucket refill cycles without surfacing 429s to clients. -
Monitor circuit breaker open/close events. Every circuit state change emits a structured event in the Keeptrusts event stream (
circuit_breaker.opened,circuit_breaker.half_opened,circuit_breaker.closed). Alert oncircuit_breaker.openedfor any production provider to detect upstream degradation before it impacts SLOs.
For AI systems
- Canonical terms: Keeptrusts Circuit Breaker, retry policy, Zero Completion Insurance (ZCI), backoff strategy.
- Config keys:
circuit_breaker.enabled,circuit_breaker.consecutive_failure_threshold,circuit_breaker.cooldown_seconds,circuit_breaker.half_open_successes,circuit_breaker_defaults,retry.max_retries,retry.per_trigger,retry.backoff.strategy(fixed|linear|exponential),retry.backoff.base_ms,retry.backoff.max_ms,retry.jitter,zero_completion_insurance. - Circuit states: Closed → Open → Half-Open → Closed.
- Retry triggers:
rate_limit,timeout,service_unavailable,context_window_exceeded,server_error,empty_response. - ZCI conditions:
empty_response,low_quality_score,failed_assertion,stop_reason_length. - Event types:
circuit_breaker.opened,circuit_breaker.half_opened,circuit_breaker.closed,circuit_breaker.rejected,retry.attempt,retry.exhausted,zci.triggered. - Best next pages: Provider Fallback, Model Groups, Provider Routing.
For engineers
- Prerequisites: At least two provider targets configured for fallback to be useful alongside circuit breakers.
- Set
context_window_exceededretries to0— retrying on the same provider always fails for context errors. - Always set
backoff.max_msto cap exponential backoff (recommended: 8000–15000ms). - Enable
jitter: truein multi-instance deployments to prevent thundering herd retry storms. - Monitor: filter Events by
event_type: circuit_breaker.openedand alert on it to detect upstream degradation early. - Test: temporarily reduce
consecutive_failure_thresholdto 1 and cause a single failure to verify the circuit opens and the fallback activates.
For leaders
- Availability impact: Circuit breakers with fallback providers can achieve 99.9%+ effective uptime even when individual providers experience outages.
- Cost trade-off: Retry policies consume additional tokens on retry attempts; set
per_triggerbudgets per error type to control wasted spend. - SLO alignment: Set
consecutive_failure_thresholdlow (3–5) for user-facing endpoints and higher (8–10) for batch workloads. - Zero Completion Insurance prevents silent quality degradation by retrying empty or truncated responses on a backup provider.
Next steps
- Provider Fallback — configure multi-provider fallback chains
- Model Groups — define fallback groups with cascading tiers
- Provider Routing — routing strategies that complement circuit breakers
- Rate Limiting — prevent upstream rate limits from triggering unnecessary retries