Incident Response for AI System Failures
AI-governed systems introduce failure modes beyond traditional services — policy misconfigurations, provider outages, escalation backlogs, and unexpected block-rate spikes. This guide covers detection, response, and post-mortem workflows using Keeptrusts tooling.
Use this page when
- You are setting up alerting on gateway health, block-rate spikes, or escalation queue growth
- You need to diagnose a policy misconfiguration that is blocking legitimate traffic
- You are building runbook automation for AI system incident response
- You want to conduct a post-mortem analysis using decision event data from the console
Primary audience
- Primary: Technical Engineers
- Secondary: AI Agents, Technical Leaders
Gateway Health Checks
The gateway exposes a /health endpoint for liveness and readiness probes:
# Basic health check
curl -f http://localhost:41002/health
# Response: {"status":"healthy","uptime_seconds":86400}
Monitoring Configuration
Set up alerting thresholds on gateway health:
# prometheus/alerts.yml
groups:
- name: keeptrusts-gateway
rules:
- alert: GatewayUnhealthy
expr: up{job="kt-gateway"} == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Keeptrusts gateway is down"
runbook: "https://runbooks.internal/keeptrusts/gateway-down"
- alert: HighBlockRate
expr: |
rate(keeptrusts_decisions_total{decision="blocked"}[5m])
/ rate(keeptrusts_decisions_total[5m]) > 0.3
for: 5m
labels:
severity: warning
annotations:
summary: "Block rate exceeds 30% over 5 minutes"
runbook: "https://runbooks.internal/keeptrusts/high-block-rate"
- alert: HighGatewayLatency
expr: |
histogram_quantile(0.99,
rate(keeptrusts_request_duration_seconds_bucket[5m])
) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Gateway p99 latency exceeds 5 seconds"
Incident Detection Signals
| Signal | Source | Severity | Likely Cause |
|---|---|---|---|
| Gateway health check failing | /health probe | Critical | Process crash, resource exhaustion |
| Block rate spike | Decision events | High | Policy misconfiguration, attack |
| Escalation queue growing | Escalation API | High | Reviewers unavailable, policy too strict |
| Provider error rate spike | Gateway logs | High | Upstream provider outage |
| Latency p99 > 5s | Gateway metrics | Medium | Provider slowdown, policy chain bottleneck |
| Event ingestion lag | API metrics | Medium | Database performance, network issue |
Escalation Workflows
When the gateway escalates a decision, it creates an escalation record requiring human review:
Monitoring the Escalation Queue
# Check pending escalations
curl -s "https://api.keeptrusts.example/v1/escalations?status=pending" \
-H "Authorization: Bearer $API_TOKEN" | jq '{
pending_count: .meta.total,
oldest: .data[-1].created_at,
by_policy: [.data[] | .policy_name] | group_by(.) | map({(.[0]): length}) | add
}'
Alert when the escalation backlog exceeds thresholds:
#!/bin/bash
# scripts/check-escalation-backlog.sh
set -euo pipefail
PENDING=$(curl -s "${KEEPTRUSTS_API_URL}/v1/escalations?status=pending&limit=0" \
-H "Authorization: Bearer ${API_TOKEN}" | jq '.meta.total')
if [[ "$PENDING" -gt 50 ]]; then
echo "CRITICAL: ${PENDING} pending escalations"
# Send alert to PagerDuty / Slack
exit 2
elif [[ "$PENDING" -gt 20 ]]; then
echo "WARNING: ${PENDING} pending escalations"
exit 1
else
echo "OK: ${PENDING} pending escalations"
exit 0
fi
Console Incident View
The Keeptrusts console provides real-time visibility during incidents:
Events Dashboard
Filter events by decision type, time range, and gateway to understand the blast radius:
- Events page → Filter by
decision: blockedand the incident time window - Gateway filter → Isolate the affected gateway instance
- User filter → Determine which users are impacted
Escalations Page
Review and resolve pending escalations during an incident:
- Sort by oldest first to clear the backlog in order
- Bulk-resolve escalations caused by a known policy misconfiguration
- Add resolution notes linking to the incident ticket
Runbook Automation
Gateway Restart
#!/bin/bash
# runbooks/gateway-restart.sh
set -euo pipefail
GATEWAY_ID="${1:?Usage: gateway-restart.sh <gateway-id>}"
echo "Restarting gateway ${GATEWAY_ID}..."
# Kubernetes
kubectl rollout restart deployment/kt-gateway -n platform
kubectl rollout status deployment/kt-gateway -n platform --timeout=120s
# Verify health
sleep 5
if curl -sf "http://kt-gateway.platform.svc:41002/health" > /dev/null; then
echo "Gateway ${GATEWAY_ID} is healthy after restart."
else
echo "CRITICAL: Gateway ${GATEWAY_ID} failed health check after restart."
exit 1
fi
Policy Rollback
Roll back to a previous known-good configuration:
#!/bin/bash
# runbooks/policy-rollback.sh
set -euo pipefail
CONFIG_ID="${1:?Usage: policy-rollback.sh <config-id>}"
TARGET_VERSION="${2:?Usage: policy-rollback.sh <config-id> <version>}"
echo "Rolling back config ${CONFIG_ID} to version ${TARGET_VERSION}..."
# Fetch the target version content
CONTENT=$(curl -s "${KEEPTRUSTS_API_URL}/v1/configurations/${CONFIG_ID}/versions/${TARGET_VERSION}" \
-H "Authorization: Bearer ${API_TOKEN}" | jq -r '.data.content')
# Validate before applying
echo "$CONTENT" > /tmp/rollback-config.yaml
kt policy lint --file /tmp/rollback-config.yaml
# Apply the rollback
curl -X PUT "${KEEPTRUSTS_API_URL}/v1/configurations/${CONFIG_ID}" \
-H "Authorization: Bearer ${API_TOKEN}" \
-H "Content-Type: application/json" \
-d "{\"content\": $(echo "$CONTENT" | jq -Rs .)}"
echo "Rollback complete. Gateways will pick up the new config on next sync."
rm -f /tmp/rollback-config.yaml
Emergency Policy Bypass
For critical production incidents where governance policies are incorrectly blocking legitimate traffic:
# emergency-passthrough.yaml — use ONLY during declared incidents
gateway:
port: 41002
secret_key_ref:
env: OPENAI_API_KEY
policies:
- name: emergency-passthrough
input:
- type: content_safety
action: flag # Log but don't block
output:
- type: content_safety
action: flag
Post-Mortem with Event Data
Gathering Evidence
Export events from the incident window:
# Export incident events
curl -X POST "${KEEPTRUSTS_API_URL}/v1/exports" \
-H "Authorization: Bearer ${API_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"type": "events",
"format": "jsonl",
"filter": {
"start_time": "2026-04-23T14:00:00Z",
"end_time": "2026-04-23T16:00:00Z"
}
}'
Analysis Queries
-- Timeline of the incident
SELECT
date_trunc('minute', timestamp) AS minute,
decision,
COUNT(*) AS count
FROM ai_decision_events
WHERE timestamp BETWEEN '2026-04-23 14:00:00Z' AND '2026-04-23 16:00:00Z'
GROUP BY minute, decision
ORDER BY minute;
-- Which policy caused the most blocks during the incident?
SELECT
pe.policy_name,
COUNT(*) AS block_count
FROM policy_evaluations pe
JOIN ai_decision_events e ON pe.event_id = e.id
WHERE e.timestamp BETWEEN '2026-04-23 14:00:00Z' AND '2026-04-23 16:00:00Z'
AND pe.result = 'block'
GROUP BY pe.policy_name
ORDER BY block_count DESC;
-- Impacted users
SELECT DISTINCT user_id, COUNT(*) AS blocked_requests
FROM ai_decision_events
WHERE timestamp BETWEEN '2026-04-23 14:00:00Z' AND '2026-04-23 16:00:00Z'
AND decision = 'blocked'
GROUP BY user_id
ORDER BY blocked_requests DESC
LIMIT 20;
Post-Mortem Template
## Incident: [Title]
**Date:** 2026-04-23 | **Duration:** 2 hours | **Severity:** P1
### Summary
[What happened and who was impacted]
### Timeline
- 14:00 — Policy config deployed via GitOps sync
- 14:05 — Block rate spiked from 2% to 45%
- 14:12 — Alert fired: HighBlockRate
- 14:20 — Root cause identified: overly broad content filter
- 14:25 — Emergency bypass applied
- 14:30 — Policy rolled back to previous version
- 16:00 — Normal block rate restored, bypass removed
### Root Cause
[Policy misconfiguration / provider outage / etc.]
### Governance Event Data
- Total events during window: 12,450
- Incorrectly blocked: 5,602 (45%)
- Affected users: 234
- Affected teams: 8
### Action Items
- [ ] Add policy behavior tests for edge cases
- [ ] Implement canary policy deployment
- [ ] Add block-rate circuit breaker to deployment pipeline
Key Takeaways
- Monitor gateway
/health, block rate, escalation queue depth, and provider error rates - Use the console Events page to scope blast radius during incidents
- Maintain runbooks for gateway restart, policy rollback, and emergency bypass
- Export and query decision events to build evidence-based post-mortems
- Every incident is an opportunity to add policy behavior tests that prevent recurrence
For AI systems
- Canonical terms:
/healthendpoint,keeptrusts_decisions_total,keeptrusts_request_duration_seconds_bucket, escalation workflow, block rate,POST /v1/events,GET /v1/escalations, Prometheus alerting rules, runbook automation - Key detection signals: gateway health probe, block rate > 30% over 5 min, escalation queue growing, provider error rate spike, P99 > 5s
- Best next pages: Observability Patterns, Resilience Engineering, Event-Driven Architecture
For engineers
- Health check:
curl -f http://localhost:41002/health— integrate as Kubernetes liveness and readiness probe - Set Prometheus alerts:
GatewayUnhealthy(up == 0 for 30s),HighBlockRate(> 30% over 5 min),HighGatewayLatency(P99 > 5s) - Escalation workflow: gateway escalates → API creates record → webhook fires → reviewer acts in console
- Post-mortem data: query
GET /v1/events?outcome=blocked&since=<incident_start>to reconstruct timeline - Common root causes: policy misconfiguration (sudden block spike), provider outage (5xx spike), stale config (drift from Git)
For leaders
- AI system incidents include novel failure modes — a single policy typo can block all legitimate traffic across the organization
- Escalation queue growth is a leading indicator of reviewer capacity problems or overly strict policies
- Mean time to detection (MTTD) and mean time to resolution (MTTR) for AI incidents should be tracked as governance KPIs
Next steps
- Observability for AI-Governed Systems — build the monitoring foundation
- Resilience Engineering for AI Services — prevent incidents through failover
- CI/CD Pipeline Integration — catch policy misconfigurations before they reach production
- Event-Driven AI Architecture — real-time alerting on governance events