Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Incident Response for AI System Failures

AI-governed systems introduce failure modes beyond traditional services — policy misconfigurations, provider outages, escalation backlogs, and unexpected block-rate spikes. This guide covers detection, response, and post-mortem workflows using Keeptrusts tooling.

Use this page when

  • You are setting up alerting on gateway health, block-rate spikes, or escalation queue growth
  • You need to diagnose a policy misconfiguration that is blocking legitimate traffic
  • You are building runbook automation for AI system incident response
  • You want to conduct a post-mortem analysis using decision event data from the console

Primary audience

  • Primary: Technical Engineers
  • Secondary: AI Agents, Technical Leaders

Gateway Health Checks

The gateway exposes a /health endpoint for liveness and readiness probes:

# Basic health check
curl -f http://localhost:41002/health
# Response: {"status":"healthy","uptime_seconds":86400}

Monitoring Configuration

Set up alerting thresholds on gateway health:

# prometheus/alerts.yml
groups:
- name: keeptrusts-gateway
rules:
- alert: GatewayUnhealthy
expr: up{job="kt-gateway"} == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Keeptrusts gateway is down"
runbook: "https://runbooks.internal/keeptrusts/gateway-down"

- alert: HighBlockRate
expr: |
rate(keeptrusts_decisions_total{decision="blocked"}[5m])
/ rate(keeptrusts_decisions_total[5m]) > 0.3
for: 5m
labels:
severity: warning
annotations:
summary: "Block rate exceeds 30% over 5 minutes"
runbook: "https://runbooks.internal/keeptrusts/high-block-rate"

- alert: HighGatewayLatency
expr: |
histogram_quantile(0.99,
rate(keeptrusts_request_duration_seconds_bucket[5m])
) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Gateway p99 latency exceeds 5 seconds"

Incident Detection Signals

SignalSourceSeverityLikely Cause
Gateway health check failing/health probeCriticalProcess crash, resource exhaustion
Block rate spikeDecision eventsHighPolicy misconfiguration, attack
Escalation queue growingEscalation APIHighReviewers unavailable, policy too strict
Provider error rate spikeGateway logsHighUpstream provider outage
Latency p99 > 5sGateway metricsMediumProvider slowdown, policy chain bottleneck
Event ingestion lagAPI metricsMediumDatabase performance, network issue

Escalation Workflows

When the gateway escalates a decision, it creates an escalation record requiring human review:

Monitoring the Escalation Queue

# Check pending escalations
curl -s "https://api.keeptrusts.example/v1/escalations?status=pending" \
-H "Authorization: Bearer $API_TOKEN" | jq '{
pending_count: .meta.total,
oldest: .data[-1].created_at,
by_policy: [.data[] | .policy_name] | group_by(.) | map({(.[0]): length}) | add
}'

Alert when the escalation backlog exceeds thresholds:

#!/bin/bash
# scripts/check-escalation-backlog.sh
set -euo pipefail

PENDING=$(curl -s "${KEEPTRUSTS_API_URL}/v1/escalations?status=pending&limit=0" \
-H "Authorization: Bearer ${API_TOKEN}" | jq '.meta.total')

if [[ "$PENDING" -gt 50 ]]; then
echo "CRITICAL: ${PENDING} pending escalations"
# Send alert to PagerDuty / Slack
exit 2
elif [[ "$PENDING" -gt 20 ]]; then
echo "WARNING: ${PENDING} pending escalations"
exit 1
else
echo "OK: ${PENDING} pending escalations"
exit 0
fi

Console Incident View

The Keeptrusts console provides real-time visibility during incidents:

Events Dashboard

Filter events by decision type, time range, and gateway to understand the blast radius:

  • Events page → Filter by decision: blocked and the incident time window
  • Gateway filter → Isolate the affected gateway instance
  • User filter → Determine which users are impacted

Escalations Page

Review and resolve pending escalations during an incident:

  • Sort by oldest first to clear the backlog in order
  • Bulk-resolve escalations caused by a known policy misconfiguration
  • Add resolution notes linking to the incident ticket

Runbook Automation

Gateway Restart

#!/bin/bash
# runbooks/gateway-restart.sh
set -euo pipefail

GATEWAY_ID="${1:?Usage: gateway-restart.sh <gateway-id>}"
echo "Restarting gateway ${GATEWAY_ID}..."

# Kubernetes
kubectl rollout restart deployment/kt-gateway -n platform
kubectl rollout status deployment/kt-gateway -n platform --timeout=120s

# Verify health
sleep 5
if curl -sf "http://kt-gateway.platform.svc:41002/health" > /dev/null; then
echo "Gateway ${GATEWAY_ID} is healthy after restart."
else
echo "CRITICAL: Gateway ${GATEWAY_ID} failed health check after restart."
exit 1
fi

Policy Rollback

Roll back to a previous known-good configuration:

#!/bin/bash
# runbooks/policy-rollback.sh
set -euo pipefail

CONFIG_ID="${1:?Usage: policy-rollback.sh <config-id>}"
TARGET_VERSION="${2:?Usage: policy-rollback.sh <config-id> <version>}"

echo "Rolling back config ${CONFIG_ID} to version ${TARGET_VERSION}..."

# Fetch the target version content
CONTENT=$(curl -s "${KEEPTRUSTS_API_URL}/v1/configurations/${CONFIG_ID}/versions/${TARGET_VERSION}" \
-H "Authorization: Bearer ${API_TOKEN}" | jq -r '.data.content')

# Validate before applying
echo "$CONTENT" > /tmp/rollback-config.yaml
kt policy lint --file /tmp/rollback-config.yaml

# Apply the rollback
curl -X PUT "${KEEPTRUSTS_API_URL}/v1/configurations/${CONFIG_ID}" \
-H "Authorization: Bearer ${API_TOKEN}" \
-H "Content-Type: application/json" \
-d "{\"content\": $(echo "$CONTENT" | jq -Rs .)}"

echo "Rollback complete. Gateways will pick up the new config on next sync."
rm -f /tmp/rollback-config.yaml

Emergency Policy Bypass

For critical production incidents where governance policies are incorrectly blocking legitimate traffic:

# emergency-passthrough.yaml — use ONLY during declared incidents
gateway:
port: 41002
secret_key_ref:
env: OPENAI_API_KEY

policies:
- name: emergency-passthrough
input:
- type: content_safety
action: flag # Log but don't block
output:
- type: content_safety
action: flag
Emergency bypass should only be applied during a declared incident with approval from the governance team. All requests during bypass are still logged as decision events for post-incident review.

Post-Mortem with Event Data

Gathering Evidence

Export events from the incident window:

# Export incident events
curl -X POST "${KEEPTRUSTS_API_URL}/v1/exports" \
-H "Authorization: Bearer ${API_TOKEN}" \
-H "Content-Type: application/json" \
-d '{
"type": "events",
"format": "jsonl",
"filter": {
"start_time": "2026-04-23T14:00:00Z",
"end_time": "2026-04-23T16:00:00Z"
}
}'

Analysis Queries

-- Timeline of the incident
SELECT
date_trunc('minute', timestamp) AS minute,
decision,
COUNT(*) AS count
FROM ai_decision_events
WHERE timestamp BETWEEN '2026-04-23 14:00:00Z' AND '2026-04-23 16:00:00Z'
GROUP BY minute, decision
ORDER BY minute;

-- Which policy caused the most blocks during the incident?
SELECT
pe.policy_name,
COUNT(*) AS block_count
FROM policy_evaluations pe
JOIN ai_decision_events e ON pe.event_id = e.id
WHERE e.timestamp BETWEEN '2026-04-23 14:00:00Z' AND '2026-04-23 16:00:00Z'
AND pe.result = 'block'
GROUP BY pe.policy_name
ORDER BY block_count DESC;

-- Impacted users
SELECT DISTINCT user_id, COUNT(*) AS blocked_requests
FROM ai_decision_events
WHERE timestamp BETWEEN '2026-04-23 14:00:00Z' AND '2026-04-23 16:00:00Z'
AND decision = 'blocked'
GROUP BY user_id
ORDER BY blocked_requests DESC
LIMIT 20;

Post-Mortem Template

## Incident: [Title]
**Date:** 2026-04-23 | **Duration:** 2 hours | **Severity:** P1

### Summary
[What happened and who was impacted]

### Timeline
- 14:00 — Policy config deployed via GitOps sync
- 14:05 — Block rate spiked from 2% to 45%
- 14:12 — Alert fired: HighBlockRate
- 14:20 — Root cause identified: overly broad content filter
- 14:25 — Emergency bypass applied
- 14:30 — Policy rolled back to previous version
- 16:00 — Normal block rate restored, bypass removed

### Root Cause
[Policy misconfiguration / provider outage / etc.]

### Governance Event Data
- Total events during window: 12,450
- Incorrectly blocked: 5,602 (45%)
- Affected users: 234
- Affected teams: 8

### Action Items
- [ ] Add policy behavior tests for edge cases
- [ ] Implement canary policy deployment
- [ ] Add block-rate circuit breaker to deployment pipeline

Key Takeaways

  • Monitor gateway /health, block rate, escalation queue depth, and provider error rates
  • Use the console Events page to scope blast radius during incidents
  • Maintain runbooks for gateway restart, policy rollback, and emergency bypass
  • Export and query decision events to build evidence-based post-mortems
  • Every incident is an opportunity to add policy behavior tests that prevent recurrence

For AI systems

  • Canonical terms: /health endpoint, keeptrusts_decisions_total, keeptrusts_request_duration_seconds_bucket, escalation workflow, block rate, POST /v1/events, GET /v1/escalations, Prometheus alerting rules, runbook automation
  • Key detection signals: gateway health probe, block rate > 30% over 5 min, escalation queue growing, provider error rate spike, P99 > 5s
  • Best next pages: Observability Patterns, Resilience Engineering, Event-Driven Architecture

For engineers

  • Health check: curl -f http://localhost:41002/health — integrate as Kubernetes liveness and readiness probe
  • Set Prometheus alerts: GatewayUnhealthy (up == 0 for 30s), HighBlockRate (> 30% over 5 min), HighGatewayLatency (P99 > 5s)
  • Escalation workflow: gateway escalates → API creates record → webhook fires → reviewer acts in console
  • Post-mortem data: query GET /v1/events?outcome=blocked&since=<incident_start> to reconstruct timeline
  • Common root causes: policy misconfiguration (sudden block spike), provider outage (5xx spike), stale config (drift from Git)

For leaders

  • AI system incidents include novel failure modes — a single policy typo can block all legitimate traffic across the organization
  • Escalation queue growth is a leading indicator of reviewer capacity problems or overly strict policies
  • Mean time to detection (MTTD) and mean time to resolution (MTTR) for AI incidents should be tracked as governance KPIs

Next steps