Incident Response for AI System Failures

AI-governed systems introduce failure modes beyond traditional services — policy misconfigurations, provider outages, escalation backlogs, and unexpected block-rate spikes. This guide covers detection, response, and post-mortem workflows using Keeptrusts tooling.

Use this page when

You are setting up alerting on gateway health, block-rate spikes, or escalation queue growth
You need to diagnose a policy misconfiguration that is blocking legitimate traffic
You are building runbook automation for AI system incident response
You want to conduct a post-mortem analysis using decision event data from the console

Primary audience

Primary: Technical Engineers
Secondary: AI Agents, Technical Leaders

Gateway Health Checks

The gateway exposes a /health endpoint for liveness and readiness probes:

# Basic health check
curl -f http://localhost:41002/health
# Response: {"status":"healthy","uptime_seconds":86400}

Monitoring Configuration

Set up alerting thresholds on gateway health:

# prometheus/alerts.yml
groups:
  - name: keeptrusts-gateway
    rules:
      - alert: GatewayUnhealthy
        expr: up{job="kt-gateway"} == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Keeptrusts gateway is down"
          runbook: "https://runbooks.internal/keeptrusts/gateway-down"

      - alert: HighBlockRate
        expr: |
          rate(keeptrusts_decisions_total{decision="blocked"}[5m])
          / rate(keeptrusts_decisions_total[5m]) > 0.3
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Block rate exceeds 30% over 5 minutes"
          runbook: "https://runbooks.internal/keeptrusts/high-block-rate"

      - alert: HighGatewayLatency
        expr: |
          histogram_quantile(0.99,
            rate(keeptrusts_request_duration_seconds_bucket[5m])
          ) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Gateway p99 latency exceeds 5 seconds"

Incident Detection Signals

Signal	Source	Severity	Likely Cause
Gateway health check failing	`/health` probe	Critical	Process crash, resource exhaustion
Block rate spike	Decision events	High	Policy misconfiguration, attack
Escalation queue growing	Escalation API	High	Reviewers unavailable, policy too strict
Provider error rate spike	Gateway logs	High	Upstream provider outage
Latency p99 > 5s	Gateway metrics	Medium	Provider slowdown, policy chain bottleneck
Event ingestion lag	API metrics	Medium	Database performance, network issue

Escalation Workflows

When the gateway escalates a decision, it creates an escalation record requiring human review:

Monitoring the Escalation Queue

# Check pending escalations
curl -s "https://api.keeptrusts.example/v1/escalations?status=pending" \
  -H "Authorization: Bearer $API_TOKEN" | jq '{
    pending_count: .meta.total,
    oldest: .data[-1].created_at,
    by_policy: [.data[] | .policy_name] | group_by(.) | map({(.[0]): length}) | add
  }'

Alert when the escalation backlog exceeds thresholds:

#!/bin/bash
# scripts/check-escalation-backlog.sh
set -euo pipefail

PENDING=$(curl -s "${KEEPTRUSTS_API_URL}/v1/escalations?status=pending&limit=0" \
  -H "Authorization: Bearer ${API_TOKEN}" | jq '.meta.total')

if [[ "$PENDING" -gt 50 ]]; then
  echo "CRITICAL: ${PENDING} pending escalations"
  # Send alert to PagerDuty / Slack
  exit 2
elif [[ "$PENDING" -gt 20 ]]; then
  echo "WARNING: ${PENDING} pending escalations"
  exit 1
else
  echo "OK: ${PENDING} pending escalations"
  exit 0
fi

Console Incident View

The Keeptrusts console provides real-time visibility during incidents:

Events Dashboard

Filter events by decision type, time range, and gateway to understand the blast radius:

Events page → Filter by decision: blocked and the incident time window
Gateway filter → Isolate the affected gateway instance
User filter → Determine which users are impacted

Escalations Page

Review and resolve pending escalations during an incident:

Sort by oldest first to clear the backlog in order
Bulk-resolve escalations caused by a known policy misconfiguration
Add resolution notes linking to the incident ticket

Runbook Automation

Gateway Restart

#!/bin/bash
# runbooks/gateway-restart.sh
set -euo pipefail

GATEWAY_ID="${1:?Usage: gateway-restart.sh <gateway-id>}"
echo "Restarting gateway ${GATEWAY_ID}..."

# Kubernetes
kubectl rollout restart deployment/kt-gateway -n platform
kubectl rollout status deployment/kt-gateway -n platform --timeout=120s

# Verify health
sleep 5
if curl -sf "http://kt-gateway.platform.svc:41002/health" > /dev/null; then
  echo "Gateway ${GATEWAY_ID} is healthy after restart."
else
  echo "CRITICAL: Gateway ${GATEWAY_ID} failed health check after restart."
  exit 1
fi

Policy Rollback

Roll back to a previous known-good configuration:

#!/bin/bash
# runbooks/policy-rollback.sh
set -euo pipefail

CONFIG_ID="${1:?Usage: policy-rollback.sh <config-id>}"
TARGET_VERSION="${2:?Usage: policy-rollback.sh <config-id> <version>}"

echo "Rolling back config ${CONFIG_ID} to version ${TARGET_VERSION}..."

# Fetch the target version content
CONTENT=$(curl -s "${KEEPTRUSTS_API_URL}/v1/configurations/${CONFIG_ID}/versions/${TARGET_VERSION}" \
  -H "Authorization: Bearer ${API_TOKEN}" | jq -r '.data.content')

# Validate before applying
echo "$CONTENT" > /tmp/rollback-config.yaml
kt policy lint --file /tmp/rollback-config.yaml

# Apply the rollback
curl -X PUT "${KEEPTRUSTS_API_URL}/v1/configurations/${CONFIG_ID}" \
  -H "Authorization: Bearer ${API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d "{\"content\": $(echo "$CONTENT" | jq -Rs .)}"

echo "Rollback complete. Gateways will pick up the new config on next sync."
rm -f /tmp/rollback-config.yaml

Emergency Policy Bypass

For critical production incidents where governance policies are incorrectly blocking legitimate traffic:

# emergency-passthrough.yaml — use ONLY during declared incidents
gateway:
  port: 41002
  secret_key_ref:
    env: OPENAI_API_KEY

policies:
  - name: emergency-passthrough
    input:
      - type: content_safety
        action: flag  # Log but don't block
    output:
      - type: content_safety
        action: flag

Emergency bypass should only be applied during a declared incident with approval from the governance team. All requests during bypass are still logged as decision events for post-incident review.

Post-Mortem with Event Data

Gathering Evidence

Export events from the incident window:

# Export incident events
curl -X POST "${KEEPTRUSTS_API_URL}/v1/exports" \
  -H "Authorization: Bearer ${API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "type": "events",
    "format": "jsonl",
    "filter": {
      "start_time": "2026-04-23T14:00:00Z",
      "end_time": "2026-04-23T16:00:00Z"
    }
  }'

Analysis Queries

-- Timeline of the incident
SELECT
    date_trunc('minute', timestamp) AS minute,
    decision,
    COUNT(*) AS count
FROM ai_decision_events
WHERE timestamp BETWEEN '2026-04-23 14:00:00Z' AND '2026-04-23 16:00:00Z'
GROUP BY minute, decision
ORDER BY minute;

-- Which policy caused the most blocks during the incident?
SELECT
    pe.policy_name,
    COUNT(*) AS block_count
FROM policy_evaluations pe
JOIN ai_decision_events e ON pe.event_id = e.id
WHERE e.timestamp BETWEEN '2026-04-23 14:00:00Z' AND '2026-04-23 16:00:00Z'
    AND pe.result = 'block'
GROUP BY pe.policy_name
ORDER BY block_count DESC;

-- Impacted users
SELECT DISTINCT user_id, COUNT(*) AS blocked_requests
FROM ai_decision_events
WHERE timestamp BETWEEN '2026-04-23 14:00:00Z' AND '2026-04-23 16:00:00Z'
    AND decision = 'blocked'
GROUP BY user_id
ORDER BY blocked_requests DESC
LIMIT 20;

Post-Mortem Template

## Incident: [Title]
**Date:** 2026-04-23 | **Duration:** 2 hours | **Severity:** P1

### Summary
[What happened and who was impacted]

### Timeline
- 14:00 — Policy config deployed via GitOps sync
- 14:05 — Block rate spiked from 2% to 45%
- 14:12 — Alert fired: HighBlockRate
- 14:20 — Root cause identified: overly broad content filter
- 14:25 — Emergency bypass applied
- 14:30 — Policy rolled back to previous version
- 16:00 — Normal block rate restored, bypass removed

### Root Cause
[Policy misconfiguration / provider outage / etc.]

### Governance Event Data
- Total events during window: 12,450
- Incorrectly blocked: 5,602 (45%)
- Affected users: 234
- Affected teams: 8

### Action Items
- [ ] Add policy behavior tests for edge cases
- [ ] Implement canary policy deployment
- [ ] Add block-rate circuit breaker to deployment pipeline

Key Takeaways

Monitor gateway /health, block rate, escalation queue depth, and provider error rates
Use the console Events page to scope blast radius during incidents
Maintain runbooks for gateway restart, policy rollback, and emergency bypass
Export and query decision events to build evidence-based post-mortems
Every incident is an opportunity to add policy behavior tests that prevent recurrence

For AI systems

Canonical terms: /health endpoint, keeptrusts_decisions_total, keeptrusts_request_duration_seconds_bucket, escalation workflow, block rate, POST /v1/events, GET /v1/escalations, Prometheus alerting rules, runbook automation
Key detection signals: gateway health probe, block rate > 30% over 5 min, escalation queue growing, provider error rate spike, P99 > 5s
Best next pages: Observability Patterns, Resilience Engineering, Event-Driven Architecture

For engineers

Health check: curl -f http://localhost:41002/health — integrate as Kubernetes liveness and readiness probe
Set Prometheus alerts: GatewayUnhealthy (up == 0 for 30s), HighBlockRate (> 30% over 5 min), HighGatewayLatency (P99 > 5s)
Escalation workflow: gateway escalates → API creates record → webhook fires → reviewer acts in console
Post-mortem data: query GET /v1/events?outcome=blocked&since=<incident_start> to reconstruct timeline
Common root causes: policy misconfiguration (sudden block spike), provider outage (5xx spike), stale config (drift from Git)

For leaders

AI system incidents include novel failure modes — a single policy typo can block all legitimate traffic across the organization
Escalation queue growth is a leading indicator of reviewer capacity problems or overly strict policies
Mean time to detection (MTTD) and mean time to resolution (MTTR) for AI incidents should be tracked as governance KPIs

Next steps

Observability for AI-Governed Systems — build the monitoring foundation
Resilience Engineering for AI Services — prevent incidents through failover
CI/CD Pipeline Integration — catch policy misconfigurations before they reach production
Event-Driven AI Architecture — real-time alerting on governance events

Use this page when​

Primary audience​

Gateway Health Checks​

Monitoring Configuration​

Incident Detection Signals​

Escalation Workflows​

Monitoring the Escalation Queue​

Console Incident View​

Events Dashboard​

Escalations Page​

Runbook Automation​

Gateway Restart​

Policy Rollback​

Emergency Policy Bypass​

Post-Mortem with Event Data​

Gathering Evidence​

Analysis Queries​

Post-Mortem Template​

Key Takeaways​

For AI systems​

For engineers​

For leaders​

Next steps​