Disaster Recovery for AI Governance

AI governance infrastructure is a critical dependency for every AI-powered application. A gateway outage blocks LLM traffic; an API outage loses audit data. This guide establishes recovery procedures, targets, and runbook templates.

Use this page when

You need to define RTO/RPO targets for AI governance infrastructure
You are building recovery runbooks for gateway, API database, console, and workers
You want to set up PostgreSQL WAL archiving, backup verification, and encryption key recovery procedures

Primary audience

Primary: Technical Engineers
Secondary: AI Agents, Technical Leaders

Component Recovery Profiles

Each Keeptrusts component has a distinct recovery profile:

Component	State	Recovery Complexity	Impact of Loss
Gateway	Stateless	Low — redeploy with config	AI traffic blocked
API	Stateful (Postgres)	Medium — restore from backup	Audit trail gap, auth outage
Console	Stateless (BFF)	Low — redeploy	Management UI unavailable
Workers	Stateless	Low — redeploy	Export/retention jobs delayed

RTO/RPO Targets

Define targets based on your organization's requirements:

Tier	RTO	RPO	Applies To
Critical	15 minutes	0 (synchronous replication)	Gateway fleet, API primary
High	1 hour	15 minutes	API standby, console
Standard	4 hours	1 hour	Workers, export storage
Low	24 hours	24 hours	Marketing site, docs

Gateway Recovery

The gateway is stateless — it loads policy configuration at startup and forwards decision events to the API. Recovery requires only a running binary with access to the policy config.

Recovery Steps

Deploy new gateway instances from the container image
Provide policy configuration via one of:
- Config file mount (--policy-config policy-config.yaml)
- Git-linked repository (auto-syncs on startup)
- API config endpoint (pulls from control plane)
Verify health — wait for /readyz to return 200
Route traffic — update load balancer or DNS

# Quick recovery with local config
kt gateway run --listen 0.0.0.0:41002 --policy-config /etc/keeptrusts/policy-config.yaml

# Recovery with API-synced config
export KEEPTRUSTS_API_URL="https://api.example.com"
export KEEPTRUSTS_GATEWAY_TOKEN="kt_gk_..."

kt gateway run --listen 0.0.0.0:41002

Recovery Time

Gateway cold start is typically under 5 seconds. The primary bottleneck is container scheduling and image pull time.

API Database Backup

Continuous Backup with WAL Archiving

Configure PostgreSQL WAL archiving for point-in-time recovery:

# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://keeptrusts-wal-archive/%f'

Automated Logical Backups

Schedule daily logical backups:

#!/bin/bash
# backup-keeptrusts-db.sh
set -euo pipefail

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="keeptrusts_backup_${TIMESTAMP}.sql.gz"

pg_dump "$DATABASE_URL" \
  --format=custom \
  --compress=9 \
  --file="/tmp/${BACKUP_FILE}"

aws s3 cp "/tmp/${BACKUP_FILE}" \
  "s3://keeptrusts-backups/daily/${BACKUP_FILE}" \
  --storage-class STANDARD_IA

rm "/tmp/${BACKUP_FILE}"

Restore Procedure

# Step 1: Create a new database
createdb keeptrusts_restored

# Step 2: Restore from backup
pg_restore --dbname=keeptrusts_restored \
  --jobs=4 \
  --no-owner \
  /tmp/keeptrusts_backup_20260423.sql.gz

# Step 3: Run pending migrations
DATABASE_URL="postgres://...keeptrusts_restored" \
  cd api && sqlx migrate run

# Step 4: Point API to restored database
export DATABASE_URL="postgres://keeptrusts:$DB_PASS@db-host:5432/keeptrusts_restored"

# Step 5: Restart API
systemctl restart keeptrusts-api

Backup Verification

Test restores monthly:

# Restore to a test database
pg_restore --dbname=keeptrusts_dr_test /tmp/latest_backup.sql.gz

# Run the API test suite against the restored database
DATABASE_URL="postgres://...keeptrusts_dr_test" cargo test --quiet

# Verify row counts match expectations
psql keeptrusts_dr_test -c "SELECT COUNT(*) FROM events;"
psql keeptrusts_dr_test -c "SELECT COUNT(*) FROM configurations;"

Console Failover

The console is a stateless Next.js application. Its only dependency is the API server for BFF routes.

Active-Passive Failover

Deploy console instances in two regions
Configure DNS failover (Route 53, Cloudflare) with health checks against /api/health
The passive instance warms up on deploy but receives no traffic
On primary failure, DNS switches to the passive instance

Active-Active

For lower RTO, run console instances in multiple regions behind a global load balancer:

# Kubernetes Ingress with external-dns
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: keeptrusts-console
  annotations:
    external-dns.alpha.kubernetes.io/hostname: console.example.com
    external-dns.alpha.kubernetes.io/ttl: "60"
spec:
  rules:
    - host: console.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: keeptrusts-console
                port:
                  number: 3000

Worker Recovery

Worker binaries (worker_export, worker_lifecycle, worker_config) are stateless processes that poll the database for work. Recovery is straightforward:

Redeploy the worker container
The worker resumes processing from the database queue
In-flight jobs that were interrupted will be retried based on their status column

Ensure only one instance of each worker type runs to avoid duplicate processing.

Encryption Key Recovery

The KEEPTRUSTS_SECRET_ENCRYPTION_KEY is critical — losing it means all encrypted secrets (provider keys, Git tokens, webhook secrets) become unrecoverable.

Key Backup Strategy

Store the encryption key in a separate secret manager (Vault, AWS Secrets Manager)
Maintain an offline backup in a sealed envelope or HSM
Document the key recovery procedure separately from this runbook
Never store the encryption key in the same backup as the database

Runbook Template

## Incident: [Component] Failure

**Severity:** P1 / P2 / P3
**RTO Target:** X minutes
**RPO Target:** X minutes

### Detection
- Alert fired: [alert name]
- Dashboard: [link]
- Affected users: [scope]

### Diagnosis
1. Check component health: `curl https://[endpoint]/healthz`
2. Check logs: `kubectl logs -l app=[component] --tail=100`
3. Check dependencies: [database, API, DNS]

### Recovery Steps
1. [Step 1 with exact command]
2. [Step 2 with exact command]
3. [Verification step]

### Post-Recovery
- [ ] Verify data integrity
- [ ] Check audit log for gap
- [ ] Notify stakeholders
- [ ] Schedule post-incident review

DR Testing Schedule

Test Type	Frequency	Scope
Backup restore	Monthly	API database
Gateway failover	Quarterly	Gateway fleet
Console failover	Quarterly	DNS and load balancer
Full DR simulation	Annually	All components
Encryption key recovery	Annually	Key backup verification

Next steps

Configure Upgrade Procedures for zero-downtime updates
Set up Monitoring & Alerting for DR-related alerts
Review Multi-Region deployment for geographic redundancy

For AI systems

Canonical terms: RTO, RPO, WAL archiving, stateless gateway, worker_lifecycle, worker_export, worker_config, KEEPTRUSTS_SECRET_ENCRYPTION_KEY, advisory lock
CLI commands: kt gateway run --listen <host:port> --policy-config <path>, export KEEPTRUSTS_API_URL=<url>, export KEEPTRUSTS_GATEWAY_TOKEN=<key>, kt gateway run --listen <host:port>
Health endpoints: /readyz (gateway), /healthz (API)
Recovery components: gateway (stateless), API (Postgres-backed), console (stateless BFF), workers (stateless)
Related pages: Upgrade Procedures, Monitoring & Alerting, Multi-Region

For engineers

Gateway recovery is under 5 seconds cold start — just deploy the binary with access to policy config
Configure PostgreSQL WAL archiving for point-in-time API database recovery
Schedule daily pg_dump backups and test restores monthly against a clean database
Store KEEPTRUSTS_SECRET_ENCRYPTION_KEY in a separate secret manager (Vault, AWS Secrets Manager) — losing it makes all encrypted secrets unrecoverable
Ensure only one instance of each worker type runs to avoid duplicate processing
Validate: run cargo test --quiet against a restored database to confirm integrity

For leaders

Gateway outage blocks all AI traffic; API outage creates audit trail gaps — define component-level RTO/RPO accordingly
Stateless components (gateway, console, workers) recover in minutes; the API database is the primary DR concern
Encryption key loss is catastrophic — mandate offline backup and annual verification
DR testing cadence: monthly backup restores, quarterly failover tests, annual full simulation
Cross-region deployment (see Multi-Region guide) provides geographic redundancy for critical-tier targets

Use this page when​

Primary audience​

Component Recovery Profiles​

RTO/RPO Targets​

Gateway Recovery​

Recovery Steps​

Recovery Time​

API Database Backup​

Continuous Backup with WAL Archiving​

Automated Logical Backups​

Restore Procedure​

Backup Verification​

Console Failover​

Active-Passive Failover​

Active-Active​

Worker Recovery​

Encryption Key Recovery​

Key Backup Strategy​

Runbook Template​

DR Testing Schedule​

Next steps​

For AI systems​

For engineers​

For leaders​