Disaster Recovery for AI Governance
AI governance infrastructure is a critical dependency for every AI-powered application. A gateway outage blocks LLM traffic; an API outage loses audit data. This guide establishes recovery procedures, targets, and runbook templates.
Use this page when
- You need to define RTO/RPO targets for AI governance infrastructure
- You are building recovery runbooks for gateway, API database, console, and workers
- You want to set up PostgreSQL WAL archiving, backup verification, and encryption key recovery procedures
Primary audience
- Primary: Technical Engineers
- Secondary: AI Agents, Technical Leaders
Component Recovery Profiles
Each Keeptrusts component has a distinct recovery profile:
| Component | State | Recovery Complexity | Impact of Loss |
|---|---|---|---|
| Gateway | Stateless | Low — redeploy with config | AI traffic blocked |
| API | Stateful (Postgres) | Medium — restore from backup | Audit trail gap, auth outage |
| Console | Stateless (BFF) | Low — redeploy | Management UI unavailable |
| Workers | Stateless | Low — redeploy | Export/retention jobs delayed |
RTO/RPO Targets
Define targets based on your organization's requirements:
| Tier | RTO | RPO | Applies To |
|---|---|---|---|
| Critical | 15 minutes | 0 (synchronous replication) | Gateway fleet, API primary |
| High | 1 hour | 15 minutes | API standby, console |
| Standard | 4 hours | 1 hour | Workers, export storage |
| Low | 24 hours | 24 hours | Marketing site, docs |
Gateway Recovery
The gateway is stateless — it loads policy configuration at startup and forwards decision events to the API. Recovery requires only a running binary with access to the policy config.
Recovery Steps
- Deploy new gateway instances from the container image
- Provide policy configuration via one of:
- Config file mount (
--policy-config policy-config.yaml) - Git-linked repository (auto-syncs on startup)
- API config endpoint (pulls from control plane)
- Config file mount (
- Verify health — wait for
/readyzto return200 - Route traffic — update load balancer or DNS
# Quick recovery with local config
kt gateway run --listen 0.0.0.0:41002 --policy-config /etc/keeptrusts/policy-config.yaml
# Recovery with API-synced config
export KEEPTRUSTS_API_URL="https://api.example.com"
export KEEPTRUSTS_GATEWAY_TOKEN="kt_gk_..."
kt gateway run --listen 0.0.0.0:41002
Recovery Time
Gateway cold start is typically under 5 seconds. The primary bottleneck is container scheduling and image pull time.
API Database Backup
Continuous Backup with WAL Archiving
Configure PostgreSQL WAL archiving for point-in-time recovery:
# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://keeptrusts-wal-archive/%f'
Automated Logical Backups
Schedule daily logical backups:
#!/bin/bash
# backup-keeptrusts-db.sh
set -euo pipefail
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="keeptrusts_backup_${TIMESTAMP}.sql.gz"
pg_dump "$DATABASE_URL" \
--format=custom \
--compress=9 \
--file="/tmp/${BACKUP_FILE}"
aws s3 cp "/tmp/${BACKUP_FILE}" \
"s3://keeptrusts-backups/daily/${BACKUP_FILE}" \
--storage-class STANDARD_IA
rm "/tmp/${BACKUP_FILE}"
Restore Procedure
# Step 1: Create a new database
createdb keeptrusts_restored
# Step 2: Restore from backup
pg_restore --dbname=keeptrusts_restored \
--jobs=4 \
--no-owner \
/tmp/keeptrusts_backup_20260423.sql.gz
# Step 3: Run pending migrations
DATABASE_URL="postgres://...keeptrusts_restored" \
cd api && sqlx migrate run
# Step 4: Point API to restored database
export DATABASE_URL="postgres://keeptrusts:$DB_PASS@db-host:5432/keeptrusts_restored"
# Step 5: Restart API
systemctl restart keeptrusts-api
Backup Verification
Test restores monthly:
# Restore to a test database
pg_restore --dbname=keeptrusts_dr_test /tmp/latest_backup.sql.gz
# Run the API test suite against the restored database
DATABASE_URL="postgres://...keeptrusts_dr_test" cargo test --quiet
# Verify row counts match expectations
psql keeptrusts_dr_test -c "SELECT COUNT(*) FROM events;"
psql keeptrusts_dr_test -c "SELECT COUNT(*) FROM configurations;"
Console Failover
The console is a stateless Next.js application. Its only dependency is the API server for BFF routes.
Active-Passive Failover
- Deploy console instances in two regions
- Configure DNS failover (Route 53, Cloudflare) with health checks against
/api/health - The passive instance warms up on deploy but receives no traffic
- On primary failure, DNS switches to the passive instance
Active-Active
For lower RTO, run console instances in multiple regions behind a global load balancer:
# Kubernetes Ingress with external-dns
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: keeptrusts-console
annotations:
external-dns.alpha.kubernetes.io/hostname: console.example.com
external-dns.alpha.kubernetes.io/ttl: "60"
spec:
rules:
- host: console.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: keeptrusts-console
port:
number: 3000
Worker Recovery
Worker binaries (worker_export, worker_lifecycle, worker_config) are stateless processes that poll the database for work. Recovery is straightforward:
- Redeploy the worker container
- The worker resumes processing from the database queue
- In-flight jobs that were interrupted will be retried based on their
statuscolumn
Ensure only one instance of each worker type runs to avoid duplicate processing.
Encryption Key Recovery
The KEEPTRUSTS_SECRET_ENCRYPTION_KEY is critical — losing it means all encrypted secrets (provider keys, Git tokens, webhook secrets) become unrecoverable.
Key Backup Strategy
- Store the encryption key in a separate secret manager (Vault, AWS Secrets Manager)
- Maintain an offline backup in a sealed envelope or HSM
- Document the key recovery procedure separately from this runbook
- Never store the encryption key in the same backup as the database
Runbook Template
## Incident: [Component] Failure
**Severity:** P1 / P2 / P3
**RTO Target:** X minutes
**RPO Target:** X minutes
### Detection
- Alert fired: [alert name]
- Dashboard: [link]
- Affected users: [scope]
### Diagnosis
1. Check component health: `curl https://[endpoint]/healthz`
2. Check logs: `kubectl logs -l app=[component] --tail=100`
3. Check dependencies: [database, API, DNS]
### Recovery Steps
1. [Step 1 with exact command]
2. [Step 2 with exact command]
3. [Verification step]
### Post-Recovery
- [ ] Verify data integrity
- [ ] Check audit log for gap
- [ ] Notify stakeholders
- [ ] Schedule post-incident review
DR Testing Schedule
| Test Type | Frequency | Scope |
|---|---|---|
| Backup restore | Monthly | API database |
| Gateway failover | Quarterly | Gateway fleet |
| Console failover | Quarterly | DNS and load balancer |
| Full DR simulation | Annually | All components |
| Encryption key recovery | Annually | Key backup verification |
Next steps
- Configure Upgrade Procedures for zero-downtime updates
- Set up Monitoring & Alerting for DR-related alerts
- Review Multi-Region deployment for geographic redundancy
For AI systems
- Canonical terms: RTO, RPO, WAL archiving, stateless gateway,
worker_lifecycle,worker_export,worker_config,KEEPTRUSTS_SECRET_ENCRYPTION_KEY, advisory lock - CLI commands:
kt gateway run --listen <host:port> --policy-config <path>,export KEEPTRUSTS_API_URL=<url>,export KEEPTRUSTS_GATEWAY_TOKEN=<key>,kt gateway run --listen <host:port> - Health endpoints:
/readyz(gateway),/healthz(API) - Recovery components: gateway (stateless), API (Postgres-backed), console (stateless BFF), workers (stateless)
- Related pages: Upgrade Procedures, Monitoring & Alerting, Multi-Region
For engineers
- Gateway recovery is under 5 seconds cold start — just deploy the binary with access to policy config
- Configure PostgreSQL WAL archiving for point-in-time API database recovery
- Schedule daily
pg_dumpbackups and test restores monthly against a clean database - Store
KEEPTRUSTS_SECRET_ENCRYPTION_KEYin a separate secret manager (Vault, AWS Secrets Manager) — losing it makes all encrypted secrets unrecoverable - Ensure only one instance of each worker type runs to avoid duplicate processing
- Validate: run
cargo test --quietagainst a restored database to confirm integrity
For leaders
- Gateway outage blocks all AI traffic; API outage creates audit trail gaps — define component-level RTO/RPO accordingly
- Stateless components (gateway, console, workers) recover in minutes; the API database is the primary DR concern
- Encryption key loss is catastrophic — mandate offline backup and annual verification
- DR testing cadence: monthly backup restores, quarterly failover tests, annual full simulation
- Cross-region deployment (see Multi-Region guide) provides geographic redundancy for critical-tier targets