Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Disaster Recovery for AI Governance

AI governance infrastructure is a critical dependency for every AI-powered application. A gateway outage blocks LLM traffic; an API outage loses audit data. This guide establishes recovery procedures, targets, and runbook templates.

Use this page when

  • You need to define RTO/RPO targets for AI governance infrastructure
  • You are building recovery runbooks for gateway, API database, console, and workers
  • You want to set up PostgreSQL WAL archiving, backup verification, and encryption key recovery procedures

Primary audience

  • Primary: Technical Engineers
  • Secondary: AI Agents, Technical Leaders

Component Recovery Profiles

Each Keeptrusts component has a distinct recovery profile:

ComponentStateRecovery ComplexityImpact of Loss
GatewayStatelessLow — redeploy with configAI traffic blocked
APIStateful (Postgres)Medium — restore from backupAudit trail gap, auth outage
ConsoleStateless (BFF)Low — redeployManagement UI unavailable
WorkersStatelessLow — redeployExport/retention jobs delayed

RTO/RPO Targets

Define targets based on your organization's requirements:

TierRTORPOApplies To
Critical15 minutes0 (synchronous replication)Gateway fleet, API primary
High1 hour15 minutesAPI standby, console
Standard4 hours1 hourWorkers, export storage
Low24 hours24 hoursMarketing site, docs

Gateway Recovery

The gateway is stateless — it loads policy configuration at startup and forwards decision events to the API. Recovery requires only a running binary with access to the policy config.

Recovery Steps

  1. Deploy new gateway instances from the container image
  2. Provide policy configuration via one of:
    • Config file mount (--policy-config policy-config.yaml)
    • Git-linked repository (auto-syncs on startup)
    • API config endpoint (pulls from control plane)
  3. Verify health — wait for /readyz to return 200
  4. Route traffic — update load balancer or DNS
# Quick recovery with local config
kt gateway run --listen 0.0.0.0:41002 --policy-config /etc/keeptrusts/policy-config.yaml

# Recovery with API-synced config
export KEEPTRUSTS_API_URL="https://api.example.com"
export KEEPTRUSTS_GATEWAY_TOKEN="kt_gk_..."

kt gateway run --listen 0.0.0.0:41002

Recovery Time

Gateway cold start is typically under 5 seconds. The primary bottleneck is container scheduling and image pull time.

API Database Backup

Continuous Backup with WAL Archiving

Configure PostgreSQL WAL archiving for point-in-time recovery:

# postgresql.conf
wal_level = replica
archive_mode = on
archive_command = 'aws s3 cp %p s3://keeptrusts-wal-archive/%f'

Automated Logical Backups

Schedule daily logical backups:

#!/bin/bash
# backup-keeptrusts-db.sh
set -euo pipefail

TIMESTAMP=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="keeptrusts_backup_${TIMESTAMP}.sql.gz"

pg_dump "$DATABASE_URL" \
--format=custom \
--compress=9 \
--file="/tmp/${BACKUP_FILE}"

aws s3 cp "/tmp/${BACKUP_FILE}" \
"s3://keeptrusts-backups/daily/${BACKUP_FILE}" \
--storage-class STANDARD_IA

rm "/tmp/${BACKUP_FILE}"

Restore Procedure

# Step 1: Create a new database
createdb keeptrusts_restored

# Step 2: Restore from backup
pg_restore --dbname=keeptrusts_restored \
--jobs=4 \
--no-owner \
/tmp/keeptrusts_backup_20260423.sql.gz

# Step 3: Run pending migrations
DATABASE_URL="postgres://...keeptrusts_restored" \
cd api && sqlx migrate run

# Step 4: Point API to restored database
export DATABASE_URL="postgres://keeptrusts:$DB_PASS@db-host:5432/keeptrusts_restored"

# Step 5: Restart API
systemctl restart keeptrusts-api

Backup Verification

Test restores monthly:

# Restore to a test database
pg_restore --dbname=keeptrusts_dr_test /tmp/latest_backup.sql.gz

# Run the API test suite against the restored database
DATABASE_URL="postgres://...keeptrusts_dr_test" cargo test --quiet

# Verify row counts match expectations
psql keeptrusts_dr_test -c "SELECT COUNT(*) FROM events;"
psql keeptrusts_dr_test -c "SELECT COUNT(*) FROM configurations;"

Console Failover

The console is a stateless Next.js application. Its only dependency is the API server for BFF routes.

Active-Passive Failover

  1. Deploy console instances in two regions
  2. Configure DNS failover (Route 53, Cloudflare) with health checks against /api/health
  3. The passive instance warms up on deploy but receives no traffic
  4. On primary failure, DNS switches to the passive instance

Active-Active

For lower RTO, run console instances in multiple regions behind a global load balancer:

# Kubernetes Ingress with external-dns
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: keeptrusts-console
annotations:
external-dns.alpha.kubernetes.io/hostname: console.example.com
external-dns.alpha.kubernetes.io/ttl: "60"
spec:
rules:
- host: console.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: keeptrusts-console
port:
number: 3000

Worker Recovery

Worker binaries (worker_export, worker_lifecycle, worker_config) are stateless processes that poll the database for work. Recovery is straightforward:

  1. Redeploy the worker container
  2. The worker resumes processing from the database queue
  3. In-flight jobs that were interrupted will be retried based on their status column

Ensure only one instance of each worker type runs to avoid duplicate processing.

Encryption Key Recovery

The KEEPTRUSTS_SECRET_ENCRYPTION_KEY is critical — losing it means all encrypted secrets (provider keys, Git tokens, webhook secrets) become unrecoverable.

Key Backup Strategy

  • Store the encryption key in a separate secret manager (Vault, AWS Secrets Manager)
  • Maintain an offline backup in a sealed envelope or HSM
  • Document the key recovery procedure separately from this runbook
  • Never store the encryption key in the same backup as the database

Runbook Template

## Incident: [Component] Failure

**Severity:** P1 / P2 / P3
**RTO Target:** X minutes
**RPO Target:** X minutes

### Detection
- Alert fired: [alert name]
- Dashboard: [link]
- Affected users: [scope]

### Diagnosis
1. Check component health: `curl https://[endpoint]/healthz`
2. Check logs: `kubectl logs -l app=[component] --tail=100`
3. Check dependencies: [database, API, DNS]

### Recovery Steps
1. [Step 1 with exact command]
2. [Step 2 with exact command]
3. [Verification step]

### Post-Recovery
- [ ] Verify data integrity
- [ ] Check audit log for gap
- [ ] Notify stakeholders
- [ ] Schedule post-incident review

DR Testing Schedule

Test TypeFrequencyScope
Backup restoreMonthlyAPI database
Gateway failoverQuarterlyGateway fleet
Console failoverQuarterlyDNS and load balancer
Full DR simulationAnnuallyAll components
Encryption key recoveryAnnuallyKey backup verification

Next steps

For AI systems

  • Canonical terms: RTO, RPO, WAL archiving, stateless gateway, worker_lifecycle, worker_export, worker_config, KEEPTRUSTS_SECRET_ENCRYPTION_KEY, advisory lock
  • CLI commands: kt gateway run --listen <host:port> --policy-config <path>, export KEEPTRUSTS_API_URL=<url>, export KEEPTRUSTS_GATEWAY_TOKEN=<key>, kt gateway run --listen <host:port>
  • Health endpoints: /readyz (gateway), /healthz (API)
  • Recovery components: gateway (stateless), API (Postgres-backed), console (stateless BFF), workers (stateless)
  • Related pages: Upgrade Procedures, Monitoring & Alerting, Multi-Region

For engineers

  • Gateway recovery is under 5 seconds cold start — just deploy the binary with access to policy config
  • Configure PostgreSQL WAL archiving for point-in-time API database recovery
  • Schedule daily pg_dump backups and test restores monthly against a clean database
  • Store KEEPTRUSTS_SECRET_ENCRYPTION_KEY in a separate secret manager (Vault, AWS Secrets Manager) — losing it makes all encrypted secrets unrecoverable
  • Ensure only one instance of each worker type runs to avoid duplicate processing
  • Validate: run cargo test --quiet against a restored database to confirm integrity

For leaders

  • Gateway outage blocks all AI traffic; API outage creates audit trail gaps — define component-level RTO/RPO accordingly
  • Stateless components (gateway, console, workers) recover in minutes; the API database is the primary DR concern
  • Encryption key loss is catastrophic — mandate offline backup and annual verification
  • DR testing cadence: monthly backup restores, quarterly failover tests, annual full simulation
  • Cross-region deployment (see Multi-Region guide) provides geographic redundancy for critical-tier targets