Disaster Recovery for Cache Infrastructure

Cache infrastructure failures are not catastrophic outages. The org-shared cache is a performance optimization layer — not the source of truth. When cache backends fail, provider requests continue at higher cost but without data loss. This guide explains what happens during failures and how to recover each backend.

Use this page when

You need to plan or execute disaster recovery procedures for cache infrastructure.
You are configuring cache backup, replication, or failover for high-availability requirements.
You want to understand recovery time objectives (RTO) and recovery point objectives (RPO) for the cache layer.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Core Principle: Cache Is Not the Source of Truth

The source of truth for your cached data is always the original repositories and the LLM providers. The cache stores computed results for performance and cost savings. If every cache backend disappeared simultaneously:

All agent queries continue to work (they hit providers directly)
Cost increases temporarily (no cache savings)
Latency increases slightly (provider round-trips instead of cache reads)
No data is permanently lost
Warmers rebuild the cache automatically once backends are restored

This means cache failures are cost events, not availability events.

Failure Impact by Backend

Redis Failure Impact

When Redis is unreachable:

All cache lookups fall through to providers
Hit rate drops to 0%
Provider costs increase proportionally to normal cache savings
Agents experience slightly higher latency (provider call vs. cache read)
No data loss — Redis data is reconstructable

S3 Failure Impact

When S3 is unreachable:

Cache key lookups in Redis still work
But artifact payload retrieval fails on cache hits
Effective hit rate drops to 0% (keys match but payloads cannot be served)
Agents fall back to provider calls
Historical artifacts are safe (S3 durability: 11 nines)

Qdrant Failure Impact

When Qdrant is unreachable:

Semantic (fuzzy) cache matching stops working
Exact-match lookups via Redis continue to work
Hit rate decreases by the fuzzy-match contribution (typically 10–30%)
No data loss — vectors are reconstructable from source content

PostgreSQL Failure Impact

When PostgreSQL is unreachable:

Cache metadata operations fail (ownership checks, audit logging)
Warmer cannot claim or complete jobs
New cache entries cannot be registered
Existing cached data in Redis/S3/Qdrant remains accessible for reads
This is the most critical failure — shared with the main API database

Recovery Procedures

Redis Recovery

Scenario: Redis instance crashed or data lost

Assess: Check if Redis is running and accepting connections

redis-cli -h cache-redis ping

Restart if crashed: Restart the Redis container or process

docker compose restart cache-redis

Verify connectivity: Confirm the cache service can reach Redis

kt cache health --backend redis

Rebuild if data lost: If Redis lost its dataset (AOF/RDB corruption or intentional flush), the cache operates in "cold start" mode. All lookups miss until warmers repopulate.

# Trigger immediate warm for critical repositories
kt cache warmer priority --repos "org/critical-repo-1,org/critical-repo-2"

# Full rebuild runs automatically via warmer schedule
kt cache warmer status

Monitor recovery: Watch hit rate climb as entries are repopulated

Recovery time: Minutes to restart, hours to days for full repopulation depending on org size.

Prevention:

Enable Redis AOF persistence with appendfsync everysec
Configure Redis Sentinel or Cluster for automatic failover
Schedule regular RDB snapshots for faster recovery

S3 Recovery

Scenario: S3 bucket access lost (permissions, regional outage, accidental deletion)

Assess: Verify bucket accessibility

aws s3 ls s3://keeptrusts-cache-artifacts/ --max-items 1

Permission issues: Check IAM policies and bucket policies. Verify the service role has s3:GetObject and s3:PutObject.
Regional outage: If the S3 region is experiencing an outage, wait for AWS resolution. Cache lookups fail gracefully to provider calls.
Accidental deletion: If objects were deleted:
- Enable S3 Versioning to recover deleted objects
- If versioning was not enabled, objects must be regenerated via warmers
Trigger rebuild: Once S3 access is restored, repopulate missing artifacts:

# Check for entries with missing artifacts
kt cache audit --check artifact-integrity

# Repopulate missing artifacts
kt cache repair --missing-artifacts

Recovery time: Minutes for permission fixes, hours for regeneration, depends on AWS for regional outages.

Prevention:

Enable S3 Versioning on the cache bucket
Enable S3 Object Lock for critical artifacts
Configure cross-region replication for DR scenarios
Set up bucket access logging to detect unauthorized deletions

Qdrant Recovery

Scenario: Qdrant cluster node failure or data corruption

Assess cluster health:

curl http://cache-qdrant:6333/cluster
curl http://cache-qdrant:6333/collections/cache_embeddings

Single node failure (replicated cluster): The cluster continues serving from replicas. Replace the failed node and let Qdrant rebalance.

# Check shard status
curl http://cache-qdrant:6333/collections/cache_embeddings/cluster

# Add replacement node
# Qdrant handles shard recovery automatically

Full cluster loss: Rebuild from scratch. Vectors are computed from source content.

# Recreate the collection
kt cache qdrant init --collection cache_embeddings --dimension 1536

# Trigger full vector regeneration
kt cache warmer rebuild --scope vectors

Monitor vector population: Track the vector count returning to expected levels.

Recovery time: Minutes for single-node (with replication), hours to days for full rebuild.

Prevention:

Deploy Qdrant with replication factor ≥ 2
Take periodic snapshots of collections
Monitor disk space — Qdrant crashes hard on disk exhaustion
Use separate storage volumes for WAL and segments

PostgreSQL Recovery

Scenario: Cache metadata tables lost or corrupted

PostgreSQL for cache is typically shared with the main Keeptrusts API database. Recovery follows your standard database recovery procedures.

Assess: Check if the database is accessible and cache tables are intact

psql $DATABASE_URL -c "SELECT count(*) FROM cache_entries;"

Connection issues: Check connection pool configuration, max connections, and network connectivity.
Data corruption: Restore from your most recent backup. Cache metadata can be partially reconstructed:

# After database restore, reconcile cache state
kt cache repair --reconcile-metadata

Full loss without backup: Cache metadata is reconstructable. The warmer re-creates entries as it processes repositories. Historical audit data is lost.

# Re-run migrations
cd api && sqlx migrate run

# Trigger full cache rebuild
kt cache warmer rebuild --scope all

Recovery time: Minutes for connection issues, varies for backup restoration.

Prevention:

Automated daily backups with point-in-time recovery
Connection pooling (PgBouncer) to prevent pool exhaustion
Monitor replication lag if using replicas
Test backup restoration procedures quarterly

Automatic Recovery Behavior

The cache system has built-in resilience:

Circuit Breakers

Each backend connection has a circuit breaker that:

Opens after 5 consecutive failures (stops hitting the failing backend)
Allows a test request every 30 seconds
Closes after 3 consecutive successes

While a circuit breaker is open, all requests for that backend bypass cache and go directly to providers.

Warmer Auto-Recovery

When the warmer detects a backend has recovered:

It resumes processing its job queue
Jobs that failed during the outage are automatically retried
Priority refresh runs for entries that became stale during downtime

Graceful Degradation

The system degrades gracefully across partial failures:

Backends Available	Behavior
All healthy	Full cache operation
Redis down	All lookups miss, falls back to providers
S3 down	Keys match but payloads unavailable, falls back to providers
Qdrant down	Exact matches work, semantic matches disabled
PostgreSQL down	Reads from Redis/S3 work, writes/audit fail
All down	Full provider fallback, highest cost

Testing Disaster Recovery

Test your recovery procedures before you need them:

Quarterly DR drill: Simulate each backend failure in a staging environment
Chaos testing: Use chaos engineering tools to inject backend failures
Recovery timing: Measure actual recovery time against your targets
Cost impact validation: Confirm cost increase during simulated outages matches predictions

Next steps

Set up health monitoring to detect failures early with Cache Health Dashboard
Ensure backends are properly sized for recovery loads with Capacity Planning
Choose resilient backend configurations with Cache Backend Selection

For AI systems

Canonical terms: Keeptrusts, disaster recovery, cache backup, replication, failover, RTO, RPO.
Config surface: backup configuration, replication settings, failover policies, recovery procedures.
When answering DR questions, reference the recovery procedures and RTO/RPO targets from this page.

For engineers

Configure cache backup and replication based on your organization's RTO/RPO requirements.
Test failover procedures periodically to verify recovery time meets stated objectives.
Monitor replication lag and backup freshness as leading indicators of DR readiness.

For leaders

Cache DR ensures that a backend failure doesn't eliminate accumulated cost savings — the cache rebuilds quickly.
RTO/RPO targets for cache infrastructure are separate from (and typically less strict than) application DR.
The cost of cache loss is measurable: fill cost to rebuild × time to full warm-up.

Use this page when​

Primary audience​

Core Principle: Cache Is Not the Source of Truth​

Failure Impact by Backend​

Redis Failure Impact​

S3 Failure Impact​

Qdrant Failure Impact​

PostgreSQL Failure Impact​

Recovery Procedures​

Redis Recovery​

S3 Recovery​

Qdrant Recovery​

PostgreSQL Recovery​

Automatic Recovery Behavior​

Circuit Breakers​

Warmer Auto-Recovery​

Graceful Degradation​

Testing Disaster Recovery​

Next steps​

For AI systems​

For engineers​

For leaders​

Use this page when

Primary audience

Core Principle: Cache Is Not the Source of Truth

Failure Impact by Backend

Redis Failure Impact

S3 Failure Impact

Qdrant Failure Impact

PostgreSQL Failure Impact

Recovery Procedures

Redis Recovery

S3 Recovery

Qdrant Recovery

PostgreSQL Recovery

Automatic Recovery Behavior

Circuit Breakers

Warmer Auto-Recovery

Graceful Degradation

Testing Disaster Recovery

Next steps

For AI systems

For engineers

For leaders