Disaster Recovery for Cache Infrastructure
Cache infrastructure failures are not catastrophic outages. The org-shared cache is a performance optimization layer — not the source of truth. When cache backends fail, provider requests continue at higher cost but without data loss. This guide explains what happens during failures and how to recover each backend.
Use this page when
- You need to plan or execute disaster recovery procedures for cache infrastructure.
- You are configuring cache backup, replication, or failover for high-availability requirements.
- You want to understand recovery time objectives (RTO) and recovery point objectives (RPO) for the cache layer.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Core Principle: Cache Is Not the Source of Truth
The source of truth for your cached data is always the original repositories and the LLM providers. The cache stores computed results for performance and cost savings. If every cache backend disappeared simultaneously:
- All agent queries continue to work (they hit providers directly)
- Cost increases temporarily (no cache savings)
- Latency increases slightly (provider round-trips instead of cache reads)
- No data is permanently lost
- Warmers rebuild the cache automatically once backends are restored
This means cache failures are cost events, not availability events.
Failure Impact by Backend
Redis Failure Impact
When Redis is unreachable:
- All cache lookups fall through to providers
- Hit rate drops to 0%
- Provider costs increase proportionally to normal cache savings
- Agents experience slightly higher latency (provider call vs. cache read)
- No data loss — Redis data is reconstructable
S3 Failure Impact
When S3 is unreachable:
- Cache key lookups in Redis still work
- But artifact payload retrieval fails on cache hits
- Effective hit rate drops to 0% (keys match but payloads cannot be served)
- Agents fall back to provider calls
- Historical artifacts are safe (S3 durability: 11 nines)
Qdrant Failure Impact
When Qdrant is unreachable:
- Semantic (fuzzy) cache matching stops working
- Exact-match lookups via Redis continue to work
- Hit rate decreases by the fuzzy-match contribution (typically 10–30%)
- No data loss — vectors are reconstructable from source content
PostgreSQL Failure Impact
When PostgreSQL is unreachable:
- Cache metadata operations fail (ownership checks, audit logging)
- Warmer cannot claim or complete jobs
- New cache entries cannot be registered
- Existing cached data in Redis/S3/Qdrant remains accessible for reads
- This is the most critical failure — shared with the main API database
Recovery Procedures
Redis Recovery
Scenario: Redis instance crashed or data lost
- Assess: Check if Redis is running and accepting connections
redis-cli -h cache-redis ping
- Restart if crashed: Restart the Redis container or process
docker compose restart cache-redis
- Verify connectivity: Confirm the cache service can reach Redis
kt cache health --backend redis
- Rebuild if data lost: If Redis lost its dataset (AOF/RDB corruption or intentional flush), the cache operates in "cold start" mode. All lookups miss until warmers repopulate.
# Trigger immediate warm for critical repositories
kt cache warmer priority --repos "org/critical-repo-1,org/critical-repo-2"
# Full rebuild runs automatically via warmer schedule
kt cache warmer status
- Monitor recovery: Watch hit rate climb as entries are repopulated
Recovery time: Minutes to restart, hours to days for full repopulation depending on org size.
Prevention:
- Enable Redis AOF persistence with
appendfsync everysec - Configure Redis Sentinel or Cluster for automatic failover
- Schedule regular RDB snapshots for faster recovery
S3 Recovery
Scenario: S3 bucket access lost (permissions, regional outage, accidental deletion)
- Assess: Verify bucket accessibility
aws s3 ls s3://keeptrusts-cache-artifacts/ --max-items 1
-
Permission issues: Check IAM policies and bucket policies. Verify the service role has
s3:GetObjectands3:PutObject. -
Regional outage: If the S3 region is experiencing an outage, wait for AWS resolution. Cache lookups fail gracefully to provider calls.
-
Accidental deletion: If objects were deleted:
- Enable S3 Versioning to recover deleted objects
- If versioning was not enabled, objects must be regenerated via warmers
-
Trigger rebuild: Once S3 access is restored, repopulate missing artifacts:
# Check for entries with missing artifacts
kt cache audit --check artifact-integrity
# Repopulate missing artifacts
kt cache repair --missing-artifacts
Recovery time: Minutes for permission fixes, hours for regeneration, depends on AWS for regional outages.
Prevention:
- Enable S3 Versioning on the cache bucket
- Enable S3 Object Lock for critical artifacts
- Configure cross-region replication for DR scenarios
- Set up bucket access logging to detect unauthorized deletions
Qdrant Recovery
Scenario: Qdrant cluster node failure or data corruption
- Assess cluster health:
curl http://cache-qdrant:6333/cluster
curl http://cache-qdrant:6333/collections/cache_embeddings
- Single node failure (replicated cluster): The cluster continues serving from replicas. Replace the failed node and let Qdrant rebalance.
# Check shard status
curl http://cache-qdrant:6333/collections/cache_embeddings/cluster
# Add replacement node
# Qdrant handles shard recovery automatically
- Full cluster loss: Rebuild from scratch. Vectors are computed from source content.
# Recreate the collection
kt cache qdrant init --collection cache_embeddings --dimension 1536
# Trigger full vector regeneration
kt cache warmer rebuild --scope vectors
- Monitor vector population: Track the vector count returning to expected levels.
Recovery time: Minutes for single-node (with replication), hours to days for full rebuild.
Prevention:
- Deploy Qdrant with replication factor ≥ 2
- Take periodic snapshots of collections
- Monitor disk space — Qdrant crashes hard on disk exhaustion
- Use separate storage volumes for WAL and segments
PostgreSQL Recovery
Scenario: Cache metadata tables lost or corrupted
PostgreSQL for cache is typically shared with the main Keeptrusts API database. Recovery follows your standard database recovery procedures.
- Assess: Check if the database is accessible and cache tables are intact
psql $DATABASE_URL -c "SELECT count(*) FROM cache_entries;"
-
Connection issues: Check connection pool configuration, max connections, and network connectivity.
-
Data corruption: Restore from your most recent backup. Cache metadata can be partially reconstructed:
# After database restore, reconcile cache state
kt cache repair --reconcile-metadata
- Full loss without backup: Cache metadata is reconstructable. The warmer re-creates entries as it processes repositories. Historical audit data is lost.
# Re-run migrations
cd api && sqlx migrate run
# Trigger full cache rebuild
kt cache warmer rebuild --scope all
Recovery time: Minutes for connection issues, varies for backup restoration.
Prevention:
- Automated daily backups with point-in-time recovery
- Connection pooling (PgBouncer) to prevent pool exhaustion
- Monitor replication lag if using replicas
- Test backup restoration procedures quarterly
Automatic Recovery Behavior
The cache system has built-in resilience:
Circuit Breakers
Each backend connection has a circuit breaker that:
- Opens after 5 consecutive failures (stops hitting the failing backend)
- Allows a test request every 30 seconds
- Closes after 3 consecutive successes
While a circuit breaker is open, all requests for that backend bypass cache and go directly to providers.
Warmer Auto-Recovery
When the warmer detects a backend has recovered:
- It resumes processing its job queue
- Jobs that failed during the outage are automatically retried
- Priority refresh runs for entries that became stale during downtime
Graceful Degradation
The system degrades gracefully across partial failures:
| Backends Available | Behavior |
|---|---|
| All healthy | Full cache operation |
| Redis down | All lookups miss, falls back to providers |
| S3 down | Keys match but payloads unavailable, falls back to providers |
| Qdrant down | Exact matches work, semantic matches disabled |
| PostgreSQL down | Reads from Redis/S3 work, writes/audit fail |
| All down | Full provider fallback, highest cost |
Testing Disaster Recovery
Test your recovery procedures before you need them:
- Quarterly DR drill: Simulate each backend failure in a staging environment
- Chaos testing: Use chaos engineering tools to inject backend failures
- Recovery timing: Measure actual recovery time against your targets
- Cost impact validation: Confirm cost increase during simulated outages matches predictions
Next steps
- Set up health monitoring to detect failures early with Cache Health Dashboard
- Ensure backends are properly sized for recovery loads with Capacity Planning
- Choose resilient backend configurations with Cache Backend Selection
For AI systems
- Canonical terms: Keeptrusts, disaster recovery, cache backup, replication, failover, RTO, RPO.
- Config surface: backup configuration, replication settings, failover policies, recovery procedures.
- When answering DR questions, reference the recovery procedures and RTO/RPO targets from this page.
For engineers
- Configure cache backup and replication based on your organization's RTO/RPO requirements.
- Test failover procedures periodically to verify recovery time meets stated objectives.
- Monitor replication lag and backup freshness as leading indicators of DR readiness.
For leaders
- Cache DR ensures that a backend failure doesn't eliminate accumulated cost savings — the cache rebuilds quickly.
- RTO/RPO targets for cache infrastructure are separate from (and typically less strict than) application DR.
- The cost of cache loss is measurable: fill cost to rebuild × time to full warm-up.