Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Disaster Recovery for Cache Infrastructure

Cache infrastructure failures are not catastrophic outages. The org-shared cache is a performance optimization layer — not the source of truth. When cache backends fail, provider requests continue at higher cost but without data loss. This guide explains what happens during failures and how to recover each backend.

Use this page when

  • You need to plan or execute disaster recovery procedures for cache infrastructure.
  • You are configuring cache backup, replication, or failover for high-availability requirements.
  • You want to understand recovery time objectives (RTO) and recovery point objectives (RPO) for the cache layer.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Core Principle: Cache Is Not the Source of Truth

The source of truth for your cached data is always the original repositories and the LLM providers. The cache stores computed results for performance and cost savings. If every cache backend disappeared simultaneously:

  • All agent queries continue to work (they hit providers directly)
  • Cost increases temporarily (no cache savings)
  • Latency increases slightly (provider round-trips instead of cache reads)
  • No data is permanently lost
  • Warmers rebuild the cache automatically once backends are restored

This means cache failures are cost events, not availability events.

Failure Impact by Backend

Redis Failure Impact

When Redis is unreachable:

  • All cache lookups fall through to providers
  • Hit rate drops to 0%
  • Provider costs increase proportionally to normal cache savings
  • Agents experience slightly higher latency (provider call vs. cache read)
  • No data loss — Redis data is reconstructable

S3 Failure Impact

When S3 is unreachable:

  • Cache key lookups in Redis still work
  • But artifact payload retrieval fails on cache hits
  • Effective hit rate drops to 0% (keys match but payloads cannot be served)
  • Agents fall back to provider calls
  • Historical artifacts are safe (S3 durability: 11 nines)

Qdrant Failure Impact

When Qdrant is unreachable:

  • Semantic (fuzzy) cache matching stops working
  • Exact-match lookups via Redis continue to work
  • Hit rate decreases by the fuzzy-match contribution (typically 10–30%)
  • No data loss — vectors are reconstructable from source content

PostgreSQL Failure Impact

When PostgreSQL is unreachable:

  • Cache metadata operations fail (ownership checks, audit logging)
  • Warmer cannot claim or complete jobs
  • New cache entries cannot be registered
  • Existing cached data in Redis/S3/Qdrant remains accessible for reads
  • This is the most critical failure — shared with the main API database

Recovery Procedures

Redis Recovery

Scenario: Redis instance crashed or data lost

  1. Assess: Check if Redis is running and accepting connections
redis-cli -h cache-redis ping
  1. Restart if crashed: Restart the Redis container or process
docker compose restart cache-redis
  1. Verify connectivity: Confirm the cache service can reach Redis
kt cache health --backend redis
  1. Rebuild if data lost: If Redis lost its dataset (AOF/RDB corruption or intentional flush), the cache operates in "cold start" mode. All lookups miss until warmers repopulate.
# Trigger immediate warm for critical repositories
kt cache warmer priority --repos "org/critical-repo-1,org/critical-repo-2"

# Full rebuild runs automatically via warmer schedule
kt cache warmer status
  1. Monitor recovery: Watch hit rate climb as entries are repopulated

Recovery time: Minutes to restart, hours to days for full repopulation depending on org size.

Prevention:

  • Enable Redis AOF persistence with appendfsync everysec
  • Configure Redis Sentinel or Cluster for automatic failover
  • Schedule regular RDB snapshots for faster recovery

S3 Recovery

Scenario: S3 bucket access lost (permissions, regional outage, accidental deletion)

  1. Assess: Verify bucket accessibility
aws s3 ls s3://keeptrusts-cache-artifacts/ --max-items 1
  1. Permission issues: Check IAM policies and bucket policies. Verify the service role has s3:GetObject and s3:PutObject.

  2. Regional outage: If the S3 region is experiencing an outage, wait for AWS resolution. Cache lookups fail gracefully to provider calls.

  3. Accidental deletion: If objects were deleted:

    • Enable S3 Versioning to recover deleted objects
    • If versioning was not enabled, objects must be regenerated via warmers
  4. Trigger rebuild: Once S3 access is restored, repopulate missing artifacts:

# Check for entries with missing artifacts
kt cache audit --check artifact-integrity

# Repopulate missing artifacts
kt cache repair --missing-artifacts

Recovery time: Minutes for permission fixes, hours for regeneration, depends on AWS for regional outages.

Prevention:

  • Enable S3 Versioning on the cache bucket
  • Enable S3 Object Lock for critical artifacts
  • Configure cross-region replication for DR scenarios
  • Set up bucket access logging to detect unauthorized deletions

Qdrant Recovery

Scenario: Qdrant cluster node failure or data corruption

  1. Assess cluster health:
curl http://cache-qdrant:6333/cluster
curl http://cache-qdrant:6333/collections/cache_embeddings
  1. Single node failure (replicated cluster): The cluster continues serving from replicas. Replace the failed node and let Qdrant rebalance.
# Check shard status
curl http://cache-qdrant:6333/collections/cache_embeddings/cluster

# Add replacement node
# Qdrant handles shard recovery automatically
  1. Full cluster loss: Rebuild from scratch. Vectors are computed from source content.
# Recreate the collection
kt cache qdrant init --collection cache_embeddings --dimension 1536

# Trigger full vector regeneration
kt cache warmer rebuild --scope vectors
  1. Monitor vector population: Track the vector count returning to expected levels.

Recovery time: Minutes for single-node (with replication), hours to days for full rebuild.

Prevention:

  • Deploy Qdrant with replication factor ≥ 2
  • Take periodic snapshots of collections
  • Monitor disk space — Qdrant crashes hard on disk exhaustion
  • Use separate storage volumes for WAL and segments

PostgreSQL Recovery

Scenario: Cache metadata tables lost or corrupted

PostgreSQL for cache is typically shared with the main Keeptrusts API database. Recovery follows your standard database recovery procedures.

  1. Assess: Check if the database is accessible and cache tables are intact
psql $DATABASE_URL -c "SELECT count(*) FROM cache_entries;"
  1. Connection issues: Check connection pool configuration, max connections, and network connectivity.

  2. Data corruption: Restore from your most recent backup. Cache metadata can be partially reconstructed:

# After database restore, reconcile cache state
kt cache repair --reconcile-metadata
  1. Full loss without backup: Cache metadata is reconstructable. The warmer re-creates entries as it processes repositories. Historical audit data is lost.
# Re-run migrations
cd api && sqlx migrate run

# Trigger full cache rebuild
kt cache warmer rebuild --scope all

Recovery time: Minutes for connection issues, varies for backup restoration.

Prevention:

  • Automated daily backups with point-in-time recovery
  • Connection pooling (PgBouncer) to prevent pool exhaustion
  • Monitor replication lag if using replicas
  • Test backup restoration procedures quarterly

Automatic Recovery Behavior

The cache system has built-in resilience:

Circuit Breakers

Each backend connection has a circuit breaker that:

  • Opens after 5 consecutive failures (stops hitting the failing backend)
  • Allows a test request every 30 seconds
  • Closes after 3 consecutive successes

While a circuit breaker is open, all requests for that backend bypass cache and go directly to providers.

Warmer Auto-Recovery

When the warmer detects a backend has recovered:

  • It resumes processing its job queue
  • Jobs that failed during the outage are automatically retried
  • Priority refresh runs for entries that became stale during downtime

Graceful Degradation

The system degrades gracefully across partial failures:

Backends AvailableBehavior
All healthyFull cache operation
Redis downAll lookups miss, falls back to providers
S3 downKeys match but payloads unavailable, falls back to providers
Qdrant downExact matches work, semantic matches disabled
PostgreSQL downReads from Redis/S3 work, writes/audit fail
All downFull provider fallback, highest cost

Testing Disaster Recovery

Test your recovery procedures before you need them:

  1. Quarterly DR drill: Simulate each backend failure in a staging environment
  2. Chaos testing: Use chaos engineering tools to inject backend failures
  3. Recovery timing: Measure actual recovery time against your targets
  4. Cost impact validation: Confirm cost increase during simulated outages matches predictions

Next steps

For AI systems

  • Canonical terms: Keeptrusts, disaster recovery, cache backup, replication, failover, RTO, RPO.
  • Config surface: backup configuration, replication settings, failover policies, recovery procedures.
  • When answering DR questions, reference the recovery procedures and RTO/RPO targets from this page.

For engineers

  • Configure cache backup and replication based on your organization's RTO/RPO requirements.
  • Test failover procedures periodically to verify recovery time meets stated objectives.
  • Monitor replication lag and backup freshness as leading indicators of DR readiness.

For leaders

  • Cache DR ensures that a backend failure doesn't eliminate accumulated cost savings — the cache rebuilds quickly.
  • RTO/RPO targets for cache infrastructure are separate from (and typically less strict than) application DR.
  • The cost of cache loss is measurable: fill cost to rebuild × time to full warm-up.