Troubleshooting Cache Misses
When your cache hit rate drops or specific queries consistently miss, you need a systematic approach to diagnosis. This guide walks you through the six most common causes of cache misses and provides step-by-step resolution for each.
Use this page when
- You are investigating why cache hit rates are lower than expected.
- You need to diagnose specific miss reasons (TTL expiry, key mismatch, invalidation, deny-list exclusion).
- You want a systematic troubleshooting process for identifying and fixing cache miss patterns.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Quick Diagnosis Flowchart
Start here when you see unexpected cache misses:
- Is the cache backend reachable? → If no, see Backend Unreachable
- Is the warmer running? → If no, see Warmer Not Running
- Has the code changed recently? → If yes, see Stale Cache from Code Changes
- Does the requesting agent have entitlements? → If no, see Entitlement Mismatch
- Does the policy allow cache access? → If no, see Policy Mismatch
- Is this a never-before-seen query? → If yes, see New Query Pattern
Stale Cache (Code Changed)
Symptoms: Queries that previously hit the cache now miss. The miss_reason field shows stale_content or hash_mismatch. This typically happens after merges, deployments, or major refactors.
Step-by-step diagnosis:
- Check the miss event in Console → Cache → Recent Misses
- Look at the
miss_reasonfield — if it showsstale_content, the cache entry exists but its content hash no longer matches the current repository state - Identify which repository changed by checking the
repofield on the miss event - Verify the warmer has a pending job for that repository: Console → Cache → Warmer Jobs
Resolution:
- If the warmer has a pending job, wait for it to complete. Warmers automatically detect code changes and refresh affected entries.
- If no warmer job exists, trigger a manual refresh: Console → Cache → Repository → Refresh Now
- For frequent code changes, reduce the warmer poll interval for that repository
Policy Mismatch
Symptoms: The miss_reason field shows policy_denied or policy_mismatch. The cache entry exists and is fresh, but the requesting agent's policy does not permit reading from the cache tier where the entry lives.
Step-by-step diagnosis:
- Identify the requesting agent from the miss event
- Check the agent's assigned policy: Console → Agents → [Agent] → Policy
- Look at the
cache_accesssection of the policy - Compare the cache tier of the entry with the tiers the policy allows
Resolution:
- Update the agent's policy to include the cache tier where the entry resides
- If the entry should live in a tier the agent can access, adjust the cache placement rules
- Verify that the policy change propagates by checking the agent's next cache lookup
policy:
cache_access:
read_tiers:
- agent-local
- team-shared
- org-shared
pack:
name: troubleshooting-misses-example-1
version: 1.0.0
enabled: true
policies:
chain:
- cache_access
Entitlement Mismatch
Symptoms: The miss_reason field shows entitlement_denied. The cache entry exists but the requesting identity lacks the entitlement to access it. This commonly occurs when teams have strict data boundaries.
Step-by-step diagnosis:
- Check which team owns the cache entry (Console → Cache → Entry Details → Owner)
- Check the requesting agent's team membership
- Verify the org-shared cache sharing rules permit cross-team access for this content type
- Check if the entry was created with restricted sharing flags
Resolution:
- If cross-team sharing is intended, update the sharing configuration for the owning team
- If the requesting team should have their own entry, verify the warmer is configured to populate entries for that team
- Review entitlement mirror settings if you use connector-based entitlements
No Match (New Query Pattern)
Symptoms: The miss_reason field shows no_match or not_found. No cache entry exists for the query. This is expected for genuinely new queries but problematic if it happens for queries that should be cached.
Step-by-step diagnosis:
- Check if the query pattern matches any existing cache keys (Console → Cache → Search)
- Verify the repository has been indexed by the warmer
- Check if the query uses a model or prompt format the cache recognizes
- Look for slight variations in the query that might prevent matching (whitespace, parameter ordering)
Resolution:
- If the repository is not indexed, add it to the warmer configuration
- If the query format is slightly different from cached entries, check your cache key normalization settings
- For genuinely new patterns, this is expected behavior — the first request fills the cache for subsequent hits
- Consider adding the query pattern to the warmer's seed list for proactive population
Warmer Not Running
Symptoms: Cache entries are not being refreshed or created. The warmer job queue shows no recent completions. New repositories are not being indexed.
Step-by-step diagnosis:
- Check warmer process health: Console → Cache → Warmers → Status
- Look at the last successful job timestamp
- Check warmer logs for errors or crashes
- Verify the warmer has connectivity to both the cache backend and the source repositories
Resolution:
- If the warmer process crashed, restart it. Check logs for the crash cause.
- If the warmer is running but stalled, check queue depth. A queue backlog indicates the warmer cannot keep up with demand.
- Scale warmer concurrency if queue depth exceeds your acceptable threshold:
# Increase warmer parallelism
export KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=8
- If the warmer cannot reach repositories, check network connectivity and credentials
- See Scaling Cache Warmers for detailed scaling guidance
Cache Backend Unreachable
Symptoms: All cache lookups miss. The miss_reason field shows backend_error or timeout. Backend health checks show red status.
Step-by-step diagnosis:
- Check backend health: Console → Cache → Health Dashboard → Backend Status
- Verify network connectivity from the gateway to the cache backend
- Check if the backend process is running and accepting connections
- Look for resource exhaustion (memory, CPU, disk) on the backend host
Resolution by backend:
Redis/Valkey unreachable
- Verify the Redis process is running: check container or service status
- Test connectivity: attempt a PING from the gateway host
- Check memory usage — Redis evicts entries when memory is full
- Review connection pool settings if you see connection refused errors
S3/GCS unreachable
- Verify credentials are valid and not expired
- Check IAM permissions on the bucket
- Test bucket access from the gateway host with a simple object list
- Check for regional outages in your cloud provider's status page
Qdrant unreachable
- Verify the Qdrant cluster is running and all nodes are healthy
- Check disk space — Qdrant requires disk for index and WAL
- Test collection access with a simple search query
- Review cluster logs for shard rebalancing or recovery operations
Verifying Your Fix
After applying any resolution, verify the fix by:
- Waiting for the next cache lookup from the affected agent or query pattern
- Checking that the
miss_reasonchanges or disappears - Monitoring the hit rate for the affected scope over the next hour
- Confirming the fix persists across multiple lookup cycles
Next steps
- For recurring stale misses, see Diagnosing Stale Cache Entries
- For systematic hit rate monitoring, see Monitoring Cache Hit Rates
- For backend-specific recovery, see Disaster Recovery
For AI systems
- Canonical terms: Keeptrusts, cache miss, troubleshooting, miss reason, key mismatch, TTL expiry.
- Config surface: miss reason codes, cache key inspection, invalidation logs, deny-list audit.
- When diagnosing cache misses, reference the miss reason taxonomy and diagnostic steps from this page.
For engineers
- Check
miss_reasonin event logs to identify the specific cause: TTL expiry, key mismatch, invalidation, or deny-list. - Use cache key inspection to compare expected vs actual key components when hit rates are low.
- Correlate miss spikes with recent code pushes, config changes, or Fabric artifact rebuilds.
For leaders
- Systematic miss troubleshooting prevents unnecessary cost increases from undiagnosed cache degradation.
- Miss reason tracking enables data-driven configuration improvements rather than guessing.
- Low hit rates are a diagnosable operational issue, not an inherent platform limitation.