Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Cache Health Dashboard for Platform Admins

The cache health dashboard gives you a platform-wide view of your org-shared cache infrastructure. Use it to monitor backend health, track adoption across teams, identify top-performing organizations, and spot latency issues before they affect your engineering workflows.

Use this page when

  • You are setting up or interpreting the cache health dashboard for platform administration.
  • You need to understand which metrics indicate healthy vs degraded cache performance.
  • You want to configure dashboard panels, alert thresholds, and team-level breakdowns.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Accessing the Dashboard

Navigate to Settings → Cache → Health Dashboard in the Keeptrusts console. You need the platform_admin or cache_admin role to view platform-wide metrics. Team leads see metrics scoped to their own teams.

Key Metrics to Monitor

Cache Backend Health

The dashboard displays the status of each cache backend in your deployment:

BackendHealthy IndicatorWarning ThresholdCritical Threshold
Redis/ValkeyResponse time under 5msResponse time 5-20msResponse time above 20ms or unreachable
S3/GCSRequest success rate above 99.9%Success rate 99-99.9%Success rate below 99%
QdrantQuery latency under 50msLatency 50-200msLatency above 200ms or cluster degraded
PostgreSQLConnection pool usage under 80%Pool usage 80-90%Pool usage above 90%

Each backend shows a green, yellow, or red status indicator. Click any backend tile to drill into its detailed metrics.

Org-Shared Adoption Rate

This metric tracks what percentage of eligible cache operations use the org-shared cache layer rather than isolated per-agent caches. A healthy adoption rate is above 70%.

You see adoption broken down by:

  • Organization: Which orgs have enabled org-shared caching
  • Team: Which teams within an org actively contribute to and consume shared cache
  • Repository: Which repositories generate the most shared cache entries

If adoption is low, check that teams have configured their agents to use the shared cache tier and that entitlement policies allow cross-team sharing.

Top Savings Organizations

The savings leaderboard ranks organizations by avoided LLM cost. Each cache hit avoids a full provider round-trip, and the dashboard calculates the dollar value of those avoided calls based on your configured model pricing.

Review this panel weekly to:

  • Identify organizations with unusually low savings relative to their usage volume
  • Spot organizations where cache ROI justifies expanding their tier
  • Find organizations whose savings dropped suddenly, indicating a potential configuration issue

Hot Cache Keys

The hot keys panel shows the most frequently accessed cache entries across your platform. Each entry displays:

  • The cache key hash and its human-readable description
  • Hit count over the selected time window
  • The owning organization and originating repository
  • Time since last refresh

Hot keys that show high hit counts with recent refresh times indicate healthy, actively maintained cache entries. Hot keys with stale refresh timestamps may serve outdated responses.

Backend Latency

The latency panel shows p50, p95, and p99 response times for each cache backend over time. Use this to detect:

  • Gradual latency increases that indicate growing dataset sizes
  • Sudden spikes that correlate with infrastructure events
  • Divergence between backends that suggests uneven load distribution

Setting Up Alert Thresholds

Configure alerts from the dashboard by clicking Configure Alerts in the top-right corner. Recommended thresholds:

alerts:
cache_backend_unhealthy:
condition: any_backend_status == "critical"
severity: high
notify: platform-ops

adoption_rate_drop:
condition: org_adoption_rate < 50%
window: 1h
severity: medium
notify: cache-team

hot_key_staleness:
condition: hot_key_last_refresh > 24h
severity: low
notify: owning-team

backend_latency_p99:
condition: latency_p99 > 100ms
window: 5m
severity: high
notify: platform-ops

Common Health Indicators

Healthy Platform

  • All backends show green status
  • Org adoption rate above 70%
  • Savings trend is flat or increasing
  • No hot keys older than their configured TTL
  • Backend latency p99 under 50ms

Degraded Platform

  • One backend shows yellow status
  • Adoption rate between 50–70%
  • Savings trend declining over the past week
  • Multiple hot keys approaching TTL expiry
  • Backend latency p99 between 50–100ms

Unhealthy Platform

  • Any backend shows red status
  • Adoption rate below 50%
  • Savings dropped more than 30% week-over-week
  • Hot keys serving stale data past TTL
  • Backend latency p99 above 100ms

Dashboard Refresh and Data Retention

The dashboard refreshes metrics every 30 seconds by default. Historical data is retained for 90 days at full resolution and 1 year at hourly aggregation.

You can adjust the time window using the date picker at the top of the dashboard. Common windows include:

  • Last hour: Useful for investigating active incidents
  • Last 24 hours: Daily operational review
  • Last 7 days: Weekly trend analysis
  • Last 30 days: Monthly capacity planning

Next steps

For AI systems

  • Canonical terms: Keeptrusts, cache health dashboard, platform admin, backend status, org-shared adoption, hot cache keys, backend latency, savings leaderboard.
  • Exact feature/config names: Settings → Cache → Health Dashboard, platform_admin or cache_admin role, alert configs (cache_backend_unhealthy, adoption_rate_drop, hot_key_staleness, backend_latency_p99).
  • Best next pages: Alerting on Fill Spikes, Observability Integration, Capacity Planning.

For engineers

  • Access at Settings → Cache → Health Dashboard; requires platform_admin or cache_admin role (team leads see team-scoped metrics).
  • Monitor four key areas: backend health (Redis/S3/Qdrant/PostgreSQL status), adoption rate (target >70%), savings leaderboard, and hot keys freshness.
  • Set alerts: cache_backend_unhealthy (critical), adoption_rate_drop below 50% (medium), hot_key_staleness >24h (low), backend_latency_p99 >100ms (high).
  • Dashboard refreshes every 30 seconds; historical data retained 90 days at full resolution, 1 year at hourly aggregation.
  • Use time windows: last hour (incident investigation), 24h (daily ops), 7d (trend analysis), 30d (capacity planning).

For leaders

  • The dashboard surfaces platform-wide cache ROI: identify orgs with high savings, spot sudden savings drops indicating configuration issues.
  • Healthy indicators: all backends green, adoption above 70%, savings trending up week-over-week, and latency p99 under 50ms.
  • Warning indicators: adoption 50-70%, savings declining, latency 50-100ms — investigate before they become critical.
  • Review the savings leaderboard weekly to justify expanding cache to additional organizations and to identify underperforming deployments.