Cache Health Dashboard for Platform Admins

The cache health dashboard gives you a platform-wide view of your org-shared cache infrastructure. Use it to monitor backend health, track adoption across teams, identify top-performing organizations, and spot latency issues before they affect your engineering workflows.

Use this page when

You are setting up or interpreting the cache health dashboard for platform administration.
You need to understand which metrics indicate healthy vs degraded cache performance.
You want to configure dashboard panels, alert thresholds, and team-level breakdowns.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Accessing the Dashboard

Navigate to Settings → Cache → Health Dashboard in the Keeptrusts console. You need the platform_admin or cache_admin role to view platform-wide metrics. Team leads see metrics scoped to their own teams.

Key Metrics to Monitor

Cache Backend Health

The dashboard displays the status of each cache backend in your deployment:

Backend	Healthy Indicator	Warning Threshold	Critical Threshold
Redis/Valkey	Response time under 5ms	Response time 5-20ms	Response time above 20ms or unreachable
S3/GCS	Request success rate above 99.9%	Success rate 99-99.9%	Success rate below 99%
Qdrant	Query latency under 50ms	Latency 50-200ms	Latency above 200ms or cluster degraded
PostgreSQL	Connection pool usage under 80%	Pool usage 80-90%	Pool usage above 90%

Each backend shows a green, yellow, or red status indicator. Click any backend tile to drill into its detailed metrics.

Org-Shared Adoption Rate

This metric tracks what percentage of eligible cache operations use the org-shared cache layer rather than isolated per-agent caches. A healthy adoption rate is above 70%.

You see adoption broken down by:

Organization: Which orgs have enabled org-shared caching
Team: Which teams within an org actively contribute to and consume shared cache
Repository: Which repositories generate the most shared cache entries

If adoption is low, check that teams have configured their agents to use the shared cache tier and that entitlement policies allow cross-team sharing.

Top Savings Organizations

The savings leaderboard ranks organizations by avoided LLM cost. Each cache hit avoids a full provider round-trip, and the dashboard calculates the dollar value of those avoided calls based on your configured model pricing.

Review this panel weekly to:

Identify organizations with unusually low savings relative to their usage volume
Spot organizations where cache ROI justifies expanding their tier
Find organizations whose savings dropped suddenly, indicating a potential configuration issue

Hot Cache Keys

The hot keys panel shows the most frequently accessed cache entries across your platform. Each entry displays:

The cache key hash and its human-readable description
Hit count over the selected time window
The owning organization and originating repository
Time since last refresh

Hot keys that show high hit counts with recent refresh times indicate healthy, actively maintained cache entries. Hot keys with stale refresh timestamps may serve outdated responses.

Backend Latency

The latency panel shows p50, p95, and p99 response times for each cache backend over time. Use this to detect:

Gradual latency increases that indicate growing dataset sizes
Sudden spikes that correlate with infrastructure events
Divergence between backends that suggests uneven load distribution

Setting Up Alert Thresholds

Configure alerts from the dashboard by clicking Configure Alerts in the top-right corner. Recommended thresholds:

alerts:
  cache_backend_unhealthy:
    condition: any_backend_status == "critical"
    severity: high
    notify: platform-ops

  adoption_rate_drop:
    condition: org_adoption_rate < 50%
    window: 1h
    severity: medium
    notify: cache-team

  hot_key_staleness:
    condition: hot_key_last_refresh > 24h
    severity: low
    notify: owning-team

  backend_latency_p99:
    condition: latency_p99 > 100ms
    window: 5m
    severity: high
    notify: platform-ops

Common Health Indicators

Healthy Platform

All backends show green status
Org adoption rate above 70%
Savings trend is flat or increasing
No hot keys older than their configured TTL
Backend latency p99 under 50ms

Degraded Platform

One backend shows yellow status
Adoption rate between 50–70%
Savings trend declining over the past week
Multiple hot keys approaching TTL expiry
Backend latency p99 between 50–100ms

Unhealthy Platform

Any backend shows red status
Adoption rate below 50%
Savings dropped more than 30% week-over-week
Hot keys serving stale data past TTL
Backend latency p99 above 100ms

Dashboard Refresh and Data Retention

The dashboard refreshes metrics every 30 seconds by default. Historical data is retained for 90 days at full resolution and 1 year at hourly aggregation.

You can adjust the time window using the date picker at the top of the dashboard. Common windows include:

Last hour: Useful for investigating active incidents
Last 24 hours: Daily operational review
Last 7 days: Weekly trend analysis
Last 30 days: Monthly capacity planning

Next steps

Set up alerting on fill cost spikes to catch abnormal cache population patterns
Configure observability integration to export these metrics to your existing monitoring stack
Review capacity planning if backend utilization approaches warning thresholds

For AI systems

Canonical terms: Keeptrusts, cache health dashboard, platform admin, backend status, org-shared adoption, hot cache keys, backend latency, savings leaderboard.
Exact feature/config names: Settings → Cache → Health Dashboard, platform_admin or cache_admin role, alert configs (cache_backend_unhealthy, adoption_rate_drop, hot_key_staleness, backend_latency_p99).
Best next pages: Alerting on Fill Spikes, Observability Integration, Capacity Planning.

For engineers

Access at Settings → Cache → Health Dashboard; requires platform_admin or cache_admin role (team leads see team-scoped metrics).
Monitor four key areas: backend health (Redis/S3/Qdrant/PostgreSQL status), adoption rate (target >70%), savings leaderboard, and hot keys freshness.
Set alerts: cache_backend_unhealthy (critical), adoption_rate_drop below 50% (medium), hot_key_staleness >24h (low), backend_latency_p99 >100ms (high).
Dashboard refreshes every 30 seconds; historical data retained 90 days at full resolution, 1 year at hourly aggregation.
Use time windows: last hour (incident investigation), 24h (daily ops), 7d (trend analysis), 30d (capacity planning).

For leaders

The dashboard surfaces platform-wide cache ROI: identify orgs with high savings, spot sudden savings drops indicating configuration issues.
Healthy indicators: all backends green, adoption above 70%, savings trending up week-over-week, and latency p99 under 50ms.
Warning indicators: adoption 50-70%, savings declining, latency 50-100ms — investigate before they become critical.
Review the savings leaderboard weekly to justify expanding cache to additional organizations and to identify underperforming deployments.

Use this page when​

Primary audience​

Accessing the Dashboard​

Key Metrics to Monitor​

Cache Backend Health​

Org-Shared Adoption Rate​

Top Savings Organizations​

Hot Cache Keys​

Backend Latency​

Setting Up Alert Thresholds​

Common Health Indicators​

Healthy Platform​

Degraded Platform​

Unhealthy Platform​

Dashboard Refresh and Data Retention​

Next steps​

For AI systems​

For engineers​

For leaders​