Observability Integration for Cache Metrics
The org-shared cache emits structured metrics that you can export to your existing observability stack. This guide covers the available metrics, integration patterns for popular platforms, and recommended dashboard configurations.
Use this page when
- You need to integrate cache metrics into your existing observability stack (Prometheus, Datadog, Grafana).
- You are configuring metric export, custom labels, or alert routing for cache telemetry.
- You want to correlate cache performance with application-level metrics.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Available Metrics
Cache metrics are organized into four categories. All metrics include standard labels for org_id, team_id, and environment.
Hit/Miss Rates
| Metric Name | Type | Description |
|---|---|---|
keeptrusts_cache_hits_total | Counter | Total cache hits |
keeptrusts_cache_misses_total | Counter | Total cache misses |
keeptrusts_cache_stale_misses_total | Counter | Misses due to stale entries |
keeptrusts_cache_hit_rate | Gauge | Current hit rate (0.0–1.0) |
keeptrusts_cache_miss_reason | Counter | Misses by reason label (no_match, stale, policy, entitlement, backend_error) |
Latency Percentiles
| Metric Name | Type | Description |
|---|---|---|
keeptrusts_cache_lookup_duration_seconds | Histogram | End-to-end cache lookup latency |
keeptrusts_cache_redis_duration_seconds | Histogram | Redis operation latency |
keeptrusts_cache_s3_duration_seconds | Histogram | S3 read/write latency |
keeptrusts_cache_qdrant_duration_seconds | Histogram | Qdrant search latency |
keeptrusts_cache_fill_duration_seconds | Histogram | Time to populate a new cache entry |
Fill Cost
| Metric Name | Type | Description |
|---|---|---|
keeptrusts_cache_fill_cost_dollars | Counter | Cumulative cost of cache fills (provider calls) |
keeptrusts_cache_avoided_cost_dollars | Counter | Cumulative cost avoided by cache hits |
keeptrusts_cache_fill_requests_total | Counter | Number of provider requests for cache population |
keeptrusts_cache_roi_ratio | Gauge | Ratio of avoided cost to fill cost |
Warmer Job Status
| Metric Name | Type | Description |
|---|---|---|
keeptrusts_warmer_queue_depth | Gauge | Pending warmer jobs |
keeptrusts_warmer_oldest_job_seconds | Gauge | Age of the oldest pending job |
keeptrusts_warmer_jobs_completed_total | Counter | Total completed warmer jobs |
keeptrusts_warmer_jobs_failed_total | Counter | Total failed warmer jobs |
keeptrusts_warmer_job_duration_seconds | Histogram | Time to complete warmer jobs |
keeptrusts_warmer_active_workers | Gauge | Currently active warmer workers |
Prometheus Integration
Scrape Configuration
The cache service exposes metrics on a /metrics endpoint in Prometheus exposition format. Add a scrape target to your Prometheus configuration:
scrape_configs:
- job_name: keeptrusts-cache
scrape_interval: 15s
static_configs:
- targets:
- keeptrusts-api:8080
metrics_path: /internal/metrics
bearer_token_file: /etc/prometheus/keeptrusts-token
Recording Rules
Define recording rules for commonly queried aggregations:
groups:
- name: keeptrusts_cache
interval: 30s
rules:
- record: keeptrusts:cache_hit_rate:5m
expr: rate(keeptrusts_cache_hits_total[5m]) / (rate(keeptrusts_cache_hits_total[5m]) + rate(keeptrusts_cache_misses_total[5m]))
- record: keeptrusts:cache_fill_cost_rate:1h
expr: rate(keeptrusts_cache_fill_cost_dollars[1h]) * 3600
- record: keeptrusts:cache_roi:1h
expr: rate(keeptrusts_cache_avoided_cost_dollars[1h]) / rate(keeptrusts_cache_fill_cost_dollars[1h])
Alert Rules
groups:
- name: keeptrusts_cache_alerts
rules:
- alert: CacheHitRateLow
expr: keeptrusts:cache_hit_rate:5m < 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "Cache hit rate below 50% for {{ $labels.org_id }}"
- alert: WarmerQueueBacklog
expr: keeptrusts_warmer_oldest_job_seconds > 900
for: 5m
labels:
severity: critical
annotations:
summary: "Warmer queue has jobs older than 15 minutes"
- alert: CacheFillCostSpike
expr: rate(keeptrusts_cache_fill_cost_dollars[15m]) > 2 * rate(keeptrusts_cache_fill_cost_dollars[1h] offset 1d)
for: 10m
labels:
severity: warning
annotations:
summary: "Cache fill cost is 2x higher than yesterday"
Datadog Integration
DogStatsD Configuration
Configure the cache service to emit metrics via DogStatsD:
observability:
metrics:
exporter: datadog
datadog:
agent_host: datadog-agent
agent_port: 8125
prefix: keeptrusts.cache
tags:
- "service:keeptrusts-cache"
- "env:production"
Custom Metrics
All cache metrics are emitted as Datadog custom metrics with the keeptrusts.cache. prefix. Map them to Datadog metric types:
- Counters →
counttype with.countsuffix - Gauges →
gaugetype - Histograms →
distributiontype with percentile aggregations
Datadog Monitor Configuration
Create monitors for critical cache health signals:
{
"name": "Keeptrusts Cache Hit Rate Low",
"type": "metric alert",
"query": "avg(last_10m):avg:keeptrusts.cache.hit_rate{*} < 0.5",
"message": "Cache hit rate has dropped below 50%. Check warmer status and backend health.",
"options": {
"thresholds": {
"critical": 0.5,
"warning": 0.6
}
}
}
Grafana Dashboards
Recommended Dashboard Panels
Build a Grafana dashboard with these panels:
Row 1: Overview
- Hit rate gauge (current value with thresholds)
- Total hits/misses time series (stacked)
- Cost avoided counter (running total today)
- ROI ratio gauge
Row 2: Latency
- Lookup latency heatmap (p50, p95, p99)
- Backend latency by type (Redis, S3, Qdrant)
- Fill duration histogram
Row 3: Warmer Health
- Queue depth time series
- Oldest job age gauge
- Jobs completed/failed rate
- Active workers gauge
Row 4: Cost Analysis
- Fill cost rate over time
- Avoided cost rate over time
- Cost by team breakdown (pie chart)
- ROI trend (7-day rolling)
Dashboard JSON Template
Import the Keeptrusts cache dashboard template from:
Console → Settings → Observability → Export Dashboard Templates → Cache Operations
This exports a Grafana JSON model you can import directly into your Grafana instance.
OpenTelemetry Integration
If you use OpenTelemetry, configure the OTLP exporter:
observability:
metrics:
exporter: otlp
otlp:
endpoint: http://otel-collector:4317
protocol: grpc
headers:
authorization: "Bearer ${OTEL_TOKEN}"
resource_attributes:
service.name: keeptrusts-cache
service.namespace: keeptrusts
The OTLP exporter sends all cache metrics as OTLP metric data points, compatible with any OTLP-receiving backend (Grafana Tempo, Jaeger, Honeycomb, Lightstep).
Metric Cardinality Management
Cache metrics include labels that can produce high cardinality. Manage this by:
- org_id: Always included — bounded by number of organizations
- team_id: Include for team-level dashboards — bounded by teams per org
- repo: Use sparingly — can be high cardinality in large orgs
- agent_id: Exclude from exported metrics unless debugging specific agents
- cache_key: Never include in time-series metrics — use logs for key-level debugging
Configure label filtering in your export settings:
observability:
metrics:
label_allowlist:
- org_id
- team_id
- environment
- backend_type
- miss_reason
Next steps
- Use your observability data to set up alerting on fill cost spikes
- Review the cache health dashboard for console-native monitoring
- Plan infrastructure growth with capacity planning
For AI systems
- Canonical terms: Keeptrusts, observability, cache metrics, Prometheus, Datadog, Grafana integration.
- Config surface: metric export endpoints, custom labels, alert routing, OTLP configuration.
- When answering observability questions, reference the metric export and integration patterns from this page.
For engineers
- Configure metric export to your observability stack (Prometheus, Datadog, Grafana) via the documented endpoints.
- Add custom labels to correlate cache metrics with application-level telemetry.
- Set up alert routing to page on-call for critical cache metrics (hit rate drops, backend failures).
For leaders
- Cache observability integrates with your existing monitoring stack — no separate observability tool required.
- Correlation between cache metrics and application performance proves the value of caching investment.
- Alert routing ensures cache issues are surfaced to the right team without manual monitoring.