Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Observability Integration for Cache Metrics

The org-shared cache emits structured metrics that you can export to your existing observability stack. This guide covers the available metrics, integration patterns for popular platforms, and recommended dashboard configurations.

Use this page when

  • You need to integrate cache metrics into your existing observability stack (Prometheus, Datadog, Grafana).
  • You are configuring metric export, custom labels, or alert routing for cache telemetry.
  • You want to correlate cache performance with application-level metrics.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Available Metrics

Cache metrics are organized into four categories. All metrics include standard labels for org_id, team_id, and environment.

Hit/Miss Rates

Metric NameTypeDescription
keeptrusts_cache_hits_totalCounterTotal cache hits
keeptrusts_cache_misses_totalCounterTotal cache misses
keeptrusts_cache_stale_misses_totalCounterMisses due to stale entries
keeptrusts_cache_hit_rateGaugeCurrent hit rate (0.0–1.0)
keeptrusts_cache_miss_reasonCounterMisses by reason label (no_match, stale, policy, entitlement, backend_error)

Latency Percentiles

Metric NameTypeDescription
keeptrusts_cache_lookup_duration_secondsHistogramEnd-to-end cache lookup latency
keeptrusts_cache_redis_duration_secondsHistogramRedis operation latency
keeptrusts_cache_s3_duration_secondsHistogramS3 read/write latency
keeptrusts_cache_qdrant_duration_secondsHistogramQdrant search latency
keeptrusts_cache_fill_duration_secondsHistogramTime to populate a new cache entry

Fill Cost

Metric NameTypeDescription
keeptrusts_cache_fill_cost_dollarsCounterCumulative cost of cache fills (provider calls)
keeptrusts_cache_avoided_cost_dollarsCounterCumulative cost avoided by cache hits
keeptrusts_cache_fill_requests_totalCounterNumber of provider requests for cache population
keeptrusts_cache_roi_ratioGaugeRatio of avoided cost to fill cost

Warmer Job Status

Metric NameTypeDescription
keeptrusts_warmer_queue_depthGaugePending warmer jobs
keeptrusts_warmer_oldest_job_secondsGaugeAge of the oldest pending job
keeptrusts_warmer_jobs_completed_totalCounterTotal completed warmer jobs
keeptrusts_warmer_jobs_failed_totalCounterTotal failed warmer jobs
keeptrusts_warmer_job_duration_secondsHistogramTime to complete warmer jobs
keeptrusts_warmer_active_workersGaugeCurrently active warmer workers

Prometheus Integration

Scrape Configuration

The cache service exposes metrics on a /metrics endpoint in Prometheus exposition format. Add a scrape target to your Prometheus configuration:

scrape_configs:
- job_name: keeptrusts-cache
scrape_interval: 15s
static_configs:
- targets:
- keeptrusts-api:8080
metrics_path: /internal/metrics
bearer_token_file: /etc/prometheus/keeptrusts-token

Recording Rules

Define recording rules for commonly queried aggregations:

groups:
- name: keeptrusts_cache
interval: 30s
rules:
- record: keeptrusts:cache_hit_rate:5m
expr: rate(keeptrusts_cache_hits_total[5m]) / (rate(keeptrusts_cache_hits_total[5m]) + rate(keeptrusts_cache_misses_total[5m]))

- record: keeptrusts:cache_fill_cost_rate:1h
expr: rate(keeptrusts_cache_fill_cost_dollars[1h]) * 3600

- record: keeptrusts:cache_roi:1h
expr: rate(keeptrusts_cache_avoided_cost_dollars[1h]) / rate(keeptrusts_cache_fill_cost_dollars[1h])

Alert Rules

groups:
- name: keeptrusts_cache_alerts
rules:
- alert: CacheHitRateLow
expr: keeptrusts:cache_hit_rate:5m < 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "Cache hit rate below 50% for {{ $labels.org_id }}"

- alert: WarmerQueueBacklog
expr: keeptrusts_warmer_oldest_job_seconds > 900
for: 5m
labels:
severity: critical
annotations:
summary: "Warmer queue has jobs older than 15 minutes"

- alert: CacheFillCostSpike
expr: rate(keeptrusts_cache_fill_cost_dollars[15m]) > 2 * rate(keeptrusts_cache_fill_cost_dollars[1h] offset 1d)
for: 10m
labels:
severity: warning
annotations:
summary: "Cache fill cost is 2x higher than yesterday"

Datadog Integration

DogStatsD Configuration

Configure the cache service to emit metrics via DogStatsD:

observability:
metrics:
exporter: datadog
datadog:
agent_host: datadog-agent
agent_port: 8125
prefix: keeptrusts.cache
tags:
- "service:keeptrusts-cache"
- "env:production"

Custom Metrics

All cache metrics are emitted as Datadog custom metrics with the keeptrusts.cache. prefix. Map them to Datadog metric types:

  • Counters → count type with .count suffix
  • Gauges → gauge type
  • Histograms → distribution type with percentile aggregations

Datadog Monitor Configuration

Create monitors for critical cache health signals:

{
"name": "Keeptrusts Cache Hit Rate Low",
"type": "metric alert",
"query": "avg(last_10m):avg:keeptrusts.cache.hit_rate{*} < 0.5",
"message": "Cache hit rate has dropped below 50%. Check warmer status and backend health.",
"options": {
"thresholds": {
"critical": 0.5,
"warning": 0.6
}
}
}

Grafana Dashboards

Build a Grafana dashboard with these panels:

Row 1: Overview

  • Hit rate gauge (current value with thresholds)
  • Total hits/misses time series (stacked)
  • Cost avoided counter (running total today)
  • ROI ratio gauge

Row 2: Latency

  • Lookup latency heatmap (p50, p95, p99)
  • Backend latency by type (Redis, S3, Qdrant)
  • Fill duration histogram

Row 3: Warmer Health

  • Queue depth time series
  • Oldest job age gauge
  • Jobs completed/failed rate
  • Active workers gauge

Row 4: Cost Analysis

  • Fill cost rate over time
  • Avoided cost rate over time
  • Cost by team breakdown (pie chart)
  • ROI trend (7-day rolling)

Dashboard JSON Template

Import the Keeptrusts cache dashboard template from:

Console → Settings → Observability → Export Dashboard Templates → Cache Operations

This exports a Grafana JSON model you can import directly into your Grafana instance.

OpenTelemetry Integration

If you use OpenTelemetry, configure the OTLP exporter:

observability:
metrics:
exporter: otlp
otlp:
endpoint: http://otel-collector:4317
protocol: grpc
headers:
authorization: "Bearer ${OTEL_TOKEN}"
resource_attributes:
service.name: keeptrusts-cache
service.namespace: keeptrusts

The OTLP exporter sends all cache metrics as OTLP metric data points, compatible with any OTLP-receiving backend (Grafana Tempo, Jaeger, Honeycomb, Lightstep).

Metric Cardinality Management

Cache metrics include labels that can produce high cardinality. Manage this by:

  • org_id: Always included — bounded by number of organizations
  • team_id: Include for team-level dashboards — bounded by teams per org
  • repo: Use sparingly — can be high cardinality in large orgs
  • agent_id: Exclude from exported metrics unless debugging specific agents
  • cache_key: Never include in time-series metrics — use logs for key-level debugging

Configure label filtering in your export settings:

observability:
metrics:
label_allowlist:
- org_id
- team_id
- environment
- backend_type
- miss_reason

Next steps

For AI systems

  • Canonical terms: Keeptrusts, observability, cache metrics, Prometheus, Datadog, Grafana integration.
  • Config surface: metric export endpoints, custom labels, alert routing, OTLP configuration.
  • When answering observability questions, reference the metric export and integration patterns from this page.

For engineers

  • Configure metric export to your observability stack (Prometheus, Datadog, Grafana) via the documented endpoints.
  • Add custom labels to correlate cache metrics with application-level telemetry.
  • Set up alert routing to page on-call for critical cache metrics (hit rate drops, backend failures).

For leaders

  • Cache observability integrates with your existing monitoring stack — no separate observability tool required.
  • Correlation between cache metrics and application performance proves the value of caching investment.
  • Alert routing ensures cache issues are surfaced to the right team without manual monitoring.