Observability Integration for Cache Metrics

The org-shared cache emits structured metrics that you can export to your existing observability stack. This guide covers the available metrics, integration patterns for popular platforms, and recommended dashboard configurations.

Use this page when

You need to integrate cache metrics into your existing observability stack (Prometheus, Datadog, Grafana).
You are configuring metric export, custom labels, or alert routing for cache telemetry.
You want to correlate cache performance with application-level metrics.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Available Metrics

Cache metrics are organized into four categories. All metrics include standard labels for org_id, team_id, and environment.

Hit/Miss Rates

Metric Name	Type	Description
`keeptrusts_cache_hits_total`	Counter	Total cache hits
`keeptrusts_cache_misses_total`	Counter	Total cache misses
`keeptrusts_cache_stale_misses_total`	Counter	Misses due to stale entries
`keeptrusts_cache_hit_rate`	Gauge	Current hit rate (0.0–1.0)
`keeptrusts_cache_miss_reason`	Counter	Misses by reason label (`no_match`, `stale`, `policy`, `entitlement`, `backend_error`)

Latency Percentiles

Metric Name	Type	Description
`keeptrusts_cache_lookup_duration_seconds`	Histogram	End-to-end cache lookup latency
`keeptrusts_cache_redis_duration_seconds`	Histogram	Redis operation latency
`keeptrusts_cache_s3_duration_seconds`	Histogram	S3 read/write latency
`keeptrusts_cache_qdrant_duration_seconds`	Histogram	Qdrant search latency
`keeptrusts_cache_fill_duration_seconds`	Histogram	Time to populate a new cache entry

Fill Cost

Metric Name	Type	Description
`keeptrusts_cache_fill_cost_dollars`	Counter	Cumulative cost of cache fills (provider calls)
`keeptrusts_cache_avoided_cost_dollars`	Counter	Cumulative cost avoided by cache hits
`keeptrusts_cache_fill_requests_total`	Counter	Number of provider requests for cache population
`keeptrusts_cache_roi_ratio`	Gauge	Ratio of avoided cost to fill cost

Warmer Job Status

Metric Name	Type	Description
`keeptrusts_warmer_queue_depth`	Gauge	Pending warmer jobs
`keeptrusts_warmer_oldest_job_seconds`	Gauge	Age of the oldest pending job
`keeptrusts_warmer_jobs_completed_total`	Counter	Total completed warmer jobs
`keeptrusts_warmer_jobs_failed_total`	Counter	Total failed warmer jobs
`keeptrusts_warmer_job_duration_seconds`	Histogram	Time to complete warmer jobs
`keeptrusts_warmer_active_workers`	Gauge	Currently active warmer workers

Prometheus Integration

Scrape Configuration

The cache service exposes metrics on a /metrics endpoint in Prometheus exposition format. Add a scrape target to your Prometheus configuration:

scrape_configs:
  - job_name: keeptrusts-cache
    scrape_interval: 15s
    static_configs:
      - targets:
          - keeptrusts-api:8080
    metrics_path: /internal/metrics
    bearer_token_file: /etc/prometheus/keeptrusts-token

Recording Rules

Define recording rules for commonly queried aggregations:

groups:
  - name: keeptrusts_cache
    interval: 30s
    rules:
      - record: keeptrusts:cache_hit_rate:5m
        expr: rate(keeptrusts_cache_hits_total[5m]) / (rate(keeptrusts_cache_hits_total[5m]) + rate(keeptrusts_cache_misses_total[5m]))

      - record: keeptrusts:cache_fill_cost_rate:1h
        expr: rate(keeptrusts_cache_fill_cost_dollars[1h]) * 3600

      - record: keeptrusts:cache_roi:1h
        expr: rate(keeptrusts_cache_avoided_cost_dollars[1h]) / rate(keeptrusts_cache_fill_cost_dollars[1h])

Alert Rules

groups:
  - name: keeptrusts_cache_alerts
    rules:
      - alert: CacheHitRateLow
        expr: keeptrusts:cache_hit_rate:5m < 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Cache hit rate below 50% for {{ $labels.org_id }}"

      - alert: WarmerQueueBacklog
        expr: keeptrusts_warmer_oldest_job_seconds > 900
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Warmer queue has jobs older than 15 minutes"

      - alert: CacheFillCostSpike
        expr: rate(keeptrusts_cache_fill_cost_dollars[15m]) > 2 * rate(keeptrusts_cache_fill_cost_dollars[1h] offset 1d)
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Cache fill cost is 2x higher than yesterday"

Datadog Integration

DogStatsD Configuration

Configure the cache service to emit metrics via DogStatsD:

observability:
  metrics:
    exporter: datadog
    datadog:
      agent_host: datadog-agent
      agent_port: 8125
      prefix: keeptrusts.cache
      tags:
        - "service:keeptrusts-cache"
        - "env:production"

Custom Metrics

All cache metrics are emitted as Datadog custom metrics with the keeptrusts.cache. prefix. Map them to Datadog metric types:

Counters → count type with .count suffix
Gauges → gauge type
Histograms → distribution type with percentile aggregations

Datadog Monitor Configuration

Create monitors for critical cache health signals:

{
  "name": "Keeptrusts Cache Hit Rate Low",
  "type": "metric alert",
  "query": "avg(last_10m):avg:keeptrusts.cache.hit_rate{*} < 0.5",
  "message": "Cache hit rate has dropped below 50%. Check warmer status and backend health.",
  "options": {
    "thresholds": {
      "critical": 0.5,
      "warning": 0.6
    }
  }
}

Grafana Dashboards

Recommended Dashboard Panels

Build a Grafana dashboard with these panels:

Row 1: Overview

Hit rate gauge (current value with thresholds)
Total hits/misses time series (stacked)
Cost avoided counter (running total today)
ROI ratio gauge

Row 2: Latency

Lookup latency heatmap (p50, p95, p99)
Backend latency by type (Redis, S3, Qdrant)
Fill duration histogram

Row 3: Warmer Health

Queue depth time series
Oldest job age gauge
Jobs completed/failed rate
Active workers gauge

Row 4: Cost Analysis

Fill cost rate over time
Avoided cost rate over time
Cost by team breakdown (pie chart)
ROI trend (7-day rolling)

Dashboard JSON Template

Import the Keeptrusts cache dashboard template from:

Console → Settings → Observability → Export Dashboard Templates → Cache Operations

This exports a Grafana JSON model you can import directly into your Grafana instance.

OpenTelemetry Integration

If you use OpenTelemetry, configure the OTLP exporter:

observability:
  metrics:
    exporter: otlp
    otlp:
      endpoint: http://otel-collector:4317
      protocol: grpc
      headers:
        authorization: "Bearer ${OTEL_TOKEN}"
      resource_attributes:
        service.name: keeptrusts-cache
        service.namespace: keeptrusts

The OTLP exporter sends all cache metrics as OTLP metric data points, compatible with any OTLP-receiving backend (Grafana Tempo, Jaeger, Honeycomb, Lightstep).

Metric Cardinality Management

Cache metrics include labels that can produce high cardinality. Manage this by:

org_id: Always included — bounded by number of organizations
team_id: Include for team-level dashboards — bounded by teams per org
repo: Use sparingly — can be high cardinality in large orgs
agent_id: Exclude from exported metrics unless debugging specific agents
cache_key: Never include in time-series metrics — use logs for key-level debugging

Configure label filtering in your export settings:

observability:
  metrics:
    label_allowlist:
      - org_id
      - team_id
      - environment
      - backend_type
      - miss_reason

Next steps

Use your observability data to set up alerting on fill cost spikes
Review the cache health dashboard for console-native monitoring
Plan infrastructure growth with capacity planning

For AI systems

Canonical terms: Keeptrusts, observability, cache metrics, Prometheus, Datadog, Grafana integration.
Config surface: metric export endpoints, custom labels, alert routing, OTLP configuration.
When answering observability questions, reference the metric export and integration patterns from this page.

For engineers

Configure metric export to your observability stack (Prometheus, Datadog, Grafana) via the documented endpoints.
Add custom labels to correlate cache metrics with application-level telemetry.
Set up alert routing to page on-call for critical cache metrics (hit rate drops, backend failures).

For leaders

Cache observability integrates with your existing monitoring stack — no separate observability tool required.
Correlation between cache metrics and application performance proves the value of caching investment.
Alert routing ensures cache issues are surfaced to the right team without manual monitoring.

Use this page when​

Primary audience​

Available Metrics​

Hit/Miss Rates​

Latency Percentiles​

Fill Cost​

Warmer Job Status​

Prometheus Integration​

Scrape Configuration​

Recording Rules​

Alert Rules​

Datadog Integration​

DogStatsD Configuration​

Custom Metrics​

Datadog Monitor Configuration​

Grafana Dashboards​

Recommended Dashboard Panels​

Dashboard JSON Template​

OpenTelemetry Integration​

Metric Cardinality Management​

Next steps​

For AI systems​

For engineers​

For leaders​