Skip to main content
Browse docs

Tutorial: Gateway Health Monitoring & Alerts

This tutorial shows you how to monitor the Keeptrusts gateway using its built-in health endpoint, expose Prometheus metrics, run kt doctor diagnostics, configure Kubernetes readiness and liveness probes, and set up alerting rules.

Use this page when

  • You need to monitor gateway uptime and provider reachability.
  • You are exposing Prometheus metrics from the gateway for alerting.
  • You want to run kt doctor diagnostics to troubleshoot connectivity issues.
  • You are configuring Kubernetes liveness and readiness probes for the gateway.

Primary audience

  • Primary: SRE and platform engineers operating gateways in production
  • Secondary: DevOps teams configuring alerting rules; technical leaders reviewing uptime SLOs

Prerequisites

  • kt CLI installed (first-run tutorial)
  • A running Keeptrusts gateway with traffic
  • curl and jq installed
  • (Optional) Prometheus and Alertmanager for the alerting sections

Step 1: Check the Health Endpoint

Every running gateway exposes a /health endpoint:

curl -s http://localhost:41002/health | jq .

Expected output:

{
"status": "healthy",
"uptime_seconds": 3600,
"version": "1.x.x",
"providers": {
"openai": "reachable",
"anthropic": "reachable"
},
"policies_loaded": 5,
"requests_processed": 2847,
"cache": {
"entries": 142,
"hit_rate_pct": 34.2
}
}

Key fields:

FieldMeaning
statushealthy, degraded, or unhealthy
providers.<name>reachable or unreachable — per-provider health
policies_loadedNumber of active policies in the chain
requests_processedTotal requests since startup

Step 2: Run kt doctor

The kt doctor command performs a comprehensive diagnostic check:

kt doctor --config policy-config.yaml

Expected output:

Keeptrusts Gateway Diagnostics
══════════════════════════════

✓ Configuration file valid (policy-config.yaml)
✓ Provider: openai reachable (latency: 85ms)
✓ Provider: anthropic reachable (latency: 120ms)
✓ Policy chain 5 policies loaded, no conflicts
✓ Port 41002 available
✓ API connection connected (http://localhost:8080)
✓ Event forwarding last event 2s ago
✗ TLS certificate not configured (recommended for production)

Summary: 7/8 checks passed

kt doctor validates:

  • Configuration syntax and provider connectivity
  • Policy chain integrity
  • Port availability
  • API connection and event forwarding
  • TLS configuration

Step 3: Access Prometheus Metrics

The gateway exposes Prometheus metrics on the /metrics endpoint:

curl -s http://localhost:41002/metrics | head -30

Example output:

# HELP keeptrusts_requests_total Total number of requests processed
# TYPE keeptrusts_requests_total counter
keeptrusts_requests_total{provider="openai",model="gpt-4o-mini",status="success"} 1842
keeptrusts_requests_total{provider="openai",model="gpt-4o-mini",status="blocked"} 23
keeptrusts_requests_total{provider="anthropic",model="claude-sonnet-4-20250514",status="success"} 982

# HELP keeptrusts_request_duration_seconds Request latency histogram
# TYPE keeptrusts_request_duration_seconds histogram
keeptrusts_request_duration_seconds_bucket{provider="openai",le="0.1"} 120
keeptrusts_request_duration_seconds_bucket{provider="openai",le="0.5"} 1650
keeptrusts_request_duration_seconds_bucket{provider="openai",le="1.0"} 1830
keeptrusts_request_duration_seconds_bucket{provider="openai",le="+Inf"} 1842

# HELP keeptrusts_tokens_total Total tokens consumed
# TYPE keeptrusts_tokens_total counter
keeptrusts_tokens_total{provider="openai",type="input"} 245000
keeptrusts_tokens_total{provider="openai",type="output"} 312000

# HELP keeptrusts_policy_evaluations_total Policy evaluation outcomes
# TYPE keeptrusts_policy_evaluations_total counter
keeptrusts_policy_evaluations_total{policy="injection-defense",result="pass"} 2800
keeptrusts_policy_evaluations_total{policy="injection-defense",result="block"} 23

Key metrics:

MetricTypeDescription
keeptrusts_requests_totalcounterTotal requests by provider, model, status
keeptrusts_request_duration_secondshistogramLatency distribution
keeptrusts_tokens_totalcounterToken usage by provider and type
keeptrusts_policy_evaluations_totalcounterPolicy outcomes
keeptrusts_cache_hits_totalcounterCache hit/miss counts
keeptrusts_provider_healthgaugeProvider health (1 = healthy, 0 = unhealthy)

Step 4: Configure Prometheus Scraping

Add the gateway to your prometheus.yml:

scrape_configs:
- job_name: "keeptrusts-gateway"
scrape_interval: 15s
static_configs:
- targets: ["localhost:41002"]
metrics_path: /metrics

Reload Prometheus:

curl -X POST http://localhost:9090/-/reload

Step 5: Set Up Kubernetes Probes

Add readiness and liveness probes to your gateway deployment:

# kubernetes/gateway-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: keeptrusts-gateway
spec:
replicas: 3
template:
spec:
containers:
- name: gateway
image: keeptrusts/gateway:latest
ports:
- containerPort: 41002
livenessProbe:
httpGet:
path: /health
port: 41002
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 41002
initialDelaySeconds: 3
periodSeconds: 5
failureThreshold: 2
  • Liveness probe: Kubernetes restarts the pod if /health fails 3 times (30 seconds)
  • Readiness probe: Kubernetes stops routing traffic if /health fails 2 times (10 seconds)

Step 6: Create Alerting Rules

Create Prometheus alerting rules for common failure scenarios:

# alerts/keeptrusts-gateway.rules.yml
groups:
- name: keeptrusts-gateway
rules:
- alert: GatewayHighErrorRate
expr: |
rate(keeptrusts_requests_total{status="error"}[5m])
/ rate(keeptrusts_requests_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Gateway error rate above 5%"

- alert: GatewayProviderDown
expr: keeptrusts_provider_health == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Provider {{ $labels.provider }} is unreachable"

- alert: GatewayHighLatency
expr: |
histogram_quantile(0.95,
rate(keeptrusts_request_duration_seconds_bucket[5m])
) > 2.0
for: 10m
labels:
severity: warning
annotations:
summary: "Gateway p95 latency exceeds 2 seconds"

- alert: GatewayHighBlockRate
expr: |
rate(keeptrusts_policy_evaluations_total{result="block"}[15m])
/ rate(keeptrusts_policy_evaluations_total[15m]) > 0.10
for: 15m
labels:
severity: info
annotations:
summary: "Policy block rate above 10% — review policy sensitivity"

Step 7: Verify Alerting

Simulate a provider failure to test alerts:

# Temporarily break the provider key
export OPENAI_API_KEY="invalid-key"
kt config reload

# Check health — should show degraded
curl -s http://localhost:41002/health | jq '.providers'

Expected output:

{
"openai": "unreachable",
"anthropic": "reachable"
}

After 2 minutes, the GatewayProviderDown alert should fire. Restore the key and reload:

export OPENAI_API_KEY="sk-your-actual-key"
kt config reload

Summary

  • /health returns gateway status, provider health, and request counts
  • kt doctor runs comprehensive diagnostics including connectivity and TLS
  • /metrics exposes Prometheus-compatible counters, gauges, and histograms
  • Configure readiness and liveness probes in Kubernetes using /health
  • Set up alerting rules for error rate, provider health, latency, and block rate
  • Use kt config reload to recover from provider key or configuration issues

For AI systems

  • Canonical terms: Keeptrusts gateway, /health endpoint, Prometheus metrics, kt doctor, readiness probe, liveness probe, alerting rules.
  • CLI commands: kt doctor --config policy-config.yaml, curl http://localhost:41002/health, curl http://localhost:41002/metrics.
  • Health status values: healthy, degraded, unhealthy.
  • Prometheus metrics: keeptrusts_requests_total, keeptrusts_request_duration_seconds, keeptrusts_provider_health, keeptrusts_cache_hit_ratio.
  • Best next pages: Circuit Breaker Config, Gateway Docker Compose, Multi-Provider Failover.

For engineers

  • Prerequisites: kt CLI, running gateway with traffic, curl and jq, optional Prometheus.
  • Quick check: curl -s http://localhost:41002/health | jq .status — returns healthy, degraded, or unhealthy.
  • Diagnostics: kt doctor --config policy-config.yaml validates config, provider connectivity, port availability, and event forwarding.
  • Prometheus: scrape http://localhost:41002/metrics for request counters, latency histograms, and provider health gauges.
  • K8s probes: livenessProbe.httpGet.path: /health, readinessProbe.httpGet.path: /health.

For leaders

  • Health monitoring provides operational confidence that AI governance controls are active and functional.
  • Prometheus integration fits existing observability stacks — no new monitoring tools required.
  • Alerting rules can page on-call when providers become unreachable or request error rates spike.
  • kt doctor gives a one-command diagnostic for rapid incident triage.

Next steps