Tutorial: Gateway Health Monitoring & Alerts
This tutorial shows you how to monitor the Keeptrusts gateway using its built-in health endpoint, expose Prometheus metrics, run kt doctor diagnostics, configure Kubernetes readiness and liveness probes, and set up alerting rules.
Use this page when
- You need to monitor gateway uptime and provider reachability.
- You are exposing Prometheus metrics from the gateway for alerting.
- You want to run
kt doctordiagnostics to troubleshoot connectivity issues. - You are configuring Kubernetes liveness and readiness probes for the gateway.
Primary audience
- Primary: SRE and platform engineers operating gateways in production
- Secondary: DevOps teams configuring alerting rules; technical leaders reviewing uptime SLOs
Prerequisites
ktCLI installed (first-run tutorial)- A running Keeptrusts gateway with traffic
curlandjqinstalled- (Optional) Prometheus and Alertmanager for the alerting sections
Step 1: Check the Health Endpoint
Every running gateway exposes a /health endpoint:
curl -s http://localhost:41002/health | jq .
Expected output:
{
"status": "healthy",
"uptime_seconds": 3600,
"version": "1.x.x",
"providers": {
"openai": "reachable",
"anthropic": "reachable"
},
"policies_loaded": 5,
"requests_processed": 2847,
"cache": {
"entries": 142,
"hit_rate_pct": 34.2
}
}
Key fields:
| Field | Meaning |
|---|---|
status | healthy, degraded, or unhealthy |
providers.<name> | reachable or unreachable — per-provider health |
policies_loaded | Number of active policies in the chain |
requests_processed | Total requests since startup |
Step 2: Run kt doctor
The kt doctor command performs a comprehensive diagnostic check:
kt doctor --config policy-config.yaml
Expected output:
Keeptrusts Gateway Diagnostics
══════════════════════════════
✓ Configuration file valid (policy-config.yaml)
✓ Provider: openai reachable (latency: 85ms)
✓ Provider: anthropic reachable (latency: 120ms)
✓ Policy chain 5 policies loaded, no conflicts
✓ Port 41002 available
✓ API connection connected (http://localhost:8080)
✓ Event forwarding last event 2s ago
✗ TLS certificate not configured (recommended for production)
Summary: 7/8 checks passed
kt doctor validates:
- Configuration syntax and provider connectivity
- Policy chain integrity
- Port availability
- API connection and event forwarding
- TLS configuration
Step 3: Access Prometheus Metrics
The gateway exposes Prometheus metrics on the /metrics endpoint:
curl -s http://localhost:41002/metrics | head -30
Example output:
# HELP keeptrusts_requests_total Total number of requests processed
# TYPE keeptrusts_requests_total counter
keeptrusts_requests_total{provider="openai",model="gpt-4o-mini",status="success"} 1842
keeptrusts_requests_total{provider="openai",model="gpt-4o-mini",status="blocked"} 23
keeptrusts_requests_total{provider="anthropic",model="claude-sonnet-4-20250514",status="success"} 982
# HELP keeptrusts_request_duration_seconds Request latency histogram
# TYPE keeptrusts_request_duration_seconds histogram
keeptrusts_request_duration_seconds_bucket{provider="openai",le="0.1"} 120
keeptrusts_request_duration_seconds_bucket{provider="openai",le="0.5"} 1650
keeptrusts_request_duration_seconds_bucket{provider="openai",le="1.0"} 1830
keeptrusts_request_duration_seconds_bucket{provider="openai",le="+Inf"} 1842
# HELP keeptrusts_tokens_total Total tokens consumed
# TYPE keeptrusts_tokens_total counter
keeptrusts_tokens_total{provider="openai",type="input"} 245000
keeptrusts_tokens_total{provider="openai",type="output"} 312000
# HELP keeptrusts_policy_evaluations_total Policy evaluation outcomes
# TYPE keeptrusts_policy_evaluations_total counter
keeptrusts_policy_evaluations_total{policy="injection-defense",result="pass"} 2800
keeptrusts_policy_evaluations_total{policy="injection-defense",result="block"} 23
Key metrics:
| Metric | Type | Description |
|---|---|---|
keeptrusts_requests_total | counter | Total requests by provider, model, status |
keeptrusts_request_duration_seconds | histogram | Latency distribution |
keeptrusts_tokens_total | counter | Token usage by provider and type |
keeptrusts_policy_evaluations_total | counter | Policy outcomes |
keeptrusts_cache_hits_total | counter | Cache hit/miss counts |
keeptrusts_provider_health | gauge | Provider health (1 = healthy, 0 = unhealthy) |
Step 4: Configure Prometheus Scraping
Add the gateway to your prometheus.yml:
scrape_configs:
- job_name: "keeptrusts-gateway"
scrape_interval: 15s
static_configs:
- targets: ["localhost:41002"]
metrics_path: /metrics
Reload Prometheus:
curl -X POST http://localhost:9090/-/reload
Step 5: Set Up Kubernetes Probes
Add readiness and liveness probes to your gateway deployment:
# kubernetes/gateway-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: keeptrusts-gateway
spec:
replicas: 3
template:
spec:
containers:
- name: gateway
image: keeptrusts/gateway:latest
ports:
- containerPort: 41002
livenessProbe:
httpGet:
path: /health
port: 41002
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 41002
initialDelaySeconds: 3
periodSeconds: 5
failureThreshold: 2
- Liveness probe: Kubernetes restarts the pod if
/healthfails 3 times (30 seconds) - Readiness probe: Kubernetes stops routing traffic if
/healthfails 2 times (10 seconds)
Step 6: Create Alerting Rules
Create Prometheus alerting rules for common failure scenarios:
# alerts/keeptrusts-gateway.rules.yml
groups:
- name: keeptrusts-gateway
rules:
- alert: GatewayHighErrorRate
expr: |
rate(keeptrusts_requests_total{status="error"}[5m])
/ rate(keeptrusts_requests_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "Gateway error rate above 5%"
- alert: GatewayProviderDown
expr: keeptrusts_provider_health == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Provider {{ $labels.provider }} is unreachable"
- alert: GatewayHighLatency
expr: |
histogram_quantile(0.95,
rate(keeptrusts_request_duration_seconds_bucket[5m])
) > 2.0
for: 10m
labels:
severity: warning
annotations:
summary: "Gateway p95 latency exceeds 2 seconds"
- alert: GatewayHighBlockRate
expr: |
rate(keeptrusts_policy_evaluations_total{result="block"}[15m])
/ rate(keeptrusts_policy_evaluations_total[15m]) > 0.10
for: 15m
labels:
severity: info
annotations:
summary: "Policy block rate above 10% — review policy sensitivity"
Step 7: Verify Alerting
Simulate a provider failure to test alerts:
# Temporarily break the provider key
export OPENAI_API_KEY="invalid-key"
kt config reload
# Check health — should show degraded
curl -s http://localhost:41002/health | jq '.providers'
Expected output:
{
"openai": "unreachable",
"anthropic": "reachable"
}
After 2 minutes, the GatewayProviderDown alert should fire. Restore the key and reload:
export OPENAI_API_KEY="sk-your-actual-key"
kt config reload
Summary
/healthreturns gateway status, provider health, and request countskt doctorruns comprehensive diagnostics including connectivity and TLS/metricsexposes Prometheus-compatible counters, gauges, and histograms- Configure readiness and liveness probes in Kubernetes using
/health - Set up alerting rules for error rate, provider health, latency, and block rate
- Use
kt config reloadto recover from provider key or configuration issues
For AI systems
- Canonical terms: Keeptrusts gateway,
/healthendpoint, Prometheus metrics,kt doctor, readiness probe, liveness probe, alerting rules. - CLI commands:
kt doctor --config policy-config.yaml,curl http://localhost:41002/health,curl http://localhost:41002/metrics. - Health status values:
healthy,degraded,unhealthy. - Prometheus metrics:
keeptrusts_requests_total,keeptrusts_request_duration_seconds,keeptrusts_provider_health,keeptrusts_cache_hit_ratio. - Best next pages: Circuit Breaker Config, Gateway Docker Compose, Multi-Provider Failover.
For engineers
- Prerequisites:
ktCLI, running gateway with traffic,curlandjq, optional Prometheus. - Quick check:
curl -s http://localhost:41002/health | jq .status— returnshealthy,degraded, orunhealthy. - Diagnostics:
kt doctor --config policy-config.yamlvalidates config, provider connectivity, port availability, and event forwarding. - Prometheus: scrape
http://localhost:41002/metricsfor request counters, latency histograms, and provider health gauges. - K8s probes:
livenessProbe.httpGet.path: /health,readinessProbe.httpGet.path: /health.
For leaders
- Health monitoring provides operational confidence that AI governance controls are active and functional.
- Prometheus integration fits existing observability stacks — no new monitoring tools required.
- Alerting rules can page on-call when providers become unreachable or request error rates spike.
kt doctorgives a one-command diagnostic for rapid incident triage.
Next steps
- Circuit Breaker Config — automatic failover when health checks detect provider outages
- Gateway Docker Compose — container health checks and restart policies
- Multi-Provider Failover — priority-based routing triggered by health status