Tutorial: Gateway Health Monitoring & Alerts

This tutorial shows you how to monitor the Keeptrusts gateway using its built-in health endpoint, expose Prometheus metrics, run kt doctor diagnostics, configure Kubernetes readiness and liveness probes, and set up alerting rules.

Use this page when

You need to monitor gateway uptime and provider reachability.
You are exposing Prometheus metrics from the gateway for alerting.
You want to run kt doctor diagnostics to troubleshoot connectivity issues.
You are configuring Kubernetes liveness and readiness probes for the gateway.

Primary audience

Primary: SRE and platform engineers operating gateways in production
Secondary: DevOps teams configuring alerting rules; technical leaders reviewing uptime SLOs

Prerequisites

kt CLI installed (first-run tutorial)
A running Keeptrusts gateway with traffic
curl and jq installed
(Optional) Prometheus and Alertmanager for the alerting sections

Step 1: Check the Health Endpoint

Every running gateway exposes a /health endpoint:

curl -s http://localhost:41002/health | jq .

Expected output:

{
  "status": "healthy",
  "uptime_seconds": 3600,
  "version": "1.x.x",
  "providers": {
    "openai": "reachable",
    "anthropic": "reachable"
  },
  "policies_loaded": 5,
  "requests_processed": 2847,
  "cache": {
    "entries": 142,
    "hit_rate_pct": 34.2
  }
}

Key fields:

Field	Meaning
`status`	`healthy`, `degraded`, or `unhealthy`
`providers.<name>`	`reachable` or `unreachable` — per-provider health
`policies_loaded`	Number of active policies in the chain
`requests_processed`	Total requests since startup

Step 2: Run kt doctor

The kt doctor command performs a comprehensive diagnostic check:

kt doctor --config policy-config.yaml

Expected output:

Keeptrusts Gateway Diagnostics
══════════════════════════════

✓ Configuration file      valid (policy-config.yaml)
✓ Provider: openai        reachable (latency: 85ms)
✓ Provider: anthropic     reachable (latency: 120ms)
✓ Policy chain            5 policies loaded, no conflicts
✓ Port 41002              available
✓ API connection          connected (http://localhost:8080)
✓ Event forwarding        last event 2s ago
✗ TLS certificate         not configured (recommended for production)

Summary: 7/8 checks passed

kt doctor validates:

Configuration syntax and provider connectivity
Policy chain integrity
Port availability
API connection and event forwarding
TLS configuration

Step 3: Access Prometheus Metrics

The gateway exposes Prometheus metrics on the /metrics endpoint:

curl -s http://localhost:41002/metrics | head -30

Example output:

# HELP keeptrusts_requests_total Total number of requests processed
# TYPE keeptrusts_requests_total counter
keeptrusts_requests_total{provider="openai",model="gpt-4o-mini",status="success"} 1842
keeptrusts_requests_total{provider="openai",model="gpt-4o-mini",status="blocked"} 23
keeptrusts_requests_total{provider="anthropic",model="claude-sonnet-4-20250514",status="success"} 982

# HELP keeptrusts_request_duration_seconds Request latency histogram
# TYPE keeptrusts_request_duration_seconds histogram
keeptrusts_request_duration_seconds_bucket{provider="openai",le="0.1"} 120
keeptrusts_request_duration_seconds_bucket{provider="openai",le="0.5"} 1650
keeptrusts_request_duration_seconds_bucket{provider="openai",le="1.0"} 1830
keeptrusts_request_duration_seconds_bucket{provider="openai",le="+Inf"} 1842

# HELP keeptrusts_tokens_total Total tokens consumed
# TYPE keeptrusts_tokens_total counter
keeptrusts_tokens_total{provider="openai",type="input"} 245000
keeptrusts_tokens_total{provider="openai",type="output"} 312000

# HELP keeptrusts_policy_evaluations_total Policy evaluation outcomes
# TYPE keeptrusts_policy_evaluations_total counter
keeptrusts_policy_evaluations_total{policy="injection-defense",result="pass"} 2800
keeptrusts_policy_evaluations_total{policy="injection-defense",result="block"} 23

Key metrics:

Metric	Type	Description
`keeptrusts_requests_total`	counter	Total requests by provider, model, status
`keeptrusts_request_duration_seconds`	histogram	Latency distribution
`keeptrusts_tokens_total`	counter	Token usage by provider and type
`keeptrusts_policy_evaluations_total`	counter	Policy outcomes
`keeptrusts_cache_hits_total`	counter	Cache hit/miss counts
`keeptrusts_provider_health`	gauge	Provider health (1 = healthy, 0 = unhealthy)

Step 4: Configure Prometheus Scraping

Add the gateway to your prometheus.yml:

scrape_configs:
  - job_name: "keeptrusts-gateway"
    scrape_interval: 15s
    static_configs:
      - targets: ["localhost:41002"]
    metrics_path: /metrics

Reload Prometheus:

curl -X POST http://localhost:9090/-/reload

Step 5: Set Up Kubernetes Probes

Add readiness and liveness probes to your gateway deployment:

# kubernetes/gateway-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: keeptrusts-gateway
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: gateway
          image: keeptrusts/gateway:latest
          ports:
            - containerPort: 41002
          livenessProbe:
            httpGet:
              path: /health
              port: 41002
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /health
              port: 41002
            initialDelaySeconds: 3
            periodSeconds: 5
            failureThreshold: 2

Liveness probe: Kubernetes restarts the pod if /health fails 3 times (30 seconds)
Readiness probe: Kubernetes stops routing traffic if /health fails 2 times (10 seconds)

Step 6: Create Alerting Rules

Create Prometheus alerting rules for common failure scenarios:

# alerts/keeptrusts-gateway.rules.yml
groups:
  - name: keeptrusts-gateway
    rules:
      - alert: GatewayHighErrorRate
        expr: |
          rate(keeptrusts_requests_total{status="error"}[5m])
          / rate(keeptrusts_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Gateway error rate above 5%"

      - alert: GatewayProviderDown
        expr: keeptrusts_provider_health == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Provider {{ $labels.provider }} is unreachable"

      - alert: GatewayHighLatency
        expr: |
          histogram_quantile(0.95,
            rate(keeptrusts_request_duration_seconds_bucket[5m])
          ) > 2.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Gateway p95 latency exceeds 2 seconds"

      - alert: GatewayHighBlockRate
        expr: |
          rate(keeptrusts_policy_evaluations_total{result="block"}[15m])
          / rate(keeptrusts_policy_evaluations_total[15m]) > 0.10
        for: 15m
        labels:
          severity: info
        annotations:
          summary: "Policy block rate above 10% — review policy sensitivity"

Step 7: Verify Alerting

Simulate a provider failure to test alerts:

# Temporarily break the provider key
export OPENAI_API_KEY="invalid-key"
kt config reload

# Check health — should show degraded
curl -s http://localhost:41002/health | jq '.providers'

Expected output:

{
  "openai": "unreachable",
  "anthropic": "reachable"
}

After 2 minutes, the GatewayProviderDown alert should fire. Restore the key and reload:

export OPENAI_API_KEY="sk-your-actual-key"
kt config reload

Summary

/health returns gateway status, provider health, and request counts
kt doctor runs comprehensive diagnostics including connectivity and TLS
/metrics exposes Prometheus-compatible counters, gauges, and histograms
Configure readiness and liveness probes in Kubernetes using /health
Set up alerting rules for error rate, provider health, latency, and block rate
Use kt config reload to recover from provider key or configuration issues

For AI systems

Canonical terms: Keeptrusts gateway, /health endpoint, Prometheus metrics, kt doctor, readiness probe, liveness probe, alerting rules.
CLI commands: kt doctor --config policy-config.yaml, curl http://localhost:41002/health, curl http://localhost:41002/metrics.
Health status values: healthy, degraded, unhealthy.
Prometheus metrics: keeptrusts_requests_total, keeptrusts_request_duration_seconds, keeptrusts_provider_health, keeptrusts_cache_hit_ratio.
Best next pages: Circuit Breaker Config, Gateway Docker Compose, Multi-Provider Failover.

For engineers

Prerequisites: kt CLI, running gateway with traffic, curl and jq, optional Prometheus.
Quick check: curl -s http://localhost:41002/health | jq .status — returns healthy, degraded, or unhealthy.
Diagnostics: kt doctor --config policy-config.yaml validates config, provider connectivity, port availability, and event forwarding.
Prometheus: scrape http://localhost:41002/metrics for request counters, latency histograms, and provider health gauges.
K8s probes: livenessProbe.httpGet.path: /health, readinessProbe.httpGet.path: /health.

For leaders

Health monitoring provides operational confidence that AI governance controls are active and functional.
Prometheus integration fits existing observability stacks — no new monitoring tools required.
Alerting rules can page on-call when providers become unreachable or request error rates spike.
kt doctor gives a one-command diagnostic for rapid incident triage.

Next steps

Circuit Breaker Config — automatic failover when health checks detect provider outages
Gateway Docker Compose — container health checks and restart policies
Multi-Provider Failover — priority-based routing triggered by health status

Use this page when​

Primary audience​

Prerequisites​

Step 1: Check the Health Endpoint​

Step 2: Run kt doctor​

Step 3: Access Prometheus Metrics​

Step 4: Configure Prometheus Scraping​

Step 5: Set Up Kubernetes Probes​

Step 6: Create Alerting Rules​

Step 7: Verify Alerting​

Summary​

For AI systems​

For engineers​

For leaders​

Next steps​

Use this page when

Primary audience

Prerequisites

Step 1: Check the Health Endpoint

Step 2: Run kt doctor

Step 3: Access Prometheus Metrics

Step 4: Configure Prometheus Scraping

Step 5: Set Up Kubernetes Probes

Step 6: Create Alerting Rules

Step 7: Verify Alerting

Summary

For AI systems

For engineers

For leaders

Next steps