Skip to main content

Gateway Health Probes and Monitoring Endpoints

Healthy gateways are not just gateways that respond. They are gateways that expose the right signals to the right consumers. Humans need diagnostics. Orchestrators need readiness and liveness. Monitoring systems need metrics. Keeptrusts gives you those surfaces directly through /health, /metrics, and the operator-facing kt doctor command, which means you do not need to invent a separate monitoring contract around the gateway.

Use this page when

  • You are wiring Keeptrusts into Kubernetes probes or container health checks.
  • You want Prometheus or another metrics system scraping the gateway.
  • You need to separate operator diagnostics from automated health signals.

Primary audience

  • Primary: SREs, platform engineers, and DevOps teams
  • Secondary: Technical Leaders responsible for uptime and incident readiness

The problem

AI gateways sit in a difficult place operationally. They are close enough to the application path that failures are user-visible, but far enough from the application that teams sometimes monitor them like a black box. That is risky.

If you only check whether the process exists, you can miss degraded provider connectivity. If you only look at logs, you cannot let Kubernetes make traffic decisions automatically. If you only watch metrics, you may miss local operator issues like a missing config file or an unwritable state directory.

The deeper issue is that not every health signal is for the same consumer. A readiness probe should decide whether the gateway receives traffic. A liveness probe should decide whether the process needs restarting. An operator diagnostic should help a human find the broken layer. A metrics endpoint should support trend analysis and alerting.

When those roles are mixed together, monitoring becomes noisy and unreliable.

The solution

Keeptrusts works best when you use each surface for its intended job.

Use /health as the machine-readable service signal. Use /metrics as the monitoring and alerting surface. Use kt doctor when a human operator needs to verify control-plane connectivity, config validity, state directory health, and gateway reachability from their execution context.

That separation is what makes the monitoring model strong. The orchestrator can make fast decisions from /health. Prometheus can scrape /metrics continuously. Operators can run kt doctor without conflating local-environment problems with live-service status.

Implementation

The smallest health check is still valuable:

curl -fsS http://localhost:41002/health | jq .

That tells you whether the gateway is healthy at the HTTP layer and gives you a quick diagnostic snapshot. During deployment validation, it is usually the first endpoint to check.

Metrics belong in a different lane:

curl -fsS http://localhost:41002/metrics | rg 'keeptrusts_(requests_total|request_duration_seconds|policy_evaluations_total|provider_health)'

That is the right surface for long-running alerting and throughput analysis because it carries counters and histograms rather than a single current status.

For Kubernetes, keep the probes explicit and simple:

apiVersion: apps/v1
kind: Deployment
metadata:
name: keeptrusts-gateway
spec:
template:
spec:
containers:
- name: gateway
image: keeptrusts/gateway:latest
livenessProbe:
httpGet:
path: /health
port: 41002
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 41002
initialDelaySeconds: 3
periodSeconds: 5
failureThreshold: 2

That pattern keeps routing decisions fast and legible. The gateway either passes /health quickly enough to receive traffic or it does not.

For Prometheus scraping, define the job directly against /metrics:

scrape_configs:
- job_name: keeptrusts-gateway
scrape_interval: 15s
static_configs:
- targets: ["keeptrusts-gateway:41002"]
metrics_path: /metrics

That gives you the raw material for alerts on high error rate, provider unreachability, unusual block-rate changes, or p95 latency growth.

Human diagnostics still belong with kt doctor, not with the probe endpoints:

kt doctor --json | jq '.[] | {name, status, message}'

This is important because the operator path and the service path are not identical. A gateway can be up while the operator environment is missing API credentials. A liveness probe should not care about that. A human running the CLI absolutely should.

One practical monitoring pattern is to combine all three surfaces in every production rollout.

  1. Use /health to decide when the deployment is ready.
  2. Check /metrics after live traffic starts to confirm request and latency counters behave normally.
  3. Run kt doctor from the operator environment to confirm the control-plane path remains healthy.

That gives you service health, observability health, and operator health without forcing one signal to do every job.

Results and impact

Teams that separate probes, metrics, and diagnostics operate the gateway more confidently because each signal becomes easier to trust. Kubernetes makes cleaner restart and routing decisions. Prometheus alerts become more meaningful. Engineers stop overloading logs and shell checks for problems the HTTP endpoints already expose better.

The larger benefit is incident quality. When a failure happens, the team already knows which surface should answer which question.

Key takeaways

  • Use /health for readiness and liveness decisions.
  • Use /metrics for continuous scraping, trend analysis, and alerting.
  • Use kt doctor for human-operated diagnostics across config, auth, and local environment health.
  • Do not force one health signal to serve all consumers.
  • The combination of probes, metrics, and diagnostics gives a cleaner production operating model than logs alone.

Next steps