Infrastructure Monitoring for AI Systems
Reliable AI governance depends on healthy infrastructure. This guide covers Prometheus metric collection, Grafana dashboards, host and container monitoring, and alerting thresholds tailored to the Keeptrusts platform.
Use this page when
- You need to set up Prometheus scrape configs for Keeptrusts gateway, API, and PostgreSQL metrics.
- You are building Grafana dashboards for gateway throughput, policy evaluation latency, and event ingest rates.
- You need alerting thresholds for gateway health, database connection saturation, or disk growth.
- You want structured logging configuration for log aggregation from gateway and API processes.
Primary audience
- Primary: Technical Engineers
- Secondary: AI Agents, Technical Leaders
Metrics Architecture
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Gateway │ │ API Server │ │ PostgreSQL │
│ /metrics │ │ /metrics │ │ exporter │
└──────┬─────┘ └──────┬─────┘ └──────┬──────┘
└────────────────┼────────────────┘
▼
┌────────────────┐
│ Prometheus │
│ (scrape) │
└───────┬────────┘
▼
┌────────────────┐
│ Grafana │
│ (dashboards) │
└────────────────┘
Prometheus Configuration
Scrape Config for Keeptrusts Components
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'keeptrusts-gateway'
static_configs:
- targets:
- 'gateway-1:41002'
- 'gateway-2:41002'
metrics_path: /metrics
scrape_interval: 10s
- job_name: 'keeptrusts-api'
static_configs:
- targets: ['api-server:8080']
metrics_path: /metrics
scrape_interval: 15s
- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']
- job_name: 'node'
static_configs:
- targets:
- 'gateway-1:9100'
- 'gateway-2:9100'
- 'api-server:9100'
- 'db-server:9100'
Service Discovery with Docker
scrape_configs:
- job_name: 'docker-keeptrusts'
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 30s
relabel_configs:
- source_labels: [__meta_docker_container_label_com_keeptrusts_metrics]
regex: 'true'
action: keep
- source_labels: [__meta_docker_container_name]
target_label: container_name
Key Metrics
Gateway Metrics
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
keeptrusts_gateway_requests_total | Counter | Total requests processed | N/A |
keeptrusts_gateway_request_duration_seconds | Histogram | Request latency | p99 > 30s |
keeptrusts_gateway_policy_evaluations_total | Counter | Policy evaluations by result | blocked/total > 50% |
keeptrusts_gateway_upstream_errors_total | Counter | Upstream provider errors | > 10/min |
keeptrusts_gateway_active_connections | Gauge | Current open connections | > 80% of max |
API Server Metrics
| Metric | Type | Description | Alert Threshold |
|---|---|---|---|
keeptrusts_api_request_duration_seconds | Histogram | API endpoint latency | p99 > 2s |
keeptrusts_api_events_ingested_total | Counter | Events written | Rate drop > 90% |
keeptrusts_api_db_pool_connections | Gauge | Active DB connections | > 80% of pool |
keeptrusts_api_db_query_duration_seconds | Histogram | Database query time | p99 > 500ms |
Host Metrics (node_exporter)
| Metric | Alert Threshold |
|---|---|
node_cpu_seconds_total (idle) | CPU usage > 85% sustained |
node_memory_MemAvailable_bytes | Available memory < 15% |
node_filesystem_avail_bytes | Disk < 20% free |
node_network_receive_errs_total | > 0 sustained |
Grafana Dashboards
Gateway Overview Dashboard
{
"title": "Keeptrusts Gateway Overview",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"targets": [{
"expr": "rate(keeptrusts_gateway_requests_total[5m])",
"legendFormat": "{{instance}}"
}]
},
{
"title": "Request Latency (p50 / p95 / p99)",
"type": "timeseries",
"targets": [
{"expr": "histogram_quantile(0.5, rate(keeptrusts_gateway_request_duration_seconds_bucket[5m]))"},
{"expr": "histogram_quantile(0.95, rate(keeptrusts_gateway_request_duration_seconds_bucket[5m]))"},
{"expr": "histogram_quantile(0.99, rate(keeptrusts_gateway_request_duration_seconds_bucket[5m]))"}
]
},
{
"title": "Policy Block Rate",
"type": "stat",
"targets": [{
"expr": "rate(keeptrusts_gateway_policy_evaluations_total{result='blocked'}[5m]) / rate(keeptrusts_gateway_policy_evaluations_total[5m]) * 100"
}]
},
{
"title": "Active Connections",
"type": "gauge",
"targets": [{
"expr": "keeptrusts_gateway_active_connections"
}]
}
]
}
Infrastructure Health Dashboard
{
"title": "Keeptrusts Infrastructure Health",
"panels": [
{
"title": "CPU Usage by Host",
"targets": [{
"expr": "100 - (avg by(instance)(rate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)"
}]
},
{
"title": "Memory Usage by Host",
"targets": [{
"expr": "(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100"
}]
},
{
"title": "Disk Usage by Host",
"targets": [{
"expr": "(1 - node_filesystem_avail_bytes{mountpoint='/'} / node_filesystem_size_bytes{mountpoint='/'}) * 100"
}]
},
{
"title": "PostgreSQL Connections",
"targets": [{
"expr": "pg_stat_activity_count"
}]
}
]
}
Container Metrics
cAdvisor for Docker Deployments
# docker-compose.monitoring.yml
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- "8081:8080"
restart: unless-stopped
Key container metrics:
# Container CPU usage
rate(container_cpu_usage_seconds_total{name=~"keeptrusts.*"}[5m])
# Container memory
container_memory_usage_bytes{name=~"keeptrusts.*"}
# Container network I/O
rate(container_network_receive_bytes_total{name=~"keeptrusts.*"}[5m])
rate(container_network_transmit_bytes_total{name=~"keeptrusts.*"}[5m])
Alerting Rules
Prometheus Alert Rules
# alerts.yml
groups:
- name: keeptrusts-gateway
rules:
- alert: GatewayHighLatency
expr: histogram_quantile(0.99, rate(keeptrusts_gateway_request_duration_seconds_bucket[5m])) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "Gateway p99 latency exceeds 30s on {{ $labels.instance }}"
- alert: GatewayHighErrorRate
expr: rate(keeptrusts_gateway_upstream_errors_total[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "Gateway upstream error rate above 10% on {{ $labels.instance }}"
- alert: GatewayDown
expr: up{job="keeptrusts-gateway"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Gateway instance {{ $labels.instance }} is down"
- name: keeptrusts-api
rules:
- alert: APIHighDBLatency
expr: histogram_quantile(0.99, rate(keeptrusts_api_db_query_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "API database p99 latency exceeds 500ms"
- alert: APIDBPoolExhaustion
expr: keeptrusts_api_db_pool_connections / keeptrusts_api_db_pool_max > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "API database connection pool above 80%"
- name: keeptrusts-infra
rules:
- alert: HostHighCPU
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m
labels:
severity: warning
- alert: HostLowDisk
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.2
for: 5m
labels:
severity: critical
Log Aggregation
Pair metrics with structured log collection:
# docker-compose logs to stdout — collect with Loki or Fluentd
services:
keeptrusts-gateway:
logging:
driver: "json-file"
options:
max-size: "50m"
max-file: "5"
tag: "keeptrusts-gateway"
# Gateway structured logs — pipe to your log aggregator
kt gateway run --policy-config policy-config.yaml 2>&1 | \
tee /var/log/keeptrusts/gateway.log
Verification
# Check Prometheus targets
curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
# Verify gateway metrics endpoint
curl -s http://localhost:41002/metrics | head -20
# Test alerting rule evaluation
curl -s http://prometheus:9090/api/v1/rules | jq '.data.groups[].rules[] | {name: .name, state: .state}'
Next steps
- Backup & Recovery — protect the data behind your dashboards
- Capacity Sizing — use monitoring data to inform scaling decisions
- Security Hardening — secure your monitoring stack
For AI systems
- Canonical terms: Keeptrusts monitoring, Prometheus scrape config, Grafana dashboards, gateway metrics endpoint, alerting thresholds, structured logging, container metrics.
- Key config/commands: Prometheus scrape targets (
gateway:41002/metrics,api:8080/metrics); 10s scrape interval for gateway; alerting rules for high error rate, connection pool saturation, disk growth;kt gateway run 2>&1 | tee /var/log/keeptrusts/gateway.logfor structured logs. - Best next pages: Backup & Recovery, Capacity Sizing, Security Hardening.
For engineers
- Prerequisites: Prometheus instance with network access to gateway and API
/metricsendpoints; Grafana for visualization; optional PostgreSQL exporter. - Configure 10s scrape interval for gateway (latency-sensitive), 15s for API and PostgreSQL exporter.
- Validate with:
curl -s http://localhost:41002/metrics | head -20to confirm metrics endpoint;curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | {job, health}'to check scrape health. - Key alert thresholds: gateway error rate > 5% for 5 minutes; PostgreSQL connection pool > 80% utilized; disk growth projecting full within 7 days.
For leaders
- Monitoring is essential for proving governance system availability to auditors and regulators.
- Alerting on gateway health prevents silent policy enforcement failures that could result in undetected compliance violations.
- Capacity decisions (when to scale, when to upgrade) depend on monitoring data — invest in dashboards early.
- Structured logging feeds into SIEM systems for security operations and incident response.