Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Infrastructure Monitoring for AI Systems

Reliable AI governance depends on healthy infrastructure. This guide covers Prometheus metric collection, Grafana dashboards, host and container monitoring, and alerting thresholds tailored to the Keeptrusts platform.

Use this page when

  • You need to set up Prometheus scrape configs for Keeptrusts gateway, API, and PostgreSQL metrics.
  • You are building Grafana dashboards for gateway throughput, policy evaluation latency, and event ingest rates.
  • You need alerting thresholds for gateway health, database connection saturation, or disk growth.
  • You want structured logging configuration for log aggregation from gateway and API processes.

Primary audience

  • Primary: Technical Engineers
  • Secondary: AI Agents, Technical Leaders

Metrics Architecture

┌────────────┐ ┌────────────┐ ┌────────────┐
│ Gateway │ │ API Server │ │ PostgreSQL │
│ /metrics │ │ /metrics │ │ exporter │
└──────┬─────┘ └──────┬─────┘ └──────┬──────┘
└────────────────┼────────────────┘

┌────────────────┐
│ Prometheus │
│ (scrape) │
└───────┬────────┘

┌────────────────┐
│ Grafana │
│ (dashboards) │
└────────────────┘

Prometheus Configuration

Scrape Config for Keeptrusts Components

# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s

scrape_configs:
- job_name: 'keeptrusts-gateway'
static_configs:
- targets:
- 'gateway-1:41002'
- 'gateway-2:41002'
metrics_path: /metrics
scrape_interval: 10s

- job_name: 'keeptrusts-api'
static_configs:
- targets: ['api-server:8080']
metrics_path: /metrics
scrape_interval: 15s

- job_name: 'postgres'
static_configs:
- targets: ['postgres-exporter:9187']

- job_name: 'node'
static_configs:
- targets:
- 'gateway-1:9100'
- 'gateway-2:9100'
- 'api-server:9100'
- 'db-server:9100'

Service Discovery with Docker

scrape_configs:
- job_name: 'docker-keeptrusts'
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 30s
relabel_configs:
- source_labels: [__meta_docker_container_label_com_keeptrusts_metrics]
regex: 'true'
action: keep
- source_labels: [__meta_docker_container_name]
target_label: container_name

Key Metrics

Gateway Metrics

MetricTypeDescriptionAlert Threshold
keeptrusts_gateway_requests_totalCounterTotal requests processedN/A
keeptrusts_gateway_request_duration_secondsHistogramRequest latencyp99 > 30s
keeptrusts_gateway_policy_evaluations_totalCounterPolicy evaluations by resultblocked/total > 50%
keeptrusts_gateway_upstream_errors_totalCounterUpstream provider errors> 10/min
keeptrusts_gateway_active_connectionsGaugeCurrent open connections> 80% of max

API Server Metrics

MetricTypeDescriptionAlert Threshold
keeptrusts_api_request_duration_secondsHistogramAPI endpoint latencyp99 > 2s
keeptrusts_api_events_ingested_totalCounterEvents writtenRate drop > 90%
keeptrusts_api_db_pool_connectionsGaugeActive DB connections> 80% of pool
keeptrusts_api_db_query_duration_secondsHistogramDatabase query timep99 > 500ms

Host Metrics (node_exporter)

MetricAlert Threshold
node_cpu_seconds_total (idle)CPU usage > 85% sustained
node_memory_MemAvailable_bytesAvailable memory < 15%
node_filesystem_avail_bytesDisk < 20% free
node_network_receive_errs_total> 0 sustained

Grafana Dashboards

Gateway Overview Dashboard

{
"title": "Keeptrusts Gateway Overview",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"targets": [{
"expr": "rate(keeptrusts_gateway_requests_total[5m])",
"legendFormat": "{{instance}}"
}]
},
{
"title": "Request Latency (p50 / p95 / p99)",
"type": "timeseries",
"targets": [
{"expr": "histogram_quantile(0.5, rate(keeptrusts_gateway_request_duration_seconds_bucket[5m]))"},
{"expr": "histogram_quantile(0.95, rate(keeptrusts_gateway_request_duration_seconds_bucket[5m]))"},
{"expr": "histogram_quantile(0.99, rate(keeptrusts_gateway_request_duration_seconds_bucket[5m]))"}
]
},
{
"title": "Policy Block Rate",
"type": "stat",
"targets": [{
"expr": "rate(keeptrusts_gateway_policy_evaluations_total{result='blocked'}[5m]) / rate(keeptrusts_gateway_policy_evaluations_total[5m]) * 100"
}]
},
{
"title": "Active Connections",
"type": "gauge",
"targets": [{
"expr": "keeptrusts_gateway_active_connections"
}]
}
]
}

Infrastructure Health Dashboard

{
"title": "Keeptrusts Infrastructure Health",
"panels": [
{
"title": "CPU Usage by Host",
"targets": [{
"expr": "100 - (avg by(instance)(rate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)"
}]
},
{
"title": "Memory Usage by Host",
"targets": [{
"expr": "(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100"
}]
},
{
"title": "Disk Usage by Host",
"targets": [{
"expr": "(1 - node_filesystem_avail_bytes{mountpoint='/'} / node_filesystem_size_bytes{mountpoint='/'}) * 100"
}]
},
{
"title": "PostgreSQL Connections",
"targets": [{
"expr": "pg_stat_activity_count"
}]
}
]
}

Container Metrics

cAdvisor for Docker Deployments

# docker-compose.monitoring.yml
services:
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.1
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- "8081:8080"
restart: unless-stopped

Key container metrics:

# Container CPU usage
rate(container_cpu_usage_seconds_total{name=~"keeptrusts.*"}[5m])

# Container memory
container_memory_usage_bytes{name=~"keeptrusts.*"}

# Container network I/O
rate(container_network_receive_bytes_total{name=~"keeptrusts.*"}[5m])
rate(container_network_transmit_bytes_total{name=~"keeptrusts.*"}[5m])

Alerting Rules

Prometheus Alert Rules

# alerts.yml
groups:
- name: keeptrusts-gateway
rules:
- alert: GatewayHighLatency
expr: histogram_quantile(0.99, rate(keeptrusts_gateway_request_duration_seconds_bucket[5m])) > 30
for: 5m
labels:
severity: warning
annotations:
summary: "Gateway p99 latency exceeds 30s on {{ $labels.instance }}"

- alert: GatewayHighErrorRate
expr: rate(keeptrusts_gateway_upstream_errors_total[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "Gateway upstream error rate above 10% on {{ $labels.instance }}"

- alert: GatewayDown
expr: up{job="keeptrusts-gateway"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Gateway instance {{ $labels.instance }} is down"

- name: keeptrusts-api
rules:
- alert: APIHighDBLatency
expr: histogram_quantile(0.99, rate(keeptrusts_api_db_query_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "API database p99 latency exceeds 500ms"

- alert: APIDBPoolExhaustion
expr: keeptrusts_api_db_pool_connections / keeptrusts_api_db_pool_max > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "API database connection pool above 80%"

- name: keeptrusts-infra
rules:
- alert: HostHighCPU
expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m
labels:
severity: warning

- alert: HostLowDisk
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.2
for: 5m
labels:
severity: critical

Log Aggregation

Pair metrics with structured log collection:

# docker-compose logs to stdout — collect with Loki or Fluentd
services:
keeptrusts-gateway:
logging:
driver: "json-file"
options:
max-size: "50m"
max-file: "5"
tag: "keeptrusts-gateway"
# Gateway structured logs — pipe to your log aggregator
kt gateway run --policy-config policy-config.yaml 2>&1 | \
tee /var/log/keeptrusts/gateway.log

Verification

# Check Prometheus targets
curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Verify gateway metrics endpoint
curl -s http://localhost:41002/metrics | head -20

# Test alerting rule evaluation
curl -s http://prometheus:9090/api/v1/rules | jq '.data.groups[].rules[] | {name: .name, state: .state}'

Next steps

For AI systems

  • Canonical terms: Keeptrusts monitoring, Prometheus scrape config, Grafana dashboards, gateway metrics endpoint, alerting thresholds, structured logging, container metrics.
  • Key config/commands: Prometheus scrape targets (gateway:41002/metrics, api:8080/metrics); 10s scrape interval for gateway; alerting rules for high error rate, connection pool saturation, disk growth; kt gateway run 2>&1 | tee /var/log/keeptrusts/gateway.log for structured logs.
  • Best next pages: Backup & Recovery, Capacity Sizing, Security Hardening.

For engineers

  • Prerequisites: Prometheus instance with network access to gateway and API /metrics endpoints; Grafana for visualization; optional PostgreSQL exporter.
  • Configure 10s scrape interval for gateway (latency-sensitive), 15s for API and PostgreSQL exporter.
  • Validate with: curl -s http://localhost:41002/metrics | head -20 to confirm metrics endpoint; curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | {job, health}' to check scrape health.
  • Key alert thresholds: gateway error rate > 5% for 5 minutes; PostgreSQL connection pool > 80% utilized; disk growth projecting full within 7 days.

For leaders

  • Monitoring is essential for proving governance system availability to auditors and regulators.
  • Alerting on gateway health prevents silent policy enforcement failures that could result in undetected compliance violations.
  • Capacity decisions (when to scale, when to upgrade) depend on monitoring data — invest in dashboards early.
  • Structured logging feeds into SIEM systems for security operations and incident response.