Infrastructure Monitoring for AI Systems

Reliable AI governance depends on healthy infrastructure. This guide covers Prometheus metric collection, Grafana dashboards, host and container monitoring, and alerting thresholds tailored to the Keeptrusts platform.

Use this page when

You need to set up Prometheus scrape configs for Keeptrusts gateway, API, and PostgreSQL metrics.
You are building Grafana dashboards for gateway throughput, policy evaluation latency, and event ingest rates.
You need alerting thresholds for gateway health, database connection saturation, or disk growth.
You want structured logging configuration for log aggregation from gateway and API processes.

Primary audience

Primary: Technical Engineers
Secondary: AI Agents, Technical Leaders

Metrics Architecture

┌────────────┐   ┌────────────┐   ┌────────────┐
│  Gateway    │   │  API Server │   │  PostgreSQL │
│  /metrics   │   │  /metrics   │   │  exporter   │
└──────┬─────┘   └──────┬─────┘   └──────┬──────┘
       └────────────────┼────────────────┘
                        ▼
               ┌────────────────┐
               │   Prometheus   │
               │  (scrape)      │
               └───────┬────────┘
                       ▼
               ┌────────────────┐
               │    Grafana     │
               │  (dashboards)  │
               └────────────────┘

Prometheus Configuration

Scrape Config for Keeptrusts Components

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'keeptrusts-gateway'
    static_configs:
      - targets:
          - 'gateway-1:41002'
          - 'gateway-2:41002'
    metrics_path: /metrics
    scrape_interval: 10s

  - job_name: 'keeptrusts-api'
    static_configs:
      - targets: ['api-server:8080']
    metrics_path: /metrics
    scrape_interval: 15s

  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter:9187']

  - job_name: 'node'
    static_configs:
      - targets:
          - 'gateway-1:9100'
          - 'gateway-2:9100'
          - 'api-server:9100'
          - 'db-server:9100'

Service Discovery with Docker

scrape_configs:
  - job_name: 'docker-keeptrusts'
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__meta_docker_container_label_com_keeptrusts_metrics]
        regex: 'true'
        action: keep
      - source_labels: [__meta_docker_container_name]
        target_label: container_name

Key Metrics

Gateway Metrics

Metric	Type	Description	Alert Threshold
`keeptrusts_gateway_requests_total`	Counter	Total requests processed	N/A
`keeptrusts_gateway_request_duration_seconds`	Histogram	Request latency	p99 > 30s
`keeptrusts_gateway_policy_evaluations_total`	Counter	Policy evaluations by result	blocked/total > 50%
`keeptrusts_gateway_upstream_errors_total`	Counter	Upstream provider errors	> 10/min
`keeptrusts_gateway_active_connections`	Gauge	Current open connections	> 80% of max

API Server Metrics

Metric	Type	Description	Alert Threshold
`keeptrusts_api_request_duration_seconds`	Histogram	API endpoint latency	p99 > 2s
`keeptrusts_api_events_ingested_total`	Counter	Events written	Rate drop > 90%
`keeptrusts_api_db_pool_connections`	Gauge	Active DB connections	> 80% of pool
`keeptrusts_api_db_query_duration_seconds`	Histogram	Database query time	p99 > 500ms

Host Metrics (node_exporter)

Metric	Alert Threshold
`node_cpu_seconds_total` (idle)	CPU usage > 85% sustained
`node_memory_MemAvailable_bytes`	Available memory < 15%
`node_filesystem_avail_bytes`	Disk < 20% free
`node_network_receive_errs_total`	> 0 sustained

Grafana Dashboards

Gateway Overview Dashboard

{
  "title": "Keeptrusts Gateway Overview",
  "panels": [
    {
      "title": "Request Rate",
      "type": "timeseries",
      "targets": [{
        "expr": "rate(keeptrusts_gateway_requests_total[5m])",
        "legendFormat": "{{instance}}"
      }]
    },
    {
      "title": "Request Latency (p50 / p95 / p99)",
      "type": "timeseries",
      "targets": [
        {"expr": "histogram_quantile(0.5, rate(keeptrusts_gateway_request_duration_seconds_bucket[5m]))"},
        {"expr": "histogram_quantile(0.95, rate(keeptrusts_gateway_request_duration_seconds_bucket[5m]))"},
        {"expr": "histogram_quantile(0.99, rate(keeptrusts_gateway_request_duration_seconds_bucket[5m]))"}
      ]
    },
    {
      "title": "Policy Block Rate",
      "type": "stat",
      "targets": [{
        "expr": "rate(keeptrusts_gateway_policy_evaluations_total{result='blocked'}[5m]) / rate(keeptrusts_gateway_policy_evaluations_total[5m]) * 100"
      }]
    },
    {
      "title": "Active Connections",
      "type": "gauge",
      "targets": [{
        "expr": "keeptrusts_gateway_active_connections"
      }]
    }
  ]
}

Infrastructure Health Dashboard

{
  "title": "Keeptrusts Infrastructure Health",
  "panels": [
    {
      "title": "CPU Usage by Host",
      "targets": [{
        "expr": "100 - (avg by(instance)(rate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)"
      }]
    },
    {
      "title": "Memory Usage by Host",
      "targets": [{
        "expr": "(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100"
      }]
    },
    {
      "title": "Disk Usage by Host",
      "targets": [{
        "expr": "(1 - node_filesystem_avail_bytes{mountpoint='/'} / node_filesystem_size_bytes{mountpoint='/'}) * 100"
      }]
    },
    {
      "title": "PostgreSQL Connections",
      "targets": [{
        "expr": "pg_stat_activity_count"
      }]
    }
  ]
}

Container Metrics

cAdvisor for Docker Deployments

# docker-compose.monitoring.yml
services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    ports:
      - "8081:8080"
    restart: unless-stopped

Key container metrics:

# Container CPU usage
rate(container_cpu_usage_seconds_total{name=~"keeptrusts.*"}[5m])

# Container memory
container_memory_usage_bytes{name=~"keeptrusts.*"}

# Container network I/O
rate(container_network_receive_bytes_total{name=~"keeptrusts.*"}[5m])
rate(container_network_transmit_bytes_total{name=~"keeptrusts.*"}[5m])

Alerting Rules

Prometheus Alert Rules

# alerts.yml
groups:
  - name: keeptrusts-gateway
    rules:
      - alert: GatewayHighLatency
        expr: histogram_quantile(0.99, rate(keeptrusts_gateway_request_duration_seconds_bucket[5m])) > 30
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Gateway p99 latency exceeds 30s on {{ $labels.instance }}"

      - alert: GatewayHighErrorRate
        expr: rate(keeptrusts_gateway_upstream_errors_total[5m]) > 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Gateway upstream error rate above 10% on {{ $labels.instance }}"

      - alert: GatewayDown
        expr: up{job="keeptrusts-gateway"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Gateway instance {{ $labels.instance }} is down"

  - name: keeptrusts-api
    rules:
      - alert: APIHighDBLatency
        expr: histogram_quantile(0.99, rate(keeptrusts_api_db_query_duration_seconds_bucket[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "API database p99 latency exceeds 500ms"

      - alert: APIDBPoolExhaustion
        expr: keeptrusts_api_db_pool_connections / keeptrusts_api_db_pool_max > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "API database connection pool above 80%"

  - name: keeptrusts-infra
    rules:
      - alert: HostHighCPU
        expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 10m
        labels:
          severity: warning

      - alert: HostLowDisk
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.2
        for: 5m
        labels:
          severity: critical

Log Aggregation

Pair metrics with structured log collection:

# docker-compose logs to stdout — collect with Loki or Fluentd
services:
  keeptrusts-gateway:
    logging:
      driver: "json-file"
      options:
        max-size: "50m"
        max-file: "5"
        tag: "keeptrusts-gateway"

# Gateway structured logs — pipe to your log aggregator
kt gateway run --policy-config policy-config.yaml 2>&1 | \
  tee /var/log/keeptrusts/gateway.log

Verification

# Check Prometheus targets
curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'

# Verify gateway metrics endpoint
curl -s http://localhost:41002/metrics | head -20

# Test alerting rule evaluation
curl -s http://prometheus:9090/api/v1/rules | jq '.data.groups[].rules[] | {name: .name, state: .state}'

Next steps

Backup & Recovery — protect the data behind your dashboards
Capacity Sizing — use monitoring data to inform scaling decisions
Security Hardening — secure your monitoring stack

For AI systems

Canonical terms: Keeptrusts monitoring, Prometheus scrape config, Grafana dashboards, gateway metrics endpoint, alerting thresholds, structured logging, container metrics.
Key config/commands: Prometheus scrape targets (gateway:41002/metrics, api:8080/metrics); 10s scrape interval for gateway; alerting rules for high error rate, connection pool saturation, disk growth; kt gateway run 2>&1 | tee /var/log/keeptrusts/gateway.log for structured logs.
Best next pages: Backup & Recovery, Capacity Sizing, Security Hardening.

For engineers

Prerequisites: Prometheus instance with network access to gateway and API /metrics endpoints; Grafana for visualization; optional PostgreSQL exporter.
Configure 10s scrape interval for gateway (latency-sensitive), 15s for API and PostgreSQL exporter.
Validate with: curl -s http://localhost:41002/metrics | head -20 to confirm metrics endpoint; curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets[] | {job, health}' to check scrape health.
Key alert thresholds: gateway error rate > 5% for 5 minutes; PostgreSQL connection pool > 80% utilized; disk growth projecting full within 7 days.

For leaders

Monitoring is essential for proving governance system availability to auditors and regulators.
Alerting on gateway health prevents silent policy enforcement failures that could result in undetected compliance violations.
Capacity decisions (when to scale, when to upgrade) depend on monitoring data — invest in dashboards early.
Structured logging feeds into SIEM systems for security operations and incident response.

Use this page when​

Primary audience​

Metrics Architecture​

Prometheus Configuration​

Scrape Config for Keeptrusts Components​

Service Discovery with Docker​

Key Metrics​

Gateway Metrics​

API Server Metrics​

Host Metrics (node_exporter)​

Grafana Dashboards​

Gateway Overview Dashboard​

Infrastructure Health Dashboard​

Container Metrics​

cAdvisor for Docker Deployments​

Alerting Rules​

Prometheus Alert Rules​

Log Aggregation​

Verification​

Next steps​

For AI systems​

For engineers​

For leaders​