Monitor AI Governance with Datadog
Datadog provides deep observability into your Keeptrusts deployment — from gateway performance metrics to policy enforcement trends and cost tracking. This guide covers custom metrics, log forwarding, dashboard templates, and alerting.
Use this page when
- You want to monitor Keeptrusts gateway performance and policy enforcement trends in Datadog.
- You need to set up DogStatsD metrics, log forwarding, and custom dashboards for AI governance.
- You are configuring anomaly detection alerts for unusual policy block spikes.
- You need SLO tracking for gateway availability and policy evaluation latency.
Primary audience
- Primary: Technical Engineers
- Secondary: AI Agents, Technical Leaders
Architecture overview
Keeptrusts Gateway
→ StatsD / DogStatsD metrics → Datadog Agent → Datadog
→ Logs (stdout/stderr) → Datadog Agent log collection → Datadog
→ OTel Collector sidecar → Datadog Exporter → Datadog APM
Keeptrusts API
→ /v1/webhooks → Datadog Log Intake API (real-time events)
→ kt export → scheduled batch → Datadog Log Archives
Prerequisites
- Datadog account with API and application keys
- Datadog Agent installed on gateway hosts or as a Kubernetes DaemonSet
- Keeptrusts gateway running with logging enabled
Datadog Agent configuration
Kubernetes DaemonSet
# datadog-values.yaml (Helm)
datadog:
apiKey: <DATADOG_API_KEY>
appKey: <DATADOG_APP_KEY>
logs:
enabled: true
containerCollectAll: true
apm:
portEnabled: true
dogstatsd:
useHostPort: true
hostPortConfig:
hostPort: 8125
agents:
containers:
agent:
env:
- name: DD_CONTAINER_LABELS_AS_TAGS
value: '{"app":"service"}'
helm install datadog-agent datadog/datadog \
-f datadog-values.yaml \
--namespace monitoring
Host-based Agent
Add to /etc/datadog-agent/conf.d/keeptrusts.d/conf.yaml:
logs:
- type: file
path: /var/log/keeptrusts/gateway.log
service: keeptrusts-gateway
source: keeptrusts
sourcecategory: ai-governance
Custom metrics
Gateway performance metrics
Configure the gateway to emit DogStatsD metrics:
kt gateway run \
--config policy-config.yaml \
--statsd-address 127.0.0.1:8125 \
--statsd-prefix keeptrusts
Key metrics emitted:
| Metric | Type | Description |
|---|---|---|
keeptrusts.gateway.requests | counter | Total requests processed |
keeptrusts.gateway.latency | histogram | End-to-end request latency (ms) |
keeptrusts.gateway.policy.blocks | counter | Requests blocked by policy |
keeptrusts.gateway.policy.escalations | counter | Requests escalated |
keeptrusts.gateway.policy.redactions | counter | Responses with redacted content |
keeptrusts.gateway.upstream.latency | histogram | Upstream LLM provider latency (ms) |
keeptrusts.gateway.upstream.errors | counter | Upstream provider errors |
Custom metrics via DogStatsD
# Example: emit custom metric from a monitoring script
from datadog import statsd
# After querying /v1/events
statsd.gauge('keeptrusts.events.pending_escalations', pending_count, tags=['env:production'])
statsd.increment('keeptrusts.events.exported', tags=['format:csv', 'env:production'])
Log forwarding
Real-time via webhook
Forward Keeptrusts events to the Datadog Log Intake API:
curl -X POST https://api.keeptrusts.com/v1/webhooks \
-H "Authorization: Bearer $KEEPTRUSTS_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"url": "https://http-intake.logs.datadoghq.com/api/v2/logs",
"description": "Forward events to Datadog Logs",
"event_types": ["event.*"],
"active": true,
"headers": {
"DD-API-KEY": "<DATADOG_API_KEY>",
"Content-Type": "application/json"
}
}'
Log pipeline
Create a Datadog log pipeline for Keeptrusts events:
- Go to Logs → Configuration → Pipelines
- Create a new pipeline with filter
source:keeptrusts - Add processors:
- Grok Parser: Extract
action,policy_name,modelfrom event JSON - Category Processor: Map
action=block→severity=error,action=escalate→severity=warning - Remapper: Set
timestampas the official log date
- Grok Parser: Extract
Dashboard template
Create a Keeptrusts governance dashboard with these widgets:
Recommended widgets
| Widget | Query | Visualization |
|---|---|---|
| Request volume | sum:keeptrusts.gateway.requests{*}.as_count() | Timeseries |
| Block rate | (sum:keeptrusts.gateway.policy.blocks / sum:keeptrusts.gateway.requests) * 100 | Query value (%) |
| P95 latency | p95:keeptrusts.gateway.latency{*} | Timeseries |
| Policy blocks by name | sum:keeptrusts.gateway.policy.blocks{*} by {policy_name}.as_count() | Top list |
| Upstream errors | sum:keeptrusts.gateway.upstream.errors{*} by {provider}.as_count() | Bar chart |
| Escalation trend | sum:keeptrusts.gateway.policy.escalations{*}.as_count() | Timeseries |
| Model usage | sum:keeptrusts.gateway.requests{*} by {model}.as_count() | Pie chart |
Dashboard JSON (import)
{
"title": "Keeptrusts AI Governance",
"description": "Real-time AI governance monitoring",
"widgets": [
{
"definition": {
"title": "Request Volume",
"type": "timeseries",
"requests": [
{
"q": "sum:keeptrusts.gateway.requests{env:production}.as_count()",
"display_type": "bars"
}
]
}
},
{
"definition": {
"title": "Policy Block Rate",
"type": "query_value",
"requests": [
{
"q": "(sum:keeptrusts.gateway.policy.blocks{env:production}.as_count() / sum:keeptrusts.gateway.requests{env:production}.as_count()) * 100",
"aggregator": "avg"
}
],
"precision": 2,
"custom_unit": "%"
}
}
]
}
Anomaly detection
Set up anomaly monitors for unusual AI usage patterns:
Metric: keeptrusts.gateway.policy.blocks
Algorithm: agile
Deviations: 3
Window: 1h
Alert: "Anomalous spike in AI policy blocks detected"
Create the monitor:
- Go to Monitors → New Monitor → Anomaly
- Select metric
keeptrusts.gateway.policy.blocks - Set algorithm to Agile, deviations to 3
- Configure notification to your Slack channel or PagerDuty
SLO tracking
Track AI governance SLOs:
| SLO | Target | Metric |
|---|---|---|
| Gateway availability | 99.9% | keeptrusts.gateway.requests with no 5xx |
| Policy evaluation latency | P95 < 100ms | keeptrusts.gateway.latency |
| Event delivery success | 99.5% | Webhook delivery success rate |
SLO: Gateway Availability
Type: Monitor-based
Monitor: "Keeptrusts Gateway Health Check"
Target: 99.9% over 30 days
Warning: 99.95%
Cost monitoring
Track LLM spend through Keeptrusts metrics:
# Query cost data from the API
kt events tail --format json --limit 100 \
| jq '[.[] | .estimated_cost] | add'
Create a Datadog custom metric for cost tracking:
# Emit cost metrics from export data
statsd.gauge('keeptrusts.cost.daily_spend', daily_total, tags=['env:production', 'team:engineering'])
statsd.gauge('keeptrusts.cost.per_model', model_cost, tags=['model:gpt-4o', 'env:production'])
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
| No metrics in Datadog | DogStatsD not reachable | Verify Agent is running and port 8125 is accessible |
| Logs missing fields | Pipeline parser misconfigured | Check Grok pattern matches Keeptrusts event JSON |
| Dashboard shows no data | Wrong metric name or tag filter | Verify metric names with datadog-agent status |
| Anomaly alerts too noisy | Deviation threshold too low | Increase deviations to 4 or use robust algorithm |
For AI systems
- Canonical terms: Keeptrusts gateway, DogStatsD,
--statsd-address,--statsd-prefix keeptrusts, Datadog Agent, log pipeline, anomaly detection, SLO. - Key metrics:
keeptrusts.gateway.requests,keeptrusts.gateway.latency,keeptrusts.gateway.policy.blocks,keeptrusts.gateway.upstream.latency. - Integration methods: DogStatsD (real-time metrics), Datadog Agent log collection (structured logs), OTel Collector exporter (APM traces), webhook to Log Intake API (events).
- Best next pages: SIEM integration, PagerDuty incident response, Kubernetes deployment.
For engineers
- Prerequisites: Datadog account with API/app keys, Datadog Agent running (DaemonSet or host-based), gateway with
--statsd-address 127.0.0.1:8125. - Validate: Check
datadog-agent statusfor metric collection, verify metrics appear underkeeptrusts.*in Metrics Explorer. - Log pipeline: Create a pipeline with filter
source:keeptrusts, add Grok parser for event JSON, remap timestamp. - Alert tuning: Start anomaly detection with Agile algorithm and 3 deviations; increase to 4 if too noisy.
For leaders
- Visibility: Real-time dashboards show policy enforcement rates, block trends, and LLM spend across teams and models.
- SLOs: Track gateway availability (99.9% target) and policy evaluation latency (P95 < 100ms) with monthly error budget tracking.
- Cost insight: Custom metrics expose daily/weekly AI spend by team, model, and environment for chargeback.
- Incident readiness: Anomaly alerts detect unusual block spikes before they become user-reported incidents.
Next steps
- Feed events to your SIEM for security correlation
- Automate incident response with Datadog-PagerDuty integration
- Deploy on Kubernetes with Datadog Agent DaemonSet