Observability Configuration

The callbacks: section sends per-request telemetry to external observability platforms. The health_monitor: section enables background health probing of upstream providers.

Use this page when

You need the exact command, config, API, or integration details for Observability Configuration.
You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Callbacks

Callbacks are dispatched asynchronously after each proxied request. Six sink types are supported.

callbacks:
  - type: "langfuse"
    host: "https://cloud.langfuse.com"
    public_key_env: "LANGFUSE_PUBLIC_KEY"
    secret_key_env: "LANGFUSE_SECRET_KEY"

Langfuse

callbacks:
  - type: "langfuse"
    host: "https://cloud.langfuse.com"        # default
    public_key: "pk-lf-..."                    # OR public_key_env
    public_key_env: "LANGFUSE_PUBLIC_KEY"
    secret_key: "sk-lf-..."                    # OR secret_key_env
    secret_key_env: "LANGFUSE_SECRET_KEY"

Sends traces with input/output messages, token counts, latency, model, and policy decisions.

Datadog

callbacks:
  - type: "datadog"
    secret_key_ref:
      env: "DD_API_KEY"
    site: "datadoghq.com"                      # default
    service_name: "keeptrusts-proxy"           # default
    tags:
      - "env:production"
      - "team:platform"

Sends APM-style spans with model, provider, latency, token usage, and policy verdicts.

Prometheus

callbacks:
  - type: "prometheus"

Exposes a /metrics scrape endpoint with these counters and histograms:

Metric	Type	Labels
`keeptrusts_llm_requests_total`	counter	`model`, `provider`, `status`, `verdict`
`keeptrusts_llm_tokens_total`	counter	`model`, `provider`, `direction` (input/output)
`keeptrusts_llm_cost_total`	counter	`model`, `provider`
`keeptrusts_llm_latency_seconds_sum`	histogram	`model`, `provider`

Helicone

callbacks:
  - type: "helicone"
    secret_key_ref:
      env: "HELICONE_API_KEY"

Injects the Helicone-Auth header into upstream requests for automatic logging.

Braintrust

callbacks:
  - type: "braintrust"
    secret_key_ref:
      env: "BRAINTRUST_API_KEY"
    project_name: "my-project"                 # default: "default"

Webhook

Send events to any HTTP endpoint with optional HMAC-SHA256 signing.

callbacks:
  - type: "webhook"
    url: "https://my-service.example.com/events"
    signing_secret_env: "WEBHOOK_SECRET"       # HMAC-SHA256 signing
    headers:
      X-Custom-Header: "value"
    event_filter:
      types:
        - "block"
        - "escalation"
        - "policy_violation"
        - "quality_failure"
        - "request"
      metadata_match:
        environment: "production"

Webhook event filter types:

Type	When
`request`	Every proxied request (default)
`block`	Request was blocked by a policy
`escalation`	Request was escalated for human review
`policy_violation`	A policy triggered a non-blocking violation
`quality_failure`	Quality assertion failed

Privacy controls

Callbacks support privacy scrubbing before dispatch:

redact_message_bodies — Strip request/response message content
redact_user — Strip user identity headers
Payload fidelity modes:
- full — All data
- identity — Metadata + user identity, no message content
- event_only — Metadata only, no content or identity

These are controlled at the provider level via providers.logging:

providers:
  logging:
    redact_message_bodies: true
    redact_api_keys: true

Multiple callbacks

You can combine multiple callback sinks. Each receives the same event independently.

callbacks:
  - type: "prometheus"
  - type: "langfuse"
    public_key_env: "LANGFUSE_PUBLIC_KEY"
    secret_key_env: "LANGFUSE_SECRET_KEY"
  - type: "webhook"
    url: "https://slack-webhook.example.com/events"
    event_filter:
      types: ["block", "escalation"]

Health monitor

The health_monitor: section runs background probes against provider endpoints and raises alerts on sustained failures.

health_monitor:
  unhealthy_threshold: 3
  alert_callback_urls:
    - "https://pagerduty.example.com/events"
    - "https://slack.example.com/webhook"
  providers:
    - name: "openai"
      endpoint: "https://api.openai.com/v1/models"
      interval_seconds: 60
      timeout_ms: 5000
    - name: "anthropic"
      endpoint: "https://api.anthropic.com/v1/models"
      interval_seconds: 60
      timeout_ms: 5000

Field	Type	Required	Default	Description
`unhealthy_threshold`	integer	no	`3`	Consecutive failures before marking unhealthy
`alert_callback_urls`	string[]	no	`[]`	Webhook URLs for status-change alerts
`providers[].name`	string	yes	—	Provider identifier for logging
`providers[].endpoint`	string	yes	—	URL to probe (GET request)
`providers[].interval_seconds`	integer	no	`60`	Seconds between probes
`providers[].timeout_ms`	integer	no	`5000`	Probe timeout in milliseconds

The health monitor:

Runs a background Tokio task per provider
Sends HTTP GET to the endpoint at the configured interval
After unhealthy_threshold consecutive failures, marks the provider unhealthy
POSTs an alert event (JSON) to each URL in alert_callback_urls
Continues probing; marks healthy again after one success

Complete observability example

pack:
  name: observable-gateway
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai-prod
    provider: openai
    model: gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
  - id: anthropic-backup
    provider: anthropic
    model: claude-sonnet-4-20250514
    secret_key_ref:
      env: ANTHROPIC_API_KEY
  logging:
    redact_message_bodies: true
    redact_api_keys: true
callbacks:
- type: prometheus
- type: langfuse
  public_key_env: LANGFUSE_PUBLIC_KEY
  secret_key_env: LANGFUSE_SECRET_KEY
- type: datadog
  secret_key_ref:
    env: DD_API_KEY
  tags:
  - env:production
  - service:ai-gateway
- type: webhook
  url: https://alerts.example.com/keeptrusts
  signing_secret_env: WEBHOOK_SECRET
  event_filter:
    types:
    - block
    - escalation
health_monitor:
  unhealthy_threshold: 3
  alert_callback_urls:
  - https://pagerduty.example.com/keeptrusts
  providers:
  - name: openai
    endpoint: https://api.openai.com/v1/models
    interval_seconds: 60
  - name: anthropic
    endpoint: https://api.anthropic.com/v1/models
    interval_seconds: 60
policies:
  chain:
  - prompt-injection
  - audit-logger

For AI systems

Canonical terms: Keeptrusts, policy-config.yaml, callbacks (langfuse, datadog, prometheus, helicone, braintrust, webhook), health_monitor, event_filter, signing_secret_env, redact_message_bodies.
Callbacks are dispatched asynchronously after each proxied request; health monitor runs background probes.
Best next pages: Providers Configuration, Rate Limits Configuration, Declarative Config Reference.

For engineers

Six callback sink types: Langfuse, Datadog, Prometheus, Helicone, Braintrust, and Webhook.
Prometheus exposes a /metrics scrape endpoint with counters for requests, tokens, cost, and latency histograms.
Webhook callbacks support HMAC-SHA256 signing via signing_secret_env and event filtering by type/metadata.
Privacy controls: redact_message_bodies strips content; redact_api_keys strips credentials from callback payloads.
Health monitor marks providers unhealthy after unhealthy_threshold consecutive probe failures and sends alerts to alert_callback_urls.
Multiple callbacks can be combined — each receives the same event independently.

For leaders

Observability callbacks provide real-time visibility into AI gateway performance, cost, and policy enforcement across existing tooling (Datadog, Prometheus, Langfuse).
Webhook event filtering allows targeted alerting on blocks and escalations without noise from normal traffic.
Health monitoring with automated alerts enables proactive provider failover before users are impacted.
Privacy controls ensure that observability data doesn't leak sensitive request/response content to external platforms.
Prometheus metrics enable SLA dashboards and capacity planning without additional infrastructure.

Next steps

Providers Configuration — per-provider health probes and logging
Rate Limits Configuration — Prometheus metrics for rate limit events
Declarative Config Reference — schema structure

Use this page when​

Primary audience​

Callbacks​

Langfuse​

Datadog​

Prometheus​

Helicone​

Braintrust​

Webhook​

Privacy controls​

Multiple callbacks​

Health monitor​

Complete observability example​

For AI systems​

For engineers​

For leaders​

Next steps​