Monitor Gateway Health & Performance from the Console

Your Keeptrusts gateways are the enforcement point for every AI policy. If a gateway goes down or degrades, your governance controls stop working. The console's Gateway Monitoring page gives you real-time visibility into every gateway in your fleet — health status, latency, throughput, error rates, and provider connectivity — so you can catch issues before they affect users.

Use this page when

You need to check the health status, latency, or throughput of your gateway fleet from the console.
You are configuring alerting rules for gateway offline, high error rate, or latency spikes.
You want to plan capacity based on historical throughput and growth trends.

Primary audience

Primary: Platform Engineers and SREs monitoring gateway infrastructure
Secondary: Technical Leaders reviewing capacity, On-call Operators responding to alerts

What You'll Accomplish

View the health status of every gateway in your organization
Monitor latency and throughput metrics per gateway
Track upstream provider availability and error rates
Configure alerting rules for proactive incident detection
Plan capacity based on historical usage patterns

Gateway Inventory

Navigate to Gateways in the console sidebar. The inventory page lists every registered gateway with:

Column	Description
Name	Gateway identifier
Status	Healthy, Degraded, or Offline
Version	Deployed configuration version
Team	Owning team
Region	Deployment region or location
Last Heartbeat	Time since the last health check
Requests (24h)	Request count in the last 24 hours

Click any row to open the gateway detail view.

Gateway inventory list

Health Status

Each gateway reports its health based on:

Heartbeat — the gateway sends periodic health checks to the API
Error rate — if the 5-minute error rate exceeds the threshold, status degrades
Latency — if P95 latency exceeds the configured ceiling, status degrades

Status Definitions

Status	Meaning	Action
Healthy	All metrics within normal ranges	No action needed
Degraded	One or more metrics exceed warning thresholds	Investigate; may resolve on its own
Offline	No heartbeat received within the timeout period	Immediate investigation required

Health Check Configuration

Configure health thresholds in Settings → Gateways → Health:

# Example health check configuration
gateway_health:
  heartbeat_timeout_seconds: 30
  degraded_error_rate_percent: 5
  degraded_latency_p95_ms: 2000
  offline_heartbeat_missed_count: 3

Gateway detail monitoring workspace

The gateway detail page now keeps runtime visibility on a single surface instead of separate Overview, Usage, and Events tabs. The page opens directly into one shared time-period picker that drives six charts:

Total events
Allowed
Blocked
Escalated
Redacted
Avg quality

Use the shared picker to switch all six graphs between the 1h, 6h, 24h, 7d, and 30d windows together.

Throughput Monitoring

Track request volume over time:

Requests per second (RPS) — current and historical
Requests per minute — smoothed view for trend analysis
Peak throughput — highest RPS recorded in the selected time range

Throughput by Outcome

Break down throughput by policy outcome:

Allowed requests
Blocked requests
Escalated requests
Provider errors

This helps you understand not just volume, but what's happening to the traffic. A high block rate at peak throughput might indicate a false-positive policy issue.

Provider Status

Each gateway tracks connectivity to its configured upstream providers:

Metric	Description
Provider health	Reachable / Unreachable / Degraded
Error rate	Percentage of provider responses that are errors (429, 500, 503)
Rate limit remaining	Remaining capacity before provider throttling
Average response time	Mean provider latency over the last 5 minutes

Provider Failover

If a provider becomes unreachable, the console surfaces it prominently:

The gateway card shows the provider as Unreachable
An alert fires to the configured notification channel
If failover is configured in the policy, the gateway automatically routes to the backup provider

Error Rates

The error rate panel shows:

Gateway errors — errors in policy evaluation or internal processing
Provider errors — upstream 4xx/5xx responses
Timeout errors — requests that exceeded the configured timeout

Error Drill-Down

Click any spike in the error chart to view the contributing events. The console filters the Events page to show only errors from that gateway and time window, letting you quickly identify:

Which provider is failing
Which policy is causing evaluation errors
Whether the errors correlate with a recent configuration change

Alerting Rules

Configure alerts in Settings → Gateways → Alerts:

Available Alert Types

Alert	Trigger
Gateway offline	No heartbeat for N seconds
High error rate	Error rate exceeds X% for Y minutes
Latency spike	P95 latency exceeds N ms for Y minutes
Throughput drop	RPS drops below X% of the rolling average
Provider unreachable	Upstream provider fails health check

Alert Configuration Example

gateway_alerts:
  - name: "Production gateway offline"
    type: gateway_offline
    gateway: "gw-prod-01"
    threshold_seconds: 60
    channel: "pagerduty:production-oncall"

  - name: "High error rate warning"
    type: high_error_rate
    gateway: "gw-prod-01"
    threshold_percent: 5
    duration_minutes: 5
    channel: "slack:#gateway-alerts"

  - name: "Latency degradation"
    type: latency_spike
    gateway: "gw-prod-01"
    p95_threshold_ms: 3000
    duration_minutes: 10
    channel: "email:ops@acme.com"

Capacity Planning

Use the gateway detail monitoring workspace trends to support scaling decisions:

Peak utilization — highest observed load as a percentage of estimated capacity
Growth trend — weekly request volume growth rate
Projected capacity exhaustion — when current growth will exceed single-gateway capacity

Use this data to:

Schedule gateway scaling before demand peaks
Justify infrastructure investment with data-backed projections
Identify underutilized gateways that can be consolidated

Business Outcomes

Outcome	How Gateway Monitoring Delivers It
Proactive incident detection	Alerts fire before users notice degradation
Reduced mean-time-to-resolve	Error drill-down connects symptoms to root cause in clicks
Provider resilience	Provider status tracking and failover ensure uninterrupted service
Cost-efficient scaling	Capacity planning prevents both over-provisioning and under-provisioning

Next steps

Notification Channels — configure where gateway alerts are delivered
Dashboard Mastery — view gateway metrics alongside organization-wide KPIs

For AI systems

Canonical terms: gateway inventory, health status (Healthy/Degraded/Offline), heartbeat, latency metrics (P50/P95/P99), throughput (RPS), provider status, error rate, alerting rules, capacity planning.
Console navigation: Gateways (sidebar) → gateway detail (details panel + six monitoring graphs), Settings → Gateways → Health, Settings → Gateways → Alerts.
Alert types: gateway_offline, high_error_rate, latency_spike, throughput_drop, provider_unreachable.
Health thresholds: heartbeat_timeout_seconds, degraded_error_rate_percent, degraded_latency_p95_ms, offline_heartbeat_missed_count.
Best next pages: Notification Channels, Dashboard Mastery, Performance Tuning (CLI).

For engineers

Navigate to Gateways in the sidebar to see the fleet inventory with status, version, and 24h request count.
Configure health thresholds in Settings → Gateways → Health (heartbeat timeout, error rate, latency ceiling).
Set alerts in Settings → Gateways → Alerts; each alert targets a notification channel (Slack, PagerDuty, email).
If policy evaluation time dominates end-to-end latency, reorder the policy chain or disable logging-only policies during peak.
Use the gateway detail monitoring graphs and selected time window to compare trend changes before making scaling decisions.

For leaders

Proactive alerting means degraded gateways are detected before end users notice — maintaining governance uptime.
Provider status tracking and automatic failover ensure uninterrupted AI service even during provider outages.
Capacity planning data supports infrastructure investment justification with concrete growth projections.
Error drill-down connects symptoms to root cause in clicks, reducing mean-time-to-resolve for on-call teams.

Use this page when​

Primary audience​

What You'll Accomplish​

Gateway Inventory​

Health Status​

Status Definitions​

Health Check Configuration​

Gateway detail monitoring workspace​

Throughput Monitoring​

Throughput by Outcome​

Provider Status​

Provider Failover​

Error Rates​

Error Drill-Down​

Alerting Rules​

Available Alert Types​

Alert Configuration Example​

Capacity Planning​

Business Outcomes​

Next steps​

For AI systems​

For engineers​

For leaders​