Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Monitor Gateway Health & Performance from the Console

Your Keeptrusts gateways are the enforcement point for every AI policy. If a gateway goes down or degrades, your governance controls stop working. The console's Gateway Monitoring page gives you real-time visibility into every gateway in your fleet — health status, latency, throughput, error rates, and provider connectivity — so you can catch issues before they affect users.

Use this page when

  • You need to check the health status, latency, or throughput of your gateway fleet from the console.
  • You are configuring alerting rules for gateway offline, high error rate, or latency spikes.
  • You want to plan capacity based on historical throughput and growth trends.

Primary audience

  • Primary: Platform Engineers and SREs monitoring gateway infrastructure
  • Secondary: Technical Leaders reviewing capacity, On-call Operators responding to alerts

What You'll Accomplish

  • View the health status of every gateway in your organization
  • Monitor latency and throughput metrics per gateway
  • Track upstream provider availability and error rates
  • Configure alerting rules for proactive incident detection
  • Plan capacity based on historical usage patterns

Gateway Inventory

Navigate to Gateways in the console sidebar. The inventory page lists every registered gateway with:

ColumnDescription
NameGateway identifier
StatusHealthy, Degraded, or Offline
VersionDeployed configuration version
TeamOwning team
RegionDeployment region or location
Last HeartbeatTime since the last health check
Requests (24h)Request count in the last 24 hours

Click any row to open the gateway detail view.

Gateway inventory list

Health Status

Each gateway reports its health based on:

  • Heartbeat — the gateway sends periodic health checks to the API
  • Error rate — if the 5-minute error rate exceeds the threshold, status degrades
  • Latency — if P95 latency exceeds the configured ceiling, status degrades

Status Definitions

StatusMeaningAction
HealthyAll metrics within normal rangesNo action needed
DegradedOne or more metrics exceed warning thresholdsInvestigate; may resolve on its own
OfflineNo heartbeat received within the timeout periodImmediate investigation required

Health Check Configuration

Configure health thresholds in Settings → Gateways → Health:

# Example health check configuration
gateway_health:
heartbeat_timeout_seconds: 30
degraded_error_rate_percent: 5
degraded_latency_p95_ms: 2000
offline_heartbeat_missed_count: 3

Gateway detail monitoring workspace

The gateway detail page now keeps runtime visibility on a single surface instead of separate Overview, Usage, and Events tabs. The page opens directly into one shared time-period picker that drives six charts:

  • Total events
  • Allowed
  • Blocked
  • Escalated
  • Redacted
  • Avg quality

Use the shared picker to switch all six graphs between the 1h, 6h, 24h, 7d, and 30d windows together.

Throughput Monitoring

Track request volume over time:

  • Requests per second (RPS) — current and historical
  • Requests per minute — smoothed view for trend analysis
  • Peak throughput — highest RPS recorded in the selected time range

Throughput by Outcome

Break down throughput by policy outcome:

  • Allowed requests
  • Blocked requests
  • Escalated requests
  • Provider errors

This helps you understand not just volume, but what's happening to the traffic. A high block rate at peak throughput might indicate a false-positive policy issue.

Provider Status

Each gateway tracks connectivity to its configured upstream providers:

MetricDescription
Provider healthReachable / Unreachable / Degraded
Error ratePercentage of provider responses that are errors (429, 500, 503)
Rate limit remainingRemaining capacity before provider throttling
Average response timeMean provider latency over the last 5 minutes

Provider Failover

If a provider becomes unreachable, the console surfaces it prominently:

  1. The gateway card shows the provider as Unreachable
  2. An alert fires to the configured notification channel
  3. If failover is configured in the policy, the gateway automatically routes to the backup provider

Error Rates

The error rate panel shows:

  • Gateway errors — errors in policy evaluation or internal processing
  • Provider errors — upstream 4xx/5xx responses
  • Timeout errors — requests that exceeded the configured timeout

Error Drill-Down

Click any spike in the error chart to view the contributing events. The console filters the Events page to show only errors from that gateway and time window, letting you quickly identify:

  • Which provider is failing
  • Which policy is causing evaluation errors
  • Whether the errors correlate with a recent configuration change

Alerting Rules

Configure alerts in Settings → Gateways → Alerts:

Available Alert Types

AlertTrigger
Gateway offlineNo heartbeat for N seconds
High error rateError rate exceeds X% for Y minutes
Latency spikeP95 latency exceeds N ms for Y minutes
Throughput dropRPS drops below X% of the rolling average
Provider unreachableUpstream provider fails health check

Alert Configuration Example

gateway_alerts:
- name: "Production gateway offline"
type: gateway_offline
gateway: "gw-prod-01"
threshold_seconds: 60
channel: "pagerduty:production-oncall"

- name: "High error rate warning"
type: high_error_rate
gateway: "gw-prod-01"
threshold_percent: 5
duration_minutes: 5
channel: "slack:#gateway-alerts"

- name: "Latency degradation"
type: latency_spike
gateway: "gw-prod-01"
p95_threshold_ms: 3000
duration_minutes: 10
channel: "email:ops@acme.com"

Capacity Planning

Use the gateway detail monitoring workspace trends to support scaling decisions:

  • Peak utilization — highest observed load as a percentage of estimated capacity
  • Growth trend — weekly request volume growth rate
  • Projected capacity exhaustion — when current growth will exceed single-gateway capacity

Use this data to:

  • Schedule gateway scaling before demand peaks
  • Justify infrastructure investment with data-backed projections
  • Identify underutilized gateways that can be consolidated

Business Outcomes

OutcomeHow Gateway Monitoring Delivers It
Proactive incident detectionAlerts fire before users notice degradation
Reduced mean-time-to-resolveError drill-down connects symptoms to root cause in clicks
Provider resilienceProvider status tracking and failover ensure uninterrupted service
Cost-efficient scalingCapacity planning prevents both over-provisioning and under-provisioning

Next steps

For AI systems

  • Canonical terms: gateway inventory, health status (Healthy/Degraded/Offline), heartbeat, latency metrics (P50/P95/P99), throughput (RPS), provider status, error rate, alerting rules, capacity planning.
  • Console navigation: Gateways (sidebar) → gateway detail (details panel + six monitoring graphs), Settings → Gateways → Health, Settings → Gateways → Alerts.
  • Alert types: gateway_offline, high_error_rate, latency_spike, throughput_drop, provider_unreachable.
  • Health thresholds: heartbeat_timeout_seconds, degraded_error_rate_percent, degraded_latency_p95_ms, offline_heartbeat_missed_count.
  • Best next pages: Notification Channels, Dashboard Mastery, Performance Tuning (CLI).

For engineers

  • Navigate to Gateways in the sidebar to see the fleet inventory with status, version, and 24h request count.
  • Configure health thresholds in Settings → Gateways → Health (heartbeat timeout, error rate, latency ceiling).
  • Set alerts in Settings → Gateways → Alerts; each alert targets a notification channel (Slack, PagerDuty, email).
  • If policy evaluation time dominates end-to-end latency, reorder the policy chain or disable logging-only policies during peak.
  • Use the gateway detail monitoring graphs and selected time window to compare trend changes before making scaling decisions.

For leaders

  • Proactive alerting means degraded gateways are detected before end users notice — maintaining governance uptime.
  • Provider status tracking and automatic failover ensure uninterrupted AI service even during provider outages.
  • Capacity planning data supports infrastructure investment justification with concrete growth projections.
  • Error drill-down connects symptoms to root cause in clicks, reducing mean-time-to-resolve for on-call teams.