Monitor Gateway Health & Performance from the Console
Your Keeptrusts gateways are the enforcement point for every AI policy. If a gateway goes down or degrades, your governance controls stop working. The console's Gateway Monitoring page gives you real-time visibility into every gateway in your fleet — health status, latency, throughput, error rates, and provider connectivity — so you can catch issues before they affect users.
Use this page when
- You need to check the health status, latency, or throughput of your gateway fleet from the console.
- You are configuring alerting rules for gateway offline, high error rate, or latency spikes.
- You want to plan capacity based on historical throughput and growth trends.
Primary audience
- Primary: Platform Engineers and SREs monitoring gateway infrastructure
- Secondary: Technical Leaders reviewing capacity, On-call Operators responding to alerts
What You'll Accomplish
- View the health status of every gateway in your organization
- Monitor latency and throughput metrics per gateway
- Track upstream provider availability and error rates
- Configure alerting rules for proactive incident detection
- Plan capacity based on historical usage patterns
Gateway Inventory
Navigate to Gateways in the console sidebar. The inventory page lists every registered gateway with:
| Column | Description |
|---|---|
| Name | Gateway identifier |
| Status | Healthy, Degraded, or Offline |
| Version | Deployed configuration version |
| Team | Owning team |
| Region | Deployment region or location |
| Last Heartbeat | Time since the last health check |
| Requests (24h) | Request count in the last 24 hours |
Click any row to open the gateway detail view.
Health Status
Each gateway reports its health based on:
- Heartbeat — the gateway sends periodic health checks to the API
- Error rate — if the 5-minute error rate exceeds the threshold, status degrades
- Latency — if P95 latency exceeds the configured ceiling, status degrades
Status Definitions
| Status | Meaning | Action |
|---|---|---|
| Healthy | All metrics within normal ranges | No action needed |
| Degraded | One or more metrics exceed warning thresholds | Investigate; may resolve on its own |
| Offline | No heartbeat received within the timeout period | Immediate investigation required |
Health Check Configuration
Configure health thresholds in Settings → Gateways → Health:
# Example health check configuration
gateway_health:
heartbeat_timeout_seconds: 30
degraded_error_rate_percent: 5
degraded_latency_p95_ms: 2000
offline_heartbeat_missed_count: 3
Gateway detail monitoring workspace
The gateway detail page now keeps runtime visibility on a single surface instead of separate Overview, Usage, and Events tabs. The page opens directly into one shared time-period picker that drives six charts:
- Total events
- Allowed
- Blocked
- Escalated
- Redacted
- Avg quality
Use the shared picker to switch all six graphs between the 1h, 6h, 24h, 7d, and 30d windows together.
Throughput Monitoring
Track request volume over time:
- Requests per second (RPS) — current and historical
- Requests per minute — smoothed view for trend analysis
- Peak throughput — highest RPS recorded in the selected time range
Throughput by Outcome
Break down throughput by policy outcome:
- Allowed requests
- Blocked requests
- Escalated requests
- Provider errors
This helps you understand not just volume, but what's happening to the traffic. A high block rate at peak throughput might indicate a false-positive policy issue.
Provider Status
Each gateway tracks connectivity to its configured upstream providers:
| Metric | Description |
|---|---|
| Provider health | Reachable / Unreachable / Degraded |
| Error rate | Percentage of provider responses that are errors (429, 500, 503) |
| Rate limit remaining | Remaining capacity before provider throttling |
| Average response time | Mean provider latency over the last 5 minutes |
Provider Failover
If a provider becomes unreachable, the console surfaces it prominently:
- The gateway card shows the provider as Unreachable
- An alert fires to the configured notification channel
- If failover is configured in the policy, the gateway automatically routes to the backup provider
Error Rates
The error rate panel shows:
- Gateway errors — errors in policy evaluation or internal processing
- Provider errors — upstream 4xx/5xx responses
- Timeout errors — requests that exceeded the configured timeout
Error Drill-Down
Click any spike in the error chart to view the contributing events. The console filters the Events page to show only errors from that gateway and time window, letting you quickly identify:
- Which provider is failing
- Which policy is causing evaluation errors
- Whether the errors correlate with a recent configuration change
Alerting Rules
Configure alerts in Settings → Gateways → Alerts:
Available Alert Types
| Alert | Trigger |
|---|---|
| Gateway offline | No heartbeat for N seconds |
| High error rate | Error rate exceeds X% for Y minutes |
| Latency spike | P95 latency exceeds N ms for Y minutes |
| Throughput drop | RPS drops below X% of the rolling average |
| Provider unreachable | Upstream provider fails health check |
Alert Configuration Example
gateway_alerts:
- name: "Production gateway offline"
type: gateway_offline
gateway: "gw-prod-01"
threshold_seconds: 60
channel: "pagerduty:production-oncall"
- name: "High error rate warning"
type: high_error_rate
gateway: "gw-prod-01"
threshold_percent: 5
duration_minutes: 5
channel: "slack:#gateway-alerts"
- name: "Latency degradation"
type: latency_spike
gateway: "gw-prod-01"
p95_threshold_ms: 3000
duration_minutes: 10
channel: "email:ops@acme.com"
Capacity Planning
Use the gateway detail monitoring workspace trends to support scaling decisions:
- Peak utilization — highest observed load as a percentage of estimated capacity
- Growth trend — weekly request volume growth rate
- Projected capacity exhaustion — when current growth will exceed single-gateway capacity
Use this data to:
- Schedule gateway scaling before demand peaks
- Justify infrastructure investment with data-backed projections
- Identify underutilized gateways that can be consolidated
Business Outcomes
| Outcome | How Gateway Monitoring Delivers It |
|---|---|
| Proactive incident detection | Alerts fire before users notice degradation |
| Reduced mean-time-to-resolve | Error drill-down connects symptoms to root cause in clicks |
| Provider resilience | Provider status tracking and failover ensure uninterrupted service |
| Cost-efficient scaling | Capacity planning prevents both over-provisioning and under-provisioning |
Next steps
- Notification Channels — configure where gateway alerts are delivered
- Dashboard Mastery — view gateway metrics alongside organization-wide KPIs
For AI systems
- Canonical terms: gateway inventory, health status (Healthy/Degraded/Offline), heartbeat, latency metrics (P50/P95/P99), throughput (RPS), provider status, error rate, alerting rules, capacity planning.
- Console navigation: Gateways (sidebar) → gateway detail (details panel + six monitoring graphs), Settings → Gateways → Health, Settings → Gateways → Alerts.
- Alert types: gateway_offline, high_error_rate, latency_spike, throughput_drop, provider_unreachable.
- Health thresholds:
heartbeat_timeout_seconds,degraded_error_rate_percent,degraded_latency_p95_ms,offline_heartbeat_missed_count. - Best next pages: Notification Channels, Dashboard Mastery, Performance Tuning (CLI).
For engineers
- Navigate to Gateways in the sidebar to see the fleet inventory with status, version, and 24h request count.
- Configure health thresholds in Settings → Gateways → Health (heartbeat timeout, error rate, latency ceiling).
- Set alerts in Settings → Gateways → Alerts; each alert targets a notification channel (Slack, PagerDuty, email).
- If policy evaluation time dominates end-to-end latency, reorder the policy chain or disable logging-only policies during peak.
- Use the gateway detail monitoring graphs and selected time window to compare trend changes before making scaling decisions.
For leaders
- Proactive alerting means degraded gateways are detected before end users notice — maintaining governance uptime.
- Provider status tracking and automatic failover ensure uninterrupted AI service even during provider outages.
- Capacity planning data supports infrastructure investment justification with concrete growth projections.
- Error drill-down connects symptoms to root cause in clicks, reducing mean-time-to-resolve for on-call teams.