Diagnose Gateway Issues with kt doctor
When requests fail, latency spikes, or policies misbehave, kt doctor gives you a systematic diagnostic toolkit. It validates connectivity, configuration, upstream health, and performance — then tells you exactly what to fix.
Use this page when
- Requests are failing, latency is spiking, or policies are misbehaving and you need structured diagnostics.
- You want to run
kt doctorto validate connectivity, configuration, upstream health, or performance. - You need to generate a support bundle or integrate health checks into Kubernetes/Docker probes.
Primary audience
- Primary: Technical Engineers and SREs troubleshooting gateway issues
- Secondary: On-call operators, Platform Engineers configuring health probes
Running kt doctor
# Full diagnostic suite
kt doctor
# Target a specific gateway
kt doctor --gateway gw-prod-01
# Run specific check categories
kt doctor --checks connectivity,config,upstream
Full diagnostic output
Keeptrusts Gateway Diagnostics
══════════════════════════════
Gateway: gw-prod-01 (v2.4.1)
Config: /etc/keeptrusts/policy-config.yaml
Connectivity
✓ API reachable (https://api.keeptrusts.com — 45ms)
✓ Gateway port 41002 listening
✓ TLS certificate valid (expires 2025-12-01)
✗ DNS resolution for api.keeptrusts.com took 850ms (threshold: 200ms)
Configuration
✓ Policy config valid (7 policies, 2 phases)
✓ All includes resolved
✓ Config variables: 4/4 bound
⚠ Deprecated policy type "content_filter_v1" in use
Upstream Providers
✓ openai/gpt-4o reachable (112ms)
✓ anthropic/claude-3 reachable (98ms)
✗ azure/gpt-4 connection refused
Performance
✓ Policy chain eval: 12ms avg (last 100 requests)
✓ Memory usage: 142 MB (limit: 512 MB)
⚠ P99 latency: 890ms (threshold: 500ms)
Summary: 2 errors, 2 warnings
→ Fix DNS resolution: check /etc/resolv.conf or configure DNS cache
→ Fix azure/gpt-4: verify AZURE_API_KEY and endpoint URL
Diagnostic categories
Connectivity checks
Validates that the gateway can reach all required services:
kt doctor --checks connectivity
| Check | What it validates |
|---|---|
| API reachability | TCP connection + HTTP health check to the control-plane API |
| Gateway port | The configured gateway port is listening and accepting connections |
| TLS certificate | Certificate validity, expiration, and chain completeness |
| DNS resolution | Resolution time for all configured hostnames |
| Firewall rules | Outbound connectivity to provider endpoints on required ports |
Configuration checks
Validates the active configuration against the schema and runtime state:
kt doctor --checks config
| Check | What it validates |
|---|---|
| Schema validation | Full kt policy lint --file policy-config.yaml against the active config source |
| Include resolution | All fragment files are accessible and parseable |
| Variable binding | All secret_key_ref and variable references resolve to a value |
| Policy deprecations | Flags deprecated policy types that will be removed in future versions |
| Config staleness | Warns if the config file has not been reloaded recently |
Upstream provider checks
Tests connectivity and authentication to each configured LLM provider:
kt doctor --checks upstream
Upstream Provider Health
────────────────────────
openai/gpt-4o
✓ DNS resolution: 12ms
✓ TCP connection: 45ms
✓ TLS handshake: 89ms
✓ Auth validation: 200 OK
✓ Model available: true
azure/gpt-4
✓ DNS resolution: 15ms
✓ TCP connection: timeout after 5000ms
✗ FAILED: Connection refused
→ Check: AZURE_OPENAI_ENDPOINT environment variable
→ Check: Network allows outbound to *.openai.azure.com:443
Performance checks
Analyzes gateway performance using recent request data:
kt doctor --checks performance
| Metric | Threshold | Description |
|---|---|---|
| Chain eval time | < 50ms avg | Time spent evaluating the policy chain |
| P50 latency | < 200ms | Median total request latency |
| P99 latency | < 500ms | Tail latency (excludes upstream time) |
| Memory usage | < 80% limit | Current RSS relative to configured limit |
| Open connections | < 80% max | Active connections relative to pool size |
| Event queue depth | < 1000 | Pending events waiting for API submission |
Debugging specific issues
Requests returning 409 unexpectedly
# Check which policy is triggering
kt events tail --filter "outcome=blocked" --format detailed
# Inspect the active chain
curl -s http://localhost:8080/keeptrusts/config | jq '.policies.chain'
# Re-run lint on the config source
kt policy lint --file policy-config.yaml
High latency
# Check per-policy latency
kt doctor --checks performance --verbose
# Identify slow policies
kt events tail --filter "latency_ms>500" --format detailed
# Check upstream provider response times
kt doctor --checks upstream --verbose
Events not appearing in the console
# Verify API connectivity
kt doctor --checks connectivity
# Check event queue status
kt doctor --checks performance --verbose | grep "queue"
# Manually test event submission
curl -s -o /dev/null -w "%{http_code}" \
-X POST https://api.keeptrusts.com/v1/events \
-H "Authorization: Bearer $KT_API_KEY" \
-H "Content-Type: application/json" \
-d '{"test": true}'
Config reload failures
# Validate the new config before reloading
kt policy lint --file /path/to/new-config.yaml
# Check if the gateway process can read the file
kt doctor --checks config --verbose
# Reload the running gateway
kt gateway reload \
--name local-dev \
--gateway-url http://localhost:41002 \
--config-path /path/to/new-config.yaml
Continuous health monitoring
Run kt doctor as a health check in your monitoring stack:
# Kubernetes liveness probe
# In your deployment manifest:
livenessProbe:
exec:
command: ["kt", "doctor", "--checks", "connectivity", "--exit-code"]
initialDelaySeconds: 10
periodSeconds: 30
# Docker health check
HEALTHCHECK --interval=30s --timeout=10s \
CMD kt doctor --checks connectivity,config --exit-code || exit 1
Exit codes
| Code | Meaning |
|---|---|
0 | All checks passed |
1 | One or more errors detected |
2 | One or more warnings (no errors) |
3 | Doctor command itself failed (invalid arguments, etc.) |
Generating a support bundle
When you need to escalate to Keeptrusts support:
# Generate a full diagnostic bundle
kt doctor --bundle --output support-bundle.tar.gz
The bundle includes:
- Full
kt doctoroutput - Sanitized configuration (secrets redacted)
- Recent event samples (last 100 events)
- Gateway process logs (last 1000 lines)
- System resource snapshot (CPU, memory, disk, network)
Business outcomes
| Outcome | How diagnostics help |
|---|---|
| Faster incident resolution | Structured diagnostics pinpoint root causes in minutes, not hours |
| Proactive issue detection | Scheduled health checks catch problems before users report them |
| Reduced escalations | Self-service debugging resolves most issues without support tickets |
| Reliable deployments | Pre-deployment health checks validate connectivity and configuration |
| Operational confidence | Continuous monitoring proves the gateway is healthy and performant |
For AI systems
- Canonical terms:
kt doctor,--checks connectivity|config|upstream|performance,--exit-code,--bundle,--gateway,--group. - Exit codes: 0 = all passed, 1 = errors, 2 = warnings only, 3 = command failure.
- Diagnostic categories: connectivity (API, port, TLS, DNS), config (schema, includes, variables, deprecations), upstream (DNS, TCP, TLS, auth per provider), performance (chain eval, latency percentiles, memory, queue depth).
- Best next pages: Performance Tuning, Multi-Gateway, Live Monitoring.
For engineers
- Quick start:
kt doctorruns all checks;kt doctor --gateway gw-prod-01targets a specific instance. - Use in probes:
kt doctor --checks connectivity --exit-codeas a Kubernetes liveness probe or Docker HEALTHCHECK. - Debugging 409s:
kt events tail --filter "outcome=blocked" --format detailedto see which policy triggered. - High latency:
kt doctor --checks performance --verbosethen check per-policy timing. - Support bundle:
kt doctor --bundle --output support-bundle.tar.gzincludes sanitized config, recent events, and logs.
For leaders
- Structured diagnostics reduce mean-time-to-resolve from hours to minutes — less downtime for governed AI services.
- Self-service debugging resolves most issues without vendor support tickets.
- Health checks integrated into orchestration probes provide continuous assurance that governance controls are operational.
- Support bundles include sanitized data only — no secrets leak to external parties.
Next steps
- Tune Gateway Performance — optimize the metrics
kt doctorreports - Operate Multiple Gateways — run diagnostics across a fleet
- Monitor AI Traffic in Real-Time — complement diagnostics with live event streaming