System Design: Integrating the AI Gateway
This guide covers the end-to-end request flow through the Keeptrusts gateway, how to set latency budgets, configure connection pooling, and wire up timeout and circuit breaker behaviors.
Use this page when
- You are designing the end-to-end request flow for LLM calls through the Keeptrusts gateway
- You need to set explicit latency budgets for each phase (policy evaluation, connection, upstream inference)
- You are configuring connection pooling, timeouts, and circuit breakers in the gateway
- You want to understand how streaming and event emission work in the request pipeline
Primary audience
- Primary: Technical Engineers
- Secondary: AI Agents, Technical Leaders
End-to-End Request Flow
Every LLM request passes through a structured pipeline:
Phase Breakdown
| Phase | Description | Typical Latency |
|---|---|---|
| Input policy evaluation | Content filters, prompt injection detection, DLP | 1–5 ms |
| Provider routing | Target resolution, key lookup | < 1 ms |
| Upstream request | Network round-trip + LLM inference | 200 ms – 30 s |
| Output policy evaluation | Redaction, disclaimers, content filtering | 1–5 ms |
| Event emission | Async POST to control-plane API | 0 ms (non-blocking) |
Latency Budget Design
Define explicit latency budgets per phase to prevent cascading delays:
# policy-config.yaml
gateway:
listen_port: 41002
timeouts:
# Total time from request received to response sent
request_timeout: 120s
# Time to establish connection with upstream provider
connect_timeout: 5s
# Time waiting for first byte from provider
first_byte_timeout: 30s
# Time for policy chain evaluation (input + output)
policy_timeout: 2s
Latency Budget Breakdown
Guidelines:
- Set
request_timeoutto at least 2× your expected P99 provider latency - Streaming responses should use
first_byte_timeoutrather than total timeout - Policy evaluation timeout prevents pathological regex or pattern matching
Connection Pooling
The gateway maintains persistent connection pools to upstream providers:
gateway:
connection_pool:
# Max idle connections per provider
max_idle_per_host: 32
# How long idle connections stay open
idle_timeout: 90s
# Max total connections across all providers
max_total: 256
Connection Lifecycle
Sizing guidelines:
max_idle_per_host: Match your sustained requests-per-second per providermax_total: Sum of all provider pools + 20% headroomidle_timeout: Keep below the provider's server-side timeout (typically 120s for OpenAI)
Timeout Configuration
Per-Provider Timeouts
Different providers have different latency profiles:
pack:
name: system-design-gateway-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider:
base_url: https://api.openai.com/v1
secret_key_ref:
env: OPENAI_API_KEY
- id: anthropic
provider:
base_url: https://api.anthropic.com/v1
secret_key_ref:
env: ANTHROPIC_API_KEY
- id: local-llm
provider:
base_url: http://localhost:8080/v1
secret_key_ref:
env: LOCAL_LLM_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Streaming Timeout Behavior
For streaming responses, the gateway uses an inter-chunk timeout:
gateway:
streaming:
# Max time between chunks before considering the stream stalled
inter_chunk_timeout: 30s
# Buffer size for streaming responses (for output policy eval)
buffer_size: 64KB
Circuit Breaker Patterns
Protect your application from cascading failures when a provider is down:
Gateway-Level Circuit Breaker
pack:
name: system-design-gateway-providers-5
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider:
base_url: https://api.openai.com/v1
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Application-Level Circuit Breaker
Wrap your gateway calls with an application-side circuit breaker for defense in depth:
import { CircuitBreaker } from 'opossum';
const breaker = new CircuitBreaker(callGateway, {
timeout: 120000, // 120s matches gateway timeout
errorThresholdPercentage: 50,
resetTimeout: 30000, // 30s cooldown
volumeThreshold: 10, // Min requests before tripping
});
breaker.fallback(() => ({
choices: [{ message: { content: 'Service temporarily unavailable.' } }],
}));
async function callGateway(messages: Message[]) {
const response = await fetch('http://localhost:41002/v1/chat/completions', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ model: 'gpt-4o', messages }),
});
if (!response.ok) throw new Error(`Gateway error: ${response.status}`);
return response.json();
}
Load Balancer Integration
Health Check Endpoint
The gateway exposes a health endpoint for load balancer probes:
curl http://localhost:41002/health
# {"status":"ok","version":"0.12.3","uptime_seconds":86400}
Kubernetes Service Configuration
apiVersion: v1
kind: Service
metadata:
name: kt-gateway
spec:
selector:
app: kt-gateway
ports:
- port: 41002
targetPort: 41002
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: kt-gateway
spec:
replicas: 3
template:
spec:
containers:
- name: kt-gateway
image: keeptrusts/gateway:latest
ports:
- containerPort: 41002
livenessProbe:
httpGet:
path: /health
port: 41002
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 41002
initialDelaySeconds: 2
periodSeconds: 5
DNS and Service Discovery
Direct Integration
# Application environment
OPENAI_BASE_URL=http://kt-gateway.internal:41002/v1
Service Mesh Integration
When running in a service mesh (Istio, Linkerd), the gateway acts as an egress point:
# Istio ServiceEntry for LLM providers
apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
name: openai-api
spec:
hosts:
- api.openai.com
ports:
- number: 443
name: https
protocol: TLS
resolution: DNS
location: MESH_EXTERNAL
Next steps
- Resilience Engineering for AI Services — failover and degradation strategies
- Performance Engineering the AI Gateway — benchmarking and optimization
- Observability for AI-Governed Systems — monitoring the request flow
For AI systems
- Canonical terms: request flow, input phase, output phase,
POST /v1/chat/completions,409 Policy Violation,POST /v1/events(async), latency budget,gateway.timeouts,request_timeout,connect_timeout,first_byte_timeout,policy_timeout, connection pooling - Key latency targets: input policy < 5 ms, routing < 1 ms, output policy < 5 ms, event emission 0 ms (non-blocking), total overhead < 10 ms
- Best next pages: Architecture Patterns, Performance Engineering, Resilience Engineering
For engineers
- Set
request_timeout: 120s(at least 2× expected P99 provider latency),connect_timeout: 5s,first_byte_timeout: 30s,policy_timeout: 2s - Event emission is non-blocking — the gateway fires
POST /v1/eventsasynchronously after responding to the caller - For streaming: use
first_byte_timeoutinstead of total timeout; output policies evaluate on buffered chunks - Connection pooling: pre-warm connections at startup to avoid cold-start latency on first requests
- The
409response from input policy evaluation is immediate — no upstream call is made
For leaders
- The gateway adds < 10 ms total overhead to LLM requests that typically take 200 ms–30 s — governance cost is negligible
- Proper timeout configuration prevents cascading failures from slow providers affecting all applications
- Asynchronous event emission means governance observability has zero impact on user-facing latency