Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

System Design: Integrating the AI Gateway

This guide covers the end-to-end request flow through the Keeptrusts gateway, how to set latency budgets, configure connection pooling, and wire up timeout and circuit breaker behaviors.

Use this page when

  • You are designing the end-to-end request flow for LLM calls through the Keeptrusts gateway
  • You need to set explicit latency budgets for each phase (policy evaluation, connection, upstream inference)
  • You are configuring connection pooling, timeouts, and circuit breakers in the gateway
  • You want to understand how streaming and event emission work in the request pipeline

Primary audience

  • Primary: Technical Engineers
  • Secondary: AI Agents, Technical Leaders

End-to-End Request Flow

Every LLM request passes through a structured pipeline:

Phase Breakdown

PhaseDescriptionTypical Latency
Input policy evaluationContent filters, prompt injection detection, DLP1–5 ms
Provider routingTarget resolution, key lookup< 1 ms
Upstream requestNetwork round-trip + LLM inference200 ms – 30 s
Output policy evaluationRedaction, disclaimers, content filtering1–5 ms
Event emissionAsync POST to control-plane API0 ms (non-blocking)

Latency Budget Design

Define explicit latency budgets per phase to prevent cascading delays:

# policy-config.yaml
gateway:
listen_port: 41002
timeouts:
# Total time from request received to response sent
request_timeout: 120s
# Time to establish connection with upstream provider
connect_timeout: 5s
# Time waiting for first byte from provider
first_byte_timeout: 30s
# Time for policy chain evaluation (input + output)
policy_timeout: 2s

Latency Budget Breakdown

Guidelines:

  • Set request_timeout to at least 2× your expected P99 provider latency
  • Streaming responses should use first_byte_timeout rather than total timeout
  • Policy evaluation timeout prevents pathological regex or pattern matching

Connection Pooling

The gateway maintains persistent connection pools to upstream providers:

gateway:
connection_pool:
# Max idle connections per provider
max_idle_per_host: 32
# How long idle connections stay open
idle_timeout: 90s
# Max total connections across all providers
max_total: 256

Connection Lifecycle

Sizing guidelines:

  • max_idle_per_host: Match your sustained requests-per-second per provider
  • max_total: Sum of all provider pools + 20% headroom
  • idle_timeout: Keep below the provider's server-side timeout (typically 120s for OpenAI)

Timeout Configuration

Per-Provider Timeouts

Different providers have different latency profiles:

pack:
name: system-design-gateway-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider:
base_url: https://api.openai.com/v1
secret_key_ref:
env: OPENAI_API_KEY
- id: anthropic
provider:
base_url: https://api.anthropic.com/v1
secret_key_ref:
env: ANTHROPIC_API_KEY
- id: local-llm
provider:
base_url: http://localhost:8080/v1
secret_key_ref:
env: LOCAL_LLM_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Streaming Timeout Behavior

For streaming responses, the gateway uses an inter-chunk timeout:

gateway:
streaming:
# Max time between chunks before considering the stream stalled
inter_chunk_timeout: 30s
# Buffer size for streaming responses (for output policy eval)
buffer_size: 64KB

Circuit Breaker Patterns

Protect your application from cascading failures when a provider is down:

Gateway-Level Circuit Breaker

pack:
name: system-design-gateway-providers-5
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider:
base_url: https://api.openai.com/v1
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Application-Level Circuit Breaker

Wrap your gateway calls with an application-side circuit breaker for defense in depth:

import { CircuitBreaker } from 'opossum';

const breaker = new CircuitBreaker(callGateway, {
timeout: 120000, // 120s matches gateway timeout
errorThresholdPercentage: 50,
resetTimeout: 30000, // 30s cooldown
volumeThreshold: 10, // Min requests before tripping
});

breaker.fallback(() => ({
choices: [{ message: { content: 'Service temporarily unavailable.' } }],
}));

async function callGateway(messages: Message[]) {
const response = await fetch('http://localhost:41002/v1/chat/completions', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ model: 'gpt-4o', messages }),
});
if (!response.ok) throw new Error(`Gateway error: ${response.status}`);
return response.json();
}

Load Balancer Integration

Health Check Endpoint

The gateway exposes a health endpoint for load balancer probes:

curl http://localhost:41002/health
# {"status":"ok","version":"0.12.3","uptime_seconds":86400}

Kubernetes Service Configuration

apiVersion: v1
kind: Service
metadata:
name: kt-gateway
spec:
selector:
app: kt-gateway
ports:
- port: 41002
targetPort: 41002
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: kt-gateway
spec:
replicas: 3
template:
spec:
containers:
- name: kt-gateway
image: keeptrusts/gateway:latest
ports:
- containerPort: 41002
livenessProbe:
httpGet:
path: /health
port: 41002
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 41002
initialDelaySeconds: 2
periodSeconds: 5

DNS and Service Discovery

Direct Integration

# Application environment
OPENAI_BASE_URL=http://kt-gateway.internal:41002/v1

Service Mesh Integration

When running in a service mesh (Istio, Linkerd), the gateway acts as an egress point:

# Istio ServiceEntry for LLM providers
apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
name: openai-api
spec:
hosts:
- api.openai.com
ports:
- number: 443
name: https
protocol: TLS
resolution: DNS
location: MESH_EXTERNAL

Next steps

For AI systems

  • Canonical terms: request flow, input phase, output phase, POST /v1/chat/completions, 409 Policy Violation, POST /v1/events (async), latency budget, gateway.timeouts, request_timeout, connect_timeout, first_byte_timeout, policy_timeout, connection pooling
  • Key latency targets: input policy < 5 ms, routing < 1 ms, output policy < 5 ms, event emission 0 ms (non-blocking), total overhead < 10 ms
  • Best next pages: Architecture Patterns, Performance Engineering, Resilience Engineering

For engineers

  • Set request_timeout: 120s (at least 2× expected P99 provider latency), connect_timeout: 5s, first_byte_timeout: 30s, policy_timeout: 2s
  • Event emission is non-blocking — the gateway fires POST /v1/events asynchronously after responding to the caller
  • For streaming: use first_byte_timeout instead of total timeout; output policies evaluate on buffered chunks
  • Connection pooling: pre-warm connections at startup to avoid cold-start latency on first requests
  • The 409 response from input policy evaluation is immediate — no upstream call is made

For leaders

  • The gateway adds < 10 ms total overhead to LLM requests that typically take 200 ms–30 s — governance cost is negligible
  • Proper timeout configuration prevents cascading failures from slow providers affecting all applications
  • Asynchronous event emission means governance observability has zero impact on user-facing latency