System Design: Integrating the AI Gateway

This guide covers the end-to-end request flow through the Keeptrusts gateway, how to set latency budgets, configure connection pooling, and wire up timeout and circuit breaker behaviors.

Use this page when

You are designing the end-to-end request flow for LLM calls through the Keeptrusts gateway
You need to set explicit latency budgets for each phase (policy evaluation, connection, upstream inference)
You are configuring connection pooling, timeouts, and circuit breakers in the gateway
You want to understand how streaming and event emission work in the request pipeline

Primary audience

Primary: Technical Engineers
Secondary: AI Agents, Technical Leaders

End-to-End Request Flow

Every LLM request passes through a structured pipeline:

Phase Breakdown

Phase	Description	Typical Latency
Input policy evaluation	Content filters, prompt injection detection, DLP	1–5 ms
Provider routing	Target resolution, key lookup	< 1 ms
Upstream request	Network round-trip + LLM inference	200 ms – 30 s
Output policy evaluation	Redaction, disclaimers, content filtering	1–5 ms
Event emission	Async POST to control-plane API	0 ms (non-blocking)

Latency Budget Design

Define explicit latency budgets per phase to prevent cascading delays:

# policy-config.yaml
gateway:
  listen_port: 41002
  timeouts:
    # Total time from request received to response sent
    request_timeout: 120s
    # Time to establish connection with upstream provider
    connect_timeout: 5s
    # Time waiting for first byte from provider
    first_byte_timeout: 30s
    # Time for policy chain evaluation (input + output)
    policy_timeout: 2s

Latency Budget Breakdown

Guidelines:

Set request_timeout to at least 2× your expected P99 provider latency
Streaming responses should use first_byte_timeout rather than total timeout
Policy evaluation timeout prevents pathological regex or pattern matching

Connection Pooling

The gateway maintains persistent connection pools to upstream providers:

gateway:
  connection_pool:
    # Max idle connections per provider
    max_idle_per_host: 32
    # How long idle connections stay open
    idle_timeout: 90s
    # Max total connections across all providers
    max_total: 256

Connection Lifecycle

Sizing guidelines:

max_idle_per_host: Match your sustained requests-per-second per provider
max_total: Sum of all provider pools + 20% headroom
idle_timeout: Keep below the provider's server-side timeout (typically 120s for OpenAI)

Timeout Configuration

Per-Provider Timeouts

Different providers have different latency profiles:

pack:
  name: system-design-gateway-providers-3
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai
    provider: 
    base_url: https://api.openai.com/v1
    secret_key_ref:
      env: OPENAI_API_KEY
  - id: anthropic
    provider: 
    base_url: https://api.anthropic.com/v1
    secret_key_ref:
      env: ANTHROPIC_API_KEY
  - id: local-llm
    provider: 
    base_url: http://localhost:8080/v1
    secret_key_ref:
      env: LOCAL_LLM_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Streaming Timeout Behavior

For streaming responses, the gateway uses an inter-chunk timeout:

gateway:
  streaming:
    # Max time between chunks before considering the stream stalled
    inter_chunk_timeout: 30s
    # Buffer size for streaming responses (for output policy eval)
    buffer_size: 64KB

Circuit Breaker Patterns

Protect your application from cascading failures when a provider is down:

Gateway-Level Circuit Breaker

pack:
  name: system-design-gateway-providers-5
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai
    provider: 
    base_url: https://api.openai.com/v1
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Application-Level Circuit Breaker

Wrap your gateway calls with an application-side circuit breaker for defense in depth:

import { CircuitBreaker } from 'opossum';

const breaker = new CircuitBreaker(callGateway, {
  timeout: 120000,        // 120s matches gateway timeout
  errorThresholdPercentage: 50,
  resetTimeout: 30000,    // 30s cooldown
  volumeThreshold: 10,    // Min requests before tripping
});

breaker.fallback(() => ({
  choices: [{ message: { content: 'Service temporarily unavailable.' } }],
}));

async function callGateway(messages: Message[]) {
  const response = await fetch('http://localhost:41002/v1/chat/completions', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ model: 'gpt-4o', messages }),
  });
  if (!response.ok) throw new Error(`Gateway error: ${response.status}`);
  return response.json();
}

Load Balancer Integration

Health Check Endpoint

The gateway exposes a health endpoint for load balancer probes:

curl http://localhost:41002/health
# {"status":"ok","version":"0.12.3","uptime_seconds":86400}

Kubernetes Service Configuration

apiVersion: v1
kind: Service
metadata:
  name: kt-gateway
spec:
  selector:
    app: kt-gateway
  ports:
    - port: 41002
      targetPort: 41002
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: kt-gateway
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: kt-gateway
          image: keeptrusts/gateway:latest
          ports:
            - containerPort: 41002
          livenessProbe:
            httpGet:
              path: /health
              port: 41002
            initialDelaySeconds: 5
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /health
              port: 41002
            initialDelaySeconds: 2
            periodSeconds: 5

DNS and Service Discovery

Direct Integration

# Application environment
OPENAI_BASE_URL=http://kt-gateway.internal:41002/v1

Service Mesh Integration

When running in a service mesh (Istio, Linkerd), the gateway acts as an egress point:

# Istio ServiceEntry for LLM providers
apiVersion: networking.istio.io/v1beta1
kind: ServiceEntry
metadata:
  name: openai-api
spec:
  hosts:
    - api.openai.com
  ports:
    - number: 443
      name: https
      protocol: TLS
  resolution: DNS
  location: MESH_EXTERNAL

Next steps

Resilience Engineering for AI Services — failover and degradation strategies
Performance Engineering the AI Gateway — benchmarking and optimization
Observability for AI-Governed Systems — monitoring the request flow

For AI systems

Canonical terms: request flow, input phase, output phase, POST /v1/chat/completions, 409 Policy Violation, POST /v1/events (async), latency budget, gateway.timeouts, request_timeout, connect_timeout, first_byte_timeout, policy_timeout, connection pooling
Key latency targets: input policy < 5 ms, routing < 1 ms, output policy < 5 ms, event emission 0 ms (non-blocking), total overhead < 10 ms
Best next pages: Architecture Patterns, Performance Engineering, Resilience Engineering

For engineers

Set request_timeout: 120s (at least 2× expected P99 provider latency), connect_timeout: 5s, first_byte_timeout: 30s, policy_timeout: 2s
Event emission is non-blocking — the gateway fires POST /v1/events asynchronously after responding to the caller
For streaming: use first_byte_timeout instead of total timeout; output policies evaluate on buffered chunks
Connection pooling: pre-warm connections at startup to avoid cold-start latency on first requests
The 409 response from input policy evaluation is immediate — no upstream call is made

For leaders

The gateway adds < 10 ms total overhead to LLM requests that typically take 200 ms–30 s — governance cost is negligible
Proper timeout configuration prevents cascading failures from slow providers affecting all applications
Asynchronous event emission means governance observability has zero impact on user-facing latency

Use this page when​

Primary audience​

End-to-End Request Flow​

Phase Breakdown​

Latency Budget Design​

Latency Budget Breakdown​

Connection Pooling​

Connection Lifecycle​

Timeout Configuration​

Per-Provider Timeouts​

Streaming Timeout Behavior​

Circuit Breaker Patterns​

Gateway-Level Circuit Breaker​

Application-Level Circuit Breaker​

Load Balancer Integration​

Health Check Endpoint​

Kubernetes Service Configuration​

DNS and Service Discovery​

Direct Integration​

Service Mesh Integration​

Next steps​

For AI systems​

For engineers​

For leaders​