Performance Engineering the AI Gateway
The gateway sits in the hot path of every LLM request. This guide covers techniques to minimize the overhead it adds, maximize throughput, and establish measurable latency targets.
Use this page when
- You are tuning gateway connection pooling, keep-alive, and HTTP/2 multiplexing for upstream providers
- You need to establish latency budgets (< 7 ms total gateway overhead at P99)
- You are configuring streaming response optimization or response caching
- You want to benchmark gateway throughput with
kt bench
Primary audience
- Primary: Technical Engineers
- Secondary: AI Agents, Technical Leaders
Gateway Overhead Budget
A well-tuned gateway adds minimal latency to the request path:
| Component | Target Overhead | Measured At |
|---|---|---|
| Input policy evaluation | < 2 ms | P99 |
| Request routing | < 0.5 ms | P99 |
| Connection acquisition | < 1 ms | P99 (warm pool) |
| Output policy evaluation | < 3 ms | P99 |
| Event emission (async) | 0 ms | Non-blocking |
| Total gateway overhead | < 7 ms | P99 |
Connection Reuse and Keep-Alive
HTTP/2 Multiplexing
The gateway uses HTTP/2 for upstream connections, multiplexing requests over a single TCP connection:
gateway:
upstream:
http_version: h2
# Keep connections alive between requests
keep_alive:
enabled: true
interval: 30s
timeout: 60s
Connection Pool Tuning
gateway:
connection_pool:
# Per-provider pool sizes
max_idle_per_host: 32
# Total pool capacity
max_total: 256
# Recycle connections before provider-side timeout
max_lifetime: 300s
# Remove idle connections after this duration
idle_timeout: 90s
Connection Warm-Up
Pre-warm connections on gateway startup to avoid cold-start latency:
gateway:
warmup:
enabled: true
# Establish this many connections per provider at startup
connections_per_provider: 4
# Lightweight health check to validate the connection
probe_endpoint: /v1/models
Request Pipelining
Streaming Response Optimization
For streaming responses, the gateway evaluates output policies on buffered chunks:
gateway:
streaming:
# Buffer size before running output policy evaluation
chunk_buffer: 4KB
# Flush interval — don't hold chunks longer than this
flush_interval: 100ms
# Enable chunked transfer encoding passthrough
passthrough_chunked: true
Concurrent Policy Evaluation
When multiple policies apply, evaluate independent policies in parallel:
policies:
- name: content-filter
type: output_filter
parallel_group: output-checks
action: block
- name: pii-redaction
type: output_filter
parallel_group: output-checks
action: redact
# These two run concurrently since they share a parallel_group
Caching Strategies
Response Caching
Cache identical requests to reduce provider calls and latency:
gateway:
cache:
enabled: true
# Cache backend
backend: memory # or redis
# Maximum cache entries
max_entries: 10000
# TTL for cached responses
ttl: 3600s
# Only cache requests with temperature=0
cache_when:
temperature: 0
# Cache key components
key_includes: [model, messages, temperature, max_tokens]
Cache Hit Flow
Cache Invalidation
# Clear the entire cache
kt cache clear
# Clear cache for a specific model
kt cache clear --model gpt-4o
# View cache statistics
kt cache stats
Benchmarking with kt bench
Basic Benchmark
# Run 100 requests with 10 concurrent connections
kt bench \
--url http://localhost:41002/v1/chat/completions \
--requests 100 \
--concurrency 10 \
--model gpt-4o-mini \
--prompt "Say hello in one word"
Benchmark Output
Benchmark Results:
Total Requests: 100
Successful: 98
Failed: 2
Duration: 12.3s
Latency (ms):
P50: 245
P90: 890
P95: 1,230
P99: 2,100
Max: 3,450
Throughput:
Requests/sec: 8.13
Tokens/sec: 4,065
Gateway Overhead:
P50: 1.2 ms
P99: 4.8 ms
Comparative Benchmarks
Compare direct-to-provider vs through-gateway:
# Direct to provider (baseline)
kt bench \
--url https://api.openai.com/v1/chat/completions \
--api-key "$OPENAI_API_KEY" \
--requests 50 \
--concurrency 5 \
--model gpt-4o-mini \
--prompt "Hello" \
--output baseline.json
# Through gateway
kt bench \
--url http://localhost:41002/v1/chat/completions \
--requests 50 \
--concurrency 5 \
--model gpt-4o-mini \
--prompt "Hello" \
--output gateway.json
# Compare results
kt bench compare baseline.json gateway.json
Streaming Benchmark
# Benchmark streaming responses
kt bench \
--url http://localhost:41002/v1/chat/completions \
--requests 50 \
--concurrency 5 \
--model gpt-4o \
--prompt "Write a haiku about clouds" \
--stream \
--measure-ttfb # Time to first byte
P99 Latency Targets
Setting Targets
Define latency SLOs per request category:
| Category | P50 Target | P99 Target | Timeout |
|---|---|---|---|
| Chat (short) | 300 ms | 2 s | 30 s |
| Chat (long context) | 2 s | 15 s | 120 s |
| Embeddings | 50 ms | 200 ms | 10 s |
| Batch processing | N/A | N/A | 300 s |
Monitoring Latency
# Real-time latency monitoring
kt events tail --format "{{.latency_ms}}ms {{.model}} {{.status}}"
# Latency percentiles over the last hour
kt events stats --metric latency --percentiles 50,90,95,99 --last 1h
Resource Sizing
Gateway Process
| Workload | CPU | Memory | Connections |
|---|---|---|---|
| Light (< 10 RPS) | 0.5 vCPU | 128 MB | 64 |
| Medium (10–100 RPS) | 2 vCPU | 512 MB | 256 |
| Heavy (100–1000 RPS) | 4 vCPU | 1 GB | 1024 |
| Extreme (> 1000 RPS) | 8+ vCPU | 2+ GB | 4096 |
Kubernetes Resource Limits
resources:
requests:
cpu: "500m"
memory: "256Mi"
limits:
cpu: "2000m"
memory: "512Mi"
Performance Anti-Patterns
| Anti-Pattern | Impact | Fix |
|---|---|---|
| New TCP connection per request | +50–100 ms per request | Enable connection pooling |
| Synchronous event emission | Blocks response delivery | Use async event dispatch |
| Unbounded policy timeout | Pathological regex stalls pipeline | Set policy_timeout |
| No keep-alive | TCP + TLS handshake per request | Enable keep_alive |
| Cache disabled for deterministic queries | Unnecessary provider calls | Cache when temperature=0 |
Next steps
- Observability for AI-Governed Systems — monitor performance metrics
- Capacity Planning for AI Workloads — size infrastructure for load
- Distributed Tracing Across AI Services — trace latency bottlenecks
For AI systems
- Canonical terms:
gateway.connection_pool,gateway.upstream.http_version: h2,gateway.warmup,gateway.streaming,kt bench, keep-alive, connection reuse, P99 latency,max_idle_per_host,chunk_buffer,flush_interval - Key targets: total gateway overhead < 7 ms P99, input policy < 2 ms, output policy < 3 ms, connection acquisition < 1 ms (warm)
- Best next pages: Capacity Planning, System Design: Integrating the AI Gateway, Resilience Engineering
For engineers
- Enable HTTP/2 multiplexing:
gateway.upstream.http_version: h2to reuse TCP connections across concurrent requests - Connection pool sizing:
max_idle_per_host: 32,max_total: 256,max_lifetime: 300s,idle_timeout: 90s - Warm-up on startup:
gateway.warmup.connections_per_provider: 4to avoid cold-start latency - Streaming: buffer 4 KB before output policy evaluation, flush every 100 ms to balance policy coverage and time-to-first-token
- Benchmark with:
kt bench --rps 100 --duration 60s --model gpt-4o-mini
For leaders
- The gateway should add < 7 ms overhead at P99 — negligible compared to 200 ms–30 s LLM inference time
- Performance tuning reduces infrastructure cost by handling more traffic per gateway instance before scaling
- Connection pooling and HTTP/2 are the highest-leverage optimizations — they reduce latency and TCP overhead simultaneously