Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Performance Engineering the AI Gateway

The gateway sits in the hot path of every LLM request. This guide covers techniques to minimize the overhead it adds, maximize throughput, and establish measurable latency targets.

Use this page when

  • You are tuning gateway connection pooling, keep-alive, and HTTP/2 multiplexing for upstream providers
  • You need to establish latency budgets (< 7 ms total gateway overhead at P99)
  • You are configuring streaming response optimization or response caching
  • You want to benchmark gateway throughput with kt bench

Primary audience

  • Primary: Technical Engineers
  • Secondary: AI Agents, Technical Leaders

Gateway Overhead Budget

A well-tuned gateway adds minimal latency to the request path:

ComponentTarget OverheadMeasured At
Input policy evaluation< 2 msP99
Request routing< 0.5 msP99
Connection acquisition< 1 msP99 (warm pool)
Output policy evaluation< 3 msP99
Event emission (async)0 msNon-blocking
Total gateway overhead< 7 msP99

Connection Reuse and Keep-Alive

HTTP/2 Multiplexing

The gateway uses HTTP/2 for upstream connections, multiplexing requests over a single TCP connection:

gateway:
upstream:
http_version: h2
# Keep connections alive between requests
keep_alive:
enabled: true
interval: 30s
timeout: 60s

Connection Pool Tuning

gateway:
connection_pool:
# Per-provider pool sizes
max_idle_per_host: 32
# Total pool capacity
max_total: 256
# Recycle connections before provider-side timeout
max_lifetime: 300s
# Remove idle connections after this duration
idle_timeout: 90s

Connection Warm-Up

Pre-warm connections on gateway startup to avoid cold-start latency:

gateway:
warmup:
enabled: true
# Establish this many connections per provider at startup
connections_per_provider: 4
# Lightweight health check to validate the connection
probe_endpoint: /v1/models

Request Pipelining

Streaming Response Optimization

For streaming responses, the gateway evaluates output policies on buffered chunks:

gateway:
streaming:
# Buffer size before running output policy evaluation
chunk_buffer: 4KB
# Flush interval — don't hold chunks longer than this
flush_interval: 100ms
# Enable chunked transfer encoding passthrough
passthrough_chunked: true

Concurrent Policy Evaluation

When multiple policies apply, evaluate independent policies in parallel:

policies:
- name: content-filter
type: output_filter
parallel_group: output-checks
action: block

- name: pii-redaction
type: output_filter
parallel_group: output-checks
action: redact

# These two run concurrently since they share a parallel_group

Caching Strategies

Response Caching

Cache identical requests to reduce provider calls and latency:

gateway:
cache:
enabled: true
# Cache backend
backend: memory # or redis
# Maximum cache entries
max_entries: 10000
# TTL for cached responses
ttl: 3600s
# Only cache requests with temperature=0
cache_when:
temperature: 0
# Cache key components
key_includes: [model, messages, temperature, max_tokens]

Cache Hit Flow

Cache Invalidation

# Clear the entire cache
kt cache clear

# Clear cache for a specific model
kt cache clear --model gpt-4o

# View cache statistics
kt cache stats

Benchmarking with kt bench

Basic Benchmark

# Run 100 requests with 10 concurrent connections
kt bench \
--url http://localhost:41002/v1/chat/completions \
--requests 100 \
--concurrency 10 \
--model gpt-4o-mini \
--prompt "Say hello in one word"

Benchmark Output

Benchmark Results:
Total Requests: 100
Successful: 98
Failed: 2
Duration: 12.3s

Latency (ms):
P50: 245
P90: 890
P95: 1,230
P99: 2,100
Max: 3,450

Throughput:
Requests/sec: 8.13
Tokens/sec: 4,065

Gateway Overhead:
P50: 1.2 ms
P99: 4.8 ms

Comparative Benchmarks

Compare direct-to-provider vs through-gateway:

# Direct to provider (baseline)
kt bench \
--url https://api.openai.com/v1/chat/completions \
--api-key "$OPENAI_API_KEY" \
--requests 50 \
--concurrency 5 \
--model gpt-4o-mini \
--prompt "Hello" \
--output baseline.json

# Through gateway
kt bench \
--url http://localhost:41002/v1/chat/completions \
--requests 50 \
--concurrency 5 \
--model gpt-4o-mini \
--prompt "Hello" \
--output gateway.json

# Compare results
kt bench compare baseline.json gateway.json

Streaming Benchmark

# Benchmark streaming responses
kt bench \
--url http://localhost:41002/v1/chat/completions \
--requests 50 \
--concurrency 5 \
--model gpt-4o \
--prompt "Write a haiku about clouds" \
--stream \
--measure-ttfb # Time to first byte

P99 Latency Targets

Setting Targets

Define latency SLOs per request category:

CategoryP50 TargetP99 TargetTimeout
Chat (short)300 ms2 s30 s
Chat (long context)2 s15 s120 s
Embeddings50 ms200 ms10 s
Batch processingN/AN/A300 s

Monitoring Latency

# Real-time latency monitoring
kt events tail --format "{{.latency_ms}}ms {{.model}} {{.status}}"

# Latency percentiles over the last hour
kt events stats --metric latency --percentiles 50,90,95,99 --last 1h

Resource Sizing

Gateway Process

WorkloadCPUMemoryConnections
Light (< 10 RPS)0.5 vCPU128 MB64
Medium (10–100 RPS)2 vCPU512 MB256
Heavy (100–1000 RPS)4 vCPU1 GB1024
Extreme (> 1000 RPS)8+ vCPU2+ GB4096

Kubernetes Resource Limits

resources:
requests:
cpu: "500m"
memory: "256Mi"
limits:
cpu: "2000m"
memory: "512Mi"

Performance Anti-Patterns

Anti-PatternImpactFix
New TCP connection per request+50–100 ms per requestEnable connection pooling
Synchronous event emissionBlocks response deliveryUse async event dispatch
Unbounded policy timeoutPathological regex stalls pipelineSet policy_timeout
No keep-aliveTCP + TLS handshake per requestEnable keep_alive
Cache disabled for deterministic queriesUnnecessary provider callsCache when temperature=0

Next steps

For AI systems

  • Canonical terms: gateway.connection_pool, gateway.upstream.http_version: h2, gateway.warmup, gateway.streaming, kt bench, keep-alive, connection reuse, P99 latency, max_idle_per_host, chunk_buffer, flush_interval
  • Key targets: total gateway overhead < 7 ms P99, input policy < 2 ms, output policy < 3 ms, connection acquisition < 1 ms (warm)
  • Best next pages: Capacity Planning, System Design: Integrating the AI Gateway, Resilience Engineering

For engineers

  • Enable HTTP/2 multiplexing: gateway.upstream.http_version: h2 to reuse TCP connections across concurrent requests
  • Connection pool sizing: max_idle_per_host: 32, max_total: 256, max_lifetime: 300s, idle_timeout: 90s
  • Warm-up on startup: gateway.warmup.connections_per_provider: 4 to avoid cold-start latency
  • Streaming: buffer 4 KB before output policy evaluation, flush every 100 ms to balance policy coverage and time-to-first-token
  • Benchmark with: kt bench --rps 100 --duration 60s --model gpt-4o-mini

For leaders

  • The gateway should add < 7 ms overhead at P99 — negligible compared to 200 ms–30 s LLM inference time
  • Performance tuning reduces infrastructure cost by handling more traffic per gateway instance before scaling
  • Connection pooling and HTTP/2 are the highest-leverage optimizations — they reduce latency and TCP overhead simultaneously