Performance Engineering the AI Gateway

The gateway sits in the hot path of every LLM request. This guide covers techniques to minimize the overhead it adds, maximize throughput, and establish measurable latency targets.

Use this page when

You are tuning gateway connection pooling, keep-alive, and HTTP/2 multiplexing for upstream providers
You need to establish latency budgets (< 7 ms total gateway overhead at P99)
You are configuring streaming response optimization or response caching
You want to benchmark gateway throughput with kt bench

Primary audience

Primary: Technical Engineers
Secondary: AI Agents, Technical Leaders

Gateway Overhead Budget

A well-tuned gateway adds minimal latency to the request path:

Component	Target Overhead	Measured At
Input policy evaluation	< 2 ms	P99
Request routing	< 0.5 ms	P99
Connection acquisition	< 1 ms	P99 (warm pool)
Output policy evaluation	< 3 ms	P99
Event emission (async)	0 ms	Non-blocking
Total gateway overhead	< 7 ms	P99

Connection Reuse and Keep-Alive

HTTP/2 Multiplexing

The gateway uses HTTP/2 for upstream connections, multiplexing requests over a single TCP connection:

gateway:
  upstream:
    http_version: h2
    # Keep connections alive between requests
    keep_alive:
      enabled: true
      interval: 30s
      timeout: 60s

Connection Pool Tuning

gateway:
  connection_pool:
    # Per-provider pool sizes
    max_idle_per_host: 32
    # Total pool capacity
    max_total: 256
    # Recycle connections before provider-side timeout
    max_lifetime: 300s
    # Remove idle connections after this duration
    idle_timeout: 90s

Connection Warm-Up

Pre-warm connections on gateway startup to avoid cold-start latency:

gateway:
  warmup:
    enabled: true
    # Establish this many connections per provider at startup
    connections_per_provider: 4
    # Lightweight health check to validate the connection
    probe_endpoint: /v1/models

Request Pipelining

Streaming Response Optimization

For streaming responses, the gateway evaluates output policies on buffered chunks:

gateway:
  streaming:
    # Buffer size before running output policy evaluation
    chunk_buffer: 4KB
    # Flush interval — don't hold chunks longer than this
    flush_interval: 100ms
    # Enable chunked transfer encoding passthrough
    passthrough_chunked: true

Concurrent Policy Evaluation

When multiple policies apply, evaluate independent policies in parallel:

policies:
  - name: content-filter
    type: output_filter
    parallel_group: output-checks
    action: block

  - name: pii-redaction
    type: output_filter
    parallel_group: output-checks
    action: redact

  # These two run concurrently since they share a parallel_group

Caching Strategies

Response Caching

Cache identical requests to reduce provider calls and latency:

gateway:
  cache:
    enabled: true
    # Cache backend
    backend: memory  # or redis
    # Maximum cache entries
    max_entries: 10000
    # TTL for cached responses
    ttl: 3600s
    # Only cache requests with temperature=0
    cache_when:
      temperature: 0
    # Cache key components
    key_includes: [model, messages, temperature, max_tokens]

Cache Hit Flow

Cache Invalidation

# Clear the entire cache
kt cache clear

# Clear cache for a specific model
kt cache clear --model gpt-4o

# View cache statistics
kt cache stats

Benchmarking with kt bench

Basic Benchmark

# Run 100 requests with 10 concurrent connections
kt bench \
  --url http://localhost:41002/v1/chat/completions \
  --requests 100 \
  --concurrency 10 \
  --model gpt-4o-mini \
  --prompt "Say hello in one word"

Benchmark Output

Benchmark Results:
  Total Requests:    100
  Successful:        98
  Failed:            2
  Duration:          12.3s

  Latency (ms):
    P50:    245
    P90:    890
    P95:    1,230
    P99:    2,100
    Max:    3,450

  Throughput:
    Requests/sec:   8.13
    Tokens/sec:     4,065

  Gateway Overhead:
    P50:    1.2 ms
    P99:    4.8 ms

Comparative Benchmarks

Compare direct-to-provider vs through-gateway:

# Direct to provider (baseline)
kt bench \
  --url https://api.openai.com/v1/chat/completions \
  --api-key "$OPENAI_API_KEY" \
  --requests 50 \
  --concurrency 5 \
  --model gpt-4o-mini \
  --prompt "Hello" \
  --output baseline.json

# Through gateway
kt bench \
  --url http://localhost:41002/v1/chat/completions \
  --requests 50 \
  --concurrency 5 \
  --model gpt-4o-mini \
  --prompt "Hello" \
  --output gateway.json

# Compare results
kt bench compare baseline.json gateway.json

Streaming Benchmark

# Benchmark streaming responses
kt bench \
  --url http://localhost:41002/v1/chat/completions \
  --requests 50 \
  --concurrency 5 \
  --model gpt-4o \
  --prompt "Write a haiku about clouds" \
  --stream \
  --measure-ttfb  # Time to first byte

P99 Latency Targets

Setting Targets

Define latency SLOs per request category:

Category	P50 Target	P99 Target	Timeout
Chat (short)	300 ms	2 s	30 s
Chat (long context)	2 s	15 s	120 s
Embeddings	50 ms	200 ms	10 s
Batch processing	N/A	N/A	300 s

Monitoring Latency

# Real-time latency monitoring
kt events tail --format "{{.latency_ms}}ms {{.model}} {{.status}}"

# Latency percentiles over the last hour
kt events stats --metric latency --percentiles 50,90,95,99 --last 1h

Resource Sizing

Gateway Process

Workload	CPU	Memory	Connections
Light (< 10 RPS)	0.5 vCPU	128 MB	64
Medium (10–100 RPS)	2 vCPU	512 MB	256
Heavy (100–1000 RPS)	4 vCPU	1 GB	1024
Extreme (> 1000 RPS)	8+ vCPU	2+ GB	4096

Kubernetes Resource Limits

resources:
  requests:
    cpu: "500m"
    memory: "256Mi"
  limits:
    cpu: "2000m"
    memory: "512Mi"

Performance Anti-Patterns

Anti-Pattern	Impact	Fix
New TCP connection per request	+50–100 ms per request	Enable connection pooling
Synchronous event emission	Blocks response delivery	Use async event dispatch
Unbounded policy timeout	Pathological regex stalls pipeline	Set `policy_timeout`
No keep-alive	TCP + TLS handshake per request	Enable `keep_alive`
Cache disabled for deterministic queries	Unnecessary provider calls	Cache when `temperature=0`

Next steps

Observability for AI-Governed Systems — monitor performance metrics
Capacity Planning for AI Workloads — size infrastructure for load
Distributed Tracing Across AI Services — trace latency bottlenecks

For AI systems

Canonical terms: gateway.connection_pool, gateway.upstream.http_version: h2, gateway.warmup, gateway.streaming, kt bench, keep-alive, connection reuse, P99 latency, max_idle_per_host, chunk_buffer, flush_interval
Key targets: total gateway overhead < 7 ms P99, input policy < 2 ms, output policy < 3 ms, connection acquisition < 1 ms (warm)
Best next pages: Capacity Planning, System Design: Integrating the AI Gateway, Resilience Engineering

For engineers

Enable HTTP/2 multiplexing: gateway.upstream.http_version: h2 to reuse TCP connections across concurrent requests
Connection pool sizing: max_idle_per_host: 32, max_total: 256, max_lifetime: 300s, idle_timeout: 90s
Warm-up on startup: gateway.warmup.connections_per_provider: 4 to avoid cold-start latency
Streaming: buffer 4 KB before output policy evaluation, flush every 100 ms to balance policy coverage and time-to-first-token
Benchmark with: kt bench --rps 100 --duration 60s --model gpt-4o-mini

For leaders

The gateway should add < 7 ms overhead at P99 — negligible compared to 200 ms–30 s LLM inference time
Performance tuning reduces infrastructure cost by handling more traffic per gateway instance before scaling
Connection pooling and HTTP/2 are the highest-leverage optimizations — they reduce latency and TCP overhead simultaneously

Use this page when​

Primary audience​

Gateway Overhead Budget​

Connection Reuse and Keep-Alive​

HTTP/2 Multiplexing​

Connection Pool Tuning​

Connection Warm-Up​

Request Pipelining​

Streaming Response Optimization​

Concurrent Policy Evaluation​

Caching Strategies​

Response Caching​

Cache Hit Flow​

Cache Invalidation​

Benchmarking with kt bench​

Basic Benchmark​

Benchmark Output​

Comparative Benchmarks​

Streaming Benchmark​

P99 Latency Targets​

Setting Targets​

Monitoring Latency​

Resource Sizing​

Gateway Process​

Kubernetes Resource Limits​

Performance Anti-Patterns​

Next steps​

For AI systems​

For engineers​

For leaders​