Tune Gateway Performance for High Throughput

The Keeptrusts gateway sits in the critical path of every LLM request. This guide covers the key levers for minimizing latency, maximizing throughput, and ensuring your gateway handles production traffic at scale.

Use this page when

You need to reduce gateway latency or increase throughput for production LLM traffic.
You are tuning connection pools, timeouts, buffer sizes, or concurrency limits in policy-config.yaml.
You want to benchmark your gateway with kt bench and compare before/after results.

Primary audience

Primary: Technical Engineers and SREs optimizing gateway performance
Secondary: Platform Architects planning capacity, AI Agents querying performance baselines

Performance architecture

Client Request
  → Connection Pool (reuse upstream connections)
    → Policy Chain Evaluation (CPU-bound, parallel where possible)
      → Upstream LLM Provider (network-bound, dominant latency)
    → Response Processing (output chain, redaction)
  → Client Response

Typical latency breakdown:
  Policy chain:     5-50ms  (tunable)
  Network to LLM:   50-2000ms (provider-dependent)
  Response chain:   5-30ms  (tunable)

Connection pooling

The gateway maintains persistent connection pools to upstream providers. Proper pool sizing prevents connection churn and reduces latency:

pack:
  name: performance-tuning-providers-1
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai
    provider: 
    base_url: https://api.openai.com/v1
    secret_key_ref:
      store: OPENAI_API_KEY
  - id: anthropic
    provider: 
    base_url: https://api.anthropic.com/v1
    secret_key_ref:
      store: ANTHROPIC_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Pool sizing guidelines

Traffic volume	`max_connections`	`max_idle_per_host`	`idle_timeout`
< 100 req/min	20	5	120s
100–1,000 req/min	50	15	90s
1,000–10,000 req/min	100	30	60s
> 10,000 req/min	200+	50	45s

Timeout tuning

Configure timeouts to balance between reliability and resource utilization:

# policy-config.yaml — timeout configuration
gateway:
  timeouts:
    connect: 5s              # TCP connection establishment
    request: 120s            # Total time for the upstream request
    idle: 90s                # Close idle connections
    policy_eval: 5s          # Max time for policy chain evaluation
    streaming_idle: 30s      # Idle timeout for streaming responses

Timeout recommendations

Scenario	`connect`	`request`	`policy_eval`
Real-time chat	3s	60s	2s
Batch processing	10s	300s	10s
Code generation	5s	180s	5s
Streaming responses	5s	300s	5s

Buffer sizing

Buffers control how the gateway handles request and response bodies:

gateway:
  buffers:
    request_body_limit: 10mb     # Max request body size
    response_body_limit: 50mb    # Max response body size
    streaming_buffer: 64kb       # Buffer size for streaming responses

For streaming responses, a smaller streaming_buffer reduces time-to-first-byte but increases syscall overhead. For batch requests with large payloads, increase request_body_limit accordingly.

Concurrent request limits

Protect the gateway from overload with concurrency controls:

gateway:
  concurrency:
    max_concurrent_requests: 500    # Total across all providers
    max_queue_size: 1000            # Requests queued when at capacity
    queue_timeout: 30s              # Max time a request waits in queue
    per_user_limit: 50              # Max concurrent requests per user
    per_team_limit: 200             # Max concurrent requests per team

When max_concurrent_requests is reached, new requests queue. If the queue is full, the gateway returns 429 Too Many Requests immediately.

Benchmarking with `kt bench`

Measure your gateway's performance under controlled conditions:

# Basic benchmark: 100 requests, 10 concurrent
kt bench --target http://localhost:41002/v1/chat/completions \
  --requests 100 \
  --concurrency 10 \
  --payload bench/sample-chat.json

# Sustained load test: 5 minutes at 50 req/s
kt bench --target http://localhost:41002/v1/chat/completions \
  --rate 50 \
  --duration 5m \
  --payload bench/sample-chat.json

# Ramp-up test: gradually increase load
kt bench --target http://localhost:41002/v1/chat/completions \
  --rate-start 10 \
  --rate-end 200 \
  --ramp-duration 2m \
  --duration 5m \
  --payload bench/sample-chat.json

Benchmark output

Benchmark Results
═════════════════
Target:       http://localhost:41002/v1/chat/completions
Duration:     5m 0s
Total:        15,000 requests
Concurrency:  50

Latency:
  P50:   142ms
  P90:   289ms
  P95:   412ms
  P99:   687ms
  Max:   1,245ms

Throughput:  50.0 req/s (target: 50 req/s)

Status Codes:
  200: 14,650 (97.7%)
  409:    285 (1.9%)   ← policy blocks
  429:     42 (0.3%)   ← rate limited
  502:     23 (0.2%)   ← upstream errors

Policy Chain:
  Avg eval time:  18ms
  Max eval time:  89ms

Latency optimization checklist

# 1. Check current performance baseline
kt doctor --checks performance

# 2. Identify bottlenecks
kt events tail --format detailed --filter "latency_ms>500"

# 3. Review connection pool utilization
kt doctor --checks performance --verbose | grep -A5 "connection"

# 4. Check policy chain timing
kt events tail --since 1h --format json | \
  jq '.policy_chain_ms' | sort -n | tail -20

# 5. Run a benchmark after each tuning change
kt bench --target http://localhost:41002/v1/chat/completions \
  --requests 1000 --concurrency 50 --payload bench/sample-chat.json

Quick wins

Optimization	Expected impact	Effort
Enable connection pooling	20-50ms latency reduction	Config change
Move blocking policies first	5-30ms reduction on blocked requests	Config reorder
Increase `max_idle_per_host`	Reduces connection setup time	Config change
Set appropriate `request` timeout	Prevents resource exhaustion	Config change
Tune `streaming_buffer` size	Better TTFB for streaming	Config change
Scale horizontally	Linear throughput increase	Infrastructure

Monitoring tuning impact

After making changes, validate with:

# Before/after comparison
kt bench --target http://localhost:41002/v1/chat/completions \
  --requests 5000 --concurrency 50 \
  --payload bench/sample-chat.json \
  --output results/after-tuning.json

# Compare results
kt bench compare results/before-tuning.json results/after-tuning.json

Performance Comparison
──────────────────────
              Before      After       Change
P50:          182ms       142ms       -22.0%
P99:          920ms       687ms       -25.3%
Throughput:   38 req/s    50 req/s    +31.6%
Error rate:   1.2%        0.2%        -83.3%

Business outcomes

Outcome	How performance tuning helps
Lower user-perceived latency	Optimized pools and timeouts cut gateway overhead to single-digit milliseconds
Higher throughput	Proper concurrency limits let you serve more users without additional infrastructure
Cost efficiency	Better resource utilization means fewer gateway instances for the same traffic
Reliability	Timeout and queue tuning prevents cascading failures during traffic spikes
Data-driven scaling	Benchmark results quantify exactly when you need to scale horizontally

For AI systems

Canonical terms: kt bench, kt doctor --checks performance, connection pool, pool.max_connections, pool.idle_timeout, pool.max_idle_per_host, gateway.timeouts, gateway.buffers, gateway.concurrency, P50/P95/P99 latency.
Config sections: providers.[].pool, gateway.timeouts (connect, request, idle, policy_eval, streaming_idle), gateway.buffers (request_body_limit, response_body_limit, streaming_buffer), gateway.concurrency (max_concurrent_requests, per_user_limit, per_team_limit).
Benchmark flags: --requests, --concurrency, --rate, --duration, --rate-start/--rate-end, --ramp-duration, --output, kt bench compare.
Best next pages: Gateway Diagnostics, Multi-Gateway, Gateway Docker Compose.

For engineers

Start with baseline: kt doctor --checks performance to see current P50/P99, memory, and connection utilization.
Tune pools first: match max_connections to your traffic volume (see pool sizing table in this page).
Benchmark after each change: kt bench --requests 1000 --concurrency 50 --payload bench/sample-chat.json.
Compare results: kt bench compare results/before.json results/after.json.
Quick wins: enable pooling, reorder blocking policies first, increase max_idle_per_host, set appropriate timeouts.

For leaders

Optimized gateway overhead means lower per-request cost and fewer infrastructure instances required.
P99 latency targets ensure AI-powered features remain responsive under load — critical for customer-facing applications.
Data-driven scaling: kt bench results quantify exactly when horizontal scaling is needed vs. config tuning.
Proper timeout and queue settings prevent cascading failures during traffic spikes, maintaining governance availability.

Next steps

Gateway Docker Compose — scale beyond a single instance
Diagnose Gateway Issues — troubleshoot performance problems
Operate Multiple Gateways — manage a fleet of tuned gateways

Use this page when​

Primary audience​

Performance architecture​

Connection pooling​

Pool sizing guidelines​

Timeout tuning​

Timeout recommendations​

Buffer sizing​

Concurrent request limits​

Benchmarking with kt bench​

Benchmark output​

Latency optimization checklist​

Quick wins​

Monitoring tuning impact​

Business outcomes​

For AI systems​

For engineers​

For leaders​

Next steps​