Tune Gateway Performance for High Throughput
The Keeptrusts gateway sits in the critical path of every LLM request. This guide covers the key levers for minimizing latency, maximizing throughput, and ensuring your gateway handles production traffic at scale.
Use this page when
- You need to reduce gateway latency or increase throughput for production LLM traffic.
- You are tuning connection pools, timeouts, buffer sizes, or concurrency limits in
policy-config.yaml. - You want to benchmark your gateway with
kt benchand compare before/after results.
Primary audience
- Primary: Technical Engineers and SREs optimizing gateway performance
- Secondary: Platform Architects planning capacity, AI Agents querying performance baselines
Performance architecture
Client Request
→ Connection Pool (reuse upstream connections)
→ Policy Chain Evaluation (CPU-bound, parallel where possible)
→ Upstream LLM Provider (network-bound, dominant latency)
→ Response Processing (output chain, redaction)
→ Client Response
Typical latency breakdown:
Policy chain: 5-50ms (tunable)
Network to LLM: 50-2000ms (provider-dependent)
Response chain: 5-30ms (tunable)
Connection pooling
The gateway maintains persistent connection pools to upstream providers. Proper pool sizing prevents connection churn and reduces latency:
pack:
name: performance-tuning-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider:
base_url: https://api.openai.com/v1
secret_key_ref:
store: OPENAI_API_KEY
- id: anthropic
provider:
base_url: https://api.anthropic.com/v1
secret_key_ref:
store: ANTHROPIC_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Pool sizing guidelines
| Traffic volume | max_connections | max_idle_per_host | idle_timeout |
|---|---|---|---|
| < 100 req/min | 20 | 5 | 120s |
| 100–1,000 req/min | 50 | 15 | 90s |
| 1,000–10,000 req/min | 100 | 30 | 60s |
| > 10,000 req/min | 200+ | 50 | 45s |
Timeout tuning
Configure timeouts to balance between reliability and resource utilization:
# policy-config.yaml — timeout configuration
gateway:
timeouts:
connect: 5s # TCP connection establishment
request: 120s # Total time for the upstream request
idle: 90s # Close idle connections
policy_eval: 5s # Max time for policy chain evaluation
streaming_idle: 30s # Idle timeout for streaming responses
Timeout recommendations
| Scenario | connect | request | policy_eval |
|---|---|---|---|
| Real-time chat | 3s | 60s | 2s |
| Batch processing | 10s | 300s | 10s |
| Code generation | 5s | 180s | 5s |
| Streaming responses | 5s | 300s | 5s |
Buffer sizing
Buffers control how the gateway handles request and response bodies:
gateway:
buffers:
request_body_limit: 10mb # Max request body size
response_body_limit: 50mb # Max response body size
streaming_buffer: 64kb # Buffer size for streaming responses
For streaming responses, a smaller streaming_buffer reduces time-to-first-byte but increases syscall overhead. For batch requests with large payloads, increase request_body_limit accordingly.
Concurrent request limits
Protect the gateway from overload with concurrency controls:
gateway:
concurrency:
max_concurrent_requests: 500 # Total across all providers
max_queue_size: 1000 # Requests queued when at capacity
queue_timeout: 30s # Max time a request waits in queue
per_user_limit: 50 # Max concurrent requests per user
per_team_limit: 200 # Max concurrent requests per team
When max_concurrent_requests is reached, new requests queue. If the queue is full, the gateway returns 429 Too Many Requests immediately.
Benchmarking with kt bench
Measure your gateway's performance under controlled conditions:
# Basic benchmark: 100 requests, 10 concurrent
kt bench --target http://localhost:41002/v1/chat/completions \
--requests 100 \
--concurrency 10 \
--payload bench/sample-chat.json
# Sustained load test: 5 minutes at 50 req/s
kt bench --target http://localhost:41002/v1/chat/completions \
--rate 50 \
--duration 5m \
--payload bench/sample-chat.json
# Ramp-up test: gradually increase load
kt bench --target http://localhost:41002/v1/chat/completions \
--rate-start 10 \
--rate-end 200 \
--ramp-duration 2m \
--duration 5m \
--payload bench/sample-chat.json
Benchmark output
Benchmark Results
═════════════════
Target: http://localhost:41002/v1/chat/completions
Duration: 5m 0s
Total: 15,000 requests
Concurrency: 50
Latency:
P50: 142ms
P90: 289ms
P95: 412ms
P99: 687ms
Max: 1,245ms
Throughput: 50.0 req/s (target: 50 req/s)
Status Codes:
200: 14,650 (97.7%)
409: 285 (1.9%) ← policy blocks
429: 42 (0.3%) ← rate limited
502: 23 (0.2%) ← upstream errors
Policy Chain:
Avg eval time: 18ms
Max eval time: 89ms
Latency optimization checklist
# 1. Check current performance baseline
kt doctor --checks performance
# 2. Identify bottlenecks
kt events tail --format detailed --filter "latency_ms>500"
# 3. Review connection pool utilization
kt doctor --checks performance --verbose | grep -A5 "connection"
# 4. Check policy chain timing
kt events tail --since 1h --format json | \
jq '.policy_chain_ms' | sort -n | tail -20
# 5. Run a benchmark after each tuning change
kt bench --target http://localhost:41002/v1/chat/completions \
--requests 1000 --concurrency 50 --payload bench/sample-chat.json
Quick wins
| Optimization | Expected impact | Effort |
|---|---|---|
| Enable connection pooling | 20-50ms latency reduction | Config change |
| Move blocking policies first | 5-30ms reduction on blocked requests | Config reorder |
Increase max_idle_per_host | Reduces connection setup time | Config change |
Set appropriate request timeout | Prevents resource exhaustion | Config change |
Tune streaming_buffer size | Better TTFB for streaming | Config change |
| Scale horizontally | Linear throughput increase | Infrastructure |
Monitoring tuning impact
After making changes, validate with:
# Before/after comparison
kt bench --target http://localhost:41002/v1/chat/completions \
--requests 5000 --concurrency 50 \
--payload bench/sample-chat.json \
--output results/after-tuning.json
# Compare results
kt bench compare results/before-tuning.json results/after-tuning.json
Performance Comparison
──────────────────────
Before After Change
P50: 182ms 142ms -22.0%
P99: 920ms 687ms -25.3%
Throughput: 38 req/s 50 req/s +31.6%
Error rate: 1.2% 0.2% -83.3%
Business outcomes
| Outcome | How performance tuning helps |
|---|---|
| Lower user-perceived latency | Optimized pools and timeouts cut gateway overhead to single-digit milliseconds |
| Higher throughput | Proper concurrency limits let you serve more users without additional infrastructure |
| Cost efficiency | Better resource utilization means fewer gateway instances for the same traffic |
| Reliability | Timeout and queue tuning prevents cascading failures during traffic spikes |
| Data-driven scaling | Benchmark results quantify exactly when you need to scale horizontally |
For AI systems
- Canonical terms:
kt bench,kt doctor --checks performance, connection pool,pool.max_connections,pool.idle_timeout,pool.max_idle_per_host,gateway.timeouts,gateway.buffers,gateway.concurrency, P50/P95/P99 latency. - Config sections:
providers.[].pool,gateway.timeouts(connect, request, idle, policy_eval, streaming_idle),gateway.buffers(request_body_limit, response_body_limit, streaming_buffer),gateway.concurrency(max_concurrent_requests, per_user_limit, per_team_limit). - Benchmark flags:
--requests,--concurrency,--rate,--duration,--rate-start/--rate-end,--ramp-duration,--output,kt bench compare. - Best next pages: Gateway Diagnostics, Multi-Gateway, Gateway Docker Compose.
For engineers
- Start with baseline:
kt doctor --checks performanceto see current P50/P99, memory, and connection utilization. - Tune pools first: match
max_connectionsto your traffic volume (see pool sizing table in this page). - Benchmark after each change:
kt bench --requests 1000 --concurrency 50 --payload bench/sample-chat.json. - Compare results:
kt bench compare results/before.json results/after.json. - Quick wins: enable pooling, reorder blocking policies first, increase
max_idle_per_host, set appropriate timeouts.
For leaders
- Optimized gateway overhead means lower per-request cost and fewer infrastructure instances required.
- P99 latency targets ensure AI-powered features remain responsive under load — critical for customer-facing applications.
- Data-driven scaling:
kt benchresults quantify exactly when horizontal scaling is needed vs. config tuning. - Proper timeout and queue settings prevent cascading failures during traffic spikes, maintaining governance availability.
Next steps
- Gateway Docker Compose — scale beyond a single instance
- Diagnose Gateway Issues — troubleshoot performance problems
- Operate Multiple Gateways — manage a fleet of tuned gateways