Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Tune Gateway Performance for High Throughput

The Keeptrusts gateway sits in the critical path of every LLM request. This guide covers the key levers for minimizing latency, maximizing throughput, and ensuring your gateway handles production traffic at scale.

Use this page when

  • You need to reduce gateway latency or increase throughput for production LLM traffic.
  • You are tuning connection pools, timeouts, buffer sizes, or concurrency limits in policy-config.yaml.
  • You want to benchmark your gateway with kt bench and compare before/after results.

Primary audience

  • Primary: Technical Engineers and SREs optimizing gateway performance
  • Secondary: Platform Architects planning capacity, AI Agents querying performance baselines

Performance architecture

Client Request
→ Connection Pool (reuse upstream connections)
→ Policy Chain Evaluation (CPU-bound, parallel where possible)
→ Upstream LLM Provider (network-bound, dominant latency)
→ Response Processing (output chain, redaction)
→ Client Response

Typical latency breakdown:
Policy chain: 5-50ms (tunable)
Network to LLM: 50-2000ms (provider-dependent)
Response chain: 5-30ms (tunable)

Connection pooling

The gateway maintains persistent connection pools to upstream providers. Proper pool sizing prevents connection churn and reduces latency:

pack:
name: performance-tuning-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider:
base_url: https://api.openai.com/v1
secret_key_ref:
store: OPENAI_API_KEY
- id: anthropic
provider:
base_url: https://api.anthropic.com/v1
secret_key_ref:
store: ANTHROPIC_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Pool sizing guidelines

Traffic volumemax_connectionsmax_idle_per_hostidle_timeout
< 100 req/min205120s
100–1,000 req/min501590s
1,000–10,000 req/min1003060s
> 10,000 req/min200+5045s

Timeout tuning

Configure timeouts to balance between reliability and resource utilization:

# policy-config.yaml — timeout configuration
gateway:
timeouts:
connect: 5s # TCP connection establishment
request: 120s # Total time for the upstream request
idle: 90s # Close idle connections
policy_eval: 5s # Max time for policy chain evaluation
streaming_idle: 30s # Idle timeout for streaming responses

Timeout recommendations

Scenarioconnectrequestpolicy_eval
Real-time chat3s60s2s
Batch processing10s300s10s
Code generation5s180s5s
Streaming responses5s300s5s

Buffer sizing

Buffers control how the gateway handles request and response bodies:

gateway:
buffers:
request_body_limit: 10mb # Max request body size
response_body_limit: 50mb # Max response body size
streaming_buffer: 64kb # Buffer size for streaming responses

For streaming responses, a smaller streaming_buffer reduces time-to-first-byte but increases syscall overhead. For batch requests with large payloads, increase request_body_limit accordingly.

Concurrent request limits

Protect the gateway from overload with concurrency controls:

gateway:
concurrency:
max_concurrent_requests: 500 # Total across all providers
max_queue_size: 1000 # Requests queued when at capacity
queue_timeout: 30s # Max time a request waits in queue
per_user_limit: 50 # Max concurrent requests per user
per_team_limit: 200 # Max concurrent requests per team

When max_concurrent_requests is reached, new requests queue. If the queue is full, the gateway returns 429 Too Many Requests immediately.

Benchmarking with kt bench

Measure your gateway's performance under controlled conditions:

# Basic benchmark: 100 requests, 10 concurrent
kt bench --target http://localhost:41002/v1/chat/completions \
--requests 100 \
--concurrency 10 \
--payload bench/sample-chat.json

# Sustained load test: 5 minutes at 50 req/s
kt bench --target http://localhost:41002/v1/chat/completions \
--rate 50 \
--duration 5m \
--payload bench/sample-chat.json

# Ramp-up test: gradually increase load
kt bench --target http://localhost:41002/v1/chat/completions \
--rate-start 10 \
--rate-end 200 \
--ramp-duration 2m \
--duration 5m \
--payload bench/sample-chat.json

Benchmark output

Benchmark Results
═════════════════
Target: http://localhost:41002/v1/chat/completions
Duration: 5m 0s
Total: 15,000 requests
Concurrency: 50

Latency:
P50: 142ms
P90: 289ms
P95: 412ms
P99: 687ms
Max: 1,245ms

Throughput: 50.0 req/s (target: 50 req/s)

Status Codes:
200: 14,650 (97.7%)
409: 285 (1.9%) ← policy blocks
429: 42 (0.3%) ← rate limited
502: 23 (0.2%) ← upstream errors

Policy Chain:
Avg eval time: 18ms
Max eval time: 89ms

Latency optimization checklist

# 1. Check current performance baseline
kt doctor --checks performance

# 2. Identify bottlenecks
kt events tail --format detailed --filter "latency_ms>500"

# 3. Review connection pool utilization
kt doctor --checks performance --verbose | grep -A5 "connection"

# 4. Check policy chain timing
kt events tail --since 1h --format json | \
jq '.policy_chain_ms' | sort -n | tail -20

# 5. Run a benchmark after each tuning change
kt bench --target http://localhost:41002/v1/chat/completions \
--requests 1000 --concurrency 50 --payload bench/sample-chat.json

Quick wins

OptimizationExpected impactEffort
Enable connection pooling20-50ms latency reductionConfig change
Move blocking policies first5-30ms reduction on blocked requestsConfig reorder
Increase max_idle_per_hostReduces connection setup timeConfig change
Set appropriate request timeoutPrevents resource exhaustionConfig change
Tune streaming_buffer sizeBetter TTFB for streamingConfig change
Scale horizontallyLinear throughput increaseInfrastructure

Monitoring tuning impact

After making changes, validate with:

# Before/after comparison
kt bench --target http://localhost:41002/v1/chat/completions \
--requests 5000 --concurrency 50 \
--payload bench/sample-chat.json \
--output results/after-tuning.json

# Compare results
kt bench compare results/before-tuning.json results/after-tuning.json
Performance Comparison
──────────────────────
Before After Change
P50: 182ms 142ms -22.0%
P99: 920ms 687ms -25.3%
Throughput: 38 req/s 50 req/s +31.6%
Error rate: 1.2% 0.2% -83.3%

Business outcomes

OutcomeHow performance tuning helps
Lower user-perceived latencyOptimized pools and timeouts cut gateway overhead to single-digit milliseconds
Higher throughputProper concurrency limits let you serve more users without additional infrastructure
Cost efficiencyBetter resource utilization means fewer gateway instances for the same traffic
ReliabilityTimeout and queue tuning prevents cascading failures during traffic spikes
Data-driven scalingBenchmark results quantify exactly when you need to scale horizontally

For AI systems

  • Canonical terms: kt bench, kt doctor --checks performance, connection pool, pool.max_connections, pool.idle_timeout, pool.max_idle_per_host, gateway.timeouts, gateway.buffers, gateway.concurrency, P50/P95/P99 latency.
  • Config sections: providers.[].pool, gateway.timeouts (connect, request, idle, policy_eval, streaming_idle), gateway.buffers (request_body_limit, response_body_limit, streaming_buffer), gateway.concurrency (max_concurrent_requests, per_user_limit, per_team_limit).
  • Benchmark flags: --requests, --concurrency, --rate, --duration, --rate-start/--rate-end, --ramp-duration, --output, kt bench compare.
  • Best next pages: Gateway Diagnostics, Multi-Gateway, Gateway Docker Compose.

For engineers

  • Start with baseline: kt doctor --checks performance to see current P50/P99, memory, and connection utilization.
  • Tune pools first: match max_connections to your traffic volume (see pool sizing table in this page).
  • Benchmark after each change: kt bench --requests 1000 --concurrency 50 --payload bench/sample-chat.json.
  • Compare results: kt bench compare results/before.json results/after.json.
  • Quick wins: enable pooling, reorder blocking policies first, increase max_idle_per_host, set appropriate timeouts.

For leaders

  • Optimized gateway overhead means lower per-request cost and fewer infrastructure instances required.
  • P99 latency targets ensure AI-powered features remain responsive under load — critical for customer-facing applications.
  • Data-driven scaling: kt bench results quantify exactly when horizontal scaling is needed vs. config tuning.
  • Proper timeout and queue settings prevent cascading failures during traffic spikes, maintaining governance availability.

Next steps