Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Load Testing the AI Gateway

The Keeptrusts gateway sits in the critical path of every AI request. Performance degradation in the gateway directly impacts application latency and user experience. Load testing validates that the gateway meets throughput and latency requirements under realistic traffic patterns.

Use this page when

  • You need to benchmark gateway throughput and latency under realistic traffic patterns
  • You are using kt bench to measure policy chain overhead, latency percentiles, and throughput ceiling
  • You want to validate rate limit handling, provider 429 responses, and add performance gates to CI

Primary audience

  • Primary: Technical Engineers
  • Secondary: AI Agents, Technical Leaders

Gateway Architecture & Performance Profile

The gateway performs these operations on every request:

  1. Input policy chain — evaluate all input-phase policies (topic control, DLP, rate limits)
  2. Provider forwarding — proxy the request to the upstream LLM provider
  3. Output policy chain — evaluate all output-phase policies (redaction, quality, disclaimers)
  4. Event emission — asynchronously send the decision event to the API

The gateway overhead is the time spent in steps 1, 3, and 4. Step 2 is provider latency, which the gateway cannot control. Load testing focuses on measuring and optimizing the gateway-added overhead.

Benchmarking with kt bench

The kt bench command generates synthetic load against a running gateway:

# Basic benchmark: 100 requests, 10 concurrent
kt bench --target http://localhost:41002 \
--requests 100 \
--concurrency 10 \
--model gpt-4o

Configuration Options

# Full benchmark configuration
kt bench \
--target http://localhost:41002 \
--requests 1000 \
--concurrency 50 \
--duration 60s \
--model gpt-4o \
--prompt-file test-prompts.txt \
--output results.json
FlagDescriptionDefault
--targetGateway URLhttp://localhost:41002
--requestsTotal request count100
--concurrencyParallel request workers10
--durationTime-bound test (overrides --requests)
--modelTarget model identifier
--prompt-fileFile with prompts (one per line)Built-in prompt
--outputResults output file (JSON)stdout

Latency Percentile Analysis

Gateway latency is best understood through percentiles, not averages. A healthy p50 can mask severe p99 tail latency.

Reading Benchmark Results

# Run benchmark and analyze percentiles
kt bench --target http://localhost:41002 \
--requests 500 \
--concurrency 20 \
--output bench-results.json

# Parse latency percentiles
jq '.latency_percentiles' bench-results.json

Sample output:

{
"p50_ms": 12,
"p75_ms": 18,
"p90_ms": 34,
"p95_ms": 52,
"p99_ms": 128,
"max_ms": 340,
"mean_ms": 22
}

Setting Latency Budgets

Define acceptable latency budgets for the gateway overhead (excluding provider time):

PercentileBudget (gateway overhead)Typical value
p50< 15ms8–12ms
p95< 60ms30–50ms
p99< 150ms80–130ms
#!/bin/bash
# check-latency-budget.sh — fail if latency exceeds budget

RESULTS="bench-results.json"

P95=$(jq '.latency_percentiles.p95_ms' "$RESULTS")
P99=$(jq '.latency_percentiles.p99_ms' "$RESULTS")

P95_BUDGET=60
P99_BUDGET=150

PASS=true

if [ "$P95" -gt "$P95_BUDGET" ]; then
echo "FAIL: p95 latency ${P95}ms exceeds budget ${P95_BUDGET}ms"
PASS=false
fi

if [ "$P99" -gt "$P99_BUDGET" ]; then
echo "FAIL: p99 latency ${P99}ms exceeds budget ${P99_BUDGET}ms"
PASS=false
fi

if [ "$PASS" = true ]; then
echo "PASS: All latency budgets met (p95=${P95}ms, p99=${P99}ms)"
else
exit 1
fi

Throughput Testing

Measure the maximum requests per second the gateway can sustain:

# Ramp up concurrency to find throughput ceiling
for CONCURRENCY in 10 25 50 100 200; do
echo "Testing concurrency=$CONCURRENCY"
kt bench --target http://localhost:41002 \
--requests 500 \
--concurrency $CONCURRENCY \
--output "bench-c${CONCURRENCY}.json"

RPS=$(jq '.requests_per_second' "bench-c${CONCURRENCY}.json")
P99=$(jq '.latency_percentiles.p99_ms' "bench-c${CONCURRENCY}.json")
echo " RPS: $RPS, p99: ${P99}ms"
done

Throughput vs. Latency Curve

Plot the results to identify the point where latency degrades:

ConcurrencyRPSp50 (ms)p99 (ms)
101801045
254201258
507501592
10098028210
200105085580

In this example, throughput plateaus around 100 concurrency while tail latency rises sharply — the gateway is saturated beyond ~1000 RPS.

Provider Rate Limit Testing

LLM providers enforce rate limits. Test how the gateway behaves when the provider returns 429 responses:

# policy-config.yaml — rate limit configuration
rate_limits:
- name: per-user-rate-limit
scope: user
requests_per_minute: 60

- name: global-rate-limit
scope: global
requests_per_minute: 1000

Testing Gateway Rate Limits

# Burst 100 requests in 1 second to trigger rate limiting
kt bench --target http://localhost:41002 \
--requests 100 \
--concurrency 100 \
--duration 1s \
--output rate-limit-test.json

# Count 429 responses
RATE_LIMITED=$(jq '[.responses[] | select(.status_code == 429)] | length' rate-limit-test.json)
echo "Rate-limited requests: $RATE_LIMITED"

Testing Provider 429 Handling

Verify the gateway gracefully handles upstream provider rate limits:

# Use a mock provider that returns 429 after N requests
# (see mock-gateway.md for mock setup)
kt bench --target http://localhost:41002 \
--requests 200 \
--concurrency 20 \
--output provider-429-test.json

# Verify gateway returns appropriate errors (not 500s)
ERRORS_500=$(jq '[.responses[] | select(.status_code == 500)] | length' provider-429-test.json)
if [ "$ERRORS_500" -gt 0 ]; then
echo "FAIL: Gateway returned $ERRORS_500 internal errors on provider 429s"
exit 1
fi
echo "PASS: Gateway handled provider rate limits gracefully"

Stress Testing Policy Complexity

Policy evaluation time scales with the number and complexity of policies. Test with production-like policy configurations:

# Benchmark with minimal policies
kt gateway run --policy-config minimal-policy.yaml --port 41002 &
kt bench --target http://localhost:41002 --requests 500 --concurrency 50 \
--output bench-minimal.json
kill %1

# Benchmark with full production policies
kt gateway run --policy-config production-policy.yaml --port 41002 &
kt bench --target http://localhost:41002 --requests 500 --concurrency 50 \
--output bench-production.json
kill %1

# Compare overhead
MINIMAL_P95=$(jq '.latency_percentiles.p95_ms' bench-minimal.json)
PRODUCTION_P95=$(jq '.latency_percentiles.p95_ms' bench-production.json)
echo "Policy overhead (p95): $((PRODUCTION_P95 - MINIMAL_P95))ms"

CI Performance Gate

Add a performance gate to your deployment pipeline:

# .github/workflows/performance-gate.yml
jobs:
load-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Start gateway
run: |
kt gateway run --policy-config policy-config.yaml --port 41002 &
sleep 3

- name: Run benchmark
run: |
kt bench --target http://localhost:41002 \
--requests 500 --concurrency 50 \
--output bench-results.json

- name: Check latency budget
run: ./scripts/check-latency-budget.sh

- name: Upload results
uses: actions/upload-artifact@v4
with:
name: benchmark-results
path: bench-results.json

Key Takeaways

  • Use kt bench to generate realistic load against the gateway
  • Measure latency in percentiles (p50, p95, p99), not averages
  • Identify the throughput ceiling by ramping concurrency and watching tail latency
  • Test both gateway rate limits and upstream provider 429 handling
  • Measure the latency overhead of your production policy configuration
  • Add performance gates to CI to catch regressions before deployment

For AI systems

  • Canonical terms: kt bench, latency percentiles (p50, p75, p90, p95, p99), throughput ceiling, policy chain overhead, rate limiting, provider 429 handling
  • CLI command: kt bench --target <url> --requests <n> --concurrency <n> --duration <time> --model <model> --output <file>
  • Output fields: latency_percentiles.p50_ms, p95_ms, p99_ms, max_ms, mean_ms
  • Gateway overhead = total latency − provider latency (steps 1 + 3 + 4 of the request path)
  • Related pages: Mock Gateway, Regression Testing, Monitoring & Alerting

For engineers

  • Run kt bench --target http://localhost:41002 --requests 1000 --concurrency 50 --output bench-results.json for a realistic load test
  • Analyze percentiles with jq '.latency_percentiles' bench-results.json — focus on p95/p99, not averages
  • Identify the throughput ceiling by ramping concurrency until p99 latency degrades significantly
  • Test rate limit behavior by exceeding configured limits and verifying 429 responses
  • Measure policy chain overhead by comparing latency with all policies vs. a no-policy baseline config
  • Add CI performance gates: fail the build if p99 exceeds your SLO budget (e.g., 200ms gateway overhead)
  • Validate: confirm benchmark results show consistent p99 across multiple runs (less than 20% variance)

For leaders

  • Gateway latency overhead directly impacts application response time — budget for it in your SLOs
  • Percentile-based analysis prevents masking severe tail latency behind healthy averages
  • CI performance gates catch regressions before they affect production users
  • Throughput ceiling determines how many gateway replicas are needed for your peak traffic
  • Rate limit testing validates that cost controls and abuse prevention work under pressure

Next steps