Load Testing the AI Gateway
The Keeptrusts gateway sits in the critical path of every AI request. Performance degradation in the gateway directly impacts application latency and user experience. Load testing validates that the gateway meets throughput and latency requirements under realistic traffic patterns.
Use this page when
- You need to benchmark gateway throughput and latency under realistic traffic patterns
- You are using
kt benchto measure policy chain overhead, latency percentiles, and throughput ceiling - You want to validate rate limit handling, provider 429 responses, and add performance gates to CI
Primary audience
- Primary: Technical Engineers
- Secondary: AI Agents, Technical Leaders
Gateway Architecture & Performance Profile
The gateway performs these operations on every request:
- Input policy chain — evaluate all input-phase policies (topic control, DLP, rate limits)
- Provider forwarding — proxy the request to the upstream LLM provider
- Output policy chain — evaluate all output-phase policies (redaction, quality, disclaimers)
- Event emission — asynchronously send the decision event to the API
The gateway overhead is the time spent in steps 1, 3, and 4. Step 2 is provider latency, which the gateway cannot control. Load testing focuses on measuring and optimizing the gateway-added overhead.
Benchmarking with kt bench
The kt bench command generates synthetic load against a running gateway:
# Basic benchmark: 100 requests, 10 concurrent
kt bench --target http://localhost:41002 \
--requests 100 \
--concurrency 10 \
--model gpt-4o
Configuration Options
# Full benchmark configuration
kt bench \
--target http://localhost:41002 \
--requests 1000 \
--concurrency 50 \
--duration 60s \
--model gpt-4o \
--prompt-file test-prompts.txt \
--output results.json
| Flag | Description | Default |
|---|---|---|
--target | Gateway URL | http://localhost:41002 |
--requests | Total request count | 100 |
--concurrency | Parallel request workers | 10 |
--duration | Time-bound test (overrides --requests) | — |
--model | Target model identifier | — |
--prompt-file | File with prompts (one per line) | Built-in prompt |
--output | Results output file (JSON) | stdout |
Latency Percentile Analysis
Gateway latency is best understood through percentiles, not averages. A healthy p50 can mask severe p99 tail latency.
Reading Benchmark Results
# Run benchmark and analyze percentiles
kt bench --target http://localhost:41002 \
--requests 500 \
--concurrency 20 \
--output bench-results.json
# Parse latency percentiles
jq '.latency_percentiles' bench-results.json
Sample output:
{
"p50_ms": 12,
"p75_ms": 18,
"p90_ms": 34,
"p95_ms": 52,
"p99_ms": 128,
"max_ms": 340,
"mean_ms": 22
}
Setting Latency Budgets
Define acceptable latency budgets for the gateway overhead (excluding provider time):
| Percentile | Budget (gateway overhead) | Typical value |
|---|---|---|
| p50 | < 15ms | 8–12ms |
| p95 | < 60ms | 30–50ms |
| p99 | < 150ms | 80–130ms |
#!/bin/bash
# check-latency-budget.sh — fail if latency exceeds budget
RESULTS="bench-results.json"
P95=$(jq '.latency_percentiles.p95_ms' "$RESULTS")
P99=$(jq '.latency_percentiles.p99_ms' "$RESULTS")
P95_BUDGET=60
P99_BUDGET=150
PASS=true
if [ "$P95" -gt "$P95_BUDGET" ]; then
echo "FAIL: p95 latency ${P95}ms exceeds budget ${P95_BUDGET}ms"
PASS=false
fi
if [ "$P99" -gt "$P99_BUDGET" ]; then
echo "FAIL: p99 latency ${P99}ms exceeds budget ${P99_BUDGET}ms"
PASS=false
fi
if [ "$PASS" = true ]; then
echo "PASS: All latency budgets met (p95=${P95}ms, p99=${P99}ms)"
else
exit 1
fi
Throughput Testing
Measure the maximum requests per second the gateway can sustain:
# Ramp up concurrency to find throughput ceiling
for CONCURRENCY in 10 25 50 100 200; do
echo "Testing concurrency=$CONCURRENCY"
kt bench --target http://localhost:41002 \
--requests 500 \
--concurrency $CONCURRENCY \
--output "bench-c${CONCURRENCY}.json"
RPS=$(jq '.requests_per_second' "bench-c${CONCURRENCY}.json")
P99=$(jq '.latency_percentiles.p99_ms' "bench-c${CONCURRENCY}.json")
echo " RPS: $RPS, p99: ${P99}ms"
done
Throughput vs. Latency Curve
Plot the results to identify the point where latency degrades:
| Concurrency | RPS | p50 (ms) | p99 (ms) |
|---|---|---|---|
| 10 | 180 | 10 | 45 |
| 25 | 420 | 12 | 58 |
| 50 | 750 | 15 | 92 |
| 100 | 980 | 28 | 210 |
| 200 | 1050 | 85 | 580 |
In this example, throughput plateaus around 100 concurrency while tail latency rises sharply — the gateway is saturated beyond ~1000 RPS.
Provider Rate Limit Testing
LLM providers enforce rate limits. Test how the gateway behaves when the provider returns 429 responses:
# policy-config.yaml — rate limit configuration
rate_limits:
- name: per-user-rate-limit
scope: user
requests_per_minute: 60
- name: global-rate-limit
scope: global
requests_per_minute: 1000
Testing Gateway Rate Limits
# Burst 100 requests in 1 second to trigger rate limiting
kt bench --target http://localhost:41002 \
--requests 100 \
--concurrency 100 \
--duration 1s \
--output rate-limit-test.json
# Count 429 responses
RATE_LIMITED=$(jq '[.responses[] | select(.status_code == 429)] | length' rate-limit-test.json)
echo "Rate-limited requests: $RATE_LIMITED"
Testing Provider 429 Handling
Verify the gateway gracefully handles upstream provider rate limits:
# Use a mock provider that returns 429 after N requests
# (see mock-gateway.md for mock setup)
kt bench --target http://localhost:41002 \
--requests 200 \
--concurrency 20 \
--output provider-429-test.json
# Verify gateway returns appropriate errors (not 500s)
ERRORS_500=$(jq '[.responses[] | select(.status_code == 500)] | length' provider-429-test.json)
if [ "$ERRORS_500" -gt 0 ]; then
echo "FAIL: Gateway returned $ERRORS_500 internal errors on provider 429s"
exit 1
fi
echo "PASS: Gateway handled provider rate limits gracefully"
Stress Testing Policy Complexity
Policy evaluation time scales with the number and complexity of policies. Test with production-like policy configurations:
# Benchmark with minimal policies
kt gateway run --policy-config minimal-policy.yaml --port 41002 &
kt bench --target http://localhost:41002 --requests 500 --concurrency 50 \
--output bench-minimal.json
kill %1
# Benchmark with full production policies
kt gateway run --policy-config production-policy.yaml --port 41002 &
kt bench --target http://localhost:41002 --requests 500 --concurrency 50 \
--output bench-production.json
kill %1
# Compare overhead
MINIMAL_P95=$(jq '.latency_percentiles.p95_ms' bench-minimal.json)
PRODUCTION_P95=$(jq '.latency_percentiles.p95_ms' bench-production.json)
echo "Policy overhead (p95): $((PRODUCTION_P95 - MINIMAL_P95))ms"
CI Performance Gate
Add a performance gate to your deployment pipeline:
# .github/workflows/performance-gate.yml
jobs:
load-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Start gateway
run: |
kt gateway run --policy-config policy-config.yaml --port 41002 &
sleep 3
- name: Run benchmark
run: |
kt bench --target http://localhost:41002 \
--requests 500 --concurrency 50 \
--output bench-results.json
- name: Check latency budget
run: ./scripts/check-latency-budget.sh
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: benchmark-results
path: bench-results.json
Key Takeaways
- Use
kt benchto generate realistic load against the gateway - Measure latency in percentiles (p50, p95, p99), not averages
- Identify the throughput ceiling by ramping concurrency and watching tail latency
- Test both gateway rate limits and upstream provider 429 handling
- Measure the latency overhead of your production policy configuration
- Add performance gates to CI to catch regressions before deployment
For AI systems
- Canonical terms:
kt bench, latency percentiles (p50, p75, p90, p95, p99), throughput ceiling, policy chain overhead, rate limiting, provider 429 handling - CLI command:
kt bench --target <url> --requests <n> --concurrency <n> --duration <time> --model <model> --output <file> - Output fields:
latency_percentiles.p50_ms,p95_ms,p99_ms,max_ms,mean_ms - Gateway overhead = total latency − provider latency (steps 1 + 3 + 4 of the request path)
- Related pages: Mock Gateway, Regression Testing, Monitoring & Alerting
For engineers
- Run
kt bench --target http://localhost:41002 --requests 1000 --concurrency 50 --output bench-results.jsonfor a realistic load test - Analyze percentiles with
jq '.latency_percentiles' bench-results.json— focus on p95/p99, not averages - Identify the throughput ceiling by ramping concurrency until p99 latency degrades significantly
- Test rate limit behavior by exceeding configured limits and verifying 429 responses
- Measure policy chain overhead by comparing latency with all policies vs. a no-policy baseline config
- Add CI performance gates: fail the build if p99 exceeds your SLO budget (e.g., 200ms gateway overhead)
- Validate: confirm benchmark results show consistent p99 across multiple runs (less than 20% variance)
For leaders
- Gateway latency overhead directly impacts application response time — budget for it in your SLOs
- Percentile-based analysis prevents masking severe tail latency behind healthy averages
- CI performance gates catch regressions before they affect production users
- Throughput ceiling determines how many gateway replicas are needed for your peak traffic
- Rate limit testing validates that cost controls and abuse prevention work under pressure
Next steps
- Use Mock Gateway for deterministic load tests without upstream provider variability
- Detect performance regressions with Regression Testing before/after comparison
- Set up Monitoring & Alerting for production latency tracking