Load Testing the AI Gateway

The Keeptrusts gateway sits in the critical path of every AI request. Performance degradation in the gateway directly impacts application latency and user experience. Load testing validates that the gateway meets throughput and latency requirements under realistic traffic patterns.

Use this page when

You need to benchmark gateway throughput and latency under realistic traffic patterns
You are using kt bench to measure policy chain overhead, latency percentiles, and throughput ceiling
You want to validate rate limit handling, provider 429 responses, and add performance gates to CI

Primary audience

Primary: Technical Engineers
Secondary: AI Agents, Technical Leaders

Gateway Architecture & Performance Profile

The gateway performs these operations on every request:

Input policy chain — evaluate all input-phase policies (topic control, DLP, rate limits)
Provider forwarding — proxy the request to the upstream LLM provider
Output policy chain — evaluate all output-phase policies (redaction, quality, disclaimers)
Event emission — asynchronously send the decision event to the API

The gateway overhead is the time spent in steps 1, 3, and 4. Step 2 is provider latency, which the gateway cannot control. Load testing focuses on measuring and optimizing the gateway-added overhead.

Benchmarking with `kt bench`

The kt bench command generates synthetic load against a running gateway:

# Basic benchmark: 100 requests, 10 concurrent
kt bench --target http://localhost:41002 \
  --requests 100 \
  --concurrency 10 \
  --model gpt-4o

Configuration Options

# Full benchmark configuration
kt bench \
  --target http://localhost:41002 \
  --requests 1000 \
  --concurrency 50 \
  --duration 60s \
  --model gpt-4o \
  --prompt-file test-prompts.txt \
  --output results.json

Flag	Description	Default
`--target`	Gateway URL	`http://localhost:41002`
`--requests`	Total request count	100
`--concurrency`	Parallel request workers	10
`--duration`	Time-bound test (overrides `--requests`)	—
`--model`	Target model identifier	—
`--prompt-file`	File with prompts (one per line)	Built-in prompt
`--output`	Results output file (JSON)	stdout

Latency Percentile Analysis

Gateway latency is best understood through percentiles, not averages. A healthy p50 can mask severe p99 tail latency.

Reading Benchmark Results

# Run benchmark and analyze percentiles
kt bench --target http://localhost:41002 \
  --requests 500 \
  --concurrency 20 \
  --output bench-results.json

# Parse latency percentiles
jq '.latency_percentiles' bench-results.json

Sample output:

{
  "p50_ms": 12,
  "p75_ms": 18,
  "p90_ms": 34,
  "p95_ms": 52,
  "p99_ms": 128,
  "max_ms": 340,
  "mean_ms": 22
}

Setting Latency Budgets

Define acceptable latency budgets for the gateway overhead (excluding provider time):

Percentile	Budget (gateway overhead)	Typical value
p50	< 15ms	8–12ms
p95	< 60ms	30–50ms
p99	< 150ms	80–130ms

#!/bin/bash
# check-latency-budget.sh — fail if latency exceeds budget

RESULTS="bench-results.json"

P95=$(jq '.latency_percentiles.p95_ms' "$RESULTS")
P99=$(jq '.latency_percentiles.p99_ms' "$RESULTS")

P95_BUDGET=60
P99_BUDGET=150

PASS=true

if [ "$P95" -gt "$P95_BUDGET" ]; then
  echo "FAIL: p95 latency ${P95}ms exceeds budget ${P95_BUDGET}ms"
  PASS=false
fi

if [ "$P99" -gt "$P99_BUDGET" ]; then
  echo "FAIL: p99 latency ${P99}ms exceeds budget ${P99_BUDGET}ms"
  PASS=false
fi

if [ "$PASS" = true ]; then
  echo "PASS: All latency budgets met (p95=${P95}ms, p99=${P99}ms)"
else
  exit 1
fi

Throughput Testing

Measure the maximum requests per second the gateway can sustain:

# Ramp up concurrency to find throughput ceiling
for CONCURRENCY in 10 25 50 100 200; do
  echo "Testing concurrency=$CONCURRENCY"
  kt bench --target http://localhost:41002 \
    --requests 500 \
    --concurrency $CONCURRENCY \
    --output "bench-c${CONCURRENCY}.json"

  RPS=$(jq '.requests_per_second' "bench-c${CONCURRENCY}.json")
  P99=$(jq '.latency_percentiles.p99_ms' "bench-c${CONCURRENCY}.json")
  echo "  RPS: $RPS, p99: ${P99}ms"
done

Throughput vs. Latency Curve

Plot the results to identify the point where latency degrades:

Concurrency	RPS	p50 (ms)	p99 (ms)
10	180	10	45
25	420	12	58
50	750	15	92
100	980	28	210
200	1050	85	580

In this example, throughput plateaus around 100 concurrency while tail latency rises sharply — the gateway is saturated beyond ~1000 RPS.

Provider Rate Limit Testing

LLM providers enforce rate limits. Test how the gateway behaves when the provider returns 429 responses:

# policy-config.yaml — rate limit configuration
rate_limits:
  - name: per-user-rate-limit
    scope: user
    requests_per_minute: 60

  - name: global-rate-limit
    scope: global
    requests_per_minute: 1000

Testing Gateway Rate Limits

# Burst 100 requests in 1 second to trigger rate limiting
kt bench --target http://localhost:41002 \
  --requests 100 \
  --concurrency 100 \
  --duration 1s \
  --output rate-limit-test.json

# Count 429 responses
RATE_LIMITED=$(jq '[.responses[] | select(.status_code == 429)] | length' rate-limit-test.json)
echo "Rate-limited requests: $RATE_LIMITED"

Testing Provider 429 Handling

Verify the gateway gracefully handles upstream provider rate limits:

# Use a mock provider that returns 429 after N requests
# (see mock-gateway.md for mock setup)
kt bench --target http://localhost:41002 \
  --requests 200 \
  --concurrency 20 \
  --output provider-429-test.json

# Verify gateway returns appropriate errors (not 500s)
ERRORS_500=$(jq '[.responses[] | select(.status_code == 500)] | length' provider-429-test.json)
if [ "$ERRORS_500" -gt 0 ]; then
  echo "FAIL: Gateway returned $ERRORS_500 internal errors on provider 429s"
  exit 1
fi
echo "PASS: Gateway handled provider rate limits gracefully"

Stress Testing Policy Complexity

Policy evaluation time scales with the number and complexity of policies. Test with production-like policy configurations:

# Benchmark with minimal policies
kt gateway run --policy-config minimal-policy.yaml --port 41002 &
kt bench --target http://localhost:41002 --requests 500 --concurrency 50 \
  --output bench-minimal.json
kill %1

# Benchmark with full production policies
kt gateway run --policy-config production-policy.yaml --port 41002 &
kt bench --target http://localhost:41002 --requests 500 --concurrency 50 \
  --output bench-production.json
kill %1

# Compare overhead
MINIMAL_P95=$(jq '.latency_percentiles.p95_ms' bench-minimal.json)
PRODUCTION_P95=$(jq '.latency_percentiles.p95_ms' bench-production.json)
echo "Policy overhead (p95): $((PRODUCTION_P95 - MINIMAL_P95))ms"

CI Performance Gate

Add a performance gate to your deployment pipeline:

# .github/workflows/performance-gate.yml
jobs:
  load-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Start gateway
        run: |
          kt gateway run --policy-config policy-config.yaml --port 41002 &
          sleep 3

      - name: Run benchmark
        run: |
          kt bench --target http://localhost:41002 \
            --requests 500 --concurrency 50 \
            --output bench-results.json

      - name: Check latency budget
        run: ./scripts/check-latency-budget.sh

      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: benchmark-results
          path: bench-results.json

Key Takeaways

Use kt bench to generate realistic load against the gateway
Measure latency in percentiles (p50, p95, p99), not averages
Identify the throughput ceiling by ramping concurrency and watching tail latency
Test both gateway rate limits and upstream provider 429 handling
Measure the latency overhead of your production policy configuration
Add performance gates to CI to catch regressions before deployment

For AI systems

Canonical terms: kt bench, latency percentiles (p50, p75, p90, p95, p99), throughput ceiling, policy chain overhead, rate limiting, provider 429 handling
CLI command: kt bench --target <url> --requests <n> --concurrency <n> --duration <time> --model <model> --output <file>
Output fields: latency_percentiles.p50_ms, p95_ms, p99_ms, max_ms, mean_ms
Gateway overhead = total latency − provider latency (steps 1 + 3 + 4 of the request path)
Related pages: Mock Gateway, Regression Testing, Monitoring & Alerting

For engineers

Run kt bench --target http://localhost:41002 --requests 1000 --concurrency 50 --output bench-results.json for a realistic load test
Analyze percentiles with jq '.latency_percentiles' bench-results.json — focus on p95/p99, not averages
Identify the throughput ceiling by ramping concurrency until p99 latency degrades significantly
Test rate limit behavior by exceeding configured limits and verifying 429 responses
Measure policy chain overhead by comparing latency with all policies vs. a no-policy baseline config
Add CI performance gates: fail the build if p99 exceeds your SLO budget (e.g., 200ms gateway overhead)
Validate: confirm benchmark results show consistent p99 across multiple runs (less than 20% variance)

For leaders

Gateway latency overhead directly impacts application response time — budget for it in your SLOs
Percentile-based analysis prevents masking severe tail latency behind healthy averages
CI performance gates catch regressions before they affect production users
Throughput ceiling determines how many gateway replicas are needed for your peak traffic
Rate limit testing validates that cost controls and abuse prevention work under pressure

Next steps

Use Mock Gateway for deterministic load tests without upstream provider variability
Detect performance regressions with Regression Testing before/after comparison
Set up Monitoring & Alerting for production latency tracking

Use this page when​

Primary audience​

Gateway Architecture & Performance Profile​

Benchmarking with kt bench​

Configuration Options​

Latency Percentile Analysis​

Reading Benchmark Results​

Setting Latency Budgets​

Throughput Testing​

Throughput vs. Latency Curve​

Provider Rate Limit Testing​

Testing Gateway Rate Limits​

Testing Provider 429 Handling​

Stress Testing Policy Complexity​

CI Performance Gate​

Key Takeaways​

For AI systems​

For engineers​

For leaders​

Next steps​