Skip to main content

Performance Tuning: Optimizing Gateway for High-Throughput Workloads

High-throughput gateway tuning is mostly about discipline. Teams often assume the provider is the only meaningful latency source and ignore the smaller decisions closer to home: how long the policy chain is, how much concurrency the runtime accepts, whether the workload is spread across enough replicas, and whether anyone is measuring the right slices before changing them. Keeptrusts gives you the levers to tune this path, but the value comes from using them in the right order.

Use this page when

  • Your gateway is handling enough traffic that p95 latency or queueing is becoming visible.
  • You need to improve throughput without loosening governance controls unnecessarily.
  • You want a safe tuning loop built on measurement rather than guesswork.

Primary audience

  • Primary: Platform engineers, SREs, and performance-minded developers
  • Secondary: Technical Leaders planning capacity and rollout safety

The problem

Performance issues in governed AI traffic rarely come from one single root cause. The upstream model call is usually the largest latency component, but it is not the only one. Policy evaluation, routing choice, gateway concurrency, and scaling topology all affect the result your users feel.

That is why high-throughput tuning fails when it starts with premature knobs. Teams increase infrastructure before confirming whether the chain is unnecessarily long. They remove policy controls before checking whether the gateway is simply underprovisioned. They tune concurrency without measuring whether the provider or model choice is the larger bottleneck.

The other common mistake is tuning from a single average. Throughput work should focus on tail behavior, backlog risk, and the specific routes that matter most to the business. The gateway path for a low-latency interactive assistant should not necessarily look the same as the path for a slower batch workflow.

The solution

Keeptrusts gives you a clean performance workflow:

  1. Measure current behavior with /metrics and recent event evidence.
  2. Keep latency-sensitive chains short and intentional.
  3. Set runtime concurrency deliberately with --max-concurrency or KEEPTRUSTS_GATEWAY_MAX_CONCURRENCY.
  4. Use Kubernetes replicas and autoscaling when throughput growth is real rather than temporary.
  5. Verify the effect of each change with the same monitoring surfaces you used at the start.

This matters because the gateway is already designed to keep some overhead out of the user path. Keeptrusts event emission is asynchronous, which means observability does not have to block the response path. The tuning work is therefore about the parts you still control directly: routing, policy composition, concurrency, and deployment shape.

Implementation

Start by measuring recent traffic, not by editing the config immediately:

curl -fsS http://localhost:41002/metrics | rg 'keeptrusts_(requests_total|request_duration_seconds|policy_evaluations_total)'

kt events tail --since 10m --json | \
jq '.[] | {timestamp, verdict, provider, model, config_version}'

That gives you two useful views. Metrics show rate and latency shape. Events show which models, providers, and config versions were actually in play during the period you care about.

Next, tune concurrency deliberately:

export KEEPTRUSTS_GATEWAY_MAX_CONCURRENCY=64

kt gateway run \
--listen 0.0.0.0:41002 \
--policy-config policy-config.yaml

This is a real runtime lever, but it is not a magic number. Raising concurrency helps when the workload and host can support it. It hurts when the host is already saturated or when the provider path is the real bottleneck. That is why measurement comes first.

Policy composition is the next major lever. For latency-sensitive paths, keep the chain focused on controls that are actually needed for that route:

pack:
name: throughput-tuned
version: 1.0.0
enabled: true
policies:
chain:
- prompt-injection
- pii-detector
- audit-logger
providers:
targets:
- id: openai-primary
provider: openai
model: gpt-5.4-mini-mini
base_url: https://api.openai.com
secret_key_ref:
env: OPENAI_API_KEY
routing:
strategy: ordered

The goal here is not to remove governance. It is to avoid paying for policies that do not add value on a given route. Interactive developer tooling and customer-facing assistants often need a different balance than slower, higher-context workflows.

After chain composition, deployment topology becomes the next lever. If one instance is healthy but saturated, add replicas and let the orchestrator distribute traffic. Kubernetes is usually the cleanest path once throughput becomes sustained rather than bursty:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: keeptrusts-gateway
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: keeptrusts-gateway
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70

That is not just about raw throughput. More replicas also help isolate noisy spikes and reduce the chance that one stressed process shapes the full user experience.

One practical tuning loop is:

  1. Capture a 10-minute baseline from /metrics and kt events tail.
  2. Change one variable: chain length, concurrency, or replica count.
  3. Run the same workload window again.
  4. Compare p95 latency, block-rate stability, and request success.

Do not change two or three variables at once unless you are handling an emergency. If you do, you will learn less from the result.

It is also worth separating interactive and non-interactive traffic if they have different latency budgets. Keeptrusts provider routing and deployment topology make that possible without forcing every request through the same operational profile.

Results and impact

Well-tuned gateways feel less like a policy tax and more like part of the platform path. Latency-sensitive requests stay responsive, large traffic spikes are less likely to overwhelm a single instance, and teams can explain exactly which control or runtime change improved the result.

The more strategic impact is confidence. Once the team has a repeatable measurement-and-tuning loop, performance work stops being a debate over guesses and becomes an engineering exercise with defensible tradeoffs.

Key takeaways

  • Measure first with /metrics and recent event data before changing runtime settings.
  • Keep latency-sensitive policy chains short and intentional.
  • Tune --max-concurrency or KEEPTRUSTS_GATEWAY_MAX_CONCURRENCY based on observed behavior, not habit.
  • Use multiple replicas and autoscaling when throughput growth is sustained.
  • Change one tuning variable at a time so the result teaches you something useful.

Next steps