Skip to main content

Gateway Performance: Sub-10ms Policy Evaluation Overhead

Yes, sub-10ms policy evaluation overhead is a realistic Keeptrusts target when the gateway is tuned and the policy chain is sensible. The public product claim says measured per-request overhead is 1-5 ms depending on chain complexity, and the engineering guide sets a < 7 ms P99 target for total gateway overhead. That is the right way to read the title: sub-10ms is an operational budget, not a magical property that holds regardless of configuration.

Use this page when

  • You need to explain or validate the Keeptrusts latency claim in practical engineering terms.
  • You are tuning gateway overhead and want to know which settings actually matter.
  • You want a benchmark-driven workflow instead of debating latency in the abstract.

Primary audience

  • Primary: Technical Engineers
  • Secondary: Technical Leaders, SRE teams

The problem

Every gateway in front of an LLM raises the same question: how much latency does governance add? If the answer is vague, teams assume the overhead is large and resist putting the gateway in the critical path.

That assumption is usually wrong because it mixes three different numbers together.

The first number is provider latency, which is often tens to thousands of milliseconds and usually dominates end-to-end time.

The second number is gateway overhead, which is the part you actually control.

The third number is policy-chain cost, which depends on what you configured. A short, well-ordered chain behaves very differently from a long chain full of expensive checks.

If those numbers are not separated, people blame the gateway for time the provider spent generating the answer.

The solution

The Keeptrusts performance docs separate those concerns clearly.

The public product copy describes the gateway as a compiled Rust binary with a low-overhead policy evaluation path and states a measured per-request overhead of 1-5 ms, varying with policy chain complexity.

The engineering guide then turns that into a budget:

  • input policy evaluation: < 2 ms P99
  • request routing: < 0.5 ms P99
  • connection acquisition on a warm pool: < 1 ms P99
  • output policy evaluation: < 3 ms P99
  • total gateway overhead: < 7 ms P99

That budget is useful because it tells you what “sub-10ms” should mean in operations. It does not mean every request will be fast if the provider takes two seconds. It means the gateway should not be the reason a fast request became slow.

The tuning story follows from that. If you want single-digit overhead, keep the connection path warm, keep the chain deliberate, and measure the gateway separately from the provider.

Implementation

Start with the documented connection and warmup settings that remove avoidable connection churn:

gateway:
upstream:
http_version: h2
keep_alive:
enabled: true
interval: 30s
timeout: 60s

connection_pool:
max_idle_per_host: 32
max_total: 256
max_lifetime: 300s
idle_timeout: 90s

warmup:
enabled: true
connections_per_provider: 4
probe_endpoint: /v1/models

Then benchmark instead of guessing:

kt bench \
--url http://localhost:41002/v1/chat/completions \
--requests 100 \
--concurrency 10 \
--model gpt-5.4-mini-mini \
--prompt "Say hello in one word"

kt doctor --checks performance

That gives you two kinds of feedback.

kt bench shows end-to-end request behavior under load.

kt doctor --checks performance helps confirm whether the gateway configuration itself is obviously unhealthy.

For ongoing observation, the health and metrics path matters as much as the benchmark. The monitoring guide exposes request histograms, policy-evaluation counters, and provider-health metrics. That means you can look at gateway latency repeatedly, not just during a one-time test.

The simplest tuning rules from the docs are worth following in order.

First, warm and reuse connections. That removes repeated setup cost.

Second, keep the chain short for interactive flows. The more policies you add, the more you should justify each one.

Third, move policies that can block early in the chain so expensive downstream work is skipped for requests that were never going to pass.

Fourth, measure P50 and P99 separately. P50 tells you the common case. P99 tells you whether the worst interactive experience is drifting.

Fifth, separate provider latency from gateway latency when you interpret the numbers. A slow model output does not disprove a fast gateway.

Caching is also part of the performance story. It is mainly a cost feature, but a cache hit is usually far faster than an upstream call. If the workload permits it, a cache layer can improve both economics and perceived responsiveness.

Results and impact

The first result is clarity. Teams stop treating “governance latency” as an unknowable tax and start treating it as a measurable budget.

The second result is faster troubleshooting. If P99 jumps, you can ask whether the issue is connection churn, policy complexity, or provider behavior instead of blaming the entire stack at once.

The third result is better rollout confidence. Single-digit gateway overhead is much easier to defend internally than a generic claim that governance is “fast enough.”

There is also a planning benefit. Once you benchmark the real workload, you can scale based on measured throughput and latency instead of on fear. That usually leads to better capacity planning and fewer surprise performance debates late in a rollout.

Key takeaways

  • The practical Keeptrusts performance claim is 1-5 ms measured overhead with a < 7 ms P99 gateway target, not “latency does not matter.”
  • Provider latency usually dominates total response time; gateway overhead should stay in the single-digit range.
  • Connection reuse, warm pools, and deliberate chain design are the main tools for hitting sub-10ms overhead.
  • Benchmark with kt bench; do not reason from intuition alone.
  • Watch P99 and provider metrics together so you do not misdiagnose the bottleneck.

Next steps