Response Caching: Reducing Costs by Serving Repeated Answers Locally

Keeptrusts reduces repeated-answer cost by returning a cached response from the gateway instead of paying the provider for the same work again. The documented model supports exact-match caching for identical prompts, semantic caching for near-duplicates, TTL-based freshness control, and event-level visibility so you can see which requests were cache hits versus provider-backed misses.

Use this page when

You have repetitive prompts and want a direct way to reduce provider spend.
You need to choose between exact-match and semantic caching.
You want practical guidance on when caching should be disabled for stricter retention or freshness requirements.

Primary audience

Primary: Technical Engineers
Secondary: Technical Leaders, FinOps owners

The problem

The easiest AI money to waste is the money spent answering the same question repeatedly. Health checks, repeated support prompts, template-driven code generation, FAQ lookups, and paraphrased internal questions often trigger a full model call even when the answer has already been computed recently.

Without a gateway cache, the provider sees each request as new work. That means the same answer incurs the same token and latency cost over and over again.

There is a second problem hidden behind the first: teams often know they have repeated traffic, but they do not have a trustworthy way to measure how much duplication exists. If you cannot observe hits and misses, the conversation about caching quickly turns into guesswork.

There is also a governance edge case. Some workloads should never reuse cached responses. If you are operating in a zero-retention or strict no-store posture, response reuse may conflict with the policy goal. So the question is not only “can we cache?” but also “when should we not?”

The solution

Keeptrusts solves the cost side and the observability side together.

At runtime, the gateway checks the cache before forwarding the request upstream. If a valid cached response exists, it returns that answer immediately. If not, it forwards the request, evaluates the normal policy chain, returns the response, and stores the result for later reuse.

The documented modes are practical.

exact caches identical message content and model combinations.
semantic caches paraphrased requests that are similar enough to clear the configured threshold.

TTL then decides freshness. A short TTL favors freshness over savings. A long TTL favors savings over freshness. Neither is universally correct; it depends on whether the answer changes often.

The platform also surfaces cache behavior in Events and cache statistics. That matters because a cache is only useful if you can measure hit rate, latency benefit, and tokens saved.

Implementation

Start with exact-match caching. It is easier to reason about and has fewer quality trade-offs:

providers:
  targets:
    - id: openai
      provider: openai
      secret_key_ref:
        env: OPENAI_API_KEY

cache:
  enabled: true
  mode: exact
  ttl_seconds: 3600
  max_entries: 10000

Validate and run the gateway:

kt policy lint --file policy-config.yaml
kt gateway run --policy-config policy-config.yaml --listen 0.0.0.0:41002

Then test it with the same request twice and inspect the event stream:

kt events tail --last 2
kt cache stats

The documented event pattern is the important signal. A cache miss shows normal provider tokens and higher latency. A cache hit shows cache=hit, zero provider tokens, and much lower latency.

When the workload includes paraphrased queries, switch to semantic mode:

cache:
  enabled: true
  mode: semantic
  ttl_seconds: 3600
  max_entries: 10000
  semantic_threshold: 0.92

If model behavior or content freshness varies by workload, add model-specific TTLs as documented:

cache:
  enabled: true
  mode: semantic
  ttl_seconds: 1800
  max_entries: 10000
  semantic_threshold: 0.92
  model_overrides:
    - model: gpt-5.4-mini
      ttl_seconds: 7200
    - model: gpt-5.4-mini-mini
      ttl_seconds: 900

And when the underlying source data changes, invalidate instead of waiting for the TTL:

kt cache clear
kt cache clear --model gpt-5.4-mini-mini

That is the practical control set: enable, choose a match mode, set freshness, monitor hits, and clear stale entries on purpose.

There is one important governance caveat from the public docs. In Unified Access, cache_enabled should be disabled for workloads where response reuse is not allowed. The reference page is explicit that stricter ZDR modes should be paired with cache-disabled deployments when request or response reuse is not permitted by policy. That is a useful reminder that caching is a cost tool, not a universal default.

If the workload permits caching, combine it with Cost Tracking & Budgets so the savings show up in the spend model as well. A cache hit is not just faster; it is also a different economic path.

Results and impact

The most obvious result is lower spend. Repeated questions stop creating repeated provider charges.

The second result is lower latency. A local cache hit is usually much faster than a full upstream round-trip, which makes repetitive interactive workloads feel more responsive.

The third result is clearer measurement. Because the gateway emits hit and miss information and exposes cache stats, teams can decide whether a workload is actually cache-friendly instead of assuming it is.

There is also a product-design benefit. Once teams can measure repeated-question behavior, they often discover where prompts, templates, or agent loops are unnecessarily duplicative. Caching reduces the bill, but it also reveals waste patterns in the workload itself.

Key takeaways

Use exact caching first; it is simpler and safer than starting with semantic reuse.
Semantic caching is useful, but it requires deliberate thresholds and freshness review.
Cache hits reduce both latency and token cost.
kt events tail, kt cache stats, and kt cache clear are the practical operating commands.
Disable caching for workloads whose retention posture does not permit response reuse.

Response Caching: Reducing Costs by Serving Repeated Answers Locally

Use this page when​

Primary audience​

The problem​

The solution​

Implementation​

Results and impact​

Key takeaways​

Next steps​