Response Caching: Reducing Costs by Serving Repeated Answers Locally
Keeptrusts reduces repeated-answer cost by returning a cached response from the gateway instead of paying the provider for the same work again. The documented model supports exact-match caching for identical prompts, semantic caching for near-duplicates, TTL-based freshness control, and event-level visibility so you can see which requests were cache hits versus provider-backed misses.
Use this page when
- You have repetitive prompts and want a direct way to reduce provider spend.
- You need to choose between exact-match and semantic caching.
- You want practical guidance on when caching should be disabled for stricter retention or freshness requirements.
Primary audience
- Primary: Technical Engineers
- Secondary: Technical Leaders, FinOps owners
The problem
The easiest AI money to waste is the money spent answering the same question repeatedly. Health checks, repeated support prompts, template-driven code generation, FAQ lookups, and paraphrased internal questions often trigger a full model call even when the answer has already been computed recently.
Without a gateway cache, the provider sees each request as new work. That means the same answer incurs the same token and latency cost over and over again.
There is a second problem hidden behind the first: teams often know they have repeated traffic, but they do not have a trustworthy way to measure how much duplication exists. If you cannot observe hits and misses, the conversation about caching quickly turns into guesswork.
There is also a governance edge case. Some workloads should never reuse cached responses. If you are operating in a zero-retention or strict no-store posture, response reuse may conflict with the policy goal. So the question is not only “can we cache?” but also “when should we not?”
The solution
Keeptrusts solves the cost side and the observability side together.
At runtime, the gateway checks the cache before forwarding the request upstream. If a valid cached response exists, it returns that answer immediately. If not, it forwards the request, evaluates the normal policy chain, returns the response, and stores the result for later reuse.
The documented modes are practical.
exactcaches identical message content and model combinations.semanticcaches paraphrased requests that are similar enough to clear the configured threshold.
TTL then decides freshness. A short TTL favors freshness over savings. A long TTL favors savings over freshness. Neither is universally correct; it depends on whether the answer changes often.
The platform also surfaces cache behavior in Events and cache statistics. That matters because a cache is only useful if you can measure hit rate, latency benefit, and tokens saved.
Implementation
Start with exact-match caching. It is easier to reason about and has fewer quality trade-offs:
providers:
targets:
- id: openai
provider: openai
secret_key_ref:
env: OPENAI_API_KEY
cache:
enabled: true
mode: exact
ttl_seconds: 3600
max_entries: 10000
Validate and run the gateway:
kt policy lint --file policy-config.yaml
kt gateway run --policy-config policy-config.yaml --listen 0.0.0.0:41002
Then test it with the same request twice and inspect the event stream:
kt events tail --last 2
kt cache stats
The documented event pattern is the important signal. A cache miss shows normal provider tokens and higher latency. A cache hit shows cache=hit, zero provider tokens, and much lower latency.
When the workload includes paraphrased queries, switch to semantic mode:
cache:
enabled: true
mode: semantic
ttl_seconds: 3600
max_entries: 10000
semantic_threshold: 0.92
If model behavior or content freshness varies by workload, add model-specific TTLs as documented:
cache:
enabled: true
mode: semantic
ttl_seconds: 1800
max_entries: 10000
semantic_threshold: 0.92
model_overrides:
- model: gpt-5.4-mini
ttl_seconds: 7200
- model: gpt-5.4-mini-mini
ttl_seconds: 900
And when the underlying source data changes, invalidate instead of waiting for the TTL:
kt cache clear
kt cache clear --model gpt-5.4-mini-mini
That is the practical control set: enable, choose a match mode, set freshness, monitor hits, and clear stale entries on purpose.
There is one important governance caveat from the public docs. In Unified Access, cache_enabled should be disabled for workloads where response reuse is not allowed. The reference page is explicit that stricter ZDR modes should be paired with cache-disabled deployments when request or response reuse is not permitted by policy. That is a useful reminder that caching is a cost tool, not a universal default.
If the workload permits caching, combine it with Cost Tracking & Budgets so the savings show up in the spend model as well. A cache hit is not just faster; it is also a different economic path.
Results and impact
The most obvious result is lower spend. Repeated questions stop creating repeated provider charges.
The second result is lower latency. A local cache hit is usually much faster than a full upstream round-trip, which makes repetitive interactive workloads feel more responsive.
The third result is clearer measurement. Because the gateway emits hit and miss information and exposes cache stats, teams can decide whether a workload is actually cache-friendly instead of assuming it is.
There is also a product-design benefit. Once teams can measure repeated-question behavior, they often discover where prompts, templates, or agent loops are unnecessarily duplicative. Caching reduces the bill, but it also reveals waste patterns in the workload itself.
Key takeaways
- Use exact caching first; it is simpler and safer than starting with semantic reuse.
- Semantic caching is useful, but it requires deliberate thresholds and freshness review.
- Cache hits reduce both latency and token cost.
kt events tail,kt cache stats, andkt cache clearare the practical operating commands.- Disable caching for workloads whose retention posture does not permit response reuse.