Response Caching: Stop Paying Twice for the Same Answer
If your organization answers the same question repeatedly, the expensive part of the stack is not the second answer. It is the fact that you are still paying the provider for it. Keeptrusts response caching cuts that waste at the gateway by serving repeated or meaningfully equivalent prompts from cache, reducing latency and preserving wallet capacity for requests that actually need fresh inference.
Use this page when
- You run FAQ, support, internal enablement, or other workloads with repeated prompts and want a concrete caching strategy.
- You need to explain why cache savings are financially real, not just a performance improvement.
- You want to decide when to use exact caching and when semantic caching is worth the additional complexity.
Primary audience
- Primary: Technical Leaders
- Secondary: Technical Engineers, support platform owners
The problem
Repeated LLM traffic hides in plain sight. One team asks the same shipping or refund question thousands of times. Another asks for the same policy summary in slightly different words. A third app sends a deterministic prompt template every time a record changes state. All of those patterns are cacheable, but many organizations still pay full provider price every time because caching is either absent or trapped inside a single application.
That creates a direct cost problem and a budgeting problem.
The direct cost problem is obvious: each repeated request produces another upstream call. If the workload is high-volume, a modest hit rate creates significant avoided cost because the provider call is what carries the bill.
The budgeting problem is less obvious but just as important. Without gateway-level caching, repeated prompts still consume wallet balance and still compete with higher-value work for budget. In Keeptrusts documentation, cache hits settle at zero cost in billing dashboards, and cache hits do not debit wallet balance. That means a cache hit is not only faster. It also preserves spend capacity for the next uncached request.
The final problem is measurement. Teams often enable a local cache and assume it is working, but cannot prove hit rate, avoided cost, or freshness quality. Without dashboards and exports, response caching sounds like a performance idea instead of a budget control.
The solution
Keeptrusts supports both exact and semantic caching at the gateway.
Exact caching is the right choice for deterministic prompts. If the same serialized request appears again, the gateway hashes the prompt and serves the stored response. This is ideal for structured workflows, templated prompts, and predictable internal assistants.
Semantic caching is the right choice when users rephrase the same question in different words. The gateway stores and compares embeddings, then returns a cached response when the similarity threshold is high enough. This is useful for support bots, search-adjacent assistants, and help-desk workflows where the wording changes but the intent does not.
Both modes reduce provider spend, but they do so in a controlled way. Exact caching eliminates duplicate work with near-zero ambiguity. Semantic caching trades a little complexity for a higher hit rate on paraphrased traffic. The business decision is simple: use exact mode where correctness depends on strict identity, and semantic mode where repeated intent matters more than repeated phrasing.
Implementation
For a support workload, semantic caching is often the better fit because the same question is asked in many forms.
cache:
enabled: true
mode: semantic
similarity_threshold: 0.93
ttl_seconds: 7200
max_entries: 50000
embedding_provider: voyage-lite
namespace: support-bot-prod
providers:
targets:
- id: openai-gpt4o
provider: openai:chat:gpt-5.4-mini
secret_key_ref:
env: OPENAI_API_KEY
- id: voyage-lite
provider: voyage:embedding:voyage-3-lite
secret_key_ref:
env: VOYAGE_API_KEY
For the runtime backend, a shared cache helps multiple gateway nodes reuse the same answers:
export KEEPTRUSTS_LLM_CACHE_ENABLED=true
export KEEPTRUSTS_LLM_CACHE_BACKEND=redis
export KEEPTRUSTS_LLM_CACHE_REDIS_URL="redis://redis.internal:6379/0"
export KEEPTRUSTS_LLM_CACHE_REDIS_KEY_PREFIX="kt:llm-cache:prod"
Those two layers do different jobs. The YAML defines what should be cached and how similar a prompt must be to qualify. The environment variables define where shared cache state lives. Together, they let the gateway answer repeated requests without creating another provider charge.
From an operations standpoint, monitor three things in the dashboard and monthly export: hit rate, avoided provider cost, and stale misses. If hit rate is low, your workload may need semantic mode, a better namespace strategy, or longer TTL. If stale misses are high, the workload may not be stable enough for caching at the current setting. If avoided cost is high and the hit quality is still good, you have one of the cleanest ROI levers in the entire LLM stack.
Results and impact
Assume a support bot handles 120,000 prompts per month. If 28 percent of those prompts are semantically equivalent to prior requests, that is more than 33,000 upstream calls you no longer need to buy. Even if each request is inexpensive on its own, the aggregate savings become meaningful because support traffic is high frequency.
The bigger financial win is that cache hits settle at zero cost and do not debit wallet balance. So the support team's budget is not just spending less. It is retaining capacity for the uncached questions that genuinely need model execution. That is a different outcome from simple performance caching inside one app, where finance may still struggle to connect faster responses to real budget preservation.
Latency drops as well, which matters for adoption. A support team is more willing to route more traffic through the governed gateway when repeated questions return quickly. That creates a virtuous cycle: more governed traffic produces better visibility, better visibility improves routing and budgeting decisions, and those decisions further reduce cost.
Caching should not be oversold as a universal answer. It works best on repetitive, stable workloads. But for those workloads, it is one of the most direct ways to stop paying twice for the same answer. When paired with dashboards and exports, the savings become measurable enough for finance and engineering to treat cache policy as a budget strategy, not just a performance tweak.
Key takeaways
- Exact caching is best for deterministic prompts. Semantic caching is best for repeated intent with varied wording.
- The financial value of caching is not abstract. Cache hits avoid provider cost and preserve wallet balance.
- Shared backends make cache economics stronger because multiple gateway nodes can reuse the same answer pool.
- Dashboards and exports are how you prove that hit rate and avoided cost justify the rollout.