Tutorial: Caching AI Responses to Cut Costs
This tutorial walks you through enabling response caching in the Keeptrusts gateway, choosing between exact-match and semantic-match modes, setting TTL policies, monitoring cache hit rates, and calculating cost savings.
Use this page when
- You are configuring response caching for the first time on a Keeptrusts gateway.
- You want to reduce LLM provider costs by serving repeated or semantically similar prompts from cache.
- You need to choose between exact-match and semantic-match caching modes.
- You want to monitor cache hit rates and calculate cost savings.
Primary audience
- Primary: Platform engineers and DevOps teams configuring gateway cost optimisation
- Secondary: Engineering managers evaluating LLM cost reduction strategies; AI agents automating cache configuration
Prerequisites
ktCLI installed (first-run tutorial)- An OpenAI-compatible API key exported as
OPENAI_API_KEY - A running Keeptrusts API instance (for event tracking)
curlandjqinstalled
How Caching Works
The gateway intercepts requests before they reach the LLM provider. If a matching cached response exists and has not expired, it is returned immediately — saving both latency and token costs.
| Mode | Match criteria | Best for |
|---|---|---|
exact | Identical message content and model | Repeated identical prompts (e.g., system health checks) |
semantic | Embedding similarity above threshold | Paraphrased or near-duplicate queries |
Step 1: Create the Cache Configuration
Create policy-config.yaml with caching enabled:
version: '1'
providers:
targets:
- id: openai
provider: openai
secret_key_ref:
env: OPENAI_API_KEY
cache:
enabled: true
mode: exact
ttl_seconds: 3600
max_entries: 10000
policies:
- name: basic-filter
type: content_filter
action: flag
config:
categories:
- hate
threshold: medium
Step 2: Validate and Start the Gateway
kt policy lint --file policy-config.yaml
Expected output:
✓ Configuration is valid
Providers: 1 (openai)
Cache: exact (TTL 3600s, max 10000 entries)
Policies: 1 (basic-filter)
Start the gateway:
kt gateway run --policy-config policy-config.yaml --port 41002
Step 3: Verify Exact-Match Caching
Send the same request twice:
# First request — cache MISS (forwarded to provider)
curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}' | jq '{model: .model, tokens: .usage.total_tokens}'
# Second request — identical prompt, cache HIT
curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}' | jq '{model: .model, tokens: .usage.total_tokens}'
The second request returns instantly with zero provider token cost.
Step 4: Confirm Cache Hits in Events
kt events tail --last 2
Expected output:
[2026-04-23T14:10:01Z] REQUEST id=evt_aaa111 cache=miss model=gpt-4o-mini tokens=45 latency=310ms
[2026-04-23T14:10:02Z] REQUEST id=evt_bbb222 cache=hit model=gpt-4o-mini tokens=0 latency=3ms
The cache=hit event shows zero tokens and dramatically lower latency.
Step 5: Switch to Semantic Caching
For applications where users rephrase the same question, switch to semantic mode. Update policy-config.yaml:
cache:
enabled: true
mode: semantic
ttl_seconds: 3600
max_entries: 10000
semantic_threshold: 0.92
Reload the configuration:
kt config reload
Test with paraphrased prompts:
# Original query
curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}' > /dev/null
# Paraphrased query — semantic cache HIT
curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{"role": "user", "content": "Tell me the capital city of France."}
]
}' | jq '{model: .model, tokens: .usage.total_tokens}'
Step 6: Set Model-Specific TTLs
Different models can have different TTL policies:
cache:
enabled: true
mode: semantic
ttl_seconds: 1800
max_entries: 10000
semantic_threshold: 0.92
model_overrides:
- model: gpt-4o
ttl_seconds: 7200
- model: gpt-4o-mini
ttl_seconds: 900
Reload the config:
kt config reload
Step 7: Calculate Cost Savings
Aggregate cache hit rate and estimated savings:
kt events tail --last 100 --format json | jq -s '
{
total: length,
cache_hits: (map(select(.cache == "hit")) | length),
cache_misses: (map(select(.cache == "miss")) | length),
hit_rate_pct: ((map(select(.cache == "hit")) | length) / length * 100 | floor),
tokens_saved: (map(select(.cache == "hit")) | map(.usage.estimated_tokens // 0) | add // 0)
}
'
Example output:
{
"total": 100,
"cache_hits": 38,
"cache_misses": 62,
"hit_rate_pct": 38,
"tokens_saved": 4560
}
A 38% hit rate on gpt-4o-mini at $0.15/1M input tokens saves approximately $0.68 per 1000 cached requests.
Step 8: Invalidate Stale Cache
Force-clear the cache when underlying data changes:
kt cache clear
Expected output:
✓ Cache cleared (10000 entries removed)
To clear entries for a specific model only:
kt cache clear --model gpt-4o-mini
Summary
cache.mode: exactmatches identical prompts — zero-cost repeatscache.mode: semanticcatches paraphrased queries above a similarity thresholdkt events tailshowscache=hit/cache=missper requestkt cache clearinvalidates stale entries on demand- Monitor hit rate over time and adjust
ttl_secondsandsemantic_thresholdto balance freshness and savings
For AI systems
- Canonical terms: Keeptrusts gateway, response caching, semantic cache, exact-match cache, TTL, cache hit rate.
- Config fields:
cache.enabled,cache.mode(exact | semantic),cache.ttl_seconds,cache.max_entries,cache.semantic_threshold. - CLI commands:
kt gateway run,kt policy lint,kt events tail,kt cache stats,kt cache clear. - Best next pages: Cost Tracking & Budgets, Context Compression, Event Tailing.
For engineers
- Prerequisites:
ktCLI installed,OPENAI_API_KEYexported, gateway running. - Validate config:
kt policy lint --file policy-config.yaml— confirms cache mode and TTL are valid. - Verify cache hits:
kt events tailshowscache=hit/cache=missper request. - Clear stale cache:
kt cache clearorkt cache clear --model gpt-4o-mini. - Monitor:
kt cache statsoutputs current entries, hit rate, and memory usage.
For leaders
- Caching can reduce LLM API spend by 30–60% for workloads with repetitive prompts (e.g., customer support, code generation templates).
- Semantic caching introduces a quality trade-off — monitor the
quality_thresholdto avoid serving stale or irrelevant answers. - TTL controls how fresh cached answers must be; shorter TTLs trade cost savings for data freshness.
- No additional infrastructure cost — caching runs in-process within the gateway.
Next steps
- Context Compression — combine caching with token reduction for maximum savings
- Cost Tracking & Budgets — monitor how caching reduces wallet spend
- Event Tailing — stream cache hit/miss decisions in real time
- Model Routing & A/B Testing — compare cached vs. live response quality