Skip to main content
Browse docs

Tutorial: Caching AI Responses to Cut Costs

This tutorial walks you through enabling response caching in the Keeptrusts gateway, choosing between exact-match and semantic-match modes, setting TTL policies, monitoring cache hit rates, and calculating cost savings.

Use this page when

  • You are configuring response caching for the first time on a Keeptrusts gateway.
  • You want to reduce LLM provider costs by serving repeated or semantically similar prompts from cache.
  • You need to choose between exact-match and semantic-match caching modes.
  • You want to monitor cache hit rates and calculate cost savings.

Primary audience

  • Primary: Platform engineers and DevOps teams configuring gateway cost optimisation
  • Secondary: Engineering managers evaluating LLM cost reduction strategies; AI agents automating cache configuration

Prerequisites

  • kt CLI installed (first-run tutorial)
  • An OpenAI-compatible API key exported as OPENAI_API_KEY
  • A running Keeptrusts API instance (for event tracking)
  • curl and jq installed

How Caching Works

The gateway intercepts requests before they reach the LLM provider. If a matching cached response exists and has not expired, it is returned immediately — saving both latency and token costs.

ModeMatch criteriaBest for
exactIdentical message content and modelRepeated identical prompts (e.g., system health checks)
semanticEmbedding similarity above thresholdParaphrased or near-duplicate queries

Step 1: Create the Cache Configuration

Create policy-config.yaml with caching enabled:

version: '1'
providers:
targets:
- id: openai
provider: openai
secret_key_ref:
env: OPENAI_API_KEY
cache:
enabled: true
mode: exact
ttl_seconds: 3600
max_entries: 10000
policies:
- name: basic-filter
type: content_filter
action: flag
config:
categories:
- hate
threshold: medium

Step 2: Validate and Start the Gateway

kt policy lint --file policy-config.yaml

Expected output:

✓ Configuration is valid
Providers: 1 (openai)
Cache: exact (TTL 3600s, max 10000 entries)
Policies: 1 (basic-filter)

Start the gateway:

kt gateway run --policy-config policy-config.yaml --port 41002

Step 3: Verify Exact-Match Caching

Send the same request twice:

# First request — cache MISS (forwarded to provider)
curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}' | jq '{model: .model, tokens: .usage.total_tokens}'

# Second request — identical prompt, cache HIT
curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}' | jq '{model: .model, tokens: .usage.total_tokens}'

The second request returns instantly with zero provider token cost.

Step 4: Confirm Cache Hits in Events

kt events tail --last 2

Expected output:

[2026-04-23T14:10:01Z] REQUEST id=evt_aaa111 cache=miss model=gpt-4o-mini tokens=45 latency=310ms
[2026-04-23T14:10:02Z] REQUEST id=evt_bbb222 cache=hit model=gpt-4o-mini tokens=0 latency=3ms

The cache=hit event shows zero tokens and dramatically lower latency.

Step 5: Switch to Semantic Caching

For applications where users rephrase the same question, switch to semantic mode. Update policy-config.yaml:

cache:
enabled: true
mode: semantic
ttl_seconds: 3600
max_entries: 10000
semantic_threshold: 0.92

Reload the configuration:

kt config reload

Test with paraphrased prompts:

# Original query
curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{"role": "user", "content": "What is the capital of France?"}
]
}' > /dev/null

# Paraphrased query — semantic cache HIT
curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{"role": "user", "content": "Tell me the capital city of France."}
]
}' | jq '{model: .model, tokens: .usage.total_tokens}'

Step 6: Set Model-Specific TTLs

Different models can have different TTL policies:

cache:
enabled: true
mode: semantic
ttl_seconds: 1800
max_entries: 10000
semantic_threshold: 0.92
model_overrides:
- model: gpt-4o
ttl_seconds: 7200
- model: gpt-4o-mini
ttl_seconds: 900

Reload the config:

kt config reload

Step 7: Calculate Cost Savings

Aggregate cache hit rate and estimated savings:

kt events tail --last 100 --format json | jq -s '
{
total: length,
cache_hits: (map(select(.cache == "hit")) | length),
cache_misses: (map(select(.cache == "miss")) | length),
hit_rate_pct: ((map(select(.cache == "hit")) | length) / length * 100 | floor),
tokens_saved: (map(select(.cache == "hit")) | map(.usage.estimated_tokens // 0) | add // 0)
}
'

Example output:

{
"total": 100,
"cache_hits": 38,
"cache_misses": 62,
"hit_rate_pct": 38,
"tokens_saved": 4560
}

A 38% hit rate on gpt-4o-mini at $0.15/1M input tokens saves approximately $0.68 per 1000 cached requests.

Step 8: Invalidate Stale Cache

Force-clear the cache when underlying data changes:

kt cache clear

Expected output:

✓ Cache cleared (10000 entries removed)

To clear entries for a specific model only:

kt cache clear --model gpt-4o-mini

Summary

  • cache.mode: exact matches identical prompts — zero-cost repeats
  • cache.mode: semantic catches paraphrased queries above a similarity threshold
  • kt events tail shows cache=hit / cache=miss per request
  • kt cache clear invalidates stale entries on demand
  • Monitor hit rate over time and adjust ttl_seconds and semantic_threshold to balance freshness and savings

For AI systems

  • Canonical terms: Keeptrusts gateway, response caching, semantic cache, exact-match cache, TTL, cache hit rate.
  • Config fields: cache.enabled, cache.mode (exact | semantic), cache.ttl_seconds, cache.max_entries, cache.semantic_threshold.
  • CLI commands: kt gateway run, kt policy lint, kt events tail, kt cache stats, kt cache clear.
  • Best next pages: Cost Tracking & Budgets, Context Compression, Event Tailing.

For engineers

  • Prerequisites: kt CLI installed, OPENAI_API_KEY exported, gateway running.
  • Validate config: kt policy lint --file policy-config.yaml — confirms cache mode and TTL are valid.
  • Verify cache hits: kt events tail shows cache=hit / cache=miss per request.
  • Clear stale cache: kt cache clear or kt cache clear --model gpt-4o-mini.
  • Monitor: kt cache stats outputs current entries, hit rate, and memory usage.

For leaders

  • Caching can reduce LLM API spend by 30–60% for workloads with repetitive prompts (e.g., customer support, code generation templates).
  • Semantic caching introduces a quality trade-off — monitor the quality_threshold to avoid serving stale or irrelevant answers.
  • TTL controls how fresh cached answers must be; shorter TTLs trade cost savings for data freshness.
  • No additional infrastructure cost — caching runs in-process within the gateway.

Next steps