Tutorial: Caching AI Responses to Cut Costs

This tutorial walks you through enabling response caching in the Keeptrusts gateway, choosing between exact-match and semantic-match modes, setting TTL policies, monitoring cache hit rates, and calculating cost savings.

Use this page when

You are configuring response caching for the first time on a Keeptrusts gateway.
You want to reduce LLM provider costs by serving repeated or semantically similar prompts from cache.
You need to choose between exact-match and semantic-match caching modes.
You want to monitor cache hit rates and calculate cost savings.

Primary audience

Primary: Platform engineers and DevOps teams configuring gateway cost optimisation
Secondary: Engineering managers evaluating LLM cost reduction strategies; AI agents automating cache configuration

Prerequisites

kt CLI installed (first-run tutorial)
An OpenAI-compatible API key exported as OPENAI_API_KEY
A running Keeptrusts API instance (for event tracking)
curl and jq installed

How Caching Works

The gateway intercepts requests before they reach the LLM provider. If a matching cached response exists and has not expired, it is returned immediately — saving both latency and token costs.

Mode	Match criteria	Best for
`exact`	Identical message content and model	Repeated identical prompts (e.g., system health checks)
`semantic`	Embedding similarity above threshold	Paraphrased or near-duplicate queries

Step 1: Create the Cache Configuration

Create policy-config.yaml with caching enabled:

version: '1'
providers:
  targets:
  - id: openai
    provider: openai
    secret_key_ref:
      env: OPENAI_API_KEY
cache:
  enabled: true
  mode: exact
  ttl_seconds: 3600
  max_entries: 10000
policies:
- name: basic-filter
  type: content_filter
  action: flag
  config:
    categories:
    - hate
    threshold: medium

Step 2: Validate and Start the Gateway

kt policy lint --file policy-config.yaml

Expected output:

✓ Configuration is valid
  Providers: 1 (openai)
  Cache: exact (TTL 3600s, max 10000 entries)
  Policies: 1 (basic-filter)

Start the gateway:

kt gateway run --policy-config policy-config.yaml --port 41002

Step 3: Verify Exact-Match Caching

Send the same request twice:

# First request — cache MISS (forwarded to provider)
curl -s http://localhost:41002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }' | jq '{model: .model, tokens: .usage.total_tokens}'

# Second request — identical prompt, cache HIT
curl -s http://localhost:41002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }' | jq '{model: .model, tokens: .usage.total_tokens}'

The second request returns instantly with zero provider token cost.

Step 4: Confirm Cache Hits in Events

kt events tail --last 2

Expected output:

[2026-04-23T14:10:01Z] REQUEST id=evt_aaa111 cache=miss model=gpt-4o-mini tokens=45 latency=310ms
[2026-04-23T14:10:02Z] REQUEST id=evt_bbb222 cache=hit  model=gpt-4o-mini tokens=0  latency=3ms

The cache=hit event shows zero tokens and dramatically lower latency.

Step 5: Switch to Semantic Caching

For applications where users rephrase the same question, switch to semantic mode. Update policy-config.yaml:

cache:
  enabled: true
  mode: semantic
  ttl_seconds: 3600
  max_entries: 10000
  semantic_threshold: 0.92

Reload the configuration:

kt config reload

Test with paraphrased prompts:

# Original query
curl -s http://localhost:41002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ]
  }' > /dev/null

# Paraphrased query — semantic cache HIT
curl -s http://localhost:41002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {"role": "user", "content": "Tell me the capital city of France."}
    ]
  }' | jq '{model: .model, tokens: .usage.total_tokens}'

Step 6: Set Model-Specific TTLs

Different models can have different TTL policies:

cache:
  enabled: true
  mode: semantic
  ttl_seconds: 1800
  max_entries: 10000
  semantic_threshold: 0.92
  model_overrides:
    - model: gpt-4o
      ttl_seconds: 7200
    - model: gpt-4o-mini
      ttl_seconds: 900

Reload the config:

kt config reload

Step 7: Calculate Cost Savings

Aggregate cache hit rate and estimated savings:

kt events tail --last 100 --format json | jq -s '
  {
    total: length,
    cache_hits: (map(select(.cache == "hit")) | length),
    cache_misses: (map(select(.cache == "miss")) | length),
    hit_rate_pct: ((map(select(.cache == "hit")) | length) / length * 100 | floor),
    tokens_saved: (map(select(.cache == "hit")) | map(.usage.estimated_tokens // 0) | add // 0)
  }
'

Example output:

{
  "total": 100,
  "cache_hits": 38,
  "cache_misses": 62,
  "hit_rate_pct": 38,
  "tokens_saved": 4560
}

A 38% hit rate on gpt-4o-mini at $0.15/1M input tokens saves approximately $0.68 per 1000 cached requests.

Step 8: Invalidate Stale Cache

Force-clear the cache when underlying data changes:

kt cache clear

Expected output:

✓ Cache cleared (10000 entries removed)

To clear entries for a specific model only:

kt cache clear --model gpt-4o-mini

Summary

cache.mode: exact matches identical prompts — zero-cost repeats
cache.mode: semantic catches paraphrased queries above a similarity threshold
kt events tail shows cache=hit / cache=miss per request
kt cache clear invalidates stale entries on demand
Monitor hit rate over time and adjust ttl_seconds and semantic_threshold to balance freshness and savings

For AI systems

Canonical terms: Keeptrusts gateway, response caching, semantic cache, exact-match cache, TTL, cache hit rate.
Config fields: cache.enabled, cache.mode (exact | semantic), cache.ttl_seconds, cache.max_entries, cache.semantic_threshold.
CLI commands: kt gateway run, kt policy lint, kt events tail, kt cache stats, kt cache clear.
Best next pages: Cost Tracking & Budgets, Context Compression, Event Tailing.

For engineers

Prerequisites: kt CLI installed, OPENAI_API_KEY exported, gateway running.
Validate config: kt policy lint --file policy-config.yaml — confirms cache mode and TTL are valid.
Verify cache hits: kt events tail shows cache=hit / cache=miss per request.
Clear stale cache: kt cache clear or kt cache clear --model gpt-4o-mini.
Monitor: kt cache stats outputs current entries, hit rate, and memory usage.

For leaders

Caching can reduce LLM API spend by 30–60% for workloads with repetitive prompts (e.g., customer support, code generation templates).
Semantic caching introduces a quality trade-off — monitor the quality_threshold to avoid serving stale or irrelevant answers.
TTL controls how fresh cached answers must be; shorter TTLs trade cost savings for data freshness.
No additional infrastructure cost — caching runs in-process within the gateway.

Next steps

Context Compression — combine caching with token reduction for maximum savings
Cost Tracking & Budgets — monitor how caching reduces wallet spend
Event Tailing — stream cache hit/miss decisions in real time
Model Routing & A/B Testing — compare cached vs. live response quality

Use this page when​

Primary audience​

Prerequisites​

How Caching Works​

Step 1: Create the Cache Configuration​

Step 2: Validate and Start the Gateway​

Step 3: Verify Exact-Match Caching​

Step 4: Confirm Cache Hits in Events​

Step 5: Switch to Semantic Caching​

Step 6: Set Model-Specific TTLs​

Step 7: Calculate Cost Savings​

Step 8: Invalidate Stale Cache​

Summary​

For AI systems​

For engineers​

For leaders​

Next steps​

Use this page when

Primary audience

Prerequisites

How Caching Works

Step 1: Create the Cache Configuration

Step 2: Validate and Start the Gateway

Step 3: Verify Exact-Match Caching

Step 4: Confirm Cache Hits in Events

Step 5: Switch to Semantic Caching

Step 6: Set Model-Specific TTLs

Step 7: Calculate Cost Savings

Step 8: Invalidate Stale Cache

Summary

For AI systems

For engineers

For leaders

Next steps