Semantic Caching
Keeptrusts's semantic cache intercepts outbound LLM requests and checks whether an equivalent query has been answered recently. On a cache hit, the gateway returns the stored response immediately without forwarding to the upstream provider — reducing both cost and end-to-end latency. On a cache miss, the request is forwarded normally and the response is stored for future hits.
Use this page when
- You need the exact command, config, API, or integration details for Semantic Caching.
- You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
- If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.
Keeptrusts supports two cache modes:
- Exact caching — cache key is a deterministic hash of the prompt. Works best for templated queries, repeated system prompts, or any workload where inputs are structurally identical.
- Semantic caching — cache key is computed from the prompt's embedding vector. Queries that are phrased differently but mean the same thing hit the same cache entry. Works best for FAQ bots, search assistants, and customer support applications.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Storage Backends
Cache behavior is configured in YAML, but cache storage is selected with environment variables so operators can choose per-node or shared storage without changing policy files.
Supported backends today:
KEEPTRUSTS_LLM_CACHE_BACKEND=memorykeeps exact responses and semantic embeddings in-process on a single gateway node.KEEPTRUSTS_LLM_CACHE_BACKEND=filesystempersists exact responses on local disk and keeps semantic embeddings in-process on a single node.KEEPTRUSTS_LLM_CACHE_BACKEND=redisenables shared exact-cache storage and shared semantic embedding persistence across gateway nodes.KEEPTRUSTS_LLM_CACHE_BACKEND=s3enables shared exact-cache storage in an S3-compatible object store.KEEPTRUSTS_LLM_CACHE_BACKEND=gcsenables shared exact-cache storage in Google Cloud Storage using S3 interoperability credentials.KEEPTRUSTS_LLM_CACHE_BACKEND=qdrantenables shared semantic vector persistence and lookup in Qdrant.
Redis backend requirements:
- Build the CLI with
--features distributed. - Set
KEEPTRUSTS_LLM_CACHE_REDIS_URLor reuseREDIS_URL. - Optionally set
KEEPTRUSTS_LLM_CACHE_REDIS_KEY_PREFIXto isolate environments or tenants sharing one Redis cluster.
S3 and GCS backend requirements:
- Build the CLI with
--features distributed. - Set the bucket, credentials, and endpoint environment variables provided by your deployment owner or runtime environment.
- Treat these backends as exact-cache storage only. They do not persist semantic embeddings or serve semantic hits.
Qdrant backend requirements:
- Build the CLI with
--features distributed. - Set
KEEPTRUSTS_LLM_CACHE_QDRANT_URL. - Optionally set
KEEPTRUSTS_LLM_CACHE_QDRANT_COLLECTIONandKEEPTRUSTS_LLM_CACHE_QDRANT_API_KEY.
Example operator environment:
export KEEPTRUSTS_LLM_CACHE_ENABLED=true
export KEEPTRUSTS_LLM_CACHE_BACKEND=redis
export KEEPTRUSTS_LLM_CACHE_REDIS_URL="redis://redis.internal:6379/0"
export KEEPTRUSTS_LLM_CACHE_REDIS_KEY_PREFIX="kt:llm-cache:prod"
Example S3-compatible exact-cache environment:
export KEEPTRUSTS_LLM_CACHE_ENABLED=true
export KEEPTRUSTS_LLM_CACHE_BACKEND=s3
export KEEPTRUSTS_LLM_CACHE_S3_BUCKET="keeptrusts-cache"
export KEEPTRUSTS_LLM_CACHE_S3_REGION="us-east-1"
export KEEPTRUSTS_LLM_CACHE_S3_ENDPOINT="https://minio.internal:9000"
export KEEPTRUSTS_LLM_CACHE_S3_ACCESS_KEY_ID="minio-access-key"
export KEEPTRUSTS_LLM_CACHE_S3_SECRET_ACCESS_KEY="minio-secret-key"
export KEEPTRUSTS_LLM_CACHE_S3_FORCE_PATH_STYLE=true
Example Qdrant semantic-cache environment:
export KEEPTRUSTS_LLM_CACHE_ENABLED=true
export KEEPTRUSTS_LLM_CACHE_BACKEND=qdrant
export KEEPTRUSTS_LLM_CACHE_QDRANT_URL="http://qdrant.internal:6333"
export KEEPTRUSTS_LLM_CACHE_QDRANT_COLLECTION="keeptrusts_llm_cache"
Cache Modes
exact mode
In exact mode, the cache key is computed as a SHA-256 hash of the serialised request messages (system + user content, in order). Two requests match only if every token is identical.
cache:
enabled: true
mode: exact
ttl_seconds: 3600 # entries expire after 1 hour
max_entries: 10000 # evict oldest entries when this limit is reached
Use exact caching when:
- Your application generates structured, templated prompts.
- Requests contain deterministic system prompts followed by a small set of possible user inputs.
- You want zero false-positive cache hits (only truly identical prompts share a cached response).
semantic mode
In semantic mode, the cache key is computed from an embedding vector of the user's last message (or the full conversation turn, configurable). A new request is a cache hit if its embedding's cosine similarity to a stored embedding exceeds similarity_threshold.
cache:
enabled: true
mode: semantic
similarity_threshold: 0.92 # cosine similarity required for a cache hit (0.0–1.0)
ttl_seconds: 7200
max_entries: 50000
embedding_provider: voyage-lite # provider target used to compute cache embeddings
Use semantic caching when:
- Users rephrase the same underlying question in different ways.
- Your workload is FAQ-heavy or search-adjacent.
- Slightly imprecise cache hits are acceptable (verified by your
similarity_threshold).
similarity_threshold (e.g., 0.7) can produce incorrect cache hits where a stored response is returned for a question the original response didn't actually answer. Start at 0.92 and tune downward only after reviewing cache-hit quality.SemanticCacheConfig Field Reference
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable the cache. No caching occurs when false. |
mode | string | "exact" | Cache strategy: exact or semantic. |
similarity_threshold | float | 0.92 | Cosine similarity required for a semantic cache hit. Ignored in exact mode. |
ttl_seconds | integer | 3600 | Seconds after which a cache entry expires. Set 0 for no expiry. |
max_entries | integer | 10000 | Maximum number of entries in the cache. When exceeded, the least-recently-used entry is evicted. |
embedding_provider | string | — | Provider target ID used to compute the query embedding for semantic mode. Must resolve to an embedding-type target. |
namespace | string | "default" | Logical cache namespace. Use per-tenant or per-consumer-group namespaces to prevent cross-tenant cache sharing. |
cache_on_stream | bool | false | Whether to cache streamed (stream: true) responses. When true, the gateway buffers the full stream before storing it. |
exclude_system_from_key | bool | false | When true, the system message is excluded from the cache key computation. Useful when system messages vary per request but the user intent is the same. |
Configuring Semantic Cache
Basic exact cache
The simplest configuration — hash-based matching, 1-hour TTL, in-memory storage:
cache:
enabled: true
mode: exact
ttl_seconds: 3600
max_entries: 20000
namespace: "my-app-v1"
Semantic cache with Voyage embeddings
pack:
name: caching-providers-4
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-gpt4o
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: voyage-lite
provider: voyage:embedding:voyage-3-lite
secret_key_ref:
env: VOYAGE_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
When a request arrives:
- Keeptrusts calls
voyage-liteto embed the user query. - The resulting vector is compared against all stored vectors in the
support-bot-prodnamespace. - If any stored vector has cosine similarity ≥ 0.93, the stored response is returned immediately.
- If no match is found, the request is forwarded to
openai-gpt4oand the response + embedding are stored.
Cache with OpenAI embeddings
pack:
name: caching-providers-5
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-embed-small
provider: openai:embedding:text-embedding-3-small
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Exact cache for deterministic batch jobs
For batch-processing workloads where prompts are machine-generated and fully deterministic, exact caching gives maximum hit rates with no embedding overhead:
cache:
enabled: true
mode: exact
ttl_seconds: 86400 # 24-hour cache; batch prompts rarely change intra-day
max_entries: 500000
namespace: "batch-classifier"
exclude_system_from_key: false # include system message — it's part of the prompt identity
Cache Hit Indicator
Keeptrusts injects a response header to indicate whether the response was served from cache:
| Header | Values | Description |
|---|---|---|
X-KT-Cache-Hit | true / false | Whether this response was served from the cache. |
X-KT-Cache-Mode | exact / semantic | The active cache mode. |
X-KT-Cache-Similarity | float string | (Semantic mode only) Cosine similarity score of the matched entry. |
X-KT-Cache-Namespace | string | The cache namespace this hit came from. |
X-KT-Cache-Age | integer | Seconds since the cached entry was stored. |
Use these headers in your application to track cache hit rates and to selectively bypass the cache when freshness is required.
curl -s -D - https://gateway.example.com/v1/chat/completions \
-H "Authorization: Bearer $KT_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o","messages":[{"role":"user","content":"What is Keeptrusts?"}]}' \
| grep -i x-kt-cache
# X-KT-Cache-Hit: true
# X-KT-Cache-Mode: semantic
# X-KT-Cache-Similarity: 0.9641
# X-KT-Cache-Age: 1834
Cache Warming
Pre-populate the cache with answers to known high-frequency questions using the admin cache-warm endpoint. This is especially useful after a gateway restart (which clears the in-memory cache) or when deploying to a new namespace.
# Warm the cache with a list of seed Q&A pairs
curl -X POST https://api.keeptrusts.com/v1/cache/warm \
-H "Authorization: Bearer $KT_ADMIN_KEY" \
-H "Content-Type: application/json" \
-d '{
"namespace": "support-bot-prod",
"entries": [
{
"messages": [{"role": "user", "content": "How do I reset my password?"}],
"response": "To reset your password, click Forgot Password on the login page..."
},
{
"messages": [{"role": "user", "content": "What are your business hours?"}],
"response": "Our support team is available Monday–Friday, 9am–6pm UTC."
}
]
}'
Cache-warm entries are treated identically to organically cached entries: they have the same TTL and are subject to the same LRU eviction policy.
Cache Invalidation
TTL expiry
Every cache entry has a ttl_seconds lifetime. When the TTL elapses, the entry is evicted and the next matching request causes a cache miss, forwarding to the upstream and refreshing the stored entry.
Manual flush
Flush all entries in a namespace using the admin cache API:
# Flush all entries in a specific namespace
curl -X DELETE https://api.keeptrusts.com/v1/cache/namespaces/support-bot-prod \
-H "Authorization: Bearer $KT_ADMIN_KEY"
# Flush entries older than a specific age
curl -X DELETE "https://api.keeptrusts.com/v1/cache/namespaces/support-bot-prod?older_than_seconds=3600" \
-H "Authorization: Bearer $KT_ADMIN_KEY"
Per-request bypass
Clients can bypass the cache for a specific request by including the X-KT-Cache-Bypass: true header:
curl -X POST https://gateway.example.com/v1/chat/completions \
-H "Authorization: Bearer $KT_API_KEY" \
-H "X-KT-Cache-Bypass: true" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o","messages":[...]}'
Bypassed requests are always forwarded to the upstream. Their responses are not stored in the cache. Use this for requests that require the freshest possible answer (e.g., real-time data queries, personalised responses).
Cache Namespacing
Cache entries are scoped to a namespace. Use separate namespaces to prevent different applications, tenants, or consumer groups from sharing cached responses.
Per-consumer-group namespacing
Combine cache namespacing with consumer groups to give each team an isolated cache:
consumer_groups:
- name: finance-team
api_keys:
- env:FINANCE_KEY
chain:
- content-safety
upstream: openai-gpt4o
- name: legal-team
api_keys:
- env:LEGAL_KEY
chain:
- content-safety
upstream: openai-gpt4o
cache:
enabled: true
mode: semantic
similarity_threshold: 0.92
ttl_seconds: 7200
max_entries: 50000
embedding_provider: voyage-lite
namespace_from_consumer_group: true # automatically uses consumer group name as namespace
With namespace_from_consumer_group: true, requests from finance-team are cached in the finance-team namespace and requests from legal-team are cached in the legal-team namespace. Finance cannot receive a cached response intended for Legal.
Per-route namespacing
Use the cache_namespace field on a route to override the namespace for that path:
pack:
name: caching-routes-8
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-primary
provider: openai
model: gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
- id: openai-gpt4o
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
routes:
- name: faq-bot
path: "/v1/faq"
match: prefix
upstream: openai-gpt4o
cache_namespace: faq-bot-v2
- name: chat
path: "/v1/chat/completions"
match: exact
upstream: openai-gpt4o
Cache + Zero Data Retention
When zero_data_retention is enabled for a request, Keeptrusts must not store response content in any persistent or semi-persistent layer. Caching is automatically disabled for requests that carry the ZDR flag, even if the global cache config has enabled: true.
cache:
enabled: true
mode: semantic
ttl_seconds: 3600
max_entries: 50000
embedding_provider: voyage-lite
policies:
- id: zero-retention-policy
type: zero_data_retention
trigger:
header: X-Zero-Data-Retention
value: "true"
When a request is processed under a zero-retention policy:
- The cache lookup is skipped — no stored response is returned.
- The response from upstream is not stored — no new cache entry is created.
- The
X-KT-Cache-Hit: falseandX-KT-ZDR-Active: trueheaders are returned to the client.
Explicitly disabling the cache for a specific consumer group:
consumer_groups:
- name: gdpr-strict
description: "EU customers requiring strict data minimisation"
api_keys:
- env:GDPR_CUSTOMER_KEY
cache_enabled: false # override: disable cache for this group regardless of global config
chain:
- pii-redaction
- gdpr-audit-logger
upstream: openai-gpt4o
Monitoring Cache Performance
Keeptrusts emits structured log fields and control-plane events for all cache interactions:
{
"timestamp": "2026-03-27T14:22:01.342Z",
"event_type": "cache_hit",
"cache_mode": "semantic",
"cache_namespace": "support-bot-prod",
"similarity_score": 0.9641,
"cache_age_seconds": 1834,
"latency_ms": 3,
"upstream_latency_ms": null,
"cost_usd_saved": 0.0024
}
{
"timestamp": "2026-03-27T14:22:45.100Z",
"event_type": "cache_miss",
"cache_mode": "semantic",
"cache_namespace": "support-bot-prod",
"upstream_latency_ms": 842,
"prompt_tokens": 312,
"completion_tokens": 156,
"cost_usd": 0.0024
}
Use the Keeptrusts console Events view and filter by event_type: cache_hit / cache_miss to monitor hit rates, average similarity scores, and estimated cost savings over time.
Best Practices
-
Start with
exactmode and upgrade tosemanticif hit rates are low. Exact caching has zero false-positive risk and no embedding overhead. Move to semantic only when you see that rephrased but equivalent questions are each causing cache misses. -
Set
similarity_thresholdto 0.92 or higher for production. Below 0.90, the probability of a semantically incorrect hit increases significantly. Tune down only after reviewing actual hit/response pairs in the console. -
Use namespacing to prevent cross-tenant cache poisoning. A cached response from one tenant must never be served to another. Use
namespace_from_consumer_group: trueor explicit per-routecache_namespacevalues for all multi-tenant deployments. -
Disable caching for personalised or real-time queries. If a response depends on time-sensitive data, the authenticated user's state, or session variables, those requests should bypass the cache via
X-KT-Cache-Bypass: trueor by excluding them from any caching route. -
Warm the cache on gateway startup for high-traffic FAQ bots. Cold starts after a restart cause a burst of upstream requests. Pre-warm with the top 100–500 most frequent queries from your production event logs.
-
Always disable cache when
zero_data_retentionis active. Caching a ZDR response defeats the purpose of the policy. Setcache_enabled: falseon any consumer group or route that is subject to ZDR requirements.
For AI systems
- Canonical terms: Keeptrusts Semantic Cache, exact cache, semantic cache, cache namespace, cache warming, cache bypass.
- Config keys:
cache.enabled,cache.mode(exact|semantic),cache.similarity_threshold,cache.ttl_seconds,cache.max_entries,cache.embedding_provider,cache.namespace,cache.cache_on_stream,namespace_from_consumer_group. - Environment variables:
KEEPTRUSTS_LLM_CACHE_ENABLED,KEEPTRUSTS_LLM_CACHE_BACKEND(memory|filesystem|redis|s3|gcs|qdrant),KEEPTRUSTS_LLM_CACHE_REDIS_URL,KEEPTRUSTS_LLM_CACHE_QDRANT_URL. - Response headers:
X-KT-Cache-Hit,X-KT-Cache-Mode,X-KT-Cache-Similarity,X-KT-Cache-Age. - Admin endpoints:
POST /v1/cache/warm,DELETE /v1/cache/namespaces/{ns}. - Best next pages: Consumer Groups, Provider Routing, Rate Limiting.
For engineers
- Prerequisites: Gateway binary with
--features distributedfor Redis/S3/Qdrant backends; an embedding provider target configured for semantic mode. - Validate cache is working: send the same prompt twice and check
X-KT-Cache-Hit: truein the response header. - Validate semantic threshold: lower
similarity_thresholdgradually from 0.95 and review hit/response pairs in the Events view; never go below 0.85 in production. - Warm cache after restarts:
curl -X POST /v1/cache/warmwith your top FAQ pairs. - Flush stale entries:
curl -X DELETE /v1/cache/namespaces/<ns>?older_than_seconds=3600. - Bypass for fresh answers: send
X-KT-Cache-Bypass: trueheader on time-sensitive requests.
For leaders
- Cost impact: Semantic caching can reduce LLM API spend by 30–60% for FAQ-heavy and repetitive workloads with no degradation in response quality.
- Rollout risk: Start with
exactmode (zero false-positive risk) and move tosemanticonly after measuring cache-miss rates on production traffic. - Compliance: Use
namespace_from_consumer_group: trueto prevent cross-tenant cache sharing; disable cache for zero-data-retention consumers. - Operational cost: Redis or Qdrant adds infrastructure; in-memory mode requires no external dependency but loses cache on restart.
Next steps
- Consumer Groups — isolate cache per team with
namespace_from_consumer_group - Rate Limiting — combine cache with rate limits for cost control
- Context Compression — reduce token usage before caching long conversations
- Custom Routes — set per-route
cache_namespaceoverrides