Semantic Caching

Keeptrusts's semantic cache intercepts outbound LLM requests and checks whether an equivalent query has been answered recently. On a cache hit, the gateway returns the stored response immediately without forwarding to the upstream provider — reducing both cost and end-to-end latency. On a cache miss, the request is forwarded normally and the response is stored for future hits.

Use this page when

You need the exact command, config, API, or integration details for Semantic Caching.
You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Keeptrusts supports two cache modes:

Exact caching — cache key is a deterministic hash of the prompt. Works best for templated queries, repeated system prompts, or any workload where inputs are structurally identical.
Semantic caching — cache key is computed from the prompt's embedding vector. Queries that are phrased differently but mean the same thing hit the same cache entry. Works best for FAQ bots, search assistants, and customer support applications.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Storage Backends

Cache behavior is configured in YAML, but cache storage is selected with environment variables so operators can choose per-node or shared storage without changing policy files.

Supported backends today:

KEEPTRUSTS_LLM_CACHE_BACKEND=memory keeps exact responses and semantic embeddings in-process on a single gateway node.
KEEPTRUSTS_LLM_CACHE_BACKEND=filesystem persists exact responses on local disk and keeps semantic embeddings in-process on a single node.
KEEPTRUSTS_LLM_CACHE_BACKEND=redis enables shared exact-cache storage and shared semantic embedding persistence across gateway nodes.
KEEPTRUSTS_LLM_CACHE_BACKEND=s3 enables shared exact-cache storage in an S3-compatible object store.
KEEPTRUSTS_LLM_CACHE_BACKEND=gcs enables shared exact-cache storage in Google Cloud Storage using S3 interoperability credentials.
KEEPTRUSTS_LLM_CACHE_BACKEND=qdrant enables shared semantic vector persistence and lookup in Qdrant.

Redis backend requirements:

Build the CLI with --features distributed.
Set KEEPTRUSTS_LLM_CACHE_REDIS_URL or reuse REDIS_URL.
Optionally set KEEPTRUSTS_LLM_CACHE_REDIS_KEY_PREFIX to isolate environments or tenants sharing one Redis cluster.

S3 and GCS backend requirements:

Build the CLI with --features distributed.
Set the bucket, credentials, and endpoint environment variables provided by your deployment owner or runtime environment.
Treat these backends as exact-cache storage only. They do not persist semantic embeddings or serve semantic hits.

Qdrant backend requirements:

Build the CLI with --features distributed.
Set KEEPTRUSTS_LLM_CACHE_QDRANT_URL.
Optionally set KEEPTRUSTS_LLM_CACHE_QDRANT_COLLECTION and KEEPTRUSTS_LLM_CACHE_QDRANT_API_KEY.

Example operator environment:

export KEEPTRUSTS_LLM_CACHE_ENABLED=true
export KEEPTRUSTS_LLM_CACHE_BACKEND=redis
export KEEPTRUSTS_LLM_CACHE_REDIS_URL="redis://redis.internal:6379/0"
export KEEPTRUSTS_LLM_CACHE_REDIS_KEY_PREFIX="kt:llm-cache:prod"

Example S3-compatible exact-cache environment:

export KEEPTRUSTS_LLM_CACHE_ENABLED=true
export KEEPTRUSTS_LLM_CACHE_BACKEND=s3
export KEEPTRUSTS_LLM_CACHE_S3_BUCKET="keeptrusts-cache"
export KEEPTRUSTS_LLM_CACHE_S3_REGION="us-east-1"
export KEEPTRUSTS_LLM_CACHE_S3_ENDPOINT="https://minio.internal:9000"
export KEEPTRUSTS_LLM_CACHE_S3_ACCESS_KEY_ID="minio-access-key"
export KEEPTRUSTS_LLM_CACHE_S3_SECRET_ACCESS_KEY="minio-secret-key"
export KEEPTRUSTS_LLM_CACHE_S3_FORCE_PATH_STYLE=true

Example Qdrant semantic-cache environment:

export KEEPTRUSTS_LLM_CACHE_ENABLED=true
export KEEPTRUSTS_LLM_CACHE_BACKEND=qdrant
export KEEPTRUSTS_LLM_CACHE_QDRANT_URL="http://qdrant.internal:6333"
export KEEPTRUSTS_LLM_CACHE_QDRANT_COLLECTION="keeptrusts_llm_cache"

Cache Modes

`exact` mode

In exact mode, the cache key is computed as a SHA-256 hash of the serialised request messages (system + user content, in order). Two requests match only if every token is identical.

cache:
  enabled: true
  mode: exact
  ttl_seconds: 3600      # entries expire after 1 hour
  max_entries: 10000     # evict oldest entries when this limit is reached

Use exact caching when:

Your application generates structured, templated prompts.
Requests contain deterministic system prompts followed by a small set of possible user inputs.
You want zero false-positive cache hits (only truly identical prompts share a cached response).

`semantic` mode

In semantic mode, the cache key is computed from an embedding vector of the user's last message (or the full conversation turn, configurable). A new request is a cache hit if its embedding's cosine similarity to a stored embedding exceeds similarity_threshold.

cache:
  enabled: true
  mode: semantic
  similarity_threshold: 0.92   # cosine similarity required for a cache hit (0.0–1.0)
  ttl_seconds: 7200
  max_entries: 50000
  embedding_provider: voyage-lite  # provider target used to compute cache embeddings

Use semantic caching when:

Users rephrase the same underlying question in different ways.
Your workload is FAQ-heavy or search-adjacent.
Slightly imprecise cache hits are acceptable (verified by your similarity_threshold).

A very low similarity_threshold (e.g., 0.7) can produce incorrect cache hits where a stored response is returned for a question the original response didn't actually answer. Start at 0.92 and tune downward only after reviewing cache-hit quality.

`SemanticCacheConfig` Field Reference

Field	Type	Default	Description
`enabled`	bool	`false`	Enable the cache. No caching occurs when `false`.
`mode`	string	`"exact"`	Cache strategy: `exact` or `semantic`.
`similarity_threshold`	float	`0.92`	Cosine similarity required for a semantic cache hit. Ignored in `exact` mode.
`ttl_seconds`	integer	`3600`	Seconds after which a cache entry expires. Set `0` for no expiry.
`max_entries`	integer	`10000`	Maximum number of entries in the cache. When exceeded, the least-recently-used entry is evicted.
`embedding_provider`	string	—	Provider target ID used to compute the query embedding for semantic mode. Must resolve to an `embedding`-type target.
`namespace`	string	`"default"`	Logical cache namespace. Use per-tenant or per-consumer-group namespaces to prevent cross-tenant cache sharing.
`cache_on_stream`	bool	`false`	Whether to cache streamed (`stream: true`) responses. When `true`, the gateway buffers the full stream before storing it.
`exclude_system_from_key`	bool	`false`	When `true`, the system message is excluded from the cache key computation. Useful when system messages vary per request but the user intent is the same.

Configuring Semantic Cache

Basic exact cache

The simplest configuration — hash-based matching, 1-hour TTL, in-memory storage:

cache:
  enabled: true
  mode: exact
  ttl_seconds: 3600
  max_entries: 20000
  namespace: "my-app-v1"

Semantic cache with Voyage embeddings

pack:
  name: caching-providers-4
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai-gpt4o
    provider: openai:chat:gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
  - id: voyage-lite
    provider: voyage:embedding:voyage-3-lite
    secret_key_ref:
      env: VOYAGE_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

When a request arrives:

Keeptrusts calls voyage-lite to embed the user query.
The resulting vector is compared against all stored vectors in the support-bot-prod namespace.
If any stored vector has cosine similarity ≥ 0.93, the stored response is returned immediately.
If no match is found, the request is forwarded to openai-gpt4o and the response + embedding are stored.

Cache with OpenAI embeddings

pack:
  name: caching-providers-5
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai-embed-small
    provider: openai:embedding:text-embedding-3-small
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Exact cache for deterministic batch jobs

For batch-processing workloads where prompts are machine-generated and fully deterministic, exact caching gives maximum hit rates with no embedding overhead:

cache:
  enabled: true
  mode: exact
  ttl_seconds: 86400         # 24-hour cache; batch prompts rarely change intra-day
  max_entries: 500000
  namespace: "batch-classifier"
  exclude_system_from_key: false  # include system message — it's part of the prompt identity

Cache Hit Indicator

Keeptrusts injects a response header to indicate whether the response was served from cache:

Header	Values	Description
`X-KT-Cache-Hit`	`true` / `false`	Whether this response was served from the cache.
`X-KT-Cache-Mode`	`exact` / `semantic`	The active cache mode.
`X-KT-Cache-Similarity`	float string	(Semantic mode only) Cosine similarity score of the matched entry.
`X-KT-Cache-Namespace`	string	The cache namespace this hit came from.
`X-KT-Cache-Age`	integer	Seconds since the cached entry was stored.

Use these headers in your application to track cache hit rates and to selectively bypass the cache when freshness is required.

curl -s -D - https://gateway.example.com/v1/chat/completions \
  -H "Authorization: Bearer $KT_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o","messages":[{"role":"user","content":"What is Keeptrusts?"}]}' \
  | grep -i x-kt-cache
# X-KT-Cache-Hit: true
# X-KT-Cache-Mode: semantic
# X-KT-Cache-Similarity: 0.9641
# X-KT-Cache-Age: 1834

Cache Warming

Pre-populate the cache with answers to known high-frequency questions using the admin cache-warm endpoint. This is especially useful after a gateway restart (which clears the in-memory cache) or when deploying to a new namespace.

# Warm the cache with a list of seed Q&A pairs
curl -X POST https://api.keeptrusts.com/v1/cache/warm \
  -H "Authorization: Bearer $KT_ADMIN_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "namespace": "support-bot-prod",
    "entries": [
      {
        "messages": [{"role": "user", "content": "How do I reset my password?"}],
        "response": "To reset your password, click Forgot Password on the login page..."
      },
      {
        "messages": [{"role": "user", "content": "What are your business hours?"}],
        "response": "Our support team is available Monday–Friday, 9am–6pm UTC."
      }
    ]
  }'

Cache-warm entries are treated identically to organically cached entries: they have the same TTL and are subject to the same LRU eviction policy.

Cache Invalidation

TTL expiry

Every cache entry has a ttl_seconds lifetime. When the TTL elapses, the entry is evicted and the next matching request causes a cache miss, forwarding to the upstream and refreshing the stored entry.

Manual flush

Flush all entries in a namespace using the admin cache API:

# Flush all entries in a specific namespace
curl -X DELETE https://api.keeptrusts.com/v1/cache/namespaces/support-bot-prod \
  -H "Authorization: Bearer $KT_ADMIN_KEY"

# Flush entries older than a specific age
curl -X DELETE "https://api.keeptrusts.com/v1/cache/namespaces/support-bot-prod?older_than_seconds=3600" \
  -H "Authorization: Bearer $KT_ADMIN_KEY"

Per-request bypass

Clients can bypass the cache for a specific request by including the X-KT-Cache-Bypass: true header:

curl -X POST https://gateway.example.com/v1/chat/completions \
  -H "Authorization: Bearer $KT_API_KEY" \
  -H "X-KT-Cache-Bypass: true" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o","messages":[...]}'

Bypassed requests are always forwarded to the upstream. Their responses are not stored in the cache. Use this for requests that require the freshest possible answer (e.g., real-time data queries, personalised responses).

Cache Namespacing

Cache entries are scoped to a namespace. Use separate namespaces to prevent different applications, tenants, or consumer groups from sharing cached responses.

Per-consumer-group namespacing

Combine cache namespacing with consumer groups to give each team an isolated cache:

consumer_groups:
  - name: finance-team
    api_keys:
      - env:FINANCE_KEY
    chain:
      - content-safety
    upstream: openai-gpt4o

  - name: legal-team
    api_keys:
      - env:LEGAL_KEY
    chain:
      - content-safety
    upstream: openai-gpt4o

cache:
  enabled: true
  mode: semantic
  similarity_threshold: 0.92
  ttl_seconds: 7200
  max_entries: 50000
  embedding_provider: voyage-lite
  namespace_from_consumer_group: true   # automatically uses consumer group name as namespace

With namespace_from_consumer_group: true, requests from finance-team are cached in the finance-team namespace and requests from legal-team are cached in the legal-team namespace. Finance cannot receive a cached response intended for Legal.

Per-route namespacing

Use the cache_namespace field on a route to override the namespace for that path:

pack:
  name: caching-routes-8
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai-primary
    provider: openai
    model: gpt-4o-mini
    secret_key_ref:
      env: OPENAI_API_KEY
  - id: openai-gpt4o
    provider: openai
    model: gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true
routes:
- name: faq-bot
  path: "/v1/faq"
  match: prefix
  upstream: openai-gpt4o
  cache_namespace: faq-bot-v2
- name: chat
  path: "/v1/chat/completions"
  match: exact
  upstream: openai-gpt4o

Cache + Zero Data Retention

When zero_data_retention is enabled for a request, Keeptrusts must not store response content in any persistent or semi-persistent layer. Caching is automatically disabled for requests that carry the ZDR flag, even if the global cache config has enabled: true.

cache:
  enabled: true
  mode: semantic
  ttl_seconds: 3600
  max_entries: 50000
  embedding_provider: voyage-lite

policies:
  - id: zero-retention-policy
    type: zero_data_retention
    trigger:
      header: X-Zero-Data-Retention
      value: "true"

When a request is processed under a zero-retention policy:

The cache lookup is skipped — no stored response is returned.
The response from upstream is not stored — no new cache entry is created.
The X-KT-Cache-Hit: false and X-KT-ZDR-Active: true headers are returned to the client.

Explicitly disabling the cache for a specific consumer group:

consumer_groups:
  - name: gdpr-strict
    description: "EU customers requiring strict data minimisation"
    api_keys:
      - env:GDPR_CUSTOMER_KEY
    cache_enabled: false    # override: disable cache for this group regardless of global config
    chain:
      - pii-redaction
      - gdpr-audit-logger
    upstream: openai-gpt4o

Monitoring Cache Performance

Keeptrusts emits structured log fields and control-plane events for all cache interactions:

{
  "timestamp": "2026-03-27T14:22:01.342Z",
  "event_type": "cache_hit",
  "cache_mode": "semantic",
  "cache_namespace": "support-bot-prod",
  "similarity_score": 0.9641,
  "cache_age_seconds": 1834,
  "latency_ms": 3,
  "upstream_latency_ms": null,
  "cost_usd_saved": 0.0024
}

{
  "timestamp": "2026-03-27T14:22:45.100Z",
  "event_type": "cache_miss",
  "cache_mode": "semantic",
  "cache_namespace": "support-bot-prod",
  "upstream_latency_ms": 842,
  "prompt_tokens": 312,
  "completion_tokens": 156,
  "cost_usd": 0.0024
}

Use the Keeptrusts console Events view and filter by event_type: cache_hit / cache_miss to monitor hit rates, average similarity scores, and estimated cost savings over time.

Best Practices

Start with exact mode and upgrade to semantic if hit rates are low. Exact caching has zero false-positive risk and no embedding overhead. Move to semantic only when you see that rephrased but equivalent questions are each causing cache misses.
Set similarity_threshold to 0.92 or higher for production. Below 0.90, the probability of a semantically incorrect hit increases significantly. Tune down only after reviewing actual hit/response pairs in the console.
Use namespacing to prevent cross-tenant cache poisoning. A cached response from one tenant must never be served to another. Use namespace_from_consumer_group: true or explicit per-route cache_namespace values for all multi-tenant deployments.
Disable caching for personalised or real-time queries. If a response depends on time-sensitive data, the authenticated user's state, or session variables, those requests should bypass the cache via X-KT-Cache-Bypass: true or by excluding them from any caching route.
Warm the cache on gateway startup for high-traffic FAQ bots. Cold starts after a restart cause a burst of upstream requests. Pre-warm with the top 100–500 most frequent queries from your production event logs.
Always disable cache when zero_data_retention is active. Caching a ZDR response defeats the purpose of the policy. Set cache_enabled: false on any consumer group or route that is subject to ZDR requirements.

For AI systems

Canonical terms: Keeptrusts Semantic Cache, exact cache, semantic cache, cache namespace, cache warming, cache bypass.
Config keys: cache.enabled, cache.mode (exact | semantic), cache.similarity_threshold, cache.ttl_seconds, cache.max_entries, cache.embedding_provider, cache.namespace, cache.cache_on_stream, namespace_from_consumer_group.
Environment variables: KEEPTRUSTS_LLM_CACHE_ENABLED, KEEPTRUSTS_LLM_CACHE_BACKEND (memory | filesystem | redis | s3 | gcs | qdrant), KEEPTRUSTS_LLM_CACHE_REDIS_URL, KEEPTRUSTS_LLM_CACHE_QDRANT_URL.
Response headers: X-KT-Cache-Hit, X-KT-Cache-Mode, X-KT-Cache-Similarity, X-KT-Cache-Age.
Admin endpoints: POST /v1/cache/warm, DELETE /v1/cache/namespaces/{ns}.
Best next pages: Consumer Groups, Provider Routing, Rate Limiting.

For engineers

Prerequisites: Gateway binary with --features distributed for Redis/S3/Qdrant backends; an embedding provider target configured for semantic mode.
Validate cache is working: send the same prompt twice and check X-KT-Cache-Hit: true in the response header.
Validate semantic threshold: lower similarity_threshold gradually from 0.95 and review hit/response pairs in the Events view; never go below 0.85 in production.
Warm cache after restarts: curl -X POST /v1/cache/warm with your top FAQ pairs.
Flush stale entries: curl -X DELETE /v1/cache/namespaces/<ns>?older_than_seconds=3600.
Bypass for fresh answers: send X-KT-Cache-Bypass: true header on time-sensitive requests.

For leaders

Cost impact: Semantic caching can reduce LLM API spend by 30–60% for FAQ-heavy and repetitive workloads with no degradation in response quality.
Rollout risk: Start with exact mode (zero false-positive risk) and move to semantic only after measuring cache-miss rates on production traffic.
Compliance: Use namespace_from_consumer_group: true to prevent cross-tenant cache sharing; disable cache for zero-data-retention consumers.
Operational cost: Redis or Qdrant adds infrastructure; in-memory mode requires no external dependency but loses cache on restart.

Next steps

Consumer Groups — isolate cache per team with namespace_from_consumer_group
Rate Limiting — combine cache with rate limits for cost control
Context Compression — reduce token usage before caching long conversations
Custom Routes — set per-route cache_namespace overrides

Use this page when​

Primary audience​

Storage Backends​

Cache Modes​

exact mode​

semantic mode​

SemanticCacheConfig Field Reference​

Configuring Semantic Cache​

Basic exact cache​

Semantic cache with Voyage embeddings​

Cache with OpenAI embeddings​

Exact cache for deterministic batch jobs​

Cache Hit Indicator​

Cache Warming​

Cache Invalidation​

TTL expiry​

Manual flush​

Per-request bypass​

Cache Namespacing​

Per-consumer-group namespacing​

Per-route namespacing​

Cache + Zero Data Retention​

Monitoring Cache Performance​

Best Practices​

For AI systems​

For engineers​

For leaders​

Next steps​