Provider-Prefix Cache Tuning

Provider-side prompt-prefix caching is a complementary caching layer that operates at the LLM provider level. When you send requests with a stable prefix (system prompt, instructions, context), the provider caches the KV computations for that prefix. Subsequent requests with the same prefix skip recomputation, reducing latency and cost.

Use this page when

You want to optimize provider-side prompt-prefix caching by controlling context ordering and prefix length.
You need to verify that cached_input_tokens appears in spend logs, confirming prefix reuse.
You are debugging why provider cache hit ratios are low despite stable system prompts.

This is different from Keeptrusts org-shared cache — provider prefix caching happens upstream at the model provider. Both layers can work together.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

How Provider Prefix Caching Works

You send a request with a system prompt + context + user message.
The provider processes the full prompt and caches the KV state for the prefix portion.
On the next request with the same prefix, the provider reuses the cached KV state.
You are charged for cached_input_tokens at a reduced rate instead of full input_tokens.

The key requirement: the prefix must be byte-identical across requests.

Stable Context Ordering

The most important optimization is keeping your context prefix stable across requests. The gateway assembles the final prompt from multiple sources:

System instructions
Policy preamble
Knowledge base context
Conversation history
User message

For maximum prefix cache hits, the order and content of items 1–4 must remain identical between requests in the same session.

Configuration

workflow_cache:
  provider_prefix_optimization:
    enabled: true
    stable_prefix_order:
      - system_instructions
      - policy_preamble
      - knowledge_context
      - conversation_history
    context_separator: "\n---\n"

What Breaks Prefix Stability

Injecting timestamps or request IDs into the system prompt.
Randomizing the order of knowledge base chunks.
Including per-request metadata before the user message.
Changing policy preamble text between requests in the same session.

Context Prefix Length

Longer stable prefixes give you more savings per request, but they also require more alignment between requests.

Optimizing prefix length

workflow_cache:
  provider_prefix_optimization:
    enabled: true
    min_prefix_tokens: 500
    max_prefix_tokens: 4000

min_prefix_tokens: Do not attempt prefix caching if the stable prefix is shorter than this. Short prefixes give minimal savings.
max_prefix_tokens: Cap the prefix at this length. Beyond this, the provider may not cache efficiently.

Guidelines

Prefix Length	Savings per Request	Use Case
< 500 tokens	Minimal	Skip prefix optimization
500–2000 tokens	Moderate	Standard system prompts
2000–4000 tokens	Significant	System prompt + knowledge context
> 4000 tokens	High	Large knowledge base inclusions

Provider-Specific Behavior

OpenAI

Prefix caching is automatic for prompts over 1024 tokens.
The prefix must be byte-identical from the start of the prompt.
Cached tokens appear as cached_input_tokens in the usage response.
Cache TTL is approximately 5–10 minutes of inactivity.

Anthropic

Prompt caching requires explicit cache_control breakpoints.
You mark where the cacheable prefix ends with a cache control annotation.
Cached tokens appear in the usage response under cache_read_input_tokens.
Cache TTL is approximately 5 minutes.

Google (Gemini)

Context caching is explicit — you create a cached content resource.
Cached content has a configurable TTL (minimum 1 minute).
Cost savings apply to the cached portion of the prompt.

Configuration per provider

workflow_cache:
  provider_prefix_optimization:
    enabled: true
    provider_settings:
      openai:
        auto_prefix: true
        min_prefix_tokens: 1024
      anthropic:
        cache_control_enabled: true
        breakpoint_after: knowledge_context
      google:
        explicit_cache: true
        cache_ttl_seconds: 300

Verifying Prefix Cache Hits

Check your spend logs to confirm prefix caching is working:

Via the console

Navigate to Spend → Events and look for the cached_input_tokens column. Non-zero values indicate successful prefix cache hits.

Via the API

curl -H "Authorization: Bearer $TOKEN" \
  "https://api.keeptrusts.com/v1/events?fields=cached_input_tokens,input_tokens,model"

What to look for

Metric	Meaning
`cached_input_tokens > 0`	Provider served the prefix from cache
`cached_input_tokens = 0`	Full recomputation occurred (cache miss)
`cached_input_tokens / input_tokens`	Prefix cache ratio — higher is better

Target ratios

First request in session: 0% (cold start, no prefix cached yet).
Subsequent requests in same session: 60–90% (prefix reused).
Across sessions with same system prompt: 40–70% (depends on provider TTL).

Interaction with Org-Shared Cache

Provider prefix caching and Keeptrusts org-shared cache are complementary:

Org-shared cache hit: Response served from Keeptrusts cache. No provider call — no prefix caching opportunity (but you saved more).
Org-shared cache miss, prefix cache hit: Provider call occurs but at reduced cost due to prefix reuse.
Both miss: Full-cost provider call. Response is cached in both layers for future requests.

The best configuration enables both:

workflow_cache:
  enabled: true
  org_shared_enabled: true
  direct_semantic_replay_enabled: true
  provider_prefix_optimization:
    enabled: true

Troubleshooting

Symptom	Cause	Fix
`cached_input_tokens` always 0	Prefix changes between requests	Audit your system prompt for dynamic content
Low prefix ratio despite stable prompt	Prefix too short	Increase knowledge context or system instructions
Prefix caching works in dev but not prod	Different config versions	Ensure prod gateway has the same stable ordering config
Anthropic cache misses	Missing cache control breakpoint	Enable `cache_control_enabled` in provider settings

For AI systems

Canonical terms: Keeptrusts, provider prefix cache, cached_input_tokens, stable context ordering, prompt prefix, provider_prefix_optimization, OpenAI prefix caching, Anthropic cache_control.
Config keys: workflow_cache.provider_prefix_optimization.enabled, workflow_cache.provider_prefix_optimization.stable_prefix_order, workflow_cache.provider_prefix_optimization.min_prefix_tokens, workflow_cache.provider_prefix_optimization.max_prefix_tokens, provider_settings per provider.
Best next pages: Provider Prompt-Prefix Caching (cost impact), Declarative Config for Workflow Cache.

For engineers

Enable with workflow_cache.provider_prefix_optimization.enabled: true.
Ensure stable prefix order: system_instructions → policy_preamble → knowledge_context → conversation_history → user message.
Avoid injecting timestamps, request IDs, or randomized content before the user message.
Verify: check spend logs for cached_input_tokens > 0 on subsequent requests in the same session.
Per-provider settings: OpenAI auto-prefixes at 1024+ tokens; Anthropic requires cache_control_enabled: true; Google needs explicit_cache: true.
Target ratios: 0% on first request (cold), 60–90% on subsequent requests in same session.

For leaders

Provider prefix caching reduces the cost of cache misses by 35–42% — it complements the org-shared cache.
Zero configuration required for OpenAI (automatic at 1024+ tokens). Anthropic and Google need explicit settings.
Combined with org-shared cache (80% hit rate), effective cost drops to 13–16% of uncached baseline.
Does not affect response quality, policy enforcement, or cache key computation — pure cost optimization.

Next steps

Provider Prompt-Prefix Caching (cost analysis) — savings projections
Declarative Config for Workflow Cache — full workflow_cache reference
Direct API Cost vs Cached Cost — combined savings across all layers

Use this page when​

Primary audience​

How Provider Prefix Caching Works​

Stable Context Ordering​

Configuration​

What Breaks Prefix Stability​

Context Prefix Length​

Optimizing prefix length​

Guidelines​

Provider-Specific Behavior​

OpenAI​

Anthropic​

Google (Gemini)​

Configuration per provider​

Verifying Prefix Cache Hits​

Via the console​

Via the API​

What to look for​

Target ratios​

Interaction with Org-Shared Cache​

Troubleshooting​

For AI systems​

For engineers​

For leaders​

Next steps​