Provider-Prefix Cache Tuning
Provider-side prompt-prefix caching is a complementary caching layer that operates at the LLM provider level. When you send requests with a stable prefix (system prompt, instructions, context), the provider caches the KV computations for that prefix. Subsequent requests with the same prefix skip recomputation, reducing latency and cost.
Use this page when
- You want to optimize provider-side prompt-prefix caching by controlling context ordering and prefix length.
- You need to verify that
cached_input_tokensappears in spend logs, confirming prefix reuse. - You are debugging why provider cache hit ratios are low despite stable system prompts.
This is different from Keeptrusts org-shared cache — provider prefix caching happens upstream at the model provider. Both layers can work together.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
How Provider Prefix Caching Works
- You send a request with a system prompt + context + user message.
- The provider processes the full prompt and caches the KV state for the prefix portion.
- On the next request with the same prefix, the provider reuses the cached KV state.
- You are charged for
cached_input_tokensat a reduced rate instead of fullinput_tokens.
The key requirement: the prefix must be byte-identical across requests.
Stable Context Ordering
The most important optimization is keeping your context prefix stable across requests. The gateway assembles the final prompt from multiple sources:
- System instructions
- Policy preamble
- Knowledge base context
- Conversation history
- User message
For maximum prefix cache hits, the order and content of items 1–4 must remain identical between requests in the same session.
Configuration
workflow_cache:
provider_prefix_optimization:
enabled: true
stable_prefix_order:
- system_instructions
- policy_preamble
- knowledge_context
- conversation_history
context_separator: "\n---\n"
What Breaks Prefix Stability
- Injecting timestamps or request IDs into the system prompt.
- Randomizing the order of knowledge base chunks.
- Including per-request metadata before the user message.
- Changing policy preamble text between requests in the same session.
Context Prefix Length
Longer stable prefixes give you more savings per request, but they also require more alignment between requests.
Optimizing prefix length
workflow_cache:
provider_prefix_optimization:
enabled: true
min_prefix_tokens: 500
max_prefix_tokens: 4000
- min_prefix_tokens: Do not attempt prefix caching if the stable prefix is shorter than this. Short prefixes give minimal savings.
- max_prefix_tokens: Cap the prefix at this length. Beyond this, the provider may not cache efficiently.
Guidelines
| Prefix Length | Savings per Request | Use Case |
|---|---|---|
| < 500 tokens | Minimal | Skip prefix optimization |
| 500–2000 tokens | Moderate | Standard system prompts |
| 2000–4000 tokens | Significant | System prompt + knowledge context |
| > 4000 tokens | High | Large knowledge base inclusions |
Provider-Specific Behavior
OpenAI
- Prefix caching is automatic for prompts over 1024 tokens.
- The prefix must be byte-identical from the start of the prompt.
- Cached tokens appear as
cached_input_tokensin the usage response. - Cache TTL is approximately 5–10 minutes of inactivity.
Anthropic
- Prompt caching requires explicit
cache_controlbreakpoints. - You mark where the cacheable prefix ends with a cache control annotation.
- Cached tokens appear in the usage response under
cache_read_input_tokens. - Cache TTL is approximately 5 minutes.
Google (Gemini)
- Context caching is explicit — you create a cached content resource.
- Cached content has a configurable TTL (minimum 1 minute).
- Cost savings apply to the cached portion of the prompt.
Configuration per provider
workflow_cache:
provider_prefix_optimization:
enabled: true
provider_settings:
openai:
auto_prefix: true
min_prefix_tokens: 1024
anthropic:
cache_control_enabled: true
breakpoint_after: knowledge_context
google:
explicit_cache: true
cache_ttl_seconds: 300
Verifying Prefix Cache Hits
Check your spend logs to confirm prefix caching is working:
Via the console
Navigate to Spend → Events and look for the cached_input_tokens column. Non-zero values indicate successful prefix cache hits.
Via the API
curl -H "Authorization: Bearer $TOKEN" \
"https://api.keeptrusts.com/v1/events?fields=cached_input_tokens,input_tokens,model"
What to look for
| Metric | Meaning |
|---|---|
cached_input_tokens > 0 | Provider served the prefix from cache |
cached_input_tokens = 0 | Full recomputation occurred (cache miss) |
cached_input_tokens / input_tokens | Prefix cache ratio — higher is better |
Target ratios
- First request in session: 0% (cold start, no prefix cached yet).
- Subsequent requests in same session: 60–90% (prefix reused).
- Across sessions with same system prompt: 40–70% (depends on provider TTL).
Interaction with Org-Shared Cache
Provider prefix caching and Keeptrusts org-shared cache are complementary:
- Org-shared cache hit: Response served from Keeptrusts cache. No provider call — no prefix caching opportunity (but you saved more).
- Org-shared cache miss, prefix cache hit: Provider call occurs but at reduced cost due to prefix reuse.
- Both miss: Full-cost provider call. Response is cached in both layers for future requests.
The best configuration enables both:
workflow_cache:
enabled: true
org_shared_enabled: true
direct_semantic_replay_enabled: true
provider_prefix_optimization:
enabled: true
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
cached_input_tokens always 0 | Prefix changes between requests | Audit your system prompt for dynamic content |
| Low prefix ratio despite stable prompt | Prefix too short | Increase knowledge context or system instructions |
| Prefix caching works in dev but not prod | Different config versions | Ensure prod gateway has the same stable ordering config |
| Anthropic cache misses | Missing cache control breakpoint | Enable cache_control_enabled in provider settings |
For AI systems
- Canonical terms: Keeptrusts, provider prefix cache,
cached_input_tokens, stable context ordering, prompt prefix,provider_prefix_optimization, OpenAI prefix caching, Anthropic cache_control. - Config keys:
workflow_cache.provider_prefix_optimization.enabled,workflow_cache.provider_prefix_optimization.stable_prefix_order,workflow_cache.provider_prefix_optimization.min_prefix_tokens,workflow_cache.provider_prefix_optimization.max_prefix_tokens, provider_settings per provider. - Best next pages: Provider Prompt-Prefix Caching (cost impact), Declarative Config for Workflow Cache.
For engineers
- Enable with
workflow_cache.provider_prefix_optimization.enabled: true. - Ensure stable prefix order: system_instructions → policy_preamble → knowledge_context → conversation_history → user message.
- Avoid injecting timestamps, request IDs, or randomized content before the user message.
- Verify: check spend logs for
cached_input_tokens > 0on subsequent requests in the same session. - Per-provider settings: OpenAI auto-prefixes at 1024+ tokens; Anthropic requires
cache_control_enabled: true; Google needsexplicit_cache: true. - Target ratios: 0% on first request (cold), 60–90% on subsequent requests in same session.
For leaders
- Provider prefix caching reduces the cost of cache misses by 35–42% — it complements the org-shared cache.
- Zero configuration required for OpenAI (automatic at 1024+ tokens). Anthropic and Google need explicit settings.
- Combined with org-shared cache (80% hit rate), effective cost drops to 13–16% of uncached baseline.
- Does not affect response quality, policy enforcement, or cache key computation — pure cost optimization.
Next steps
- Provider Prompt-Prefix Caching (cost analysis) — savings projections
- Declarative Config for Workflow Cache — full workflow_cache reference
- Direct API Cost vs Cached Cost — combined savings across all layers