Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Provider-Prefix Cache Tuning

Provider-side prompt-prefix caching is a complementary caching layer that operates at the LLM provider level. When you send requests with a stable prefix (system prompt, instructions, context), the provider caches the KV computations for that prefix. Subsequent requests with the same prefix skip recomputation, reducing latency and cost.

Use this page when

  • You want to optimize provider-side prompt-prefix caching by controlling context ordering and prefix length.
  • You need to verify that cached_input_tokens appears in spend logs, confirming prefix reuse.
  • You are debugging why provider cache hit ratios are low despite stable system prompts.

This is different from Keeptrusts org-shared cache — provider prefix caching happens upstream at the model provider. Both layers can work together.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

How Provider Prefix Caching Works

  1. You send a request with a system prompt + context + user message.
  2. The provider processes the full prompt and caches the KV state for the prefix portion.
  3. On the next request with the same prefix, the provider reuses the cached KV state.
  4. You are charged for cached_input_tokens at a reduced rate instead of full input_tokens.

The key requirement: the prefix must be byte-identical across requests.

Stable Context Ordering

The most important optimization is keeping your context prefix stable across requests. The gateway assembles the final prompt from multiple sources:

  1. System instructions
  2. Policy preamble
  3. Knowledge base context
  4. Conversation history
  5. User message

For maximum prefix cache hits, the order and content of items 1–4 must remain identical between requests in the same session.

Configuration

workflow_cache:
provider_prefix_optimization:
enabled: true
stable_prefix_order:
- system_instructions
- policy_preamble
- knowledge_context
- conversation_history
context_separator: "\n---\n"

What Breaks Prefix Stability

  • Injecting timestamps or request IDs into the system prompt.
  • Randomizing the order of knowledge base chunks.
  • Including per-request metadata before the user message.
  • Changing policy preamble text between requests in the same session.

Context Prefix Length

Longer stable prefixes give you more savings per request, but they also require more alignment between requests.

Optimizing prefix length

workflow_cache:
provider_prefix_optimization:
enabled: true
min_prefix_tokens: 500
max_prefix_tokens: 4000
  • min_prefix_tokens: Do not attempt prefix caching if the stable prefix is shorter than this. Short prefixes give minimal savings.
  • max_prefix_tokens: Cap the prefix at this length. Beyond this, the provider may not cache efficiently.

Guidelines

Prefix LengthSavings per RequestUse Case
< 500 tokensMinimalSkip prefix optimization
500–2000 tokensModerateStandard system prompts
2000–4000 tokensSignificantSystem prompt + knowledge context
> 4000 tokensHighLarge knowledge base inclusions

Provider-Specific Behavior

OpenAI

  • Prefix caching is automatic for prompts over 1024 tokens.
  • The prefix must be byte-identical from the start of the prompt.
  • Cached tokens appear as cached_input_tokens in the usage response.
  • Cache TTL is approximately 5–10 minutes of inactivity.

Anthropic

  • Prompt caching requires explicit cache_control breakpoints.
  • You mark where the cacheable prefix ends with a cache control annotation.
  • Cached tokens appear in the usage response under cache_read_input_tokens.
  • Cache TTL is approximately 5 minutes.

Google (Gemini)

  • Context caching is explicit — you create a cached content resource.
  • Cached content has a configurable TTL (minimum 1 minute).
  • Cost savings apply to the cached portion of the prompt.

Configuration per provider

workflow_cache:
provider_prefix_optimization:
enabled: true
provider_settings:
openai:
auto_prefix: true
min_prefix_tokens: 1024
anthropic:
cache_control_enabled: true
breakpoint_after: knowledge_context
google:
explicit_cache: true
cache_ttl_seconds: 300

Verifying Prefix Cache Hits

Check your spend logs to confirm prefix caching is working:

Via the console

Navigate to Spend → Events and look for the cached_input_tokens column. Non-zero values indicate successful prefix cache hits.

Via the API

curl -H "Authorization: Bearer $TOKEN" \
"https://api.keeptrusts.com/v1/events?fields=cached_input_tokens,input_tokens,model"

What to look for

MetricMeaning
cached_input_tokens > 0Provider served the prefix from cache
cached_input_tokens = 0Full recomputation occurred (cache miss)
cached_input_tokens / input_tokensPrefix cache ratio — higher is better

Target ratios

  • First request in session: 0% (cold start, no prefix cached yet).
  • Subsequent requests in same session: 60–90% (prefix reused).
  • Across sessions with same system prompt: 40–70% (depends on provider TTL).

Interaction with Org-Shared Cache

Provider prefix caching and Keeptrusts org-shared cache are complementary:

  1. Org-shared cache hit: Response served from Keeptrusts cache. No provider call — no prefix caching opportunity (but you saved more).
  2. Org-shared cache miss, prefix cache hit: Provider call occurs but at reduced cost due to prefix reuse.
  3. Both miss: Full-cost provider call. Response is cached in both layers for future requests.

The best configuration enables both:

workflow_cache:
enabled: true
org_shared_enabled: true
direct_semantic_replay_enabled: true
provider_prefix_optimization:
enabled: true

Troubleshooting

SymptomCauseFix
cached_input_tokens always 0Prefix changes between requestsAudit your system prompt for dynamic content
Low prefix ratio despite stable promptPrefix too shortIncrease knowledge context or system instructions
Prefix caching works in dev but not prodDifferent config versionsEnsure prod gateway has the same stable ordering config
Anthropic cache missesMissing cache control breakpointEnable cache_control_enabled in provider settings

For AI systems

  • Canonical terms: Keeptrusts, provider prefix cache, cached_input_tokens, stable context ordering, prompt prefix, provider_prefix_optimization, OpenAI prefix caching, Anthropic cache_control.
  • Config keys: workflow_cache.provider_prefix_optimization.enabled, workflow_cache.provider_prefix_optimization.stable_prefix_order, workflow_cache.provider_prefix_optimization.min_prefix_tokens, workflow_cache.provider_prefix_optimization.max_prefix_tokens, provider_settings per provider.
  • Best next pages: Provider Prompt-Prefix Caching (cost impact), Declarative Config for Workflow Cache.

For engineers

  • Enable with workflow_cache.provider_prefix_optimization.enabled: true.
  • Ensure stable prefix order: system_instructions → policy_preamble → knowledge_context → conversation_history → user message.
  • Avoid injecting timestamps, request IDs, or randomized content before the user message.
  • Verify: check spend logs for cached_input_tokens > 0 on subsequent requests in the same session.
  • Per-provider settings: OpenAI auto-prefixes at 1024+ tokens; Anthropic requires cache_control_enabled: true; Google needs explicit_cache: true.
  • Target ratios: 0% on first request (cold), 60–90% on subsequent requests in same session.

For leaders

  • Provider prefix caching reduces the cost of cache misses by 35–42% — it complements the org-shared cache.
  • Zero configuration required for OpenAI (automatic at 1024+ tokens). Anthropic and Google need explicit settings.
  • Combined with org-shared cache (80% hit rate), effective cost drops to 13–16% of uncached baseline.
  • Does not affect response quality, policy enforcement, or cache key computation — pure cost optimization.

Next steps