Provider Prompt-Prefix Caching
Provider-side prompt-prefix caching is an optimization layer below the Keeptrusts org-shared cache. When a cache miss sends a request upstream, stable context ordering helps the provider serve portions of the prompt from its own internal cache — reducing your per-request cost even on misses.
Use this page when
- You want to understand how provider-side prompt-prefix caching saves tokens even on cache misses.
- You are configuring Keeptrusts context ordering to maximize provider KV cache hits.
- You need to quantify the additional cost reduction from stable prefix matching.
Primary audience
- Primary: Technical Leaders
- Secondary: Technical Engineers, AI Agents
How Provider Prefix Caching Works
Major LLM providers cache the computed key-value (KV) representations of prompt prefixes:
- OpenAI — Caches prompt prefixes and reports
cached_input_tokensin the response usage - Anthropic — Supports explicit cache breakpoints and reports
cache_read_input_tokens - Google — Offers context caching with explicit TTLs
When consecutive requests share the same prefix (system prompt, codebase context, instructions), the provider skips recomputation of those tokens. You pay a reduced rate — typically 50% off input token pricing — for the cached portion.
How Keeptrusts Enables Higher Provider Cache Hits
The Codebase Context Fabric assembles context for each request using a stable ordering algorithm:
- System instructions (fixed per gateway configuration)
- Codebase structure context (deterministic tree ordering)
- Relevant file contents (sorted by path)
- Conversation history
- User prompt
Because the ordering is deterministic across all engineers on the same codebase, requests from different team members produce identical prefixes up to the divergence point (typically the user prompt). This maximizes the prefix length that providers can cache.
Without stable ordering, equivalent context assembled in random order produces different token sequences — defeating provider-side caching entirely.
Viewing Cached Token Metrics
In the console, navigate to Cost Center → Spend Logs to see per-request token breakdowns:
| Field | Description |
|---|---|
input_tokens | Total input tokens sent to the provider |
cached_input_tokens | Tokens served from provider's prefix cache |
output_tokens | Tokens generated by the provider |
effective_input_cost | Actual cost after cached-token discount |
A typical request with good prefix caching:
{
"model": "gpt-4o",
"input_tokens": 4200,
"cached_input_tokens": 3800,
"output_tokens": 650,
"effective_input_cost": 0.0041,
"full_input_cost_would_be": 0.0126
}
In this example, 90% of input tokens hit the provider's cache, reducing input cost by approximately 50% on those tokens.
Relationship to Org-Shared Cache
Provider prefix caching and the Keeptrusts org-shared cache operate at different levels:
| Layer | Scope | Effect |
|---|---|---|
| Org-shared cache | Full request+response | Eliminates the upstream call entirely |
| Provider prefix cache | Input token prefix only | Reduces cost of the upstream call |
The evaluation order:
- Request arrives at gateway
- Org-shared cache lookup → Hit? Return cached response. Done.
- Cache miss → Forward to provider with stable-ordered context
- Provider applies its own prefix caching → Reduced input token cost
- Response returns → Stored in org-shared cache for future hits
Provider prefix caching only matters on cache misses. Once the org-shared cache reaches steady-state hit rates (70-90%), most requests never reach the provider at all.
Correctness Guarantees
Provider prefix caching is a pure cost optimization. It does not affect:
- Response quality or content
- Policy enforcement (all policies run before and after the provider call)
- Cache key computation (org-shared cache keys are independent of provider behavior)
- Determinism of outputs (provider caching is transparent to the caller)
If a provider disables or changes their caching behavior, your system continues to function identically — you simply pay full input token rates on misses.
Maximizing Provider Cache Hits
To get the best provider-side cache rates:
- Keep system prompts stable — Avoid per-request randomization in system instructions
- Use consistent context ordering — The Codebase Context Fabric handles this automatically
- Minimize context churn — Rapid file changes reduce prefix overlap between requests
- Batch similar work — Engineers working on the same area produce similar prefixes
Estimating Provider Cache Savings
For a team of 100 engineers with 80% org-shared cache hit rate:
- 20% of requests go upstream (cache misses)
- Of those, ~70-85% of input tokens typically hit provider prefix cache
- Effective input cost reduction on misses: ~35-42%
Combined with org-shared cache:
| Scenario | Cost vs Baseline |
|---|---|
| No caching | 100% |
| Provider prefix cache only | ~80% |
| Org-shared cache only (80% hit) | ~20% |
| Both layers combined | ~13-16% |
The org-shared cache delivers the dominant savings. Provider prefix caching provides incremental benefit on the remaining misses.
Provider-Specific Behavior
OpenAI
Reports cached_tokens in the usage.prompt_tokens_details object. Cached tokens are billed at 50% of the standard input rate. Caching is automatic — no explicit opt-in required.
Anthropic
Supports explicit cache_control breakpoints. Reports cache_read_input_tokens in usage. Cached tokens billed at approximately 10% of standard input rate. Keeptrusts inserts cache breakpoints at optimal boundaries automatically.
Google (Gemini)
Offers explicit context caching with configurable TTLs. Cached tokens billed at 75% discount. Requires explicit cache creation — Keeptrusts manages cache lifecycle per gateway configuration.
Monitoring in the Console
The Cost Center → Provider Efficiency view shows:
- Provider cache hit rate over time
- Average cached prefix length per model
- Savings from provider caching (separate from org-shared cache savings)
- Trend line showing prefix cache effectiveness as codebase context stabilizes
Next steps
- Direct API Cost vs Cached Cost — see combined savings across all layers
- Single-Flight Fill — reduce fill cost further with request deduplication
- Tracking Avoided Cost — monitor org-shared cache savings
For AI systems
- Canonical terms: Keeptrusts, provider prefix caching, KV cache, cached_input_tokens, stable context ordering, prompt prefix, OpenAI cached tokens, Anthropic prompt caching.
- Config keys:
cache.provider_prefix_optimization: true,cache.context_ordering: stable. - Metrics:
cached_input_tokens, provider cache hit rate, cached prefix length, per-model caching behavior. - Best next pages: Provider Prefix Tuning, Direct API Cost vs Cached Cost.
For engineers
- Enable with
cache.provider_prefix_optimization: true— Keeptrusts reorders system prompt, fabric context, and user content into a deterministic prefix. - OpenAI: Automatic for prompts > 1024 tokens. 50% discount on cached tokens. TTL: 5–10 minutes of inactivity.
- Anthropic: Explicit cache-control breakpoints. 90% discount on cached tokens. TTL: 5 minutes.
- DeepSeek: 90%+ discount on cached tokens (DeepSeek R1, V3). Automatic, no config needed.
- Monitor via
cached_input_tokensin usage responses. Console shows provider cache rate under Cost Center → Provider Efficiency. - Works in parallel with org-shared cache: prefix caching saves tokens on misses, org-shared cache eliminates calls entirely on hits.
For leaders
- Provider prefix caching delivers 30–50% token cost reduction on cache misses — cost savings without any behavior change.
- Combined with org-shared cache (80% hit rate), effective cost reduction approaches 90%+ of baseline.
- No additional spend to enable — just configuration. Savings appear immediately in provider billing.
- The savings compound as team size grows: more engineers = more stable prefix patterns = higher prefix cache hit rates.
- Monitor in the console: Cost Center → Provider Efficiency shows prefix savings separately from org-shared savings.