Provider Prompt-Prefix Caching

Provider-side prompt-prefix caching is an optimization layer below the Keeptrusts org-shared cache. When a cache miss sends a request upstream, stable context ordering helps the provider serve portions of the prompt from its own internal cache — reducing your per-request cost even on misses.

Use this page when

You want to understand how provider-side prompt-prefix caching saves tokens even on cache misses.
You are configuring Keeptrusts context ordering to maximize provider KV cache hits.
You need to quantify the additional cost reduction from stable prefix matching.

Primary audience

Primary: Technical Leaders
Secondary: Technical Engineers, AI Agents

How Provider Prefix Caching Works

Major LLM providers cache the computed key-value (KV) representations of prompt prefixes:

OpenAI — Caches prompt prefixes and reports cached_input_tokens in the response usage
Anthropic — Supports explicit cache breakpoints and reports cache_read_input_tokens
Google — Offers context caching with explicit TTLs

When consecutive requests share the same prefix (system prompt, codebase context, instructions), the provider skips recomputation of those tokens. You pay a reduced rate — typically 50% off input token pricing — for the cached portion.

How Keeptrusts Enables Higher Provider Cache Hits

The Codebase Context Fabric assembles context for each request using a stable ordering algorithm:

System instructions (fixed per gateway configuration)
Codebase structure context (deterministic tree ordering)
Relevant file contents (sorted by path)
Conversation history
User prompt

Because the ordering is deterministic across all engineers on the same codebase, requests from different team members produce identical prefixes up to the divergence point (typically the user prompt). This maximizes the prefix length that providers can cache.

Without stable ordering, equivalent context assembled in random order produces different token sequences — defeating provider-side caching entirely.

Viewing Cached Token Metrics

In the console, navigate to Cost Center → Spend Logs to see per-request token breakdowns:

Field	Description
`input_tokens`	Total input tokens sent to the provider
`cached_input_tokens`	Tokens served from provider's prefix cache
`output_tokens`	Tokens generated by the provider
`effective_input_cost`	Actual cost after cached-token discount

A typical request with good prefix caching:

{
  "model": "gpt-4o",
  "input_tokens": 4200,
  "cached_input_tokens": 3800,
  "output_tokens": 650,
  "effective_input_cost": 0.0041,
  "full_input_cost_would_be": 0.0126
}

In this example, 90% of input tokens hit the provider's cache, reducing input cost by approximately 50% on those tokens.

Relationship to Org-Shared Cache

Provider prefix caching and the Keeptrusts org-shared cache operate at different levels:

Layer	Scope	Effect
Org-shared cache	Full request+response	Eliminates the upstream call entirely
Provider prefix cache	Input token prefix only	Reduces cost of the upstream call

The evaluation order:

Request arrives at gateway
Org-shared cache lookup → Hit? Return cached response. Done.
Cache miss → Forward to provider with stable-ordered context
Provider applies its own prefix caching → Reduced input token cost
Response returns → Stored in org-shared cache for future hits

Provider prefix caching only matters on cache misses. Once the org-shared cache reaches steady-state hit rates (70-90%), most requests never reach the provider at all.

Correctness Guarantees

Provider prefix caching is a pure cost optimization. It does not affect:

Response quality or content
Policy enforcement (all policies run before and after the provider call)
Cache key computation (org-shared cache keys are independent of provider behavior)
Determinism of outputs (provider caching is transparent to the caller)

If a provider disables or changes their caching behavior, your system continues to function identically — you simply pay full input token rates on misses.

Maximizing Provider Cache Hits

To get the best provider-side cache rates:

Keep system prompts stable — Avoid per-request randomization in system instructions
Use consistent context ordering — The Codebase Context Fabric handles this automatically
Minimize context churn — Rapid file changes reduce prefix overlap between requests
Batch similar work — Engineers working on the same area produce similar prefixes

Estimating Provider Cache Savings

For a team of 100 engineers with 80% org-shared cache hit rate:

20% of requests go upstream (cache misses)
Of those, ~70-85% of input tokens typically hit provider prefix cache
Effective input cost reduction on misses: ~35-42%

Combined with org-shared cache:

Scenario	Cost vs Baseline
No caching	100%
Provider prefix cache only	~80%
Org-shared cache only (80% hit)	~20%
Both layers combined	~13-16%

The org-shared cache delivers the dominant savings. Provider prefix caching provides incremental benefit on the remaining misses.

Provider-Specific Behavior

OpenAI

Reports cached_tokens in the usage.prompt_tokens_details object. Cached tokens are billed at 50% of the standard input rate. Caching is automatic — no explicit opt-in required.

Anthropic

Supports explicit cache_control breakpoints. Reports cache_read_input_tokens in usage. Cached tokens billed at approximately 10% of standard input rate. Keeptrusts inserts cache breakpoints at optimal boundaries automatically.

Google (Gemini)

Offers explicit context caching with configurable TTLs. Cached tokens billed at 75% discount. Requires explicit cache creation — Keeptrusts manages cache lifecycle per gateway configuration.

Monitoring in the Console

The Cost Center → Provider Efficiency view shows:

Provider cache hit rate over time
Average cached prefix length per model
Savings from provider caching (separate from org-shared cache savings)
Trend line showing prefix cache effectiveness as codebase context stabilizes

Next steps

Direct API Cost vs Cached Cost — see combined savings across all layers
Single-Flight Fill — reduce fill cost further with request deduplication
Tracking Avoided Cost — monitor org-shared cache savings

For AI systems

Canonical terms: Keeptrusts, provider prefix caching, KV cache, cached_input_tokens, stable context ordering, prompt prefix, OpenAI cached tokens, Anthropic prompt caching.
Config keys: cache.provider_prefix_optimization: true, cache.context_ordering: stable.
Metrics: cached_input_tokens, provider cache hit rate, cached prefix length, per-model caching behavior.
Best next pages: Provider Prefix Tuning, Direct API Cost vs Cached Cost.

For engineers

Enable with cache.provider_prefix_optimization: true — Keeptrusts reorders system prompt, fabric context, and user content into a deterministic prefix.
OpenAI: Automatic for prompts > 1024 tokens. 50% discount on cached tokens. TTL: 5–10 minutes of inactivity.
Anthropic: Explicit cache-control breakpoints. 90% discount on cached tokens. TTL: 5 minutes.
DeepSeek: 90%+ discount on cached tokens (DeepSeek R1, V3). Automatic, no config needed.
Monitor via cached_input_tokens in usage responses. Console shows provider cache rate under Cost Center → Provider Efficiency.
Works in parallel with org-shared cache: prefix caching saves tokens on misses, org-shared cache eliminates calls entirely on hits.

For leaders

Provider prefix caching delivers 30–50% token cost reduction on cache misses — cost savings without any behavior change.
Combined with org-shared cache (80% hit rate), effective cost reduction approaches 90%+ of baseline.
No additional spend to enable — just configuration. Savings appear immediately in provider billing.
The savings compound as team size grows: more engineers = more stable prefix patterns = higher prefix cache hit rates.
Monitor in the console: Cost Center → Provider Efficiency shows prefix savings separately from org-shared savings.

Use this page when​

Primary audience​

How Provider Prefix Caching Works​

How Keeptrusts Enables Higher Provider Cache Hits​

Viewing Cached Token Metrics​

Relationship to Org-Shared Cache​

Correctness Guarantees​

Maximizing Provider Cache Hits​

Estimating Provider Cache Savings​

Provider-Specific Behavior​

OpenAI​

Anthropic​

Google (Gemini)​

Monitoring in the Console​

Next steps​

For AI systems​

For engineers​

For leaders​