Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Provider Prompt-Prefix Caching

Provider-side prompt-prefix caching is an optimization layer below the Keeptrusts org-shared cache. When a cache miss sends a request upstream, stable context ordering helps the provider serve portions of the prompt from its own internal cache — reducing your per-request cost even on misses.

Use this page when

  • You want to understand how provider-side prompt-prefix caching saves tokens even on cache misses.
  • You are configuring Keeptrusts context ordering to maximize provider KV cache hits.
  • You need to quantify the additional cost reduction from stable prefix matching.

Primary audience

  • Primary: Technical Leaders
  • Secondary: Technical Engineers, AI Agents

How Provider Prefix Caching Works

Major LLM providers cache the computed key-value (KV) representations of prompt prefixes:

  • OpenAI — Caches prompt prefixes and reports cached_input_tokens in the response usage
  • Anthropic — Supports explicit cache breakpoints and reports cache_read_input_tokens
  • Google — Offers context caching with explicit TTLs

When consecutive requests share the same prefix (system prompt, codebase context, instructions), the provider skips recomputation of those tokens. You pay a reduced rate — typically 50% off input token pricing — for the cached portion.

How Keeptrusts Enables Higher Provider Cache Hits

The Codebase Context Fabric assembles context for each request using a stable ordering algorithm:

  1. System instructions (fixed per gateway configuration)
  2. Codebase structure context (deterministic tree ordering)
  3. Relevant file contents (sorted by path)
  4. Conversation history
  5. User prompt

Because the ordering is deterministic across all engineers on the same codebase, requests from different team members produce identical prefixes up to the divergence point (typically the user prompt). This maximizes the prefix length that providers can cache.

Without stable ordering, equivalent context assembled in random order produces different token sequences — defeating provider-side caching entirely.

Viewing Cached Token Metrics

In the console, navigate to Cost Center → Spend Logs to see per-request token breakdowns:

FieldDescription
input_tokensTotal input tokens sent to the provider
cached_input_tokensTokens served from provider's prefix cache
output_tokensTokens generated by the provider
effective_input_costActual cost after cached-token discount

A typical request with good prefix caching:

{
"model": "gpt-4o",
"input_tokens": 4200,
"cached_input_tokens": 3800,
"output_tokens": 650,
"effective_input_cost": 0.0041,
"full_input_cost_would_be": 0.0126
}

In this example, 90% of input tokens hit the provider's cache, reducing input cost by approximately 50% on those tokens.

Relationship to Org-Shared Cache

Provider prefix caching and the Keeptrusts org-shared cache operate at different levels:

LayerScopeEffect
Org-shared cacheFull request+responseEliminates the upstream call entirely
Provider prefix cacheInput token prefix onlyReduces cost of the upstream call

The evaluation order:

  1. Request arrives at gateway
  2. Org-shared cache lookup → Hit? Return cached response. Done.
  3. Cache miss → Forward to provider with stable-ordered context
  4. Provider applies its own prefix caching → Reduced input token cost
  5. Response returns → Stored in org-shared cache for future hits

Provider prefix caching only matters on cache misses. Once the org-shared cache reaches steady-state hit rates (70-90%), most requests never reach the provider at all.

Correctness Guarantees

Provider prefix caching is a pure cost optimization. It does not affect:

  • Response quality or content
  • Policy enforcement (all policies run before and after the provider call)
  • Cache key computation (org-shared cache keys are independent of provider behavior)
  • Determinism of outputs (provider caching is transparent to the caller)

If a provider disables or changes their caching behavior, your system continues to function identically — you simply pay full input token rates on misses.

Maximizing Provider Cache Hits

To get the best provider-side cache rates:

  1. Keep system prompts stable — Avoid per-request randomization in system instructions
  2. Use consistent context ordering — The Codebase Context Fabric handles this automatically
  3. Minimize context churn — Rapid file changes reduce prefix overlap between requests
  4. Batch similar work — Engineers working on the same area produce similar prefixes

Estimating Provider Cache Savings

For a team of 100 engineers with 80% org-shared cache hit rate:

  • 20% of requests go upstream (cache misses)
  • Of those, ~70-85% of input tokens typically hit provider prefix cache
  • Effective input cost reduction on misses: ~35-42%

Combined with org-shared cache:

ScenarioCost vs Baseline
No caching100%
Provider prefix cache only~80%
Org-shared cache only (80% hit)~20%
Both layers combined~13-16%

The org-shared cache delivers the dominant savings. Provider prefix caching provides incremental benefit on the remaining misses.

Provider-Specific Behavior

OpenAI

Reports cached_tokens in the usage.prompt_tokens_details object. Cached tokens are billed at 50% of the standard input rate. Caching is automatic — no explicit opt-in required.

Anthropic

Supports explicit cache_control breakpoints. Reports cache_read_input_tokens in usage. Cached tokens billed at approximately 10% of standard input rate. Keeptrusts inserts cache breakpoints at optimal boundaries automatically.

Google (Gemini)

Offers explicit context caching with configurable TTLs. Cached tokens billed at 75% discount. Requires explicit cache creation — Keeptrusts manages cache lifecycle per gateway configuration.

Monitoring in the Console

The Cost Center → Provider Efficiency view shows:

  • Provider cache hit rate over time
  • Average cached prefix length per model
  • Savings from provider caching (separate from org-shared cache savings)
  • Trend line showing prefix cache effectiveness as codebase context stabilizes

Next steps

For AI systems

  • Canonical terms: Keeptrusts, provider prefix caching, KV cache, cached_input_tokens, stable context ordering, prompt prefix, OpenAI cached tokens, Anthropic prompt caching.
  • Config keys: cache.provider_prefix_optimization: true, cache.context_ordering: stable.
  • Metrics: cached_input_tokens, provider cache hit rate, cached prefix length, per-model caching behavior.
  • Best next pages: Provider Prefix Tuning, Direct API Cost vs Cached Cost.

For engineers

  • Enable with cache.provider_prefix_optimization: true — Keeptrusts reorders system prompt, fabric context, and user content into a deterministic prefix.
  • OpenAI: Automatic for prompts > 1024 tokens. 50% discount on cached tokens. TTL: 5–10 minutes of inactivity.
  • Anthropic: Explicit cache-control breakpoints. 90% discount on cached tokens. TTL: 5 minutes.
  • DeepSeek: 90%+ discount on cached tokens (DeepSeek R1, V3). Automatic, no config needed.
  • Monitor via cached_input_tokens in usage responses. Console shows provider cache rate under Cost Center → Provider Efficiency.
  • Works in parallel with org-shared cache: prefix caching saves tokens on misses, org-shared cache eliminates calls entirely on hits.

For leaders

  • Provider prefix caching delivers 30–50% token cost reduction on cache misses — cost savings without any behavior change.
  • Combined with org-shared cache (80% hit rate), effective cost reduction approaches 90%+ of baseline.
  • No additional spend to enable — just configuration. Savings appear immediately in provider billing.
  • The savings compound as team size grows: more engineers = more stable prefix patterns = higher prefix cache hit rates.
  • Monitor in the console: Cost Center → Provider Efficiency shows prefix savings separately from org-shared savings.