Cache and Fabric Cost Adjustments in Estimates

Cache hits and Codebase Context Fabric retrieval both affect your cost estimates. A cache hit can reduce your estimated cost to near zero, while fabric retrieval adds context tokens that increase input cost. The pre-dispatch estimate breaks these down transparently so you understand exactly what drives the cost of each request.

Use this page when

You need to understand how cache hits (full or partial) reduce your pre-dispatch cost estimate.
You are configuring cache_hit_confidence_threshold or fabric_retrieval_cost_per_query.
You want to understand the combined cost breakdown (provider cost, cache savings, fabric cost) shown in the chat UI.

Primary audience

Primary: Technical Engineers
Secondary: AI Agents, Technical Leaders

Cache Hit Cost Adjustments

Full Cache Hits

When the gateway detects a full cache hit — an exact key match with a valid TTL — the estimate reflects near-zero provider cost:

cache_savings_estimate equals the full provider cost that would have been incurred.
estimated_total_cost drops to the cache lookup overhead only (typically negligible).
The confidence remains at the level determined by the tokenizer adapter, since the token count is still estimated for display purposes.

A full cache hit means the exact same prompt (including system message and context) was sent previously and the cached response is still valid. You pay no provider tokens — only the minimal overhead of cache key lookup and response delivery.

Partial Cache Hits

Partial cache hits occur when some context in the request matches cached content but the prompt itself is new. Common scenarios include:

Cached Knowledge Base context — The KB assets injected into your request have a cached embedding or retrieval result, but your prompt text is new.
Cached system message — The system message and static context are cached at the provider level (e.g., OpenAI prompt caching), reducing input token cost for that portion.
Conversation history overlap — Previous turns in the conversation are cached, and only new tokens are billed by the provider.

For partial cache hits, the savings are proportional to the cached portion:

cache_savings_estimate = cached_token_count * input_cost_per_token

The estimate shows the reduced cost for cached tokens while billing the full rate for new tokens.

Cache Confidence Threshold

Not all potential cache hits are reflected in the estimate. The gateway only includes cache savings when it is confident a hit will occur. This is controlled by cost_estimation.cache_hit_confidence_threshold:

cost_estimation:
  cache_hit_confidence_threshold: 0.8

The confidence score (0.0 to 1.0) is based on:

Key match certainty — Does the cache key exactly match, or is it a fuzzy match?
TTL validity — How much time remains before the cached entry expires?
Provider cache behavior — Does the provider guarantee cache hits for matching prefixes, or is caching best-effort?

When confidence is below the threshold, the estimate assumes no cache hit and shows the full provider cost. This prevents misleading estimates where a cache miss would result in an unexpectedly higher actual cost.

Threshold	Behavior
`1.0`	Only show savings for guaranteed cache hits (exact match, valid TTL, deterministic provider)
`0.8` (default)	Show savings when a hit is highly likely
`0.5`	Show savings for probable hits (more optimistic estimates)
`0.0`	Always show potential savings regardless of confidence

Fabric Retrieval Cost Adjustments

How Fabric Context Affects Estimates

When Codebase Context Fabric is enabled, the gateway retrieves relevant context from the fabric index before assembling the prompt. This affects costs in two ways:

Retrieval cost — The semantic search query against the fabric index has its own cost model.
Additional input tokens — Retrieved context is injected into the prompt, increasing the input token count.

Fabric Retrieval Cost

Each fabric query incurs a configurable cost:

cost_estimation:
  fabric_retrieval_cost_per_query: 0.0

The default is 0.0 (free) because most self-hosted fabric indexes have no per-query cost. If you use a managed embedding service or external vector database with per-query pricing, set this to reflect the actual cost per retrieval operation.

The retrieval cost is fixed per query regardless of how many tokens are returned. It appears in the estimate as a separate line item.

Fabric Context Token Cost

Retrieved fabric context tokens count toward input tokens at the standard model rate. The estimate includes these tokens in estimated_input_tokens:

total_input_tokens = prompt_tokens + system_message_tokens + kb_context_tokens + fabric_context_tokens + history_tokens

The gateway estimates fabric context size based on your configured retrieval parameters:

fabric.max_context_tokens — Maximum tokens to retrieve from the fabric index.
fabric.top_k — Number of chunks to retrieve.
Historical average chunk size for your fabric index.

Including or Excluding Fabric Costs

You control whether fabric costs appear in estimates:

cost_estimation:
  include_fabric_costs: true

Setting	Behavior
`true` (default)	Estimates include fabric retrieval cost and fabric context tokens
`false`	Estimates exclude fabric costs — useful if fabric is free and you want simpler estimates

When set to false, fabric context tokens are still counted for accuracy in reconciliation, but they do not appear in the pre-dispatch estimate shown to users.

Combined Cost Breakdown

The estimate provides a combined view that separates cost components:

Field	Description
`provider_cost`	Base cost of sending the prompt to the LLM provider (input + output tokens)
`cache_savings`	Estimated reduction from cache hits (subtracted from provider cost)
`fabric_retrieval_cost`	Cost of querying the fabric index
`net_estimated_cost`	Final estimated cost: `provider_cost - cache_savings + fabric_retrieval_cost`

This breakdown appears in the chat UI cost badge tooltip and in the Cost Center detail view.

Example Breakdown

For a request with partial cache hit and fabric context:

Provider cost (input):     $0.0045  (1,500 tokens × $0.000003)
Provider cost (output):    $0.0060  (500 tokens × $0.000012)
Cache savings:            -$0.0012  (400 cached tokens × $0.000003)
Fabric retrieval:          $0.0000  (free tier)
─────────────────────────────────────
Net estimated cost:        $0.0093

Configuration Reference

cost_estimation:
  # Cache settings
  cache_hit_confidence_threshold: 0.8
  include_cache_savings_in_estimate: true

  # Fabric settings
  include_fabric_costs: true
  fabric_retrieval_cost_per_query: 0.0

  # Display settings
  show_cost_breakdown: true
  breakdown_components:
    - provider_cost
    - cache_savings
    - fabric_retrieval_cost
    - net_estimated_cost

Setting	Default	Description
`cache_hit_confidence_threshold`	`0.8`	Minimum confidence to include cache savings in estimate
`include_cache_savings_in_estimate`	`true`	Show cache savings in the estimate
`include_fabric_costs`	`true`	Include fabric retrieval and context costs
`fabric_retrieval_cost_per_query`	`0.0`	Per-query cost for fabric index searches
`show_cost_breakdown`	`true`	Show the component breakdown in the UI

Interaction With Wallet Reservations

When both cache savings and fabric costs are included in the estimate, the wallet reservation reflects the net amount:

The reservation equals net_estimated_cost, not the gross provider_cost.
If a cache hit does not materialize (confidence was below 1.0 and the hit failed), the actual cost exceeds the reservation. The wallet absorbs the difference at settlement.
To avoid under-reservation, set cache_hit_confidence_threshold to 1.0 for conservative wallet management.

Next steps

Pre-Dispatch Prompt Cost Estimates — understand the full estimation flow.
Token Estimation Across Providers — learn how token counts are estimated for different models.
Estimate vs Actual Cost Reconciliation — see how estimates are compared to actual costs after the response arrives.

For AI systems

Canonical terms: Keeptrusts, cache cost adjustments, fabric cost adjustments, pre-dispatch estimates, full cache hit, partial cache hit, cache savings estimate, fabric retrieval cost, net estimated cost, cache confidence threshold.
Feature/config names: cost_estimation.cache_hit_confidence_threshold, cost_estimation.include_cache_savings_in_estimate, cost_estimation.include_fabric_costs, cost_estimation.fabric_retrieval_cost_per_query, cost_estimation.show_cost_breakdown, cache_savings_estimate, provider_cost, net_estimated_cost.
Best next pages: Pre-Dispatch Prompt Cost Estimates, Token Estimation Across Providers, Estimate vs Actual Cost Reconciliation.

For engineers

Set cache_hit_confidence_threshold: 0.8 (default) for balanced estimates. Use 1.0 for conservative wallet management that avoids under-reservation.
Set fabric_retrieval_cost_per_query: 0.0 for self-hosted fabric indexes. Update this if using a managed embedding service with per-query pricing.
Validate: Check the cost badge tooltip in the chat UI for the component breakdown (provider cost, cache savings, fabric cost, net cost).
Wallet interaction: reservations use the net estimated cost. If a cache hit fails to materialize, the wallet absorbs the difference at settlement.

For leaders

Cache savings in estimates give engineers real-time visibility into cache ROI before each request.
Conservative confidence thresholds (1.0) prevent wallet under-reservation at the cost of less optimistic estimates shown to users.
Fabric retrieval is typically free for self-hosted indexes — the cost impact comes from additional input tokens (fabric context), not the query itself.
Cost breakdown transparency helps teams understand where spend goes: provider tokens vs. cache savings vs. fabric context overhead.

Use this page when​

Primary audience​

Cache Hit Cost Adjustments​

Full Cache Hits​

Partial Cache Hits​

Cache Confidence Threshold​

Fabric Retrieval Cost Adjustments​

How Fabric Context Affects Estimates​

Fabric Retrieval Cost​

Fabric Context Token Cost​

Including or Excluding Fabric Costs​

Combined Cost Breakdown​

Example Breakdown​

Configuration Reference​

Interaction With Wallet Reservations​

Next steps​

For AI systems​

For engineers​

For leaders​