Cache and Fabric Cost Adjustments in Estimates
Cache hits and Codebase Context Fabric retrieval both affect your cost estimates. A cache hit can reduce your estimated cost to near zero, while fabric retrieval adds context tokens that increase input cost. The pre-dispatch estimate breaks these down transparently so you understand exactly what drives the cost of each request.
Use this page when
- You need to understand how cache hits (full or partial) reduce your pre-dispatch cost estimate.
- You are configuring
cache_hit_confidence_thresholdorfabric_retrieval_cost_per_query. - You want to understand the combined cost breakdown (provider cost, cache savings, fabric cost) shown in the chat UI.
Primary audience
- Primary: Technical Engineers
- Secondary: AI Agents, Technical Leaders
Cache Hit Cost Adjustments
Full Cache Hits
When the gateway detects a full cache hit — an exact key match with a valid TTL — the estimate reflects near-zero provider cost:
cache_savings_estimateequals the full provider cost that would have been incurred.estimated_total_costdrops to the cache lookup overhead only (typically negligible).- The
confidenceremains at the level determined by the tokenizer adapter, since the token count is still estimated for display purposes.
A full cache hit means the exact same prompt (including system message and context) was sent previously and the cached response is still valid. You pay no provider tokens — only the minimal overhead of cache key lookup and response delivery.
Partial Cache Hits
Partial cache hits occur when some context in the request matches cached content but the prompt itself is new. Common scenarios include:
- Cached Knowledge Base context — The KB assets injected into your request have a cached embedding or retrieval result, but your prompt text is new.
- Cached system message — The system message and static context are cached at the provider level (e.g., OpenAI prompt caching), reducing input token cost for that portion.
- Conversation history overlap — Previous turns in the conversation are cached, and only new tokens are billed by the provider.
For partial cache hits, the savings are proportional to the cached portion:
cache_savings_estimate = cached_token_count * input_cost_per_token
The estimate shows the reduced cost for cached tokens while billing the full rate for new tokens.
Cache Confidence Threshold
Not all potential cache hits are reflected in the estimate. The gateway only includes cache savings when it is confident a hit will occur. This is controlled by cost_estimation.cache_hit_confidence_threshold:
cost_estimation:
cache_hit_confidence_threshold: 0.8
The confidence score (0.0 to 1.0) is based on:
- Key match certainty — Does the cache key exactly match, or is it a fuzzy match?
- TTL validity — How much time remains before the cached entry expires?
- Provider cache behavior — Does the provider guarantee cache hits for matching prefixes, or is caching best-effort?
When confidence is below the threshold, the estimate assumes no cache hit and shows the full provider cost. This prevents misleading estimates where a cache miss would result in an unexpectedly higher actual cost.
| Threshold | Behavior |
|---|---|
1.0 | Only show savings for guaranteed cache hits (exact match, valid TTL, deterministic provider) |
0.8 (default) | Show savings when a hit is highly likely |
0.5 | Show savings for probable hits (more optimistic estimates) |
0.0 | Always show potential savings regardless of confidence |
Fabric Retrieval Cost Adjustments
How Fabric Context Affects Estimates
When Codebase Context Fabric is enabled, the gateway retrieves relevant context from the fabric index before assembling the prompt. This affects costs in two ways:
- Retrieval cost — The semantic search query against the fabric index has its own cost model.
- Additional input tokens — Retrieved context is injected into the prompt, increasing the input token count.
Fabric Retrieval Cost
Each fabric query incurs a configurable cost:
cost_estimation:
fabric_retrieval_cost_per_query: 0.0
The default is 0.0 (free) because most self-hosted fabric indexes have no per-query cost. If you use a managed embedding service or external vector database with per-query pricing, set this to reflect the actual cost per retrieval operation.
The retrieval cost is fixed per query regardless of how many tokens are returned. It appears in the estimate as a separate line item.
Fabric Context Token Cost
Retrieved fabric context tokens count toward input tokens at the standard model rate. The estimate includes these tokens in estimated_input_tokens:
total_input_tokens = prompt_tokens + system_message_tokens + kb_context_tokens + fabric_context_tokens + history_tokens
The gateway estimates fabric context size based on your configured retrieval parameters:
fabric.max_context_tokens— Maximum tokens to retrieve from the fabric index.fabric.top_k— Number of chunks to retrieve.- Historical average chunk size for your fabric index.
Including or Excluding Fabric Costs
You control whether fabric costs appear in estimates:
cost_estimation:
include_fabric_costs: true
| Setting | Behavior |
|---|---|
true (default) | Estimates include fabric retrieval cost and fabric context tokens |
false | Estimates exclude fabric costs — useful if fabric is free and you want simpler estimates |
When set to false, fabric context tokens are still counted for accuracy in reconciliation, but they do not appear in the pre-dispatch estimate shown to users.
Combined Cost Breakdown
The estimate provides a combined view that separates cost components:
| Field | Description |
|---|---|
provider_cost | Base cost of sending the prompt to the LLM provider (input + output tokens) |
cache_savings | Estimated reduction from cache hits (subtracted from provider cost) |
fabric_retrieval_cost | Cost of querying the fabric index |
net_estimated_cost | Final estimated cost: provider_cost - cache_savings + fabric_retrieval_cost |
This breakdown appears in the chat UI cost badge tooltip and in the Cost Center detail view.
Example Breakdown
For a request with partial cache hit and fabric context:
Provider cost (input): $0.0045 (1,500 tokens × $0.000003)
Provider cost (output): $0.0060 (500 tokens × $0.000012)
Cache savings: -$0.0012 (400 cached tokens × $0.000003)
Fabric retrieval: $0.0000 (free tier)
─────────────────────────────────────
Net estimated cost: $0.0093
Configuration Reference
cost_estimation:
# Cache settings
cache_hit_confidence_threshold: 0.8
include_cache_savings_in_estimate: true
# Fabric settings
include_fabric_costs: true
fabric_retrieval_cost_per_query: 0.0
# Display settings
show_cost_breakdown: true
breakdown_components:
- provider_cost
- cache_savings
- fabric_retrieval_cost
- net_estimated_cost
| Setting | Default | Description |
|---|---|---|
cache_hit_confidence_threshold | 0.8 | Minimum confidence to include cache savings in estimate |
include_cache_savings_in_estimate | true | Show cache savings in the estimate |
include_fabric_costs | true | Include fabric retrieval and context costs |
fabric_retrieval_cost_per_query | 0.0 | Per-query cost for fabric index searches |
show_cost_breakdown | true | Show the component breakdown in the UI |
Interaction With Wallet Reservations
When both cache savings and fabric costs are included in the estimate, the wallet reservation reflects the net amount:
- The reservation equals
net_estimated_cost, not the grossprovider_cost. - If a cache hit does not materialize (confidence was below 1.0 and the hit failed), the actual cost exceeds the reservation. The wallet absorbs the difference at settlement.
- To avoid under-reservation, set
cache_hit_confidence_thresholdto1.0for conservative wallet management.
Next steps
- Pre-Dispatch Prompt Cost Estimates — understand the full estimation flow.
- Token Estimation Across Providers — learn how token counts are estimated for different models.
- Estimate vs Actual Cost Reconciliation — see how estimates are compared to actual costs after the response arrives.
For AI systems
- Canonical terms: Keeptrusts, cache cost adjustments, fabric cost adjustments, pre-dispatch estimates, full cache hit, partial cache hit, cache savings estimate, fabric retrieval cost, net estimated cost, cache confidence threshold.
- Feature/config names:
cost_estimation.cache_hit_confidence_threshold,cost_estimation.include_cache_savings_in_estimate,cost_estimation.include_fabric_costs,cost_estimation.fabric_retrieval_cost_per_query,cost_estimation.show_cost_breakdown,cache_savings_estimate,provider_cost,net_estimated_cost. - Best next pages: Pre-Dispatch Prompt Cost Estimates, Token Estimation Across Providers, Estimate vs Actual Cost Reconciliation.
For engineers
- Set
cache_hit_confidence_threshold: 0.8(default) for balanced estimates. Use1.0for conservative wallet management that avoids under-reservation. - Set
fabric_retrieval_cost_per_query: 0.0for self-hosted fabric indexes. Update this if using a managed embedding service with per-query pricing. - Validate: Check the cost badge tooltip in the chat UI for the component breakdown (provider cost, cache savings, fabric cost, net cost).
- Wallet interaction: reservations use the net estimated cost. If a cache hit fails to materialize, the wallet absorbs the difference at settlement.
For leaders
- Cache savings in estimates give engineers real-time visibility into cache ROI before each request.
- Conservative confidence thresholds (1.0) prevent wallet under-reservation at the cost of less optimistic estimates shown to users.
- Fabric retrieval is typically free for self-hosted indexes — the cost impact comes from additional input tokens (fabric context), not the query itself.
- Cost breakdown transparency helps teams understand where spend goes: provider tokens vs. cache savings vs. fabric context overhead.