Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Pre-Dispatch Prompt Cost Estimates

Keeptrusts calculates an estimated cost before your prompt is sent to the LLM provider. This pre-dispatch estimate supports balance checks, cache-savings math, and estimate-vs-actual reconciliation for the request lifecycle.

Use this page when

  • You need to understand how the cost estimate is calculated before a prompt is sent to the provider.
  • You are configuring output_token_multiplier, show_estimate_in_chat, or block_if_exceeds_balance.
  • You want to understand confidence levels (high, medium, low) and what drives them.

Primary audience

  • Primary: Technical Engineers
  • Secondary: AI Agents, Technical Leaders

How the Estimation Flow Works

When you submit a prompt in the chat workbench, the following sequence occurs before the request leaves the gateway:

  1. Prompt assembly — The gateway assembles the full request payload including your prompt text, system message, and any injected context from Knowledge Base assets or Codebase Context Fabric.
  2. Token counting — A tokenizer adapter estimates the total input token count for the assembled payload.
  3. Output estimation — The gateway estimates how many output tokens the model will generate based on configuration defaults or model-specific heuristics.
  4. Pricing lookup — The model-pricing catalog provides per-token rates for the target model.
  5. Cache evaluation — If a cache hit is expected based on key matching, the estimate applies a discount.
  6. Estimate delivery — The calculated estimate is attached to the request lifecycle so balance enforcement, telemetry, and reconciliation can use the same pre-dispatch numbers.

The current console chat workbench does not render this estimate inline in the composer. Instead, Keeptrusts uses the estimate for wallet and governance decisions before dispatch, and surfaces cost evidence through post-send usage and reconciliation records.

The cost_estimate Object

The gateway returns a structured cost_estimate object with the following fields:

FieldTypeDescription
estimated_input_tokensintegerEstimated number of input tokens in the full request
estimated_output_tokensintegerEstimated number of output tokens the model will generate
estimated_input_costdecimalEstimated cost for input tokens in the specified currency
estimated_output_costdecimalEstimated cost for output tokens in the specified currency
estimated_total_costdecimalSum of input cost, output cost, minus any cache savings
cache_savings_estimatedecimalEstimated cost reduction from cache hits
currencystringCurrency code (e.g., USD)
model_idstringThe model used for pricing lookup
confidenceenumhigh, medium, or low — indicates estimate reliability

Confidence Levels

The confidence field tells you how reliable the estimate is:

  • high — Exact pricing is known for the model from the pricing catalog, and the tokenizer adapter matches the provider's tokenizer. You can trust this estimate closely.
  • medium — Pricing is interpolated from a related model family, or the tokenizer adapter is approximate. Expect up to 10% variance from actual cost.
  • low — The estimate uses fallback pricing or a character-based token approximation. Variance may exceed 20%.

What Feeds Into the Estimate

Model Pricing

The gateway looks up per-token pricing from the model-pricing catalog. This catalog contains input and output rates for each model, updated as providers change their pricing. If an exact model match is not found, the gateway falls back to the model family rate, then to a configurable default rate.

Input Token Estimation

The input token count includes everything sent to the provider:

  • Your prompt text
  • The system message configured for the gateway or chat session
  • Knowledge Base context injected based on active bindings
  • Codebase Context Fabric results if fabric retrieval is enabled
  • Conversation history included in the request window

The gateway uses a tokenizer adapter matched to the target model's tokenizer family. See Token Estimation Across Providers for details on adapter selection.

Output Token Estimation

Since the gateway cannot know in advance how many tokens the model will generate, it uses a configurable heuristic:

  • The default estimate is max_tokens * output_token_multiplier where output_token_multiplier defaults to 0.5.
  • Model-specific overrides can set a fixed expected output length.
  • If the prompt includes structured output instructions (JSON mode, function calling), the estimate adjusts based on schema complexity.

Cache Discount

If the gateway determines that a cache hit is likely — based on exact key matching and valid TTL — the estimate reduces the provider cost accordingly. See Cache and Fabric Cost Adjustments for the full cache costing model.

How Estimates Surface Today

Pre-dispatch estimates currently feed:

  • wallet and balance enforcement before the provider call
  • cache and fabric savings calculations
  • estimate-vs-actual reconciliation after the response settles
  • telemetry or downstream product surfaces that consume the cost_estimate object

If your team has configured a wallet with balance limits, the estimate still informs whether the request would exceed the remaining balance even though the chat composer no longer shows an inline badge.

Configuration

You can tune pre-dispatch estimation behavior through your gateway configuration:

cost_estimation:
enabled: true
output_token_multiplier: 0.5
show_estimate_in_chat: true
block_if_exceeds_balance: false
confidence_minimum: low
SettingDefaultDescription
enabledtrueEnable or disable pre-dispatch cost estimation
output_token_multiplier0.5Fraction of max_tokens used for output estimation
show_estimate_in_chattrueLegacy inline-chat display flag for clients that choose to render pre-dispatch estimates
block_if_exceeds_balancefalseBlock requests that would exceed wallet balance
confidence_minimumlowMinimum confidence level required to show estimates

Limitations

  • Estimates are predictions, not guarantees. Actual costs depend on the model's response length and provider billing.
  • Streaming responses do not update the estimate mid-stream. Reconciliation happens after the full response arrives.
  • If the model-pricing catalog does not have an entry for your model, estimates fall back to a default rate with confidence: low.

Next steps

For AI systems

  • Canonical terms: Keeptrusts, pre-dispatch cost estimate, cost_estimate object, estimated_input_tokens, estimated_output_tokens, confidence levels, model-pricing catalog, output_token_multiplier.
  • Feature/config names: cost_estimation.enabled, cost_estimation.output_token_multiplier, cost_estimation.show_estimate_in_chat, cost_estimation.block_if_exceeds_balance, cost_estimation.confidence_minimum, cost_estimate.confidence (high/medium/low), cost_estimate.cache_savings_estimate.
  • Best next pages: Token Estimation Across Providers, Estimate vs Actual Cost Reconciliation, Cache and Fabric Cost Adjustments.

For engineers

  • The estimate is computed before dispatch and can be consumed by wallet checks, audit flows, or custom client surfaces even though the default console chat composer no longer shows an inline badge.
  • Tune output_token_multiplier (default 0.5) to match expected response lengths for your common use cases. Lower for classification, higher for code generation.
  • Confidence levels: high = exact tokenizer match + known pricing; medium = interpolated pricing or approximate tokenizer; low = character fallback.
  • If model-pricing catalog is missing an entry, estimates fall back to a default rate with confidence: low. Add the model to the catalog for accurate estimates.

For leaders

  • Pre-dispatch estimates give engineers cost awareness before spending — preventing surprise charges and enabling informed model selection.
  • block_if_exceeds_balance: true enforces hard budget limits at the request level, preventing wallet overdraft.
  • Confidence levels signal data quality: track the percentage of requests at each confidence level to identify pricing catalog gaps.
  • Pre-dispatch estimates still support budget controls and reconciliation even when the default chat UI does not display them inline.