Pre-Dispatch Prompt Cost Estimates

Keeptrusts calculates an estimated cost before your prompt is sent to the LLM provider. This pre-dispatch estimate supports balance checks, cache-savings math, and estimate-vs-actual reconciliation for the request lifecycle.

Use this page when

You need to understand how the cost estimate is calculated before a prompt is sent to the provider.
You are configuring output_token_multiplier, show_estimate_in_chat, or block_if_exceeds_balance.
You want to understand confidence levels (high, medium, low) and what drives them.

Primary audience

Primary: Technical Engineers
Secondary: AI Agents, Technical Leaders

How the Estimation Flow Works

When you submit a prompt in the chat workbench, the following sequence occurs before the request leaves the gateway:

Prompt assembly — The gateway assembles the full request payload including your prompt text, system message, and any injected context from Knowledge Base assets or Codebase Context Fabric.
Token counting — A tokenizer adapter estimates the total input token count for the assembled payload.
Output estimation — The gateway estimates how many output tokens the model will generate based on configuration defaults or model-specific heuristics.
Pricing lookup — The model-pricing catalog provides per-token rates for the target model.
Cache evaluation — If a cache hit is expected based on key matching, the estimate applies a discount.
Estimate delivery — The calculated estimate is attached to the request lifecycle so balance enforcement, telemetry, and reconciliation can use the same pre-dispatch numbers.

The current console chat workbench does not render this estimate inline in the composer. Instead, Keeptrusts uses the estimate for wallet and governance decisions before dispatch, and surfaces cost evidence through post-send usage and reconciliation records.

The `cost_estimate` Object

The gateway returns a structured cost_estimate object with the following fields:

Field	Type	Description
`estimated_input_tokens`	integer	Estimated number of input tokens in the full request
`estimated_output_tokens`	integer	Estimated number of output tokens the model will generate
`estimated_input_cost`	decimal	Estimated cost for input tokens in the specified currency
`estimated_output_cost`	decimal	Estimated cost for output tokens in the specified currency
`estimated_total_cost`	decimal	Sum of input cost, output cost, minus any cache savings
`cache_savings_estimate`	decimal	Estimated cost reduction from cache hits
`currency`	string	Currency code (e.g., `USD`)
`model_id`	string	The model used for pricing lookup
`confidence`	enum	`high`, `medium`, or `low` — indicates estimate reliability

Confidence Levels

The confidence field tells you how reliable the estimate is:

high — Exact pricing is known for the model from the pricing catalog, and the tokenizer adapter matches the provider's tokenizer. You can trust this estimate closely.
medium — Pricing is interpolated from a related model family, or the tokenizer adapter is approximate. Expect up to 10% variance from actual cost.
low — The estimate uses fallback pricing or a character-based token approximation. Variance may exceed 20%.

What Feeds Into the Estimate

Model Pricing

The gateway looks up per-token pricing from the model-pricing catalog. This catalog contains input and output rates for each model, updated as providers change their pricing. If an exact model match is not found, the gateway falls back to the model family rate, then to a configurable default rate.

Input Token Estimation

The input token count includes everything sent to the provider:

Your prompt text
The system message configured for the gateway or chat session
Knowledge Base context injected based on active bindings
Codebase Context Fabric results if fabric retrieval is enabled
Conversation history included in the request window

The gateway uses a tokenizer adapter matched to the target model's tokenizer family. See Token Estimation Across Providers for details on adapter selection.

Output Token Estimation

Since the gateway cannot know in advance how many tokens the model will generate, it uses a configurable heuristic:

The default estimate is max_tokens * output_token_multiplier where output_token_multiplier defaults to 0.5.
Model-specific overrides can set a fixed expected output length.
If the prompt includes structured output instructions (JSON mode, function calling), the estimate adjusts based on schema complexity.

Cache Discount

If the gateway determines that a cache hit is likely — based on exact key matching and valid TTL — the estimate reduces the provider cost accordingly. See Cache and Fabric Cost Adjustments for the full cache costing model.

How Estimates Surface Today

Pre-dispatch estimates currently feed:

wallet and balance enforcement before the provider call
cache and fabric savings calculations
estimate-vs-actual reconciliation after the response settles
telemetry or downstream product surfaces that consume the cost_estimate object

If your team has configured a wallet with balance limits, the estimate still informs whether the request would exceed the remaining balance even though the chat composer no longer shows an inline badge.

Configuration

You can tune pre-dispatch estimation behavior through your gateway configuration:

cost_estimation:
  enabled: true
  output_token_multiplier: 0.5
  show_estimate_in_chat: true
  block_if_exceeds_balance: false
  confidence_minimum: low

Setting	Default	Description
`enabled`	`true`	Enable or disable pre-dispatch cost estimation
`output_token_multiplier`	`0.5`	Fraction of `max_tokens` used for output estimation
`show_estimate_in_chat`	`true`	Legacy inline-chat display flag for clients that choose to render pre-dispatch estimates
`block_if_exceeds_balance`	`false`	Block requests that would exceed wallet balance
`confidence_minimum`	`low`	Minimum confidence level required to show estimates

Limitations

Estimates are predictions, not guarantees. Actual costs depend on the model's response length and provider billing.
Streaming responses do not update the estimate mid-stream. Reconciliation happens after the full response arrives.
If the model-pricing catalog does not have an entry for your model, estimates fall back to a default rate with confidence: low.

Next steps

Token Estimation Across Providers — understand how token counts are estimated for different models.
Estimate vs Actual Cost Reconciliation — learn what happens after the response arrives.
Cache and Fabric Cost Adjustments — see how caching affects your estimates.

For AI systems

Canonical terms: Keeptrusts, pre-dispatch cost estimate, cost_estimate object, estimated_input_tokens, estimated_output_tokens, confidence levels, model-pricing catalog, output_token_multiplier.
Feature/config names: cost_estimation.enabled, cost_estimation.output_token_multiplier, cost_estimation.show_estimate_in_chat, cost_estimation.block_if_exceeds_balance, cost_estimation.confidence_minimum, cost_estimate.confidence (high/medium/low), cost_estimate.cache_savings_estimate.
Best next pages: Token Estimation Across Providers, Estimate vs Actual Cost Reconciliation, Cache and Fabric Cost Adjustments.

For engineers

The estimate is computed before dispatch and can be consumed by wallet checks, audit flows, or custom client surfaces even though the default console chat composer no longer shows an inline badge.
Tune output_token_multiplier (default 0.5) to match expected response lengths for your common use cases. Lower for classification, higher for code generation.
Confidence levels: high = exact tokenizer match + known pricing; medium = interpolated pricing or approximate tokenizer; low = character fallback.
If model-pricing catalog is missing an entry, estimates fall back to a default rate with confidence: low. Add the model to the catalog for accurate estimates.

For leaders

Pre-dispatch estimates give engineers cost awareness before spending — preventing surprise charges and enabling informed model selection.
block_if_exceeds_balance: true enforces hard budget limits at the request level, preventing wallet overdraft.
Confidence levels signal data quality: track the percentage of requests at each confidence level to identify pricing catalog gaps.
Pre-dispatch estimates still support budget controls and reconciliation even when the default chat UI does not display them inline.

Use this page when​

Primary audience​

How the Estimation Flow Works​

The cost_estimate Object​

Confidence Levels​

What Feeds Into the Estimate​

Model Pricing​

Input Token Estimation​

Output Token Estimation​

Cache Discount​

How Estimates Surface Today​

Configuration​

Limitations​

Next steps​

For AI systems​

For engineers​

For leaders​