Token Estimation Across Providers
Keeptrusts estimates token counts locally without calling provider tokenizer APIs. This keeps estimation fast, avoids additional API costs, and works offline. Built-in adapters cover major provider tokenizer families, and you can register custom adapters for private or self-hosted models.
Use this page when
- You need to understand how Keeptrusts estimates token counts locally without calling provider APIs.
- You are selecting or registering a tokenizer adapter (
cl100k_base,o200k_base, custom, or character fallback). - You are troubleshooting persistent estimation divergence (>10%) for a specific model.
Primary audience
- Primary: Technical Engineers
- Secondary: AI Agents, Technical Leaders
How Adapter Selection Works
When the gateway needs to estimate tokens for a request, it selects an adapter based on the model's tokenizer_family field from the model-pricing catalog:
- The gateway looks up the target model in the pricing catalog.
- It reads the
tokenizer_familyfield (e.g.,cl100k_base,o200k_base,character_estimate). - It loads the corresponding adapter to perform local token counting.
- If no
tokenizer_familyis set, the gateway falls back to the character-based estimator.
This selection happens once per request and adds negligible latency (typically less than 1ms).
Built-In Adapters
tiktoken — OpenAI and Azure OpenAI Models
The tiktoken adapter provides accurate token estimation for all OpenAI and Azure OpenAI models. It uses the same tokenization algorithm as the provider, giving you confidence: high estimates.
Two encoding variants are supported:
cl100k_base
Used by the GPT-4 and GPT-3.5 model families:
gpt-4gpt-4-turbogpt-3.5-turbo- All dated snapshots of the above (e.g.,
gpt-4-0613)
Token estimation with cl100k_base is typically within 0-2% of the provider's actual count.
o200k_base
Used by the GPT-4o family and newer models:
gpt-4ogpt-4o-minio1o1-minio3o3-mini
The o200k_base encoding has a larger vocabulary (200k tokens) and produces fewer tokens for the same text compared to cl100k_base. Estimates are typically exact or within 1% variance.
Character-Based Fallback
For providers without public tokenizers — such as certain Anthropic, Cohere, or Mistral models — the gateway uses a character-based approximation:
estimated_tokens = character_count / 4
This produces confidence: low estimates. The divisor of 4 is a reasonable average across Latin-script text but may undercount for CJK characters or overcount for highly technical content with long tokens.
Providers using this fallback include:
- Models without a declared
tokenizer_familyin the pricing catalog - Self-hosted models that do not publish tokenizer specifications
- Any model where the adapter registry has no matching entry
Custom Adapters
You can register custom tokenizer adapters for private or self-hosted models. A custom adapter implements the token counting interface and is associated with a tokenizer_family name.
Registering a Custom Adapter
Add a custom adapter in your gateway configuration:
cost_estimation:
custom_tokenizers:
- family: "my-custom-tokenizer"
type: "tiktoken_compatible"
vocabulary_file: "/path/to/vocab.bpe"
- family: "simple-word-split"
type: "regex"
pattern: "\\s+"
overhead_factor: 1.3
Adapter Types
| Type | Description | Use Case |
|---|---|---|
tiktoken_compatible | BPE tokenizer with a custom vocabulary file | Models fine-tuned from OpenAI base with modified vocab |
regex | Splits text by regex pattern, applies overhead factor | Simple estimation for models with word-level tokenization |
external | Calls a local HTTP endpoint for token counting | Models with proprietary tokenizers you host locally |
Assigning a Custom Adapter to a Model
After registering the adapter, set the tokenizer_family field in the model-pricing catalog entry:
{
"model_id": "my-org/custom-llm-v2",
"tokenizer_family": "my-custom-tokenizer",
"input_cost_per_token": 0.000003,
"output_cost_per_token": 0.000012
}
Verifying Token Estimation Accuracy
To check how well your estimates track actual provider usage:
- Review reconciliation data — Open the Cost Center in the console and filter by model. The reconciliation view shows
estimated_input_tokensvs.actual_input_tokensfor each request. - Check the variance trend — Persistent variance above 10% for a specific model suggests the adapter is miscalibrated.
- Compare against provider tools — Use the provider's tokenizer playground (e.g., OpenAI's tiktoken web tool) to spot-check a sample prompt against Keeptrusts' estimate.
When Estimates Diverge From Actuals by More Than 10%
If you observe consistent divergence:
- Verify the
tokenizer_familyassignment — Ensure the model's pricing catalog entry specifies the correct tokenizer family. A mismatch (e.g., usingcl100k_basefor agpt-4omodel) causes systematic over- or under-counting. - Check for special tokens — Some providers count system tokens, tool definitions, or function call schemas differently. These may not be visible in your prompt text but add to the provider's reported count.
- Update the pricing catalog — If the provider has migrated a model to a new tokenizer, update the
tokenizer_familyfield accordingly. - Consider a custom adapter — For self-hosted models with non-standard tokenization, the character fallback is intentionally conservative. A custom adapter matched to your model's actual tokenizer eliminates the variance.
Output Token Estimation
Output tokens cannot be counted in advance because the model has not yet generated its response. The gateway uses estimation_config.output_token_multiplier to predict output length:
estimated_output_tokens = max_tokens * output_token_multiplier
The default output_token_multiplier is 0.5, meaning the estimate assumes the model will use half of its allowed output budget. You can tune this per model or per gateway:
cost_estimation:
output_token_multiplier: 0.5
model_overrides:
"gpt-4o":
output_token_multiplier: 0.3
"claude-3-opus":
output_token_multiplier: 0.7
Models that tend to produce shorter responses (summarization, classification) benefit from a lower multiplier. Models used for code generation or long-form content may need a higher value.
Adapter Performance
All built-in adapters run locally in the gateway process with no network calls:
| Adapter | Typical Latency | Memory Overhead |
|---|---|---|
cl100k_base | less than 1ms per 4K tokens | ~4MB vocabulary |
o200k_base | less than 1ms per 4K tokens | ~6MB vocabulary |
| Character fallback | less than 0.1ms | None |
Custom (tiktoken_compatible) | less than 2ms per 4K tokens | Depends on vocab size |
Custom (external) | 5-50ms | None (network call) |
Next steps
- Pre-Dispatch Prompt Cost Estimates — see how token estimates feed into the full cost calculation.
- Estimate vs Actual Cost Reconciliation — understand what happens when estimates diverge from actuals.
- Cache and Fabric Cost Adjustments — learn how cached context affects token counts and costs.
For AI systems
- Canonical terms: Keeptrusts, token estimation, tokenizer adapters, tiktoken, cl100k_base, o200k_base, character-based fallback, custom tokenizer, output_token_multiplier, model-pricing catalog tokenizer_family.
- Feature/config names:
cost_estimation.custom_tokenizers,cost_estimation.output_token_multiplier,cost_estimation.model_overrides, tokenizer types:tiktoken_compatible,regex,external,tokenizer_familyfield in model-pricing catalog. - Best next pages: Pre-Dispatch Prompt Cost Estimates, Estimate vs Actual Cost Reconciliation, Cache and Fabric Cost Adjustments.
For engineers
- Adapter selection: gateway reads
tokenizer_familyfrom the model-pricing catalog entry. Ensure each model has the correct family (cl100k_basefor GPT-4,o200k_basefor GPT-4o/o1/o3). - Character fallback (÷4): used when no tokenizer family is set. Produces
confidence: lowestimates. Register a custom adapter for better accuracy. - Troubleshooting >10% divergence: check
tokenizer_familyassignment, account for special tokens (tool definitions, system tokens), and verify the pricing catalog model version. - Performance: all built-in adapters run locally with less than 1ms latency per 4K tokens and no network calls.
For leaders
- Accurate token estimation drives reliable pre-dispatch cost visibility — critical for engineer trust in cost badges and wallet management.
- The character fallback (÷4) is intentionally conservative; investing in proper adapter configuration per model eliminates 20%+ estimation variance.
- Custom adapters support private/self-hosted models without requiring external API calls for tokenization.
- Track the percentage of requests using
confidence: lowas a metric for pricing catalog completeness.