Token Estimation Across Providers

Keeptrusts estimates token counts locally without calling provider tokenizer APIs. This keeps estimation fast, avoids additional API costs, and works offline. Built-in adapters cover major provider tokenizer families, and you can register custom adapters for private or self-hosted models.

Use this page when

You need to understand how Keeptrusts estimates token counts locally without calling provider APIs.
You are selecting or registering a tokenizer adapter (cl100k_base, o200k_base, custom, or character fallback).
You are troubleshooting persistent estimation divergence (>10%) for a specific model.

Primary audience

Primary: Technical Engineers
Secondary: AI Agents, Technical Leaders

How Adapter Selection Works

When the gateway needs to estimate tokens for a request, it selects an adapter based on the model's tokenizer_family field from the model-pricing catalog:

The gateway looks up the target model in the pricing catalog.
It reads the tokenizer_family field (e.g., cl100k_base, o200k_base, character_estimate).
It loads the corresponding adapter to perform local token counting.
If no tokenizer_family is set, the gateway falls back to the character-based estimator.

This selection happens once per request and adds negligible latency (typically less than 1ms).

Built-In Adapters

`tiktoken` — OpenAI and Azure OpenAI Models

The tiktoken adapter provides accurate token estimation for all OpenAI and Azure OpenAI models. It uses the same tokenization algorithm as the provider, giving you confidence: high estimates.

Two encoding variants are supported:

`cl100k_base`

Used by the GPT-4 and GPT-3.5 model families:

gpt-4
gpt-4-turbo
gpt-3.5-turbo
All dated snapshots of the above (e.g., gpt-4-0613)

Token estimation with cl100k_base is typically within 0-2% of the provider's actual count.

`o200k_base`

Used by the GPT-4o family and newer models:

gpt-4o
gpt-4o-mini
o1
o1-mini
o3
o3-mini

The o200k_base encoding has a larger vocabulary (200k tokens) and produces fewer tokens for the same text compared to cl100k_base. Estimates are typically exact or within 1% variance.

Character-Based Fallback

For providers without public tokenizers — such as certain Anthropic, Cohere, or Mistral models — the gateway uses a character-based approximation:

estimated_tokens = character_count / 4

This produces confidence: low estimates. The divisor of 4 is a reasonable average across Latin-script text but may undercount for CJK characters or overcount for highly technical content with long tokens.

Providers using this fallback include:

Models without a declared tokenizer_family in the pricing catalog
Self-hosted models that do not publish tokenizer specifications
Any model where the adapter registry has no matching entry

Custom Adapters

You can register custom tokenizer adapters for private or self-hosted models. A custom adapter implements the token counting interface and is associated with a tokenizer_family name.

Registering a Custom Adapter

Add a custom adapter in your gateway configuration:

cost_estimation:
  custom_tokenizers:
    - family: "my-custom-tokenizer"
      type: "tiktoken_compatible"
      vocabulary_file: "/path/to/vocab.bpe"
    - family: "simple-word-split"
      type: "regex"
      pattern: "\\s+"
      overhead_factor: 1.3

Adapter Types

Type	Description	Use Case
`tiktoken_compatible`	BPE tokenizer with a custom vocabulary file	Models fine-tuned from OpenAI base with modified vocab
`regex`	Splits text by regex pattern, applies overhead factor	Simple estimation for models with word-level tokenization
`external`	Calls a local HTTP endpoint for token counting	Models with proprietary tokenizers you host locally

Assigning a Custom Adapter to a Model

After registering the adapter, set the tokenizer_family field in the model-pricing catalog entry:

{
  "model_id": "my-org/custom-llm-v2",
  "tokenizer_family": "my-custom-tokenizer",
  "input_cost_per_token": 0.000003,
  "output_cost_per_token": 0.000012
}

Verifying Token Estimation Accuracy

To check how well your estimates track actual provider usage:

Review reconciliation data — Open the Cost Center in the console and filter by model. The reconciliation view shows estimated_input_tokens vs. actual_input_tokens for each request.
Check the variance trend — Persistent variance above 10% for a specific model suggests the adapter is miscalibrated.
Compare against provider tools — Use the provider's tokenizer playground (e.g., OpenAI's tiktoken web tool) to spot-check a sample prompt against Keeptrusts' estimate.

When Estimates Diverge From Actuals by More Than 10%

If you observe consistent divergence:

Verify the tokenizer_family assignment — Ensure the model's pricing catalog entry specifies the correct tokenizer family. A mismatch (e.g., using cl100k_base for a gpt-4o model) causes systematic over- or under-counting.
Check for special tokens — Some providers count system tokens, tool definitions, or function call schemas differently. These may not be visible in your prompt text but add to the provider's reported count.
Update the pricing catalog — If the provider has migrated a model to a new tokenizer, update the tokenizer_family field accordingly.
Consider a custom adapter — For self-hosted models with non-standard tokenization, the character fallback is intentionally conservative. A custom adapter matched to your model's actual tokenizer eliminates the variance.

Output Token Estimation

Output tokens cannot be counted in advance because the model has not yet generated its response. The gateway uses estimation_config.output_token_multiplier to predict output length:

estimated_output_tokens = max_tokens * output_token_multiplier

The default output_token_multiplier is 0.5, meaning the estimate assumes the model will use half of its allowed output budget. You can tune this per model or per gateway:

cost_estimation:
  output_token_multiplier: 0.5
  model_overrides:
    "gpt-4o":
      output_token_multiplier: 0.3
    "claude-3-opus":
      output_token_multiplier: 0.7

Models that tend to produce shorter responses (summarization, classification) benefit from a lower multiplier. Models used for code generation or long-form content may need a higher value.

Adapter Performance

All built-in adapters run locally in the gateway process with no network calls:

Adapter	Typical Latency	Memory Overhead
`cl100k_base`	less than 1ms per 4K tokens	~4MB vocabulary
`o200k_base`	less than 1ms per 4K tokens	~6MB vocabulary
Character fallback	less than 0.1ms	None
Custom (`tiktoken_compatible`)	less than 2ms per 4K tokens	Depends on vocab size
Custom (`external`)	5-50ms	None (network call)

Next steps

Pre-Dispatch Prompt Cost Estimates — see how token estimates feed into the full cost calculation.
Estimate vs Actual Cost Reconciliation — understand what happens when estimates diverge from actuals.
Cache and Fabric Cost Adjustments — learn how cached context affects token counts and costs.

For AI systems

Canonical terms: Keeptrusts, token estimation, tokenizer adapters, tiktoken, cl100k_base, o200k_base, character-based fallback, custom tokenizer, output_token_multiplier, model-pricing catalog tokenizer_family.
Feature/config names: cost_estimation.custom_tokenizers, cost_estimation.output_token_multiplier, cost_estimation.model_overrides, tokenizer types: tiktoken_compatible, regex, external, tokenizer_family field in model-pricing catalog.
Best next pages: Pre-Dispatch Prompt Cost Estimates, Estimate vs Actual Cost Reconciliation, Cache and Fabric Cost Adjustments.

For engineers

Adapter selection: gateway reads tokenizer_family from the model-pricing catalog entry. Ensure each model has the correct family (cl100k_base for GPT-4, o200k_base for GPT-4o/o1/o3).
Character fallback (÷4): used when no tokenizer family is set. Produces confidence: low estimates. Register a custom adapter for better accuracy.
Troubleshooting >10% divergence: check tokenizer_family assignment, account for special tokens (tool definitions, system tokens), and verify the pricing catalog model version.
Performance: all built-in adapters run locally with less than 1ms latency per 4K tokens and no network calls.

For leaders

Accurate token estimation drives reliable pre-dispatch cost visibility — critical for engineer trust in cost badges and wallet management.
The character fallback (÷4) is intentionally conservative; investing in proper adapter configuration per model eliminates 20%+ estimation variance.
Custom adapters support private/self-hosted models without requiring external API calls for tokenization.
Track the percentage of requests using confidence: low as a metric for pricing catalog completeness.

Use this page when​

Primary audience​

How Adapter Selection Works​

Built-In Adapters​

tiktoken — OpenAI and Azure OpenAI Models​

cl100k_base​

o200k_base​

Character-Based Fallback​

Custom Adapters​

Registering a Custom Adapter​

Adapter Types​

Assigning a Custom Adapter to a Model​

Verifying Token Estimation Accuracy​

When Estimates Diverge From Actuals by More Than 10%​

Output Token Estimation​

Adapter Performance​

Next steps​

For AI systems​

For engineers​

For leaders​