Context Compression
Context compression automatically truncates or summarises conversation history when a multi-turn conversation approaches the provider's context window limit. The gateway inspects the total token count of the request before forwarding it; when the count exceeds the configured trigger threshold, compression is applied transparently before the request reaches the upstream provider.
Use this page when
- You need the exact command, config, API, or integration details for Context Compression.
- You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
- If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.
This means your application code never needs to manage context window budgeting manually. Long-running chat sessions, agentic tool loops, and multi-turn assistants simply keep appending messages — the gateway handles the pruning.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Why Context Compression?
LLM providers have finite context windows. Current limits range from 8K tokens (older models) to 128K–200K tokens (latest frontier models), but even large windows fill up in long-running agent sessions that include many tool calls, large tool outputs, or lengthy system prompts.
Without compression
Without any mitigation strategy:
- Requests eventually fail with
context_length_exceeded(HTTP 400, OpenAI error codecontext_length_exceeded/ Anthropicinvalid_request_error). - The failure happens at the provider level after all policy checks and network overhead, wasting time and tokens.
- Application code must catch the error, manually truncate history, and retry — adding latency and complexity.
With compression
With context compression enabled:
- The gateway inspects the total token count of every request before forwarding.
- When the count exceeds
max_context_tokens × trigger_ratio(default 90%), the compression strategy fires. - The pruned request is forwarded to the upstream; the client and application code never see an error.
- Compression events are logged so you can observe them in the Keeptrusts console.
ProviderContextCompression Fields
Context compression is configured under a provider target's context_compression key, or at the global level under context_compression to apply to all targets.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable context compression for this target. |
strategy | string | "drop_oldest" | Compression strategy. drop_oldest or summarize (see Strategies). |
preserve_system_message | bool | true | When true, the system message is never removed regardless of how aggressive the compression is. |
preserve_first_n | integer | 0 | Always preserve the first N user/assistant message pairs from the conversation, in addition to the system message. |
preserve_last_n | integer | 5 | Always preserve the most recent N user/assistant message pairs from the conversation. |
max_messages | integer | null | Hard cap on the total number of messages in the request (excluding the system message). Older messages are dropped when the count exceeds this value. Applied before token-based compression. |
max_context_tokens | integer | — | Maximum tokens for this target's context window. Defaults to the provider catalog value if not set. |
trigger_ratio | float | 0.9 | Fraction of max_context_tokens at which compression triggers (e.g., 0.9 = fire when 90% full). |
message_compression_strategy | string | "omit" | How individual messages are handled when dropped: omit removes them entirely; truncate keeps the message but cuts its content to fit. |
tokenizer | string | "cl100k_base" | Tiktoken tokenizer used to count tokens. Use "o200k_base" for GPT-4o and o-series; "claude" for Anthropic models. |
Strategies
drop_oldest
The drop_oldest strategy removes messages from the beginning of the conversation (after any preserve_first_n protection) until the total token count drops below the target threshold.
[ system ] [ turn 1 ] [ turn 2 ] [ turn 3 ] [ turn 4 ] [ turn 5 ]
↑ oldest ↑ newest
When compression fires and needs to shed 2 turns, turns 1 and 2 are removed:
[ system ] [ turn 3 ] [ turn 4 ] [ turn 5 ]
Use drop_oldest when:
- Recent context is more relevant than historical context (most chat applications).
- The conversation is a simple turn-by-turn dialogue with no persistent references to early messages.
- You want the simplest, fastest compression behaviour with no additional LLM calls.
summarize (planned)
The summarize strategy calls a configured summarisation model to condense dropped messages into a short summary that is prepended as a synthetic assistant message before being removed from the detailed history.
context_compression:
enabled: true
strategy: summarize
summarizer:
provider: openai-gpt4o-mini # cheap model for summarisation
max_summary_tokens: 300
summary_prompt: |
Summarise the following conversation history in 3–5 sentences,
preserving key facts, decisions, and unresolved questions.
summarize strategy is planned for a future release. The current stable strategy is drop_oldest. Configuring strategy: summarize falls back to drop_oldest until the feature ships.Trigger threshold
Compression fires when the estimated token count of the full request exceeds:
trigger_tokens = max_context_tokens × trigger_ratio
For example, with max_context_tokens: 128000 and trigger_ratio: 0.9, compression fires when the request contains more than 115,200 tokens. After compression, the gateway targets a token count at or below max_context_tokens × 0.75 (a 15-point headroom below the trigger) to avoid repeatedly re-triggering compression on every subsequent message.
Configuration Examples
Basic drop_oldest
The most common configuration — preserve the system message and recent context, drop everything else when approaching the limit:
pack:
name: context-compression-providers-2
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-gpt4o
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
With this config, a conversation of 100 turns that hits 90% of the 128K context window will have its oldest messages stripped, keeping the system message and the 10 most recent turns.
Agentic tool loop — preserve anchoring context
Agentic workflows often establish critical context in the first few turns (task description, goals, constraints) that must never be dropped, plus produce many intermediate tool-call/result pairs that are safe to drop once the agent has moved on:
pack:
name: context-compression-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: anthropic-sonnet
provider: anthropic:chat:claude-3-5-sonnet-20241022
secret_key_ref:
env: ANTHROPIC_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Example message budget at compression time for a 200K-token model:
| Region | Messages kept | Reason |
|---|---|---|
| System | 1 | Always preserved |
| First 2 turns | 2 | Task definition |
| Last 15 turns | 15 | Active working memory |
| Everything else | dropped | Old tool outputs |
Multi-turn chat — rolling window
For a customer support chatbot where no single message is critical to preserve, use a max_messages rolling window to keep memory overhead bounded regardless of token count:
pack:
name: context-compression-providers-4
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-gpt4o-mini
provider: openai:chat:gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
The max_messages: 20 limit fires before token-based compression. As a result, the conversation is bounded to 20 messages regardless of whether the token trigger is reached.
Per-target with global fallback
You can set a global default and override per target:
pack:
name: context-compression-providers-5
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-gpt4o
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: gpt4o-large-context
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: gpt4o-mini
provider: openai:chat:gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Combining with max_context_tokens
max_context_tokens sets the upper bound for this provider target's context window. It serves two purposes:
- Pre-request rejection gate — requests that still exceed
max_context_tokensafter compression are rejected with a413 Content Too Largeresponse rather than being forwarded and failing at the provider level. - Compression trigger — the trigger token count is
max_context_tokens × trigger_ratio.
pack:
name: context-compression-providers-6
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-gpt4o
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
If after compression the request still exceeds max_context_tokens, the gateway returns:
{
"error": {
"type": "context_too_long",
"message": "Request context cannot be compressed below max_context_tokens: 128000. Reduce the size of your system message or conversation history.",
"code": "context_too_long"
}
}
This hard-gate prevents your application from silently receiving a truncated, nonsensical response from a provider that would otherwise accept an oversized request and hallucinate.
Monitoring Compression Events
Keeptrusts logs every compression event as a structured context_compression event forwarded to the control-plane API:
{
"timestamp": "2026-03-27T14:31:00.411Z",
"event_type": "context_compression",
"target_id": "openai-gpt4o",
"strategy": "drop_oldest",
"pre_compression_tokens": 116842,
"post_compression_tokens": 96240,
"messages_before": 48,
"messages_after": 18,
"messages_dropped": 30,
"system_message_preserved": true,
"first_n_preserved": 2,
"last_n_preserved": 10,
"trigger_ratio_applied": 0.9,
"max_context_tokens": 128000
}
Key fields to monitor in the Events view:
| Field | What to watch for |
|---|---|
messages_dropped | Consistently high values indicate conversations need rethinking or preserve_last_n needs increasing. |
post_compression_tokens | Should stay well below max_context_tokens. If it approaches the limit, lower trigger_ratio. |
event_type: context_compression frequency | High frequency on short conversations indicates a system prompt that is too large. |
Best Practices
-
Always set
preserve_system_message: true. The system message usually contains the model's persona, safety guidelines, and task instructions. Dropping it produces undefined model behaviour. This is the default but worth keeping explicit in your config. -
Set
preserve_last_nto at least 4–6 for interactive chat. Users expect the model to remember what was said two or three turns ago. Dropping everything before the last turn produces obviously broken conversations. -
Use
preserve_first_nfor agentic task briefings. The first 1–2 turns often contain the task goal, constraints, and available tools. Preserving these prevents the agent from losing track of its objective mid-session. -
Use
max_messagesto bound memory use on high-volume bots. For customer support bots with millions of daily sessions, uncapped conversation histories consume unbounded memory. Amax_messages: 20limit keeps per-session memory predictable. -
Tune
trigger_ratiodownward if you seepost_compression_tokensclose tomax_context_tokens. A trigger ratio of0.9leaves only 10% headroom. If reply completions consume 10%+ of the context window, you'll hit the limit mid-turn. Try0.80for models that generate long completions. -
Monitor compression events and react to them. Frequent compression is a signal that your application is generating longer conversations than the model was designed for. Consider splitting the conversation into sessions, using session summaries at handoff, or upgrading to a provider with a larger context window.
Pre-Compression Context Flush
When the layered memory system is enabled, the gateway can run a deterministic context flush step before falling back to lossy compression. This preserves important context that would otherwise be permanently lost.
How it works
- After normal context assembly (frozen memory + ranked recall + episodic history), the gateway checks if the total prompt exceeds the provider's token budget.
- If it does and
context_flush_enabled = trueon the agent, the gateway calls the API's context flush endpoint. - The flush produces a condensed summary of the conversation so far, stored as a
history_condensationsrecord. - A recall document is created from the condensation so it can be picked up on future context resolutions.
- The gateway re-resolves context once. If the prompt now fits, the request proceeds without any lossy compression.
- If it still doesn't fit, the normal lossy compression strategy fires as a fallback.
Flush failure policy
The context_flush_failure_policy agent setting controls what happens when the flush step fails or times out:
| Policy | Behavior |
|---|---|
fallback_to_lossy (default) | The request proceeds with normal lossy compression. No context is preserved but the request completes. |
fail_closed | The request is rejected with a structured error. Use this for regulated profiles where silent context loss is not acceptable. |
Configuration
# Agent-level settings (API / console)
context_flush_enabled: true
context_flush_timeout_ms: 5000
context_flush_failure_policy: "fallback_to_lossy"
Monitoring
Context flush events appear in history entry metadata and the gateway telemetry. Look for:
context_flush_invocations— how often flush is triggeredcontext_flush_successes— how often flush avoids lossy compressioncontext_flush_fallbacks— how often flush falls back to lossyflush_duration_ms— latency added by the flush step
For AI systems
- Canonical terms: Keeptrusts Context Compression, drop_oldest strategy, summarize strategy (planned), context flush, trigger ratio, preserve_system_message.
- Config keys:
context_compression.enabled,context_compression.strategy(drop_oldest|summarize),context_compression.preserve_system_message,context_compression.preserve_first_n,context_compression.preserve_last_n,context_compression.max_messages,context_compression.max_context_tokens,context_compression.trigger_ratio,context_compression.tokenizer(cl100k_base|o200k_base|claude). - Agent-level settings:
context_flush_enabled,context_flush_timeout_ms,context_flush_failure_policy(fallback_to_lossy|fail_closed). - Error response:
context_too_longwhen request cannot be compressed belowmax_context_tokens. - Event type:
context_compressionwith fieldspre_compression_tokens,post_compression_tokens,messages_dropped. - Best next pages: Provider Fallback, Semantic Caching, Provider Routing.
For engineers
- Prerequisites: set
max_context_tokensto match the provider model’s actual context window (e.g., 128000 for GPT-4o, 200000 for Claude 3.5 Sonnet). - Use
tokenizer: "o200k_base"for GPT-4o/o-series models andtokenizer: "claude"for Anthropic models. - Validate: send a conversation exceeding 90% of the context window and confirm the response succeeds with a
context_compressionevent in the Events view. - Tune
trigger_ratiodownward (e.g., 0.80) ifpost_compression_tokensfrequently approachesmax_context_tokens. - For agentic workflows: set
preserve_first_n: 2to retain the task briefing across compression cycles. - Monitor
messages_dropped— consistently high values suggest sessions need splitting or a larger context window model.
For leaders
- Reliability: context compression prevents
context_length_exceedederrors from reaching end users, eliminating a class of silent failures in long-running chat sessions and agent loops. - Cost: compression reduces forwarded token count, directly lowering per-request cost on long conversations.
- Compliance:
context_flush_failure_policy: fail_closedensures regulated workflows never silently lose context — they fail explicitly for human review. - Capacity planning: frequent compression events signal the need to upgrade to larger-context models or redesign session boundaries.
Next steps
- Provider Fallback — route to larger-context models when compression is insufficient
- Semantic Caching — cache responses to reduce repeated context window pressure
- Model Groups — define fallback to larger-context model pools
- Rate Limiting — token rate limits interact with compressed token counts