Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Context Compression

Context compression automatically truncates or summarises conversation history when a multi-turn conversation approaches the provider's context window limit. The gateway inspects the total token count of the request before forwarding it; when the count exceeds the configured trigger threshold, compression is applied transparently before the request reaches the upstream provider.

Use this page when

  • You need the exact command, config, API, or integration details for Context Compression.
  • You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
  • If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

This means your application code never needs to manage context window budgeting manually. Long-running chat sessions, agentic tool loops, and multi-turn assistants simply keep appending messages — the gateway handles the pruning.


Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Why Context Compression?

LLM providers have finite context windows. Current limits range from 8K tokens (older models) to 128K–200K tokens (latest frontier models), but even large windows fill up in long-running agent sessions that include many tool calls, large tool outputs, or lengthy system prompts.

Without compression

Without any mitigation strategy:

  • Requests eventually fail with context_length_exceeded (HTTP 400, OpenAI error code context_length_exceeded / Anthropic invalid_request_error).
  • The failure happens at the provider level after all policy checks and network overhead, wasting time and tokens.
  • Application code must catch the error, manually truncate history, and retry — adding latency and complexity.

With compression

With context compression enabled:

  • The gateway inspects the total token count of every request before forwarding.
  • When the count exceeds max_context_tokens × trigger_ratio (default 90%), the compression strategy fires.
  • The pruned request is forwarded to the upstream; the client and application code never see an error.
  • Compression events are logged so you can observe them in the Keeptrusts console.

ProviderContextCompression Fields

Context compression is configured under a provider target's context_compression key, or at the global level under context_compression to apply to all targets.

FieldTypeDefaultDescription
enabledboolfalseEnable context compression for this target.
strategystring"drop_oldest"Compression strategy. drop_oldest or summarize (see Strategies).
preserve_system_messagebooltrueWhen true, the system message is never removed regardless of how aggressive the compression is.
preserve_first_ninteger0Always preserve the first N user/assistant message pairs from the conversation, in addition to the system message.
preserve_last_ninteger5Always preserve the most recent N user/assistant message pairs from the conversation.
max_messagesintegernullHard cap on the total number of messages in the request (excluding the system message). Older messages are dropped when the count exceeds this value. Applied before token-based compression.
max_context_tokensintegerMaximum tokens for this target's context window. Defaults to the provider catalog value if not set.
trigger_ratiofloat0.9Fraction of max_context_tokens at which compression triggers (e.g., 0.9 = fire when 90% full).
message_compression_strategystring"omit"How individual messages are handled when dropped: omit removes them entirely; truncate keeps the message but cuts its content to fit.
tokenizerstring"cl100k_base"Tiktoken tokenizer used to count tokens. Use "o200k_base" for GPT-4o and o-series; "claude" for Anthropic models.

Strategies

drop_oldest

The drop_oldest strategy removes messages from the beginning of the conversation (after any preserve_first_n protection) until the total token count drops below the target threshold.

[ system ] [ turn 1 ] [ turn 2 ] [ turn 3 ] [ turn 4 ] [ turn 5 ]
↑ oldest ↑ newest

When compression fires and needs to shed 2 turns, turns 1 and 2 are removed:

[ system ] [ turn 3 ] [ turn 4 ] [ turn 5 ]

Use drop_oldest when:

  • Recent context is more relevant than historical context (most chat applications).
  • The conversation is a simple turn-by-turn dialogue with no persistent references to early messages.
  • You want the simplest, fastest compression behaviour with no additional LLM calls.

summarize (planned)

The summarize strategy calls a configured summarisation model to condense dropped messages into a short summary that is prepended as a synthetic assistant message before being removed from the detailed history.

context_compression:
enabled: true
strategy: summarize
summarizer:
provider: openai-gpt4o-mini # cheap model for summarisation
max_summary_tokens: 300
summary_prompt: |
Summarise the following conversation history in 3–5 sentences,
preserving key facts, decisions, and unresolved questions.
summarize strategy is planned for a future release. The current stable strategy is drop_oldest. Configuring strategy: summarize falls back to drop_oldest until the feature ships.

Trigger threshold

Compression fires when the estimated token count of the full request exceeds:

trigger_tokens = max_context_tokens × trigger_ratio

For example, with max_context_tokens: 128000 and trigger_ratio: 0.9, compression fires when the request contains more than 115,200 tokens. After compression, the gateway targets a token count at or below max_context_tokens × 0.75 (a 15-point headroom below the trigger) to avoid repeatedly re-triggering compression on every subsequent message.


Configuration Examples

Basic drop_oldest

The most common configuration — preserve the system message and recent context, drop everything else when approaching the limit:

pack:
name: context-compression-providers-2
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-gpt4o
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

With this config, a conversation of 100 turns that hits 90% of the 128K context window will have its oldest messages stripped, keeping the system message and the 10 most recent turns.

Agentic tool loop — preserve anchoring context

Agentic workflows often establish critical context in the first few turns (task description, goals, constraints) that must never be dropped, plus produce many intermediate tool-call/result pairs that are safe to drop once the agent has moved on:

pack:
name: context-compression-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: anthropic-sonnet
provider: anthropic:chat:claude-3-5-sonnet-20241022
secret_key_ref:
env: ANTHROPIC_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Example message budget at compression time for a 200K-token model:

RegionMessages keptReason
System1Always preserved
First 2 turns2Task definition
Last 15 turns15Active working memory
Everything elsedroppedOld tool outputs

Multi-turn chat — rolling window

For a customer support chatbot where no single message is critical to preserve, use a max_messages rolling window to keep memory overhead bounded regardless of token count:

pack:
name: context-compression-providers-4
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-gpt4o-mini
provider: openai:chat:gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

The max_messages: 20 limit fires before token-based compression. As a result, the conversation is bounded to 20 messages regardless of whether the token trigger is reached.

Per-target with global fallback

You can set a global default and override per target:

pack:
name: context-compression-providers-5
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-gpt4o
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: gpt4o-large-context
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: gpt4o-mini
provider: openai:chat:gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Combining with max_context_tokens

max_context_tokens sets the upper bound for this provider target's context window. It serves two purposes:

  1. Pre-request rejection gate — requests that still exceed max_context_tokens after compression are rejected with a 413 Content Too Large response rather than being forwarded and failing at the provider level.
  2. Compression trigger — the trigger token count is max_context_tokens × trigger_ratio.
pack:
name: context-compression-providers-6
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-gpt4o
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

If after compression the request still exceeds max_context_tokens, the gateway returns:

{
"error": {
"type": "context_too_long",
"message": "Request context cannot be compressed below max_context_tokens: 128000. Reduce the size of your system message or conversation history.",
"code": "context_too_long"
}
}

This hard-gate prevents your application from silently receiving a truncated, nonsensical response from a provider that would otherwise accept an oversized request and hallucinate.


Monitoring Compression Events

Keeptrusts logs every compression event as a structured context_compression event forwarded to the control-plane API:

{
"timestamp": "2026-03-27T14:31:00.411Z",
"event_type": "context_compression",
"target_id": "openai-gpt4o",
"strategy": "drop_oldest",
"pre_compression_tokens": 116842,
"post_compression_tokens": 96240,
"messages_before": 48,
"messages_after": 18,
"messages_dropped": 30,
"system_message_preserved": true,
"first_n_preserved": 2,
"last_n_preserved": 10,
"trigger_ratio_applied": 0.9,
"max_context_tokens": 128000
}

Key fields to monitor in the Events view:

FieldWhat to watch for
messages_droppedConsistently high values indicate conversations need rethinking or preserve_last_n needs increasing.
post_compression_tokensShould stay well below max_context_tokens. If it approaches the limit, lower trigger_ratio.
event_type: context_compression frequencyHigh frequency on short conversations indicates a system prompt that is too large.

Best Practices

  1. Always set preserve_system_message: true. The system message usually contains the model's persona, safety guidelines, and task instructions. Dropping it produces undefined model behaviour. This is the default but worth keeping explicit in your config.

  2. Set preserve_last_n to at least 4–6 for interactive chat. Users expect the model to remember what was said two or three turns ago. Dropping everything before the last turn produces obviously broken conversations.

  3. Use preserve_first_n for agentic task briefings. The first 1–2 turns often contain the task goal, constraints, and available tools. Preserving these prevents the agent from losing track of its objective mid-session.

  4. Use max_messages to bound memory use on high-volume bots. For customer support bots with millions of daily sessions, uncapped conversation histories consume unbounded memory. A max_messages: 20 limit keeps per-session memory predictable.

  5. Tune trigger_ratio downward if you see post_compression_tokens close to max_context_tokens. A trigger ratio of 0.9 leaves only 10% headroom. If reply completions consume 10%+ of the context window, you'll hit the limit mid-turn. Try 0.80 for models that generate long completions.

  6. Monitor compression events and react to them. Frequent compression is a signal that your application is generating longer conversations than the model was designed for. Consider splitting the conversation into sessions, using session summaries at handoff, or upgrading to a provider with a larger context window.


Pre-Compression Context Flush

When the layered memory system is enabled, the gateway can run a deterministic context flush step before falling back to lossy compression. This preserves important context that would otherwise be permanently lost.

How it works

  1. After normal context assembly (frozen memory + ranked recall + episodic history), the gateway checks if the total prompt exceeds the provider's token budget.
  2. If it does and context_flush_enabled = true on the agent, the gateway calls the API's context flush endpoint.
  3. The flush produces a condensed summary of the conversation so far, stored as a history_condensations record.
  4. A recall document is created from the condensation so it can be picked up on future context resolutions.
  5. The gateway re-resolves context once. If the prompt now fits, the request proceeds without any lossy compression.
  6. If it still doesn't fit, the normal lossy compression strategy fires as a fallback.

Flush failure policy

The context_flush_failure_policy agent setting controls what happens when the flush step fails or times out:

PolicyBehavior
fallback_to_lossy (default)The request proceeds with normal lossy compression. No context is preserved but the request completes.
fail_closedThe request is rejected with a structured error. Use this for regulated profiles where silent context loss is not acceptable.

Configuration

# Agent-level settings (API / console)
context_flush_enabled: true
context_flush_timeout_ms: 5000
context_flush_failure_policy: "fallback_to_lossy"

Monitoring

Context flush events appear in history entry metadata and the gateway telemetry. Look for:

  • context_flush_invocations — how often flush is triggered
  • context_flush_successes — how often flush avoids lossy compression
  • context_flush_fallbacks — how often flush falls back to lossy
  • flush_duration_ms — latency added by the flush step

For AI systems

  • Canonical terms: Keeptrusts Context Compression, drop_oldest strategy, summarize strategy (planned), context flush, trigger ratio, preserve_system_message.
  • Config keys: context_compression.enabled, context_compression.strategy (drop_oldest | summarize), context_compression.preserve_system_message, context_compression.preserve_first_n, context_compression.preserve_last_n, context_compression.max_messages, context_compression.max_context_tokens, context_compression.trigger_ratio, context_compression.tokenizer (cl100k_base | o200k_base | claude).
  • Agent-level settings: context_flush_enabled, context_flush_timeout_ms, context_flush_failure_policy (fallback_to_lossy | fail_closed).
  • Error response: context_too_long when request cannot be compressed below max_context_tokens.
  • Event type: context_compression with fields pre_compression_tokens, post_compression_tokens, messages_dropped.
  • Best next pages: Provider Fallback, Semantic Caching, Provider Routing.

For engineers

  • Prerequisites: set max_context_tokens to match the provider model’s actual context window (e.g., 128000 for GPT-4o, 200000 for Claude 3.5 Sonnet).
  • Use tokenizer: "o200k_base" for GPT-4o/o-series models and tokenizer: "claude" for Anthropic models.
  • Validate: send a conversation exceeding 90% of the context window and confirm the response succeeds with a context_compression event in the Events view.
  • Tune trigger_ratio downward (e.g., 0.80) if post_compression_tokens frequently approaches max_context_tokens.
  • For agentic workflows: set preserve_first_n: 2 to retain the task briefing across compression cycles.
  • Monitor messages_dropped — consistently high values suggest sessions need splitting or a larger context window model.

For leaders

  • Reliability: context compression prevents context_length_exceeded errors from reaching end users, eliminating a class of silent failures in long-running chat sessions and agent loops.
  • Cost: compression reduces forwarded token count, directly lowering per-request cost on long conversations.
  • Compliance: context_flush_failure_policy: fail_closed ensures regulated workflows never silently lose context — they fail explicitly for human review.
  • Capacity planning: frequent compression events signal the need to upgrade to larger-context models or redesign session boundaries.

Next steps

  • Provider Fallback — route to larger-context models when compression is insufficient
  • Semantic Caching — cache responses to reduce repeated context window pressure
  • Model Groups — define fallback to larger-context model pools
  • Rate Limiting — token rate limits interact with compressed token counts

Next steps