Context Compression

Context compression automatically truncates or summarises conversation history when a multi-turn conversation approaches the provider's context window limit. The gateway inspects the total token count of the request before forwarding it; when the count exceeds the configured trigger threshold, compression is applied transparently before the request reaches the upstream provider.

Use this page when

You need the exact command, config, API, or integration details for Context Compression.
You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

This means your application code never needs to manage context window budgeting manually. Long-running chat sessions, agentic tool loops, and multi-turn assistants simply keep appending messages — the gateway handles the pruning.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Why Context Compression?

LLM providers have finite context windows. Current limits range from 8K tokens (older models) to 128K–200K tokens (latest frontier models), but even large windows fill up in long-running agent sessions that include many tool calls, large tool outputs, or lengthy system prompts.

Without compression

Without any mitigation strategy:

Requests eventually fail with context_length_exceeded (HTTP 400, OpenAI error code context_length_exceeded / Anthropic invalid_request_error).
The failure happens at the provider level after all policy checks and network overhead, wasting time and tokens.
Application code must catch the error, manually truncate history, and retry — adding latency and complexity.

With compression

With context compression enabled:

The gateway inspects the total token count of every request before forwarding.
When the count exceeds max_context_tokens × trigger_ratio (default 90%), the compression strategy fires.
The pruned request is forwarded to the upstream; the client and application code never see an error.
Compression events are logged so you can observe them in the Keeptrusts console.

`ProviderContextCompression` Fields

Context compression is configured under a provider target's context_compression key, or at the global level under context_compression to apply to all targets.

Field	Type	Default	Description
`enabled`	bool	`false`	Enable context compression for this target.
`strategy`	string	`"drop_oldest"`	Compression strategy. `drop_oldest` or `summarize` (see Strategies).
`preserve_system_message`	bool	`true`	When `true`, the system message is never removed regardless of how aggressive the compression is.
`preserve_first_n`	integer	`0`	Always preserve the first N user/assistant message pairs from the conversation, in addition to the system message.
`preserve_last_n`	integer	`5`	Always preserve the most recent N user/assistant message pairs from the conversation.
`max_messages`	integer	`null`	Hard cap on the total number of messages in the request (excluding the system message). Older messages are dropped when the count exceeds this value. Applied before token-based compression.
`max_context_tokens`	integer	—	Maximum tokens for this target's context window. Defaults to the provider catalog value if not set.
`trigger_ratio`	float	`0.9`	Fraction of `max_context_tokens` at which compression triggers (e.g., `0.9` = fire when 90% full).
`message_compression_strategy`	string	`"omit"`	How individual messages are handled when dropped: `omit` removes them entirely; `truncate` keeps the message but cuts its content to fit.
`tokenizer`	string	`"cl100k_base"`	Tiktoken tokenizer used to count tokens. Use `"o200k_base"` for GPT-4o and o-series; `"claude"` for Anthropic models.

Strategies

`drop_oldest`

The drop_oldest strategy removes messages from the beginning of the conversation (after any preserve_first_n protection) until the total token count drops below the target threshold.

[ system ] [ turn 1 ] [ turn 2 ] [ turn 3 ] [ turn 4 ] [ turn 5 ]
                 ↑ oldest                            ↑ newest

When compression fires and needs to shed 2 turns, turns 1 and 2 are removed:

[ system ] [ turn 3 ] [ turn 4 ] [ turn 5 ]

Use drop_oldest when:

Recent context is more relevant than historical context (most chat applications).
The conversation is a simple turn-by-turn dialogue with no persistent references to early messages.
You want the simplest, fastest compression behaviour with no additional LLM calls.

`summarize` (planned)

The summarize strategy calls a configured summarisation model to condense dropped messages into a short summary that is prepended as a synthetic assistant message before being removed from the detailed history.

context_compression:
  enabled: true
  strategy: summarize
  summarizer:
    provider: openai-gpt4o-mini   # cheap model for summarisation
    max_summary_tokens: 300
    summary_prompt: |
      Summarise the following conversation history in 3–5 sentences,
      preserving key facts, decisions, and unresolved questions.

summarize strategy is planned for a future release. The current stable strategy is drop_oldest. Configuring strategy: summarize falls back to drop_oldest until the feature ships.

Trigger threshold

Compression fires when the estimated token count of the full request exceeds:

trigger_tokens = max_context_tokens × trigger_ratio

For example, with max_context_tokens: 128000 and trigger_ratio: 0.9, compression fires when the request contains more than 115,200 tokens. After compression, the gateway targets a token count at or below max_context_tokens × 0.75 (a 15-point headroom below the trigger) to avoid repeatedly re-triggering compression on every subsequent message.

Configuration Examples

Basic drop_oldest

The most common configuration — preserve the system message and recent context, drop everything else when approaching the limit:

pack:
  name: context-compression-providers-2
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai-gpt4o
    provider: openai:chat:gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

With this config, a conversation of 100 turns that hits 90% of the 128K context window will have its oldest messages stripped, keeping the system message and the 10 most recent turns.

Agentic tool loop — preserve anchoring context

Agentic workflows often establish critical context in the first few turns (task description, goals, constraints) that must never be dropped, plus produce many intermediate tool-call/result pairs that are safe to drop once the agent has moved on:

pack:
  name: context-compression-providers-3
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: anthropic-sonnet
    provider: anthropic:chat:claude-3-5-sonnet-20241022
    secret_key_ref:
      env: ANTHROPIC_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Example message budget at compression time for a 200K-token model:

Region	Messages kept	Reason
System	1	Always preserved
First 2 turns	2	Task definition
Last 15 turns	15	Active working memory
Everything else	dropped	Old tool outputs

Multi-turn chat — rolling window

For a customer support chatbot where no single message is critical to preserve, use a max_messages rolling window to keep memory overhead bounded regardless of token count:

pack:
  name: context-compression-providers-4
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai-gpt4o-mini
    provider: openai:chat:gpt-4o-mini
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

The max_messages: 20 limit fires before token-based compression. As a result, the conversation is bounded to 20 messages regardless of whether the token trigger is reached.

Per-target with global fallback

You can set a global default and override per target:

pack:
  name: context-compression-providers-5
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai-gpt4o
    provider: openai:chat:gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
  - id: gpt4o-large-context
    provider: openai:chat:gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
  - id: gpt4o-mini
    provider: openai:chat:gpt-4o-mini
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Combining with `max_context_tokens`

max_context_tokens sets the upper bound for this provider target's context window. It serves two purposes:

Pre-request rejection gate — requests that still exceed max_context_tokens after compression are rejected with a 413 Content Too Large response rather than being forwarded and failing at the provider level.
Compression trigger — the trigger token count is max_context_tokens × trigger_ratio.

pack:
  name: context-compression-providers-6
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai-gpt4o
    provider: openai:chat:gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

If after compression the request still exceeds max_context_tokens, the gateway returns:

{
  "error": {
    "type": "context_too_long",
    "message": "Request context cannot be compressed below max_context_tokens: 128000. Reduce the size of your system message or conversation history.",
    "code": "context_too_long"
  }
}

This hard-gate prevents your application from silently receiving a truncated, nonsensical response from a provider that would otherwise accept an oversized request and hallucinate.

Monitoring Compression Events

Keeptrusts logs every compression event as a structured context_compression event forwarded to the control-plane API:

{
  "timestamp": "2026-03-27T14:31:00.411Z",
  "event_type": "context_compression",
  "target_id": "openai-gpt4o",
  "strategy": "drop_oldest",
  "pre_compression_tokens": 116842,
  "post_compression_tokens": 96240,
  "messages_before": 48,
  "messages_after": 18,
  "messages_dropped": 30,
  "system_message_preserved": true,
  "first_n_preserved": 2,
  "last_n_preserved": 10,
  "trigger_ratio_applied": 0.9,
  "max_context_tokens": 128000
}

Key fields to monitor in the Events view:

Field	What to watch for
`messages_dropped`	Consistently high values indicate conversations need rethinking or `preserve_last_n` needs increasing.
`post_compression_tokens`	Should stay well below `max_context_tokens`. If it approaches the limit, lower `trigger_ratio`.
`event_type: context_compression` frequency	High frequency on short conversations indicates a system prompt that is too large.

Best Practices

Always set preserve_system_message: true. The system message usually contains the model's persona, safety guidelines, and task instructions. Dropping it produces undefined model behaviour. This is the default but worth keeping explicit in your config.
Set preserve_last_n to at least 4–6 for interactive chat. Users expect the model to remember what was said two or three turns ago. Dropping everything before the last turn produces obviously broken conversations.
Use preserve_first_n for agentic task briefings. The first 1–2 turns often contain the task goal, constraints, and available tools. Preserving these prevents the agent from losing track of its objective mid-session.
Use max_messages to bound memory use on high-volume bots. For customer support bots with millions of daily sessions, uncapped conversation histories consume unbounded memory. A max_messages: 20 limit keeps per-session memory predictable.
Tune trigger_ratio downward if you see post_compression_tokens close to max_context_tokens. A trigger ratio of 0.9 leaves only 10% headroom. If reply completions consume 10%+ of the context window, you'll hit the limit mid-turn. Try 0.80 for models that generate long completions.
Monitor compression events and react to them. Frequent compression is a signal that your application is generating longer conversations than the model was designed for. Consider splitting the conversation into sessions, using session summaries at handoff, or upgrading to a provider with a larger context window.

Pre-Compression Context Flush

When the layered memory system is enabled, the gateway can run a deterministic context flush step before falling back to lossy compression. This preserves important context that would otherwise be permanently lost.

How it works

After normal context assembly (frozen memory + ranked recall + episodic history), the gateway checks if the total prompt exceeds the provider's token budget.
If it does and context_flush_enabled = true on the agent, the gateway calls the API's context flush endpoint.
The flush produces a condensed summary of the conversation so far, stored as a history_condensations record.
A recall document is created from the condensation so it can be picked up on future context resolutions.
The gateway re-resolves context once. If the prompt now fits, the request proceeds without any lossy compression.
If it still doesn't fit, the normal lossy compression strategy fires as a fallback.

Flush failure policy

The context_flush_failure_policy agent setting controls what happens when the flush step fails or times out:

Policy	Behavior
`fallback_to_lossy` (default)	The request proceeds with normal lossy compression. No context is preserved but the request completes.
`fail_closed`	The request is rejected with a structured error. Use this for regulated profiles where silent context loss is not acceptable.

Configuration

# Agent-level settings (API / console)
context_flush_enabled: true
context_flush_timeout_ms: 5000
context_flush_failure_policy: "fallback_to_lossy"

Monitoring

Context flush events appear in history entry metadata and the gateway telemetry. Look for:

context_flush_invocations — how often flush is triggered
context_flush_successes — how often flush avoids lossy compression
context_flush_fallbacks — how often flush falls back to lossy
flush_duration_ms — latency added by the flush step

For AI systems

Canonical terms: Keeptrusts Context Compression, drop_oldest strategy, summarize strategy (planned), context flush, trigger ratio, preserve_system_message.
Config keys: context_compression.enabled, context_compression.strategy (drop_oldest | summarize), context_compression.preserve_system_message, context_compression.preserve_first_n, context_compression.preserve_last_n, context_compression.max_messages, context_compression.max_context_tokens, context_compression.trigger_ratio, context_compression.tokenizer (cl100k_base | o200k_base | claude).
Agent-level settings: context_flush_enabled, context_flush_timeout_ms, context_flush_failure_policy (fallback_to_lossy | fail_closed).
Error response: context_too_long when request cannot be compressed below max_context_tokens.
Event type: context_compression with fields pre_compression_tokens, post_compression_tokens, messages_dropped.
Best next pages: Provider Fallback, Semantic Caching, Provider Routing.

For engineers

Prerequisites: set max_context_tokens to match the provider model’s actual context window (e.g., 128000 for GPT-4o, 200000 for Claude 3.5 Sonnet).
Use tokenizer: "o200k_base" for GPT-4o/o-series models and tokenizer: "claude" for Anthropic models.
Validate: send a conversation exceeding 90% of the context window and confirm the response succeeds with a context_compression event in the Events view.
Tune trigger_ratio downward (e.g., 0.80) if post_compression_tokens frequently approaches max_context_tokens.
For agentic workflows: set preserve_first_n: 2 to retain the task briefing across compression cycles.
Monitor messages_dropped — consistently high values suggest sessions need splitting or a larger context window model.

For leaders

Reliability: context compression prevents context_length_exceeded errors from reaching end users, eliminating a class of silent failures in long-running chat sessions and agent loops.
Cost: compression reduces forwarded token count, directly lowering per-request cost on long conversations.
Compliance: context_flush_failure_policy: fail_closed ensures regulated workflows never silently lose context — they fail explicitly for human review.
Capacity planning: frequent compression events signal the need to upgrade to larger-context models or redesign session boundaries.

Next steps

Provider Fallback — route to larger-context models when compression is insufficient
Semantic Caching — cache responses to reduce repeated context window pressure
Model Groups — define fallback to larger-context model pools
Rate Limiting — token rate limits interact with compressed token counts

Use this page when​

Primary audience​

Why Context Compression?​

Without compression​

With compression​

ProviderContextCompression Fields​

Strategies​

drop_oldest​

summarize (planned)​

Trigger threshold​

Configuration Examples​

Basic drop_oldest​

Agentic tool loop — preserve anchoring context​

Multi-turn chat — rolling window​

Per-target with global fallback​

Combining with max_context_tokens​

Monitoring Compression Events​

Best Practices​

Pre-Compression Context Flush​

How it works​

Flush failure policy​

Configuration​

Monitoring​

For AI systems​

For engineers​

For leaders​

Next steps​

Next steps​

Use this page when

Primary audience

Why Context Compression?

Without compression

With compression

`ProviderContextCompression` Fields

Strategies

`drop_oldest`

`summarize` (planned)

Trigger threshold

Configuration Examples

Basic drop_oldest

Agentic tool loop — preserve anchoring context

Multi-turn chat — rolling window

Per-target with global fallback

Combining with `max_context_tokens`

Monitoring Compression Events

Best Practices

Pre-Compression Context Flush

How it works

Flush failure policy

Configuration

Monitoring

For AI systems

For engineers

For leaders

Next steps

Next steps