Skip to main content
Browse docs

Tutorial: Context Compression to Reduce Token Costs

This tutorial shows you how to configure context compression in the Keeptrusts gateway to reduce the number of tokens sent to LLM providers, measure the cost savings, and verify that response quality is preserved.

Use this page when

  • You want to reduce token costs on long-context or multi-turn conversations.
  • You are configuring summarize, truncate, or deduplicate compression strategies.
  • You need to verify that compression preserves response quality above a similarity threshold.
  • You are calculating cost savings from token reduction on production workloads.

Primary audience

  • Primary: Platform engineers optimising token spend on high-volume LLM workloads
  • Secondary: Product teams managing long-context chatbots; finance teams tracking per-request cost reduction

Prerequisites

  • kt CLI installed (first-run tutorial)
  • An OpenAI-compatible API key exported as OPENAI_API_KEY
  • curl and jq installed

Why Context Compression Matters

Large conversation histories and document-heavy prompts consume tokens rapidly. Context compression reduces input tokens before they reach the provider by summarizing, deduplicating, or truncating older context while preserving the most relevant information.

Without compressionWith compression
8,000 input tokens~3,200 input tokens
$0.024 per request (gpt-4o-mini)$0.0096 per request
60% savings

Step 1: Create the Compression Policy Configuration

Create policy-config.yaml with a context_compression policy:

version: '1'
providers:
targets:
- id: openai
provider: openai
secret_key_ref:
env: OPENAI_API_KEY
policies:
- name: compress-context
type: context_compression
action: modify
config:
strategy: summarize
target_ratio: 0.4
min_messages_preserved: 3
preserve_system_message: true
preserve_last_n_messages: 5
quality_threshold: 0.85
apply_to:
- input

Configuration breakdown

FieldPurpose
strategyCompression method — summarize, truncate, or deduplicate
target_ratioTarget compression ratio (0.4 = reduce to 40% of original size)
min_messages_preservedMinimum messages always kept uncompressed
preserve_system_messageAlways keep the system prompt intact
preserve_last_n_messagesAlways keep the N most recent messages intact
quality_thresholdMinimum semantic similarity score (0-1) to accept compression

Step 2: Validate and Start the Gateway

kt policy lint --file policy-config.yaml
kt gateway run --policy-config policy-config.yaml --port 41002

Expected output:

INFO keeptrusts::gateway Loaded 1 provider(s), 1 policy(ies)
INFO keeptrusts::gateway Context compression: strategy=summarize, target_ratio=0.40, quality=0.85
INFO keeptrusts::gateway Gateway ready

Step 3: Test with a Long Conversation History

Send a request with a large message history:

curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": "You are a helpful financial advisor."},
{"role": "user", "content": "What is compound interest and how does it work?"},
{"role": "assistant", "content": "Compound interest is interest calculated on the initial principal and also on the accumulated interest from previous periods. It grows exponentially over time because each period interest is earned on a larger base."},
{"role": "user", "content": "Can you give me a specific example with numbers?"},
{"role": "assistant", "content": "Sure. If you invest $10,000 at 5% annual compound interest for 10 years: Year 1: $10,500, Year 2: $11,025, Year 5: $12,763, Year 10: $16,289. The total interest earned is $6,289 compared to $5,000 with simple interest."},
{"role": "user", "content": "What about monthly compounding versus annual?"},
{"role": "assistant", "content": "Monthly compounding produces slightly higher returns because interest is calculated and added 12 times per year instead of once. The same $10,000 at 5% for 10 years yields $16,470 with monthly compounding versus $16,289 with annual — a difference of $181."},
{"role": "user", "content": "How does the Rule of 72 relate to compound interest?"},
{"role": "assistant", "content": "The Rule of 72 is a quick estimation tool. Divide 72 by the annual interest rate to approximate how many years it takes to double your money. At 6% interest, your money doubles in approximately 72/6 = 12 years."},
{"role": "user", "content": "What are the best investment vehicles for compound interest?"},
{"role": "assistant", "content": "Common vehicles include: high-yield savings accounts (safe, lower returns), CDs (fixed terms, guaranteed rates), index funds (market returns, long-term growth), bonds (steady income, moderate growth), and dividend reinvestment plans (DRIPs) which automatically compound returns."},
{"role": "user", "content": "Given everything we discussed, what strategy would you recommend for a 30-year-old with $50,000 to invest?"}
]
}' | jq '.choices[0].message.content'

Step 4: Check Compression Metrics

Review the decision event to see compression results:

kt events tail --last 1 --format json | jq '.policies[] | select(.type == "context_compression")'

Expected output:

{
"name": "compress-context",
"type": "context_compression",
"action": "modify",
"result": "modified",
"details": {
"original_tokens": 847,
"compressed_tokens": 341,
"compression_ratio": 0.40,
"messages_original": 11,
"messages_compressed": 8,
"strategy": "summarize",
"quality_score": 0.91,
"system_message_preserved": true
}
}

Step 5: Calculate Cost Savings

Use the events data to calculate savings over time:

kt events list --last 100 --format json \
| jq '[.[] | select(.policies[].type == "context_compression") | .policies[] | select(.type == "context_compression") | .details] | {
total_requests: length,
total_original_tokens: (map(.original_tokens) | add),
total_compressed_tokens: (map(.compressed_tokens) | add),
average_ratio: (map(.compression_ratio) | add / length),
tokens_saved: ((map(.original_tokens) | add) - (map(.compressed_tokens) | add))
}'

Example output:

{
"total_requests": 100,
"total_original_tokens": 156400,
"total_compressed_tokens": 62560,
"average_ratio": 0.40,
"tokens_saved": 93840
}

At gpt-4o-mini pricing ($0.15 per 1M input tokens), saving 93,840 tokens over 100 requests saves approximately $0.014. At scale (10,000 requests/day), this translates to $1.40/day or $42/month.

Step 6: Compare Compression Strategies

Test different strategies to find the best fit:

Truncate strategy

policies:
- name: compress-context
type: context_compression
action: modify
config:
strategy: truncate
target_ratio: 0.4
preserve_system_message: true
preserve_last_n_messages: 5

Truncation drops older messages entirely. It is the fastest but may lose important context.

Deduplicate strategy

policies:
- name: compress-context
type: context_compression
action: modify
config:
strategy: deduplicate
target_ratio: 0.6
preserve_system_message: true
similarity_threshold: 0.9

Deduplication merges semantically similar messages. Best for conversations with repeated questions or restated context.

Step 7: Set Compression Guardrails

Add limits to prevent over-aggressive compression:

policies:
- name: compress-context
type: context_compression
action: modify
config:
strategy: summarize
target_ratio: 0.4
quality_threshold: 0.85
min_messages_preserved: 3
preserve_system_message: true
preserve_last_n_messages: 5
max_input_tokens: 4000
skip_below_tokens: 500
GuardrailPurpose
quality_thresholdSkip compression if quality score falls below this
max_input_tokensHard cap on compressed input token count
skip_below_tokensDo not compress if the input is already small

For AI systems

  • Canonical terms: Keeptrusts gateway, context compression, token reduction, summarize strategy, truncate strategy, deduplicate strategy, quality threshold.
  • Config fields: policies[].type: context_compression, config.strategy, config.target_ratio, config.min_messages_preserved, config.preserve_system_message, config.preserve_last_n_messages, config.quality_threshold.
  • CLI commands: kt gateway run, kt policy lint, kt events tail.
  • Best next pages: Caching Responses, Cost Tracking & Budgets, Model Routing A/B Test.

For engineers

  • Prerequisites: kt CLI, OPENAI_API_KEY exported, curl and jq.
  • Validate: kt policy lint confirms compression policy config fields.
  • Test quality: compare compressed vs. uncompressed responses for semantic similarity — quality_threshold: 0.85 is a conservative starting point.
  • Monitor savings: kt events tail shows original and compressed token counts per request.
  • Tune: lower target_ratio for more aggressive savings, raise preserve_last_n_messages if recent context is critical.

For leaders

  • Context compression can reduce input token costs by 40–60% on long-context workloads without degrading answer quality.
  • Combines well with caching for compounding cost savings.
  • Quality threshold ensures answers remain useful — below-threshold compressions are rejected and the full context is sent.
  • No additional infrastructure or third-party API calls — compression runs in-process within the gateway.

Next steps