Tutorial: Context Compression to Reduce Token Costs
This tutorial shows you how to configure context compression in the Keeptrusts gateway to reduce the number of tokens sent to LLM providers, measure the cost savings, and verify that response quality is preserved.
Use this page when
- You want to reduce token costs on long-context or multi-turn conversations.
- You are configuring summarize, truncate, or deduplicate compression strategies.
- You need to verify that compression preserves response quality above a similarity threshold.
- You are calculating cost savings from token reduction on production workloads.
Primary audience
- Primary: Platform engineers optimising token spend on high-volume LLM workloads
- Secondary: Product teams managing long-context chatbots; finance teams tracking per-request cost reduction
Prerequisites
ktCLI installed (first-run tutorial)- An OpenAI-compatible API key exported as
OPENAI_API_KEY curlandjqinstalled
Why Context Compression Matters
Large conversation histories and document-heavy prompts consume tokens rapidly. Context compression reduces input tokens before they reach the provider by summarizing, deduplicating, or truncating older context while preserving the most relevant information.
| Without compression | With compression |
|---|---|
| 8,000 input tokens | ~3,200 input tokens |
| $0.024 per request (gpt-4o-mini) | $0.0096 per request |
| 60% savings | — |
Step 1: Create the Compression Policy Configuration
Create policy-config.yaml with a context_compression policy:
version: '1'
providers:
targets:
- id: openai
provider: openai
secret_key_ref:
env: OPENAI_API_KEY
policies:
- name: compress-context
type: context_compression
action: modify
config:
strategy: summarize
target_ratio: 0.4
min_messages_preserved: 3
preserve_system_message: true
preserve_last_n_messages: 5
quality_threshold: 0.85
apply_to:
- input
Configuration breakdown
| Field | Purpose |
|---|---|
strategy | Compression method — summarize, truncate, or deduplicate |
target_ratio | Target compression ratio (0.4 = reduce to 40% of original size) |
min_messages_preserved | Minimum messages always kept uncompressed |
preserve_system_message | Always keep the system prompt intact |
preserve_last_n_messages | Always keep the N most recent messages intact |
quality_threshold | Minimum semantic similarity score (0-1) to accept compression |
Step 2: Validate and Start the Gateway
kt policy lint --file policy-config.yaml
kt gateway run --policy-config policy-config.yaml --port 41002
Expected output:
INFO keeptrusts::gateway Loaded 1 provider(s), 1 policy(ies)
INFO keeptrusts::gateway Context compression: strategy=summarize, target_ratio=0.40, quality=0.85
INFO keeptrusts::gateway Gateway ready
Step 3: Test with a Long Conversation History
Send a request with a large message history:
curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [
{"role": "system", "content": "You are a helpful financial advisor."},
{"role": "user", "content": "What is compound interest and how does it work?"},
{"role": "assistant", "content": "Compound interest is interest calculated on the initial principal and also on the accumulated interest from previous periods. It grows exponentially over time because each period interest is earned on a larger base."},
{"role": "user", "content": "Can you give me a specific example with numbers?"},
{"role": "assistant", "content": "Sure. If you invest $10,000 at 5% annual compound interest for 10 years: Year 1: $10,500, Year 2: $11,025, Year 5: $12,763, Year 10: $16,289. The total interest earned is $6,289 compared to $5,000 with simple interest."},
{"role": "user", "content": "What about monthly compounding versus annual?"},
{"role": "assistant", "content": "Monthly compounding produces slightly higher returns because interest is calculated and added 12 times per year instead of once. The same $10,000 at 5% for 10 years yields $16,470 with monthly compounding versus $16,289 with annual — a difference of $181."},
{"role": "user", "content": "How does the Rule of 72 relate to compound interest?"},
{"role": "assistant", "content": "The Rule of 72 is a quick estimation tool. Divide 72 by the annual interest rate to approximate how many years it takes to double your money. At 6% interest, your money doubles in approximately 72/6 = 12 years."},
{"role": "user", "content": "What are the best investment vehicles for compound interest?"},
{"role": "assistant", "content": "Common vehicles include: high-yield savings accounts (safe, lower returns), CDs (fixed terms, guaranteed rates), index funds (market returns, long-term growth), bonds (steady income, moderate growth), and dividend reinvestment plans (DRIPs) which automatically compound returns."},
{"role": "user", "content": "Given everything we discussed, what strategy would you recommend for a 30-year-old with $50,000 to invest?"}
]
}' | jq '.choices[0].message.content'
Step 4: Check Compression Metrics
Review the decision event to see compression results:
kt events tail --last 1 --format json | jq '.policies[] | select(.type == "context_compression")'
Expected output:
{
"name": "compress-context",
"type": "context_compression",
"action": "modify",
"result": "modified",
"details": {
"original_tokens": 847,
"compressed_tokens": 341,
"compression_ratio": 0.40,
"messages_original": 11,
"messages_compressed": 8,
"strategy": "summarize",
"quality_score": 0.91,
"system_message_preserved": true
}
}
Step 5: Calculate Cost Savings
Use the events data to calculate savings over time:
kt events list --last 100 --format json \
| jq '[.[] | select(.policies[].type == "context_compression") | .policies[] | select(.type == "context_compression") | .details] | {
total_requests: length,
total_original_tokens: (map(.original_tokens) | add),
total_compressed_tokens: (map(.compressed_tokens) | add),
average_ratio: (map(.compression_ratio) | add / length),
tokens_saved: ((map(.original_tokens) | add) - (map(.compressed_tokens) | add))
}'
Example output:
{
"total_requests": 100,
"total_original_tokens": 156400,
"total_compressed_tokens": 62560,
"average_ratio": 0.40,
"tokens_saved": 93840
}
At gpt-4o-mini pricing ($0.15 per 1M input tokens), saving 93,840 tokens over 100 requests saves approximately $0.014. At scale (10,000 requests/day), this translates to $1.40/day or $42/month.
Step 6: Compare Compression Strategies
Test different strategies to find the best fit:
Truncate strategy
policies:
- name: compress-context
type: context_compression
action: modify
config:
strategy: truncate
target_ratio: 0.4
preserve_system_message: true
preserve_last_n_messages: 5
Truncation drops older messages entirely. It is the fastest but may lose important context.
Deduplicate strategy
policies:
- name: compress-context
type: context_compression
action: modify
config:
strategy: deduplicate
target_ratio: 0.6
preserve_system_message: true
similarity_threshold: 0.9
Deduplication merges semantically similar messages. Best for conversations with repeated questions or restated context.
Step 7: Set Compression Guardrails
Add limits to prevent over-aggressive compression:
policies:
- name: compress-context
type: context_compression
action: modify
config:
strategy: summarize
target_ratio: 0.4
quality_threshold: 0.85
min_messages_preserved: 3
preserve_system_message: true
preserve_last_n_messages: 5
max_input_tokens: 4000
skip_below_tokens: 500
| Guardrail | Purpose |
|---|---|
quality_threshold | Skip compression if quality score falls below this |
max_input_tokens | Hard cap on compressed input token count |
skip_below_tokens | Do not compress if the input is already small |
For AI systems
- Canonical terms: Keeptrusts gateway, context compression, token reduction, summarize strategy, truncate strategy, deduplicate strategy, quality threshold.
- Config fields:
policies[].type: context_compression,config.strategy,config.target_ratio,config.min_messages_preserved,config.preserve_system_message,config.preserve_last_n_messages,config.quality_threshold. - CLI commands:
kt gateway run,kt policy lint,kt events tail. - Best next pages: Caching Responses, Cost Tracking & Budgets, Model Routing A/B Test.
For engineers
- Prerequisites:
ktCLI,OPENAI_API_KEYexported,curlandjq. - Validate:
kt policy lintconfirms compression policy config fields. - Test quality: compare compressed vs. uncompressed responses for semantic similarity —
quality_threshold: 0.85is a conservative starting point. - Monitor savings:
kt events tailshows original and compressed token counts per request. - Tune: lower
target_ratiofor more aggressive savings, raisepreserve_last_n_messagesif recent context is critical.
For leaders
- Context compression can reduce input token costs by 40–60% on long-context workloads without degrading answer quality.
- Combines well with caching for compounding cost savings.
- Quality threshold ensures answers remain useful — below-threshold compressions are rejected and the full context is sent.
- No additional infrastructure or third-party API calls — compression runs in-process within the gateway.
Next steps
- Cost Tracking & Budgets — monitor how compression reduces wallet spend
- Caching Responses — combine caching with compression for compounding savings
- Model Routing A/B Test — route smaller compressed contexts to cheaper models