Tutorial: Context Compression to Reduce Token Costs

This tutorial shows you how to configure context compression in the Keeptrusts gateway to reduce the number of tokens sent to LLM providers, measure the cost savings, and verify that response quality is preserved.

Use this page when

You want to reduce token costs on long-context or multi-turn conversations.
You are configuring summarize, truncate, or deduplicate compression strategies.
You need to verify that compression preserves response quality above a similarity threshold.
You are calculating cost savings from token reduction on production workloads.

Primary audience

Primary: Platform engineers optimising token spend on high-volume LLM workloads
Secondary: Product teams managing long-context chatbots; finance teams tracking per-request cost reduction

Prerequisites

kt CLI installed (first-run tutorial)
An OpenAI-compatible API key exported as OPENAI_API_KEY
curl and jq installed

Why Context Compression Matters

Large conversation histories and document-heavy prompts consume tokens rapidly. Context compression reduces input tokens before they reach the provider by summarizing, deduplicating, or truncating older context while preserving the most relevant information.

Without compression	With compression
8,000 input tokens	~3,200 input tokens
$0.024 per request (gpt-4o-mini)	$0.0096 per request
60% savings	—

Step 1: Create the Compression Policy Configuration

Create policy-config.yaml with a context_compression policy:

version: '1'
providers:
  targets:
  - id: openai
    provider: openai
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
- name: compress-context
  type: context_compression
  action: modify
  config:
    strategy: summarize
    target_ratio: 0.4
    min_messages_preserved: 3
    preserve_system_message: true
    preserve_last_n_messages: 5
    quality_threshold: 0.85
    apply_to:
    - input

Configuration breakdown

Field	Purpose
`strategy`	Compression method — `summarize`, `truncate`, or `deduplicate`
`target_ratio`	Target compression ratio (0.4 = reduce to 40% of original size)
`min_messages_preserved`	Minimum messages always kept uncompressed
`preserve_system_message`	Always keep the system prompt intact
`preserve_last_n_messages`	Always keep the N most recent messages intact
`quality_threshold`	Minimum semantic similarity score (0-1) to accept compression

Step 2: Validate and Start the Gateway

kt policy lint --file policy-config.yaml
kt gateway run --policy-config policy-config.yaml --port 41002

Expected output:

INFO  keeptrusts::gateway Loaded 1 provider(s), 1 policy(ies)
INFO  keeptrusts::gateway Context compression: strategy=summarize, target_ratio=0.40, quality=0.85
INFO  keeptrusts::gateway Gateway ready

Step 3: Test with a Long Conversation History

Send a request with a large message history:

curl -s http://localhost:41002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-mini",
    "messages": [
      {"role": "system", "content": "You are a helpful financial advisor."},
      {"role": "user", "content": "What is compound interest and how does it work?"},
      {"role": "assistant", "content": "Compound interest is interest calculated on the initial principal and also on the accumulated interest from previous periods. It grows exponentially over time because each period interest is earned on a larger base."},
      {"role": "user", "content": "Can you give me a specific example with numbers?"},
      {"role": "assistant", "content": "Sure. If you invest $10,000 at 5% annual compound interest for 10 years: Year 1: $10,500, Year 2: $11,025, Year 5: $12,763, Year 10: $16,289. The total interest earned is $6,289 compared to $5,000 with simple interest."},
      {"role": "user", "content": "What about monthly compounding versus annual?"},
      {"role": "assistant", "content": "Monthly compounding produces slightly higher returns because interest is calculated and added 12 times per year instead of once. The same $10,000 at 5% for 10 years yields $16,470 with monthly compounding versus $16,289 with annual — a difference of $181."},
      {"role": "user", "content": "How does the Rule of 72 relate to compound interest?"},
      {"role": "assistant", "content": "The Rule of 72 is a quick estimation tool. Divide 72 by the annual interest rate to approximate how many years it takes to double your money. At 6% interest, your money doubles in approximately 72/6 = 12 years."},
      {"role": "user", "content": "What are the best investment vehicles for compound interest?"},
      {"role": "assistant", "content": "Common vehicles include: high-yield savings accounts (safe, lower returns), CDs (fixed terms, guaranteed rates), index funds (market returns, long-term growth), bonds (steady income, moderate growth), and dividend reinvestment plans (DRIPs) which automatically compound returns."},
      {"role": "user", "content": "Given everything we discussed, what strategy would you recommend for a 30-year-old with $50,000 to invest?"}
    ]
  }' | jq '.choices[0].message.content'

Step 4: Check Compression Metrics

Review the decision event to see compression results:

kt events tail --last 1 --format json | jq '.policies[] | select(.type == "context_compression")'

Expected output:

{
  "name": "compress-context",
  "type": "context_compression",
  "action": "modify",
  "result": "modified",
  "details": {
    "original_tokens": 847,
    "compressed_tokens": 341,
    "compression_ratio": 0.40,
    "messages_original": 11,
    "messages_compressed": 8,
    "strategy": "summarize",
    "quality_score": 0.91,
    "system_message_preserved": true
  }
}

Step 5: Calculate Cost Savings

Use the events data to calculate savings over time:

kt events list --last 100 --format json \
  | jq '[.[] | select(.policies[].type == "context_compression") | .policies[] | select(.type == "context_compression") | .details] | {
    total_requests: length,
    total_original_tokens: (map(.original_tokens) | add),
    total_compressed_tokens: (map(.compressed_tokens) | add),
    average_ratio: (map(.compression_ratio) | add / length),
    tokens_saved: ((map(.original_tokens) | add) - (map(.compressed_tokens) | add))
  }'

Example output:

{
  "total_requests": 100,
  "total_original_tokens": 156400,
  "total_compressed_tokens": 62560,
  "average_ratio": 0.40,
  "tokens_saved": 93840
}

At gpt-4o-mini pricing ($0.15 per 1M input tokens), saving 93,840 tokens over 100 requests saves approximately $0.014. At scale (10,000 requests/day), this translates to $1.40/day or $42/month.

Step 6: Compare Compression Strategies

Test different strategies to find the best fit:

Truncate strategy

policies:
  - name: compress-context
    type: context_compression
    action: modify
    config:
      strategy: truncate
      target_ratio: 0.4
      preserve_system_message: true
      preserve_last_n_messages: 5

Truncation drops older messages entirely. It is the fastest but may lose important context.

Deduplicate strategy

policies:
  - name: compress-context
    type: context_compression
    action: modify
    config:
      strategy: deduplicate
      target_ratio: 0.6
      preserve_system_message: true
      similarity_threshold: 0.9

Deduplication merges semantically similar messages. Best for conversations with repeated questions or restated context.

Step 7: Set Compression Guardrails

Add limits to prevent over-aggressive compression:

policies:
  - name: compress-context
    type: context_compression
    action: modify
    config:
      strategy: summarize
      target_ratio: 0.4
      quality_threshold: 0.85
      min_messages_preserved: 3
      preserve_system_message: true
      preserve_last_n_messages: 5
      max_input_tokens: 4000
      skip_below_tokens: 500

Guardrail	Purpose
`quality_threshold`	Skip compression if quality score falls below this
`max_input_tokens`	Hard cap on compressed input token count
`skip_below_tokens`	Do not compress if the input is already small

For AI systems

Canonical terms: Keeptrusts gateway, context compression, token reduction, summarize strategy, truncate strategy, deduplicate strategy, quality threshold.
Config fields: policies[].type: context_compression, config.strategy, config.target_ratio, config.min_messages_preserved, config.preserve_system_message, config.preserve_last_n_messages, config.quality_threshold.
CLI commands: kt gateway run, kt policy lint, kt events tail.
Best next pages: Caching Responses, Cost Tracking & Budgets, Model Routing A/B Test.

For engineers

Prerequisites: kt CLI, OPENAI_API_KEY exported, curl and jq.
Validate: kt policy lint confirms compression policy config fields.
Test quality: compare compressed vs. uncompressed responses for semantic similarity — quality_threshold: 0.85 is a conservative starting point.
Monitor savings: kt events tail shows original and compressed token counts per request.
Tune: lower target_ratio for more aggressive savings, raise preserve_last_n_messages if recent context is critical.

For leaders

Context compression can reduce input token costs by 40–60% on long-context workloads without degrading answer quality.
Combines well with caching for compounding cost savings.
Quality threshold ensures answers remain useful — below-threshold compressions are rejected and the full context is sent.
No additional infrastructure or third-party API calls — compression runs in-process within the gateway.

Next steps

Cost Tracking & Budgets — monitor how compression reduces wallet spend
Caching Responses — combine caching with compression for compounding savings
Model Routing A/B Test — route smaller compressed contexts to cheaper models

Use this page when​

Primary audience​

Prerequisites​

Why Context Compression Matters​

Step 1: Create the Compression Policy Configuration​

Configuration breakdown​

Step 2: Validate and Start the Gateway​

Step 3: Test with a Long Conversation History​

Step 4: Check Compression Metrics​

Step 5: Calculate Cost Savings​

Step 6: Compare Compression Strategies​

Truncate strategy​

Deduplicate strategy​

Step 7: Set Compression Guardrails​

For AI systems​

For engineers​

For leaders​

Next steps​