Tutorial: Managing Context Window in Chat
Every language model has a finite context window. When your conversation grows beyond that limit, older messages are truncated or compressed. The Keeptrusts chat workbench provides tools to monitor token usage, configure compression behavior, and manage long conversations effectively.
Use this page when
- You need to monitor token usage and understand what consumes your model's context window.
- You want to configure context compression (summarize, sliding-window, or none) for long conversations.
- You are troubleshooting truncated responses or unexpected context resets.
Primary audience
- Primary: Technical Engineers (power users managing long conversations)
- Secondary: AI Agents (context-aware prompting), Technical Leaders (cost implications)
Prerequisites
- Access to the Keeptrusts chat workbench
- A configured model with a known context window size
- Basic understanding of how tokens relate to text length
Step 1: Read the Token Counter
The token counter is displayed in the conversation toolbar and updates after each message.
- Start or open a conversation in the chat workbench.
- Look for the token gauge in the toolbar — it shows current usage as a fraction of the model's context limit.
- The gauge changes color as usage increases:
| Color | Usage Level | Meaning |
|---|---|---|
| Green | 0–50% | Plenty of context remaining |
| Yellow | 50–80% | Approaching limit; consider summarizing |
| Red | 80–100% | Near limit; truncation or compression imminent |
Click the token counter to expand a detailed breakdown showing input tokens, output tokens, system prompt tokens, and any injected context (such as knowledge base assets).
Step 2: Understand Context Composition
The context window is consumed by multiple components. Understanding the breakdown helps you manage it.
- System Prompt — the base instructions and persona definition. This is always included and counts against your limit.
- Knowledge Base Context — any bound knowledge assets injected into the conversation.
- Conversation History — all prior user and assistant messages.
- Current Prompt — the message you are about to send.
- Reserved Output — tokens reserved for the model's response (typically configured via max_tokens).
The token breakdown panel shows each component's contribution. If knowledge base assets consume a large share, consider binding fewer assets or shorter ones.
Step 3: Configure Context Compression
When the conversation approaches the context limit, compression can summarize older messages to free space.
- Open conversation Settings from the toolbar.
- Scroll to the Context Management section.
- Configure compression options:
| Setting | Description | Default |
|---|---|---|
| Compression Mode | none, summarize, or sliding-window | sliding-window |
| Compression Threshold | Percentage of context usage that triggers compression | 80% |
| Summary Model | Model used to generate conversation summaries | Same as chat model |
| Preserve Recent | Number of recent message pairs to always keep uncompressed | 5 |
Compression Modes Explained
- None — no compression. When the context is full, the oldest messages are hard-truncated.
- Summarize — when the threshold is reached, older messages are replaced with a concise summary generated by the summary model.
- Sliding Window — maintains a rolling window of recent messages. Messages outside the window are dropped without summarization.
Step 4: Observe Truncation Behavior
When the context limit is reached and compression cannot free enough space, truncation occurs.
- Send messages until the token gauge enters the red zone.
- Continue the conversation past the context limit.
- A truncation notice appears in the conversation, indicating which messages were removed or summarized.
The truncation notice shows:
- How many messages were removed or compressed.
- The approximate token savings.
- A link to view the full, untruncated conversation history.
Truncated messages are not deleted. They remain in the conversation history and are accessible through the Full History view.
Step 5: Manage Long Conversations
For extended conversations that span many exchanges, use these strategies to maintain quality.
Strategy 1: Pin Important Messages
- Hover over a message you want to preserve.
- Click the Pin icon.
- Pinned messages are never truncated or compressed, regardless of context pressure.
Use pinning sparingly — each pinned message permanently reduces available context.
Strategy 2: Branch the Conversation
- At any point, click Branch on a message to create a new conversation thread from that point.
- The branch starts fresh with only the system prompt and the branched message as context.
- The original conversation continues independently.
Branching is useful when a conversation shifts topic and you want full context dedicated to the new direction.
Strategy 3: Reset Context with Summary
- Click Summarize & Reset in the toolbar.
- The chat workbench generates a summary of the entire conversation so far.
- A new conversation begins with the summary injected as context.
This gives you a clean context window while preserving the essential information from the prior discussion.
Step 6: Monitor Context Across Team Conversations
If you manage a team, monitor how context is being used across conversations.
- Open the Keeptrusts console and navigate to Chat Analytics.
- Review the Context Utilization panel, which shows average and peak context usage across team conversations.
- Identify conversations that frequently hit the context limit — these may benefit from shorter system prompts or fewer knowledge base bindings.
Token Counting Accuracy
Token counts in the chat workbench are estimates based on the selected model's tokenizer. Actual token usage may differ slightly because:
- Different models use different tokenization algorithms.
- Special tokens (start/end of message markers) are counted but not visible.
- Image or file attachments have model-specific token costs.
The token counter refreshes after each API response with the actual token count reported by the provider.
Troubleshooting
| Issue | Solution |
|---|---|
| Token counter shows 0 | Refresh the page; the counter initializes after the first message |
| Compression summary is too brief | Increase the summary model's max_tokens or switch to a more capable summary model |
| Pinned messages not preserved | Verify the pin icon is active (highlighted) on the message |
| Context resets unexpectedly | Check the compression threshold — it may be set too low |
Layered Context Model
When the layered memory system is enabled, the token counter breakdown shows three additional lanes:
| Lane | Description |
|---|---|
| Always remembered | Frozen memory facts that appear every turn (stable, tiny) |
| Knowledge used | Ranked knowledge base assets and memories |
| Past session context | Episodic recall from prior sessions |
These lanes are assembled before your conversation messages and contribute to total context usage. The "Context used" panel on session detail pages shows which items from each lane were injected and can link back to their source records.
If the combined context exceeds the model limit:
- The gateway first attempts a context flush (condensing older context into a summary)
- If that isn't enough, normal compression fires as a fallback
See Context Compression for details on the flush step.
Next steps
- Tutorial: Searching Chat History — find and revisit past conversations including truncated content.
- Tutorial: Multi-Turn Conversation Policies — set conversation-level token budgets via policy.
- Tutorial: System Prompts in Chat — optimize your system prompt to save context space.
For AI systems
- Canonical terms: Keeptrusts chat workbench, context window, token counter, token gauge, context compression, sliding-window, summarize mode, conversation truncation, Compression Threshold, Preserve Recent, context flush, layered context model.
- Config names: Compression Mode (
none/summarize/sliding-window), Compression Threshold (default 80%), Summary Model, Preserve Recent (default 5 pairs). - Context composition: System Prompt + Knowledge Base Context + Conversation History + Current Prompt + Reserved Output.
- Best next pages: Chat History Search, Multi-Turn Policies, System Prompts.
For engineers
- Prerequisites: a configured model with a known context window size; understanding of token–text relationship (~4 chars per token for English).
- Validation: Send messages until token gauge turns yellow (50%) → verify color change. Trigger compression threshold → verify older messages are summarized or dropped. Check token breakdown panel for per-component counts.
- Optimization: Start new conversations for unrelated topics; bind fewer/shorter knowledge assets; set a concise system prompt.
For leaders
- Context management directly impacts cost — longer contexts mean higher per-message token charges.
- Compression extends conversation usefulness without proportional cost increase.
- Teams doing document analysis or multi-session research benefit most from tuned compression settings.
- Monitor token consumption trends in analytics to identify teams that need larger context window models.