Tutorial: Managing Context Window in Chat

Every language model has a finite context window. When your conversation grows beyond that limit, older messages are truncated or compressed. The Keeptrusts chat workbench provides tools to monitor token usage, configure compression behavior, and manage long conversations effectively.

Use this page when

You need to monitor token usage and understand what consumes your model's context window.
You want to configure context compression (summarize, sliding-window, or none) for long conversations.
You are troubleshooting truncated responses or unexpected context resets.

Primary audience

Primary: Technical Engineers (power users managing long conversations)
Secondary: AI Agents (context-aware prompting), Technical Leaders (cost implications)

Prerequisites

Access to the Keeptrusts chat workbench
A configured model with a known context window size
Basic understanding of how tokens relate to text length

Step 1: Read the Token Counter

The token counter is displayed in the conversation toolbar and updates after each message.

Start or open a conversation in the chat workbench.
Look for the token gauge in the toolbar — it shows current usage as a fraction of the model's context limit.
The gauge changes color as usage increases:

Color	Usage Level	Meaning
Green	0–50%	Plenty of context remaining
Yellow	50–80%	Approaching limit; consider summarizing
Red	80–100%	Near limit; truncation or compression imminent

Click the token counter to expand a detailed breakdown showing input tokens, output tokens, system prompt tokens, and any injected context (such as knowledge base assets).

Step 2: Understand Context Composition

The context window is consumed by multiple components. Understanding the breakdown helps you manage it.

System Prompt — the base instructions and persona definition. This is always included and counts against your limit.
Knowledge Base Context — any bound knowledge assets injected into the conversation.
Conversation History — all prior user and assistant messages.
Current Prompt — the message you are about to send.
Reserved Output — tokens reserved for the model's response (typically configured via max_tokens).

The token breakdown panel shows each component's contribution. If knowledge base assets consume a large share, consider binding fewer assets or shorter ones.

Step 3: Configure Context Compression

When the conversation approaches the context limit, compression can summarize older messages to free space.

Open conversation Settings from the toolbar.
Scroll to the Context Management section.
Configure compression options:

Setting	Description	Default
Compression Mode	`none`, `summarize`, or `sliding-window`	`sliding-window`
Compression Threshold	Percentage of context usage that triggers compression	80%
Summary Model	Model used to generate conversation summaries	Same as chat model
Preserve Recent	Number of recent message pairs to always keep uncompressed	5

Compression Modes Explained

None — no compression. When the context is full, the oldest messages are hard-truncated.
Summarize — when the threshold is reached, older messages are replaced with a concise summary generated by the summary model.
Sliding Window — maintains a rolling window of recent messages. Messages outside the window are dropped without summarization.

Step 4: Observe Truncation Behavior

When the context limit is reached and compression cannot free enough space, truncation occurs.

Send messages until the token gauge enters the red zone.
Continue the conversation past the context limit.
A truncation notice appears in the conversation, indicating which messages were removed or summarized.

The truncation notice shows:

How many messages were removed or compressed.
The approximate token savings.
A link to view the full, untruncated conversation history.

Truncated messages are not deleted. They remain in the conversation history and are accessible through the Full History view.

Step 5: Manage Long Conversations

For extended conversations that span many exchanges, use these strategies to maintain quality.

Strategy 1: Pin Important Messages

Hover over a message you want to preserve.
Click the Pin icon.
Pinned messages are never truncated or compressed, regardless of context pressure.

Use pinning sparingly — each pinned message permanently reduces available context.

Strategy 2: Branch the Conversation

At any point, click Branch on a message to create a new conversation thread from that point.
The branch starts fresh with only the system prompt and the branched message as context.
The original conversation continues independently.

Branching is useful when a conversation shifts topic and you want full context dedicated to the new direction.

Strategy 3: Reset Context with Summary

Click Summarize & Reset in the toolbar.
The chat workbench generates a summary of the entire conversation so far.
A new conversation begins with the summary injected as context.

This gives you a clean context window while preserving the essential information from the prior discussion.

Step 6: Monitor Context Across Team Conversations

If you manage a team, monitor how context is being used across conversations.

Open the Keeptrusts console and navigate to Chat Analytics.
Review the Context Utilization panel, which shows average and peak context usage across team conversations.
Identify conversations that frequently hit the context limit — these may benefit from shorter system prompts or fewer knowledge base bindings.

Token Counting Accuracy

Token counts in the chat workbench are estimates based on the selected model's tokenizer. Actual token usage may differ slightly because:

Different models use different tokenization algorithms.
Special tokens (start/end of message markers) are counted but not visible.
Image or file attachments have model-specific token costs.

The token counter refreshes after each API response with the actual token count reported by the provider.

Troubleshooting

Issue	Solution
Token counter shows 0	Refresh the page; the counter initializes after the first message
Compression summary is too brief	Increase the summary model's max_tokens or switch to a more capable summary model
Pinned messages not preserved	Verify the pin icon is active (highlighted) on the message
Context resets unexpectedly	Check the compression threshold — it may be set too low

Layered Context Model

When the layered memory system is enabled, the token counter breakdown shows three additional lanes:

Lane	Description
Always remembered	Frozen memory facts that appear every turn (stable, tiny)
Knowledge used	Ranked knowledge base assets and memories
Past session context	Episodic recall from prior sessions

These lanes are assembled before your conversation messages and contribute to total context usage. The "Context used" panel on session detail pages shows which items from each lane were injected and can link back to their source records.

If the combined context exceeds the model limit:

The gateway first attempts a context flush (condensing older context into a summary)
If that isn't enough, normal compression fires as a fallback

See Context Compression for details on the flush step.

Next steps

Tutorial: Searching Chat History — find and revisit past conversations including truncated content.
Tutorial: Multi-Turn Conversation Policies — set conversation-level token budgets via policy.
Tutorial: System Prompts in Chat — optimize your system prompt to save context space.

For AI systems

Canonical terms: Keeptrusts chat workbench, context window, token counter, token gauge, context compression, sliding-window, summarize mode, conversation truncation, Compression Threshold, Preserve Recent, context flush, layered context model.
Config names: Compression Mode (none/summarize/sliding-window), Compression Threshold (default 80%), Summary Model, Preserve Recent (default 5 pairs).
Context composition: System Prompt + Knowledge Base Context + Conversation History + Current Prompt + Reserved Output.
Best next pages: Chat History Search, Multi-Turn Policies, System Prompts.

For engineers

Prerequisites: a configured model with a known context window size; understanding of token–text relationship (~4 chars per token for English).
Validation: Send messages until token gauge turns yellow (50%) → verify color change. Trigger compression threshold → verify older messages are summarized or dropped. Check token breakdown panel for per-component counts.
Optimization: Start new conversations for unrelated topics; bind fewer/shorter knowledge assets; set a concise system prompt.

For leaders

Context management directly impacts cost — longer contexts mean higher per-message token charges.
Compression extends conversation usefulness without proportional cost increase.
Teams doing document analysis or multi-session research benefit most from tuned compression settings.
Monitor token consumption trends in analytics to identify teams that need larger context window models.

Use this page when​

Primary audience​

Prerequisites​

Step 1: Read the Token Counter​

Step 2: Understand Context Composition​

Step 3: Configure Context Compression​

Compression Modes Explained​

Step 4: Observe Truncation Behavior​

Step 5: Manage Long Conversations​

Strategy 1: Pin Important Messages​

Strategy 2: Branch the Conversation​

Strategy 3: Reset Context with Summary​

Step 6: Monitor Context Across Team Conversations​

Token Counting Accuracy​

Troubleshooting​

Layered Context Model​

Next steps​

For AI systems​

For engineers​

For leaders​