Skip to main content
Browse docs

Tutorial: Managing Context Window in Chat

Every language model has a finite context window. When your conversation grows beyond that limit, older messages are truncated or compressed. The Keeptrusts chat workbench provides tools to monitor token usage, configure compression behavior, and manage long conversations effectively.

Use this page when

  • You need to monitor token usage and understand what consumes your model's context window.
  • You want to configure context compression (summarize, sliding-window, or none) for long conversations.
  • You are troubleshooting truncated responses or unexpected context resets.

Primary audience

  • Primary: Technical Engineers (power users managing long conversations)
  • Secondary: AI Agents (context-aware prompting), Technical Leaders (cost implications)

Prerequisites

  • Access to the Keeptrusts chat workbench
  • A configured model with a known context window size
  • Basic understanding of how tokens relate to text length

Step 1: Read the Token Counter

The token counter is displayed in the conversation toolbar and updates after each message.

  1. Start or open a conversation in the chat workbench.
  2. Look for the token gauge in the toolbar — it shows current usage as a fraction of the model's context limit.
  3. The gauge changes color as usage increases:
ColorUsage LevelMeaning
Green0–50%Plenty of context remaining
Yellow50–80%Approaching limit; consider summarizing
Red80–100%Near limit; truncation or compression imminent

Click the token counter to expand a detailed breakdown showing input tokens, output tokens, system prompt tokens, and any injected context (such as knowledge base assets).

Step 2: Understand Context Composition

The context window is consumed by multiple components. Understanding the breakdown helps you manage it.

  • System Prompt — the base instructions and persona definition. This is always included and counts against your limit.
  • Knowledge Base Context — any bound knowledge assets injected into the conversation.
  • Conversation History — all prior user and assistant messages.
  • Current Prompt — the message you are about to send.
  • Reserved Output — tokens reserved for the model's response (typically configured via max_tokens).

The token breakdown panel shows each component's contribution. If knowledge base assets consume a large share, consider binding fewer assets or shorter ones.

Step 3: Configure Context Compression

When the conversation approaches the context limit, compression can summarize older messages to free space.

  1. Open conversation Settings from the toolbar.
  2. Scroll to the Context Management section.
  3. Configure compression options:
SettingDescriptionDefault
Compression Modenone, summarize, or sliding-windowsliding-window
Compression ThresholdPercentage of context usage that triggers compression80%
Summary ModelModel used to generate conversation summariesSame as chat model
Preserve RecentNumber of recent message pairs to always keep uncompressed5

Compression Modes Explained

  • None — no compression. When the context is full, the oldest messages are hard-truncated.
  • Summarize — when the threshold is reached, older messages are replaced with a concise summary generated by the summary model.
  • Sliding Window — maintains a rolling window of recent messages. Messages outside the window are dropped without summarization.

Step 4: Observe Truncation Behavior

When the context limit is reached and compression cannot free enough space, truncation occurs.

  1. Send messages until the token gauge enters the red zone.
  2. Continue the conversation past the context limit.
  3. A truncation notice appears in the conversation, indicating which messages were removed or summarized.

The truncation notice shows:

  • How many messages were removed or compressed.
  • The approximate token savings.
  • A link to view the full, untruncated conversation history.

Truncated messages are not deleted. They remain in the conversation history and are accessible through the Full History view.

Step 5: Manage Long Conversations

For extended conversations that span many exchanges, use these strategies to maintain quality.

Strategy 1: Pin Important Messages

  1. Hover over a message you want to preserve.
  2. Click the Pin icon.
  3. Pinned messages are never truncated or compressed, regardless of context pressure.

Use pinning sparingly — each pinned message permanently reduces available context.

Strategy 2: Branch the Conversation

  1. At any point, click Branch on a message to create a new conversation thread from that point.
  2. The branch starts fresh with only the system prompt and the branched message as context.
  3. The original conversation continues independently.

Branching is useful when a conversation shifts topic and you want full context dedicated to the new direction.

Strategy 3: Reset Context with Summary

  1. Click Summarize & Reset in the toolbar.
  2. The chat workbench generates a summary of the entire conversation so far.
  3. A new conversation begins with the summary injected as context.

This gives you a clean context window while preserving the essential information from the prior discussion.

Step 6: Monitor Context Across Team Conversations

If you manage a team, monitor how context is being used across conversations.

  1. Open the Keeptrusts console and navigate to Chat Analytics.
  2. Review the Context Utilization panel, which shows average and peak context usage across team conversations.
  3. Identify conversations that frequently hit the context limit — these may benefit from shorter system prompts or fewer knowledge base bindings.

Token Counting Accuracy

Token counts in the chat workbench are estimates based on the selected model's tokenizer. Actual token usage may differ slightly because:

  • Different models use different tokenization algorithms.
  • Special tokens (start/end of message markers) are counted but not visible.
  • Image or file attachments have model-specific token costs.

The token counter refreshes after each API response with the actual token count reported by the provider.

Troubleshooting

IssueSolution
Token counter shows 0Refresh the page; the counter initializes after the first message
Compression summary is too briefIncrease the summary model's max_tokens or switch to a more capable summary model
Pinned messages not preservedVerify the pin icon is active (highlighted) on the message
Context resets unexpectedlyCheck the compression threshold — it may be set too low

Layered Context Model

When the layered memory system is enabled, the token counter breakdown shows three additional lanes:

LaneDescription
Always rememberedFrozen memory facts that appear every turn (stable, tiny)
Knowledge usedRanked knowledge base assets and memories
Past session contextEpisodic recall from prior sessions

These lanes are assembled before your conversation messages and contribute to total context usage. The "Context used" panel on session detail pages shows which items from each lane were injected and can link back to their source records.

If the combined context exceeds the model limit:

  1. The gateway first attempts a context flush (condensing older context into a summary)
  2. If that isn't enough, normal compression fires as a fallback

See Context Compression for details on the flush step.

Next steps

For AI systems

  • Canonical terms: Keeptrusts chat workbench, context window, token counter, token gauge, context compression, sliding-window, summarize mode, conversation truncation, Compression Threshold, Preserve Recent, context flush, layered context model.
  • Config names: Compression Mode (none/summarize/sliding-window), Compression Threshold (default 80%), Summary Model, Preserve Recent (default 5 pairs).
  • Context composition: System Prompt + Knowledge Base Context + Conversation History + Current Prompt + Reserved Output.
  • Best next pages: Chat History Search, Multi-Turn Policies, System Prompts.

For engineers

  • Prerequisites: a configured model with a known context window size; understanding of token–text relationship (~4 chars per token for English).
  • Validation: Send messages until token gauge turns yellow (50%) → verify color change. Trigger compression threshold → verify older messages are summarized or dropped. Check token breakdown panel for per-component counts.
  • Optimization: Start new conversations for unrelated topics; bind fewer/shorter knowledge assets; set a concise system prompt.

For leaders

  • Context management directly impacts cost — longer contexts mean higher per-message token charges.
  • Compression extends conversation usefulness without proportional cost increase.
  • Teams doing document analysis or multi-session research benefit most from tuned compression settings.
  • Monitor token consumption trends in analytics to identify teams that need larger context window models.