Streaming Responses & Real-Time Chat
This tutorial explains how the Keeptrusts chat workbench handles streaming responses, including token-by-token display, how policies evaluate streaming output, handling interruptions, and what latency to expect.
Use this page when
- You want to understand how token-by-token streaming works in the chat workbench.
- You need to know how output policies evaluate streaming content (accumulation vs. post-stream).
- You are diagnosing latency issues, stream interruptions, or typing indicator problems.
Primary audience
- Primary: Technical Engineers (understanding streaming UX and policy interaction)
- Secondary: AI Agents (streaming API consumers), Technical Leaders (latency expectations)
Prerequisites
- Authenticated access to the Keeptrusts chat workbench
- A gateway with a streaming-capable model configured
- Familiarity with the first conversation tutorial
Step 1: Understand Streaming in Keeptrusts
When you send a message in the chat workbench, the model generates its response token by token. Instead of waiting for the complete response, Keeptrusts streams tokens to the chat interface as they are produced:
User sends message
→ Gateway evaluates input policies
→ Request forwarded to model provider
→ Model streams tokens back
→ Gateway evaluates output policies on the stream
→ Chat workbench renders tokens incrementally
This provides a responsive experience — you see the response forming in real time rather than waiting for the entire generation to complete.
Step 2: Observe Token-by-Token Display
When a response streams in, the chat workbench displays it progressively:
- Send a message in the chat workbench.
- Watch the response area — text appears word-by-word or phrase-by-phrase.
- A typing indicator or cursor animation shows that the response is still generating.
- When streaming completes, the indicator disappears and the full response is finalized.
What you see during streaming
| UI Element | During Streaming | After Completion |
|---|---|---|
| Response text | Growing progressively | Complete and static |
| Typing indicator | Visible (pulsing cursor or dots) | Hidden |
| Send button | Disabled (or shows stop option) | Enabled |
| Token count | Incrementing | Final count |
Step 3: Understand Policy Evaluation on Streams
The gateway evaluates policies on streaming output differently than on complete responses:
Input phase (pre-stream)
Input policies are evaluated on your full prompt before the request is forwarded to the model. This happens before any streaming begins. If an input policy blocks, you receive the block immediately — no streaming occurs.
Output phase (during stream)
Output policy evaluation on streaming content depends on the policy type:
| Policy Type | Streaming Behavior |
|---|---|
| Disclaimers | Appended after the stream completes |
| Content redaction | Applied inline as tokens arrive (may cause brief pauses) |
| Toxicity filters | Evaluated on accumulated content; may halt the stream |
| Token limits | Stream is cut off when the limit is reached |
Step 4: Handle Stream Interruptions
Stopping a response
If you want to stop a response while it is streaming:
- Look for the Stop button (typically a square icon) that replaces the send button during streaming.
- Click Stop.
- The stream is terminated. The response up to that point is preserved in the conversation.
The partial response remains in the conversation history. You can send a new message to continue.
Network interruptions
If your network connection drops during streaming:
- The chat workbench detects the disconnection and shows a connection error.
- Tokens received before the interruption are preserved.
- When connectivity returns, you may need to resend the message to get a complete response.
- The gateway logs a partial event for the interrupted exchange.
Gateway timeouts
Long responses may hit gateway timeout limits:
- The gateway has a configurable timeout for upstream model responses.
- If the model takes too long, the gateway closes the connection.
- A timeout error is displayed in the chat interface.
- Partial content received before the timeout is shown.
Step 5: Understand Latency Components
The total time from sending a message to seeing the first token involves several steps:
| Component | Typical Latency | Description |
|---|---|---|
| Input policy evaluation | 5-50 ms | Gateway evaluates input policies |
| Network to provider | 20-100 ms | Round-trip to the model provider |
| Model time-to-first-token | 200-2000 ms | Model begins generating (varies by model) |
| Network back | 20-100 ms | First token reaches the gateway |
| Output policy buffering | 0-200 ms | Policies that need accumulated context |
| Rendering | < 10 ms | Chat workbench renders the token |
Total time to first token: ~250 ms to ~2.5 seconds depending on model, network conditions, and policy complexity.
Factors that increase latency
| Factor | Impact |
|---|---|
| Complex input policies (e.g., semantic analysis) | Adds 50-200 ms to input evaluation |
| Large model (e.g., GPT-4 vs. GPT-4o-mini) | Slower time-to-first-token |
| Knowledge base context injection | Adds recall and injection time |
| Long conversation history | More tokens for the model to process |
| Geographic distance to provider | Higher network round-trip time |
Step 6: Optimize for Responsiveness
Choose faster models for interactive use
For real-time conversations, prefer models with lower time-to-first-token:
- Faster: GPT-4o-mini, Claude Haiku
- Slower: GPT-4o, Claude Sonnet (but higher quality)
Keep conversations focused
Shorter conversation histories mean less context for the model to process, resulting in faster first-token times. Start new conversations for unrelated topics.
Minimize knowledge context
If latency is a concern and knowledge injection is not needed for a particular question, phrase your query to avoid triggering knowledge recall.
Step 7: Monitor Streaming Performance
You can track streaming performance through the management console:
- Navigate to Events in the console.
- Open a recent event.
- Check the Latency field — this shows the total time for the exchange.
- Look for Time to first token if available in the event detail.
For aggregate performance monitoring:
- Navigate to the Dashboard or Analytics section.
- Review latency distributions across models and teams.
- Identify patterns — specific models or policy configurations that add latency.
Step 8: Streaming with Different Response Types
Different types of model output stream differently in the UI:
Text responses
Standard text streams smoothly, token by token. This is the most common response type.
Code blocks
When the model generates code, the chat workbench typically:
- Detects the opening code fence (
```) - Begins syntax highlighting as tokens arrive
- Completes highlighting when the closing fence appears
Structured output
If the model returns structured content (lists, tables), formatting may appear incrementally:
- List items appear one by one
- Table rows fill in progressively
- Markdown formatting is rendered as it completes
Long responses
For very long responses:
- The chat area auto-scrolls to follow new tokens
- Previous content remains scrollable above
- The stop button remains available throughout
Troubleshooting
| Problem | Solution |
|---|---|
| Response never starts streaming | Check model provider status; verify input policies are not silently blocking |
| Tokens appear in bursts rather than smoothly | Network buffering or output policy accumulation; typically not actionable |
| Stream stops mid-response | Check gateway timeout settings; the model may have hit its token limit |
| High latency to first token | Switch to a faster model, reduce conversation length, or check network conditions |
| Typing indicator persists after response ends | Refresh the chat workbench; this may indicate a WebSocket state issue |
Next steps
- Choosing & Switching AI Models — pick models optimized for streaming performance.
- Understanding Policy Feedback in Chat — learn how policies interact with streaming output.
- Your First Governed Chat Conversation — review the fundamentals.
For AI systems
- Canonical terms: Keeptrusts chat workbench, streaming responses, SSE (Server-Sent Events), token streaming, TTFT (time to first token), typing indicator, stop button, output policy streaming evaluation.
- Policy streaming behavior: disclaimers appended after stream completes; redaction accumulates tokens before evaluating; toxicity filters may pause the stream for buffered evaluation.
- Flow: User message → input policy evaluation → forward to model → model streams tokens → gateway evaluates output policies on stream → workbench renders incrementally.
- Best next pages: Model Selection, Policy Feedback, First Conversation.
For engineers
- Prerequisites: a streaming-capable model configured on your gateway; stable network connection.
- Validation: Send a message → verify tokens appear progressively (not all at once). Click Stop during streaming → verify generation halts. Check TTFT timing in response metadata.
- Troubleshooting: "Response never starts" = check input policy (may be silently blocking). Tokens in bursts = network buffering or output policy accumulation (normal).
For leaders
- Streaming provides perceived responsiveness — users see output immediately rather than waiting for full generation, improving satisfaction.
- Output policies evaluate during streaming without adding user-visible latency in most cases.
- TTFT (time to first token) is the key UX metric for chat responsiveness — monitor in analytics.
- Stream interruptions may indicate gateway timeout configuration issues requiring ops attention.