Skip to main content
Browse docs

Streaming Responses & Real-Time Chat

This tutorial explains how the Keeptrusts chat workbench handles streaming responses, including token-by-token display, how policies evaluate streaming output, handling interruptions, and what latency to expect.

Use this page when

  • You want to understand how token-by-token streaming works in the chat workbench.
  • You need to know how output policies evaluate streaming content (accumulation vs. post-stream).
  • You are diagnosing latency issues, stream interruptions, or typing indicator problems.

Primary audience

  • Primary: Technical Engineers (understanding streaming UX and policy interaction)
  • Secondary: AI Agents (streaming API consumers), Technical Leaders (latency expectations)

Prerequisites

  • Authenticated access to the Keeptrusts chat workbench
  • A gateway with a streaming-capable model configured
  • Familiarity with the first conversation tutorial

Step 1: Understand Streaming in Keeptrusts

When you send a message in the chat workbench, the model generates its response token by token. Instead of waiting for the complete response, Keeptrusts streams tokens to the chat interface as they are produced:

User sends message
→ Gateway evaluates input policies
→ Request forwarded to model provider
→ Model streams tokens back
→ Gateway evaluates output policies on the stream
→ Chat workbench renders tokens incrementally

This provides a responsive experience — you see the response forming in real time rather than waiting for the entire generation to complete.

Step 2: Observe Token-by-Token Display

When a response streams in, the chat workbench displays it progressively:

  1. Send a message in the chat workbench.
  2. Watch the response area — text appears word-by-word or phrase-by-phrase.
  3. A typing indicator or cursor animation shows that the response is still generating.
  4. When streaming completes, the indicator disappears and the full response is finalized.

What you see during streaming

UI ElementDuring StreamingAfter Completion
Response textGrowing progressivelyComplete and static
Typing indicatorVisible (pulsing cursor or dots)Hidden
Send buttonDisabled (or shows stop option)Enabled
Token countIncrementingFinal count

Step 3: Understand Policy Evaluation on Streams

The gateway evaluates policies on streaming output differently than on complete responses:

Input phase (pre-stream)

Input policies are evaluated on your full prompt before the request is forwarded to the model. This happens before any streaming begins. If an input policy blocks, you receive the block immediately — no streaming occurs.

Output phase (during stream)

Output policy evaluation on streaming content depends on the policy type:

Policy TypeStreaming Behavior
DisclaimersAppended after the stream completes
Content redactionApplied inline as tokens arrive (may cause brief pauses)
Toxicity filtersEvaluated on accumulated content; may halt the stream
Token limitsStream is cut off when the limit is reached
Some policies require the full response to evaluate (e.g., semantic analysis). These policies buffer a portion of the stream internally, which may introduce slight latency before tokens appear in the UI.

Step 4: Handle Stream Interruptions

Stopping a response

If you want to stop a response while it is streaming:

  1. Look for the Stop button (typically a square icon) that replaces the send button during streaming.
  2. Click Stop.
  3. The stream is terminated. The response up to that point is preserved in the conversation.

The partial response remains in the conversation history. You can send a new message to continue.

Network interruptions

If your network connection drops during streaming:

  • The chat workbench detects the disconnection and shows a connection error.
  • Tokens received before the interruption are preserved.
  • When connectivity returns, you may need to resend the message to get a complete response.
  • The gateway logs a partial event for the interrupted exchange.

Gateway timeouts

Long responses may hit gateway timeout limits:

  • The gateway has a configurable timeout for upstream model responses.
  • If the model takes too long, the gateway closes the connection.
  • A timeout error is displayed in the chat interface.
  • Partial content received before the timeout is shown.

Step 5: Understand Latency Components

The total time from sending a message to seeing the first token involves several steps:

ComponentTypical LatencyDescription
Input policy evaluation5-50 msGateway evaluates input policies
Network to provider20-100 msRound-trip to the model provider
Model time-to-first-token200-2000 msModel begins generating (varies by model)
Network back20-100 msFirst token reaches the gateway
Output policy buffering0-200 msPolicies that need accumulated context
Rendering< 10 msChat workbench renders the token

Total time to first token: ~250 ms to ~2.5 seconds depending on model, network conditions, and policy complexity.

Factors that increase latency

FactorImpact
Complex input policies (e.g., semantic analysis)Adds 50-200 ms to input evaluation
Large model (e.g., GPT-4 vs. GPT-4o-mini)Slower time-to-first-token
Knowledge base context injectionAdds recall and injection time
Long conversation historyMore tokens for the model to process
Geographic distance to providerHigher network round-trip time

Step 6: Optimize for Responsiveness

Choose faster models for interactive use

For real-time conversations, prefer models with lower time-to-first-token:

  • Faster: GPT-4o-mini, Claude Haiku
  • Slower: GPT-4o, Claude Sonnet (but higher quality)

Keep conversations focused

Shorter conversation histories mean less context for the model to process, resulting in faster first-token times. Start new conversations for unrelated topics.

Minimize knowledge context

If latency is a concern and knowledge injection is not needed for a particular question, phrase your query to avoid triggering knowledge recall.

Step 7: Monitor Streaming Performance

You can track streaming performance through the management console:

  1. Navigate to Events in the console.
  2. Open a recent event.
  3. Check the Latency field — this shows the total time for the exchange.
  4. Look for Time to first token if available in the event detail.

For aggregate performance monitoring:

  1. Navigate to the Dashboard or Analytics section.
  2. Review latency distributions across models and teams.
  3. Identify patterns — specific models or policy configurations that add latency.

Step 8: Streaming with Different Response Types

Different types of model output stream differently in the UI:

Text responses

Standard text streams smoothly, token by token. This is the most common response type.

Code blocks

When the model generates code, the chat workbench typically:

  • Detects the opening code fence (```)
  • Begins syntax highlighting as tokens arrive
  • Completes highlighting when the closing fence appears

Structured output

If the model returns structured content (lists, tables), formatting may appear incrementally:

  • List items appear one by one
  • Table rows fill in progressively
  • Markdown formatting is rendered as it completes

Long responses

For very long responses:

  • The chat area auto-scrolls to follow new tokens
  • Previous content remains scrollable above
  • The stop button remains available throughout

Troubleshooting

ProblemSolution
Response never starts streamingCheck model provider status; verify input policies are not silently blocking
Tokens appear in bursts rather than smoothlyNetwork buffering or output policy accumulation; typically not actionable
Stream stops mid-responseCheck gateway timeout settings; the model may have hit its token limit
High latency to first tokenSwitch to a faster model, reduce conversation length, or check network conditions
Typing indicator persists after response endsRefresh the chat workbench; this may indicate a WebSocket state issue

Next steps

For AI systems

  • Canonical terms: Keeptrusts chat workbench, streaming responses, SSE (Server-Sent Events), token streaming, TTFT (time to first token), typing indicator, stop button, output policy streaming evaluation.
  • Policy streaming behavior: disclaimers appended after stream completes; redaction accumulates tokens before evaluating; toxicity filters may pause the stream for buffered evaluation.
  • Flow: User message → input policy evaluation → forward to model → model streams tokens → gateway evaluates output policies on stream → workbench renders incrementally.
  • Best next pages: Model Selection, Policy Feedback, First Conversation.

For engineers

  • Prerequisites: a streaming-capable model configured on your gateway; stable network connection.
  • Validation: Send a message → verify tokens appear progressively (not all at once). Click Stop during streaming → verify generation halts. Check TTFT timing in response metadata.
  • Troubleshooting: "Response never starts" = check input policy (may be silently blocking). Tokens in bursts = network buffering or output policy accumulation (normal).

For leaders

  • Streaming provides perceived responsiveness — users see output immediately rather than waiting for full generation, improving satisfaction.
  • Output policies evaluate during streaming without adding user-visible latency in most cases.
  • TTFT (time to first token) is the key UX metric for chat responsiveness — monitor in analytics.
  • Stream interruptions may indicate gateway timeout configuration issues requiring ops attention.