Streaming & SSE
Keeptrusts supports real-time streaming of LLM responses via Server-Sent Events (SSE). Policies are applied to both the initial request and to streamed response chunks, enabling real-time content filtering without buffering entire responses.
Use this page when
- You need to understand how the Keeptrusts gateway enforces policies on streamed LLM responses.
- You are configuring
streaming_mode(realtime vs. buffered) for specific policies. - You want to confirm which providers support streaming and how protocol translation works.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Configuration
Streaming is enabled by default when the client sends "stream": true in the request body. No gateway-level configuration is needed.
pack:
name: streaming-sse-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
How It Works
Client Keeptrusts Gateway Provider
│ │ │
│── POST (stream: true) ──► │ │
│ │── Policy check (input) ─► │
│ │── Forward request ──────► │
│ │ │
│ │◄── SSE chunk 1 ───────── │
│◄── SSE chunk 1 (checked) │ (output policy check) │
│ │◄── SSE chunk 2 ───────── │
│◄── SSE chunk 2 (checked) │ │
│ │◄── [DONE] ───────────── │
│◄── [DONE] ────────────── │ │
Input Policies
Applied to the full request before forwarding:
prompt-injection— Scans complete promptpii-detector— Redacts PII in inputrbac— Checks user permissions
Output Policies
Applied to streamed chunks:
safety-filter— Blocks unsafe content chunkspii-detector— Redacts PII in output as it streamsaudit-logger— Logs each chunk for complete audit trail
Streaming with All Providers
Keeptrusts handles streaming across different provider protocols:
| Provider | Streaming Protocol | Notes |
|---|---|---|
| OpenAI | SSE (text/event-stream) | Standard SSE |
| Anthropic | SSE | Anthropic streaming events |
| Google Gemini | SSE (streamGenerateContent) | Auto-translated from OpenAI format |
| Groq | SSE | OpenAI-compatible |
| Mistral | SSE | OpenAI-compatible |
| Azure OpenAI | SSE | OpenAI-compatible |
| Ollama | Newline-delimited JSON | Translated to SSE |
| vLLM | SSE | OpenAI-compatible |
Buffered vs. Real-Time Policy Checks
Some policies require the full response for accurate detection. Configure buffering behavior:
policy:
pii-detector:
action: redact
quality-scorer:
thresholds:
min_aggregate: 0.7
pack:
name: streaming-sse-example-2
version: 1.0.0
enabled: true
policies:
chain:
- pii-detector
- quality-scorer
| Mode | Behavior | Use For |
|---|---|---|
realtime | Check each chunk as it arrives | PII, safety filter |
buffered | Buffer full response, then check | Quality scoring, citation verification |
Local Streaming SSE
The Keeptrusts gateway supports local streaming for both /v1/chat/completions and /v1/responses endpoints:
# Stream a chat completion through the gateway
curl -N http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"stream": true,
"messages": [{"role": "user", "content": "Hello"}]
}'
The gateway streams SSE events in real-time while applying all configured policies on each chunk.
For AI systems
- Canonical terms: streaming, SSE, Server-Sent Events, real-time policy, buffered policy, streaming_mode.
- Config key:
policy.<kind>.streaming_modewith valuesrealtimeorbuffered. - Streaming is activated by
"stream": truein the client request body — no gateway config change needed. - Supported providers: OpenAI, Anthropic, Google Gemini, Groq, Mistral, Azure OpenAI, Ollama, vLLM.
- Input policies run on the full request before forwarding; output policies run per-chunk.
- Endpoints:
/v1/chat/completions,/v1/responses. - Related pages: kt gateway run, Multi-Provider Fallback, WebSocket Gateway.
For engineers
- Prerequisites: A running gateway with at least one streaming-capable provider target.
- Test:
curl -N http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"gpt-4o","stream":true,"messages":[{"role":"user","content":"Hello"}]}'— you should see SSE chunks. - Policy modes: Use
streaming_mode: realtimefor PII and safety (low-latency per-chunk). Usestreaming_mode: bufferedfor quality scoring (needs full response). - Protocol translation: Ollama (newline-delimited JSON) and Gemini (streamGenerateContent) are auto-translated to SSE. No client changes needed.
- Troubleshooting: If chunks arrive but aren't filtered, check that the output policy has
streaming_mode: realtime. If latency spikes, check whether abufferedpolicy is blocking the stream.
For leaders
- Streaming support means users get real-time responses while safety policies still run on every chunk — no UX trade-off required.
- Buffered policies (like quality scoring) add latency proportional to response length; decide per-policy whether real-time enforcement or full-response accuracy is more important.
- All major LLM providers are supported for streaming, avoiding vendor lock-in.
- Audit logging captures the complete streamed response for compliance, even though users see it chunk-by-chunk.
Next steps
- kt gateway run — Start the gateway
- WebSocket Gateway — Bidirectional real-time connections
- Multi-Provider Fallback — Provider routing with streaming
- CLI overview