Streaming & SSE

Keeptrusts supports real-time streaming of LLM responses via Server-Sent Events (SSE). Policies are applied to both the initial request and to streamed response chunks, enabling real-time content filtering without buffering entire responses.

Use this page when

You need to understand how the Keeptrusts gateway enforces policies on streamed LLM responses.
You are configuring streaming_mode (realtime vs. buffered) for specific policies.
You want to confirm which providers support streaming and how protocol translation works.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Configuration

Streaming is enabled by default when the client sends "stream": true in the request body. No gateway-level configuration is needed.

pack:
  name: streaming-sse-providers-1
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai
    provider: openai
    model: gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

How It Works

Client                    Keeptrusts Gateway                Provider
  │                            │                           │
  │── POST (stream: true) ──► │                           │
  │                            │── Policy check (input) ─► │
  │                            │── Forward request ──────► │
  │                            │                           │
  │                            │◄── SSE chunk 1 ───────── │
  │◄── SSE chunk 1 (checked)  │  (output policy check)    │
  │                            │◄── SSE chunk 2 ───────── │
  │◄── SSE chunk 2 (checked)  │                           │
  │                            │◄── [DONE] ─────────────  │
  │◄── [DONE] ──────────────  │                           │

Input Policies

Applied to the full request before forwarding:

prompt-injection — Scans complete prompt
pii-detector — Redacts PII in input
rbac — Checks user permissions

Output Policies

Applied to streamed chunks:

safety-filter — Blocks unsafe content chunks
pii-detector — Redacts PII in output as it streams
audit-logger — Logs each chunk for complete audit trail

Streaming with All Providers

Keeptrusts handles streaming across different provider protocols:

Provider	Streaming Protocol	Notes
OpenAI	SSE (`text/event-stream`)	Standard SSE
Anthropic	SSE	Anthropic streaming events
Google Gemini	SSE (`streamGenerateContent`)	Auto-translated from OpenAI format
Groq	SSE	OpenAI-compatible
Mistral	SSE	OpenAI-compatible
Azure OpenAI	SSE	OpenAI-compatible
Ollama	Newline-delimited JSON	Translated to SSE
vLLM	SSE	OpenAI-compatible

Buffered vs. Real-Time Policy Checks

Some policies require the full response for accurate detection. Configure buffering behavior:

policy:
  pii-detector:
    action: redact
  quality-scorer:
    thresholds:
      min_aggregate: 0.7
pack:
  name: streaming-sse-example-2
  version: 1.0.0
  enabled: true
policies:
  chain:
  - pii-detector
  - quality-scorer

Mode	Behavior	Use For
`realtime`	Check each chunk as it arrives	PII, safety filter
`buffered`	Buffer full response, then check	Quality scoring, citation verification

Local Streaming SSE

The Keeptrusts gateway supports local streaming for both /v1/chat/completions and /v1/responses endpoints:

# Stream a chat completion through the gateway
curl -N http://localhost:41002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "stream": true,
    "messages": [{"role": "user", "content": "Hello"}]
  }'

The gateway streams SSE events in real-time while applying all configured policies on each chunk.

For AI systems

Canonical terms: streaming, SSE, Server-Sent Events, real-time policy, buffered policy, streaming_mode.
Config key: policy.<kind>.streaming_mode with values realtime or buffered.
Streaming is activated by "stream": true in the client request body — no gateway config change needed.
Supported providers: OpenAI, Anthropic, Google Gemini, Groq, Mistral, Azure OpenAI, Ollama, vLLM.
Input policies run on the full request before forwarding; output policies run per-chunk.
Endpoints: /v1/chat/completions, /v1/responses.
Related pages: kt gateway run, Multi-Provider Fallback, WebSocket Gateway.

For engineers

Prerequisites: A running gateway with at least one streaming-capable provider target.
Test: curl -N http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"gpt-4o","stream":true,"messages":[{"role":"user","content":"Hello"}]}' — you should see SSE chunks.
Policy modes: Use streaming_mode: realtime for PII and safety (low-latency per-chunk). Use streaming_mode: buffered for quality scoring (needs full response).
Protocol translation: Ollama (newline-delimited JSON) and Gemini (streamGenerateContent) are auto-translated to SSE. No client changes needed.
Troubleshooting: If chunks arrive but aren't filtered, check that the output policy has streaming_mode: realtime. If latency spikes, check whether a buffered policy is blocking the stream.

For leaders

Streaming support means users get real-time responses while safety policies still run on every chunk — no UX trade-off required.
Buffered policies (like quality scoring) add latency proportional to response length; decide per-policy whether real-time enforcement or full-response accuracy is more important.
All major LLM providers are supported for streaming, avoiding vendor lock-in.
Audit logging captures the complete streamed response for compliance, even though users see it chunk-by-chunk.

Next steps

kt gateway run — Start the gateway
WebSocket Gateway — Bidirectional real-time connections
Multi-Provider Fallback — Provider routing with streaming
CLI overview

Use this page when​

Primary audience​

Configuration​

How It Works​

Input Policies​

Output Policies​

Streaming with All Providers​

Buffered vs. Real-Time Policy Checks​

Local Streaming SSE​

For AI systems​

For engineers​

For leaders​

Next steps​