Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Streaming & SSE

Keeptrusts supports real-time streaming of LLM responses via Server-Sent Events (SSE). Policies are applied to both the initial request and to streamed response chunks, enabling real-time content filtering without buffering entire responses.

Use this page when

  • You need to understand how the Keeptrusts gateway enforces policies on streamed LLM responses.
  • You are configuring streaming_mode (realtime vs. buffered) for specific policies.
  • You want to confirm which providers support streaming and how protocol translation works.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Configuration

Streaming is enabled by default when the client sends "stream": true in the request body. No gateway-level configuration is needed.

pack:
name: streaming-sse-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

How It Works

Client Keeptrusts Gateway Provider
│ │ │
│── POST (stream: true) ──► │ │
│ │── Policy check (input) ─► │
│ │── Forward request ──────► │
│ │ │
│ │◄── SSE chunk 1 ───────── │
│◄── SSE chunk 1 (checked) │ (output policy check) │
│ │◄── SSE chunk 2 ───────── │
│◄── SSE chunk 2 (checked) │ │
│ │◄── [DONE] ───────────── │
│◄── [DONE] ────────────── │ │

Input Policies

Applied to the full request before forwarding:

  • prompt-injection — Scans complete prompt
  • pii-detector — Redacts PII in input
  • rbac — Checks user permissions

Output Policies

Applied to streamed chunks:

  • safety-filter — Blocks unsafe content chunks
  • pii-detector — Redacts PII in output as it streams
  • audit-logger — Logs each chunk for complete audit trail

Streaming with All Providers

Keeptrusts handles streaming across different provider protocols:

ProviderStreaming ProtocolNotes
OpenAISSE (text/event-stream)Standard SSE
AnthropicSSEAnthropic streaming events
Google GeminiSSE (streamGenerateContent)Auto-translated from OpenAI format
GroqSSEOpenAI-compatible
MistralSSEOpenAI-compatible
Azure OpenAISSEOpenAI-compatible
OllamaNewline-delimited JSONTranslated to SSE
vLLMSSEOpenAI-compatible

Buffered vs. Real-Time Policy Checks

Some policies require the full response for accurate detection. Configure buffering behavior:

policy:
pii-detector:
action: redact
quality-scorer:
thresholds:
min_aggregate: 0.7
pack:
name: streaming-sse-example-2
version: 1.0.0
enabled: true
policies:
chain:
- pii-detector
- quality-scorer
ModeBehaviorUse For
realtimeCheck each chunk as it arrivesPII, safety filter
bufferedBuffer full response, then checkQuality scoring, citation verification

Local Streaming SSE

The Keeptrusts gateway supports local streaming for both /v1/chat/completions and /v1/responses endpoints:

# Stream a chat completion through the gateway
curl -N http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"stream": true,
"messages": [{"role": "user", "content": "Hello"}]
}'

The gateway streams SSE events in real-time while applying all configured policies on each chunk.

For AI systems

  • Canonical terms: streaming, SSE, Server-Sent Events, real-time policy, buffered policy, streaming_mode.
  • Config key: policy.<kind>.streaming_mode with values realtime or buffered.
  • Streaming is activated by "stream": true in the client request body — no gateway config change needed.
  • Supported providers: OpenAI, Anthropic, Google Gemini, Groq, Mistral, Azure OpenAI, Ollama, vLLM.
  • Input policies run on the full request before forwarding; output policies run per-chunk.
  • Endpoints: /v1/chat/completions, /v1/responses.
  • Related pages: kt gateway run, Multi-Provider Fallback, WebSocket Gateway.

For engineers

  • Prerequisites: A running gateway with at least one streaming-capable provider target.
  • Test: curl -N http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"gpt-4o","stream":true,"messages":[{"role":"user","content":"Hello"}]}' — you should see SSE chunks.
  • Policy modes: Use streaming_mode: realtime for PII and safety (low-latency per-chunk). Use streaming_mode: buffered for quality scoring (needs full response).
  • Protocol translation: Ollama (newline-delimited JSON) and Gemini (streamGenerateContent) are auto-translated to SSE. No client changes needed.
  • Troubleshooting: If chunks arrive but aren't filtered, check that the output policy has streaming_mode: realtime. If latency spikes, check whether a buffered policy is blocking the stream.

For leaders

  • Streaming support means users get real-time responses while safety policies still run on every chunk — no UX trade-off required.
  • Buffered policies (like quality scoring) add latency proportional to response length; decide per-policy whether real-time enforcement or full-response accuracy is more important.
  • All major LLM providers are supported for streaming, avoiding vendor lock-in.
  • Audit logging captures the complete streamed response for compliance, even though users see it chunk-by-chunk.

Next steps