Cerebras
Cerebras delivers ultra-fast LLM inference powered by its wafer-scale chip architecture, achieving speeds several times faster than GPU-based providers. Keeptrusts integrates natively with the Cerebras Inference API using OpenAI-compatible transport, so you get low-latency responses and real-time policy enforcement without added overhead.
Use this page when
- You need the exact command, config, API, or integration details for Cerebras.
- You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
- If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Prerequisites
- A Cerebras account and API key from cloud.cerebras.ai
- Keeptrusts CLI installed
export CEREBRAS_API_KEY="csk-..."
Configuration
pack:
name: cerebras-gateway
version: 0.1.0
enabled: true
policies:
chain:
- prompt-injection
- pii-detector
- audit-logger
providers:
targets:
- id: cerebras-llama
provider: cerebras:chat:llama3.3-70b
base_url: https://api.cerebras.ai/v1
secret_key_ref:
env: CEREBRAS_API_KEY
Start the gateway:
kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml
Provider Fields
| Field | Type | Default | Description |
|---|---|---|---|
provider | string | — | "cerebras" or "cerebras:chat:<model>" |
base_url | string | https://api.cerebras.ai/v1 | Cerebras API base URL (auto-detected) |
secret_key_ref | object | CEREBRAS_API_KEY | Object reference to the env var holding the Cerebras API key |
format | string | "openai" | Wire format — Cerebras uses an OpenAI-compatible API |
Supported Models
| Model ID | Context | Approx. Speed | Use Case |
|---|---|---|---|
llama3.3-70b | 128K | ~2,100 t/s | High quality, still very fast |
llama3.1-70b | 128K | ~2,100 t/s | Previous generation, high throughput |
llama3.1-8b | 128K | ~6,000 t/s | Ultra-fast, cost-effective, low latency |
Cerebras token speeds are hardware-limited by the wafer-scale chip — throughput figures are per-request and do not degrade under load.
Client Examples
- Python
- Node.js
- cURL
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="unused",
)
response = client.chat.completions.create(
model="llama3.3-70b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain WebAssembly and its use cases."},
],
max_tokens=1024,
)
print(response.choices[0].message.content)
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:41002/v1',
apiKey: 'unused',
});
const response = await client.chat.completions.create({
model: 'llama3.3-70b',
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Explain WebAssembly and its use cases.' },
],
max_tokens: 1024,
});
console.log(response.choices[0].message.content);
curl http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.3-70b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain WebAssembly and its use cases."}
],
"max_tokens": 1024
}'
Streaming
Cerebras supports streaming chat completions. The gateway forwards SSE chunks in real time — the wafer-scale chip's high throughput means the first token arrives extremely quickly:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="unused",
)
stream = client.chat.completions.create(
model="llama3.3-70b",
messages=[{"role": "user", "content": "List 10 design patterns with one-sentence descriptions."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
Advanced Configuration
Latency-Optimized Routing
Use Cerebras as the primary target for latency-sensitive workloads with a GPU-based fallback:
pack:
name: cerebras-providers-2
version: 1.0.0
enabled: true
providers:
targets:
- id: cerebras-fast
provider: cerebras:chat:llama3.3-70b
secret_key_ref:
env: CEREBRAS_API_KEY
- id: openai-fallback
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
High-Throughput Batch Pipelines with llama3.1-8b
The llama3.1-8b model achieves ~6,000 tokens/second, making it ideal for classification, summarization pipelines, and batch annotation:
pack:
name: cerebras-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: cerebras-8b
provider: cerebras:chat:llama3.1-8b
secret_key_ref:
env: CEREBRAS_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
from openai import OpenAI
client = OpenAI(base_url="http://localhost:41002/v1", api_key="unused")
items = ["statement one", "statement two", "statement three"]
for item in items:
response = client.chat.completions.create(
model="llama3.1-8b",
messages=[{"role": "user", "content": f'Classify as positive, neutral, or negative: "{item}"'}],
max_tokens=5,
)
print(item, "->", response.choices[0].message.content.strip())
Best Practices
- Prefer
llama3.3-70bfor quality,llama3.1-8bfor throughput — choose based on whether latency-per-token or tokens-per-second matters more to your workload - Set low
max_tokensfor classification — Cerebras pricing is per token; short-answer tasks withmax_tokens: 20dramatically reduce cost while benefiting from full speed - Use the
audit-loggerpolicy for compliance — Cerebras does not provide per-inference audit logs; capture them at the Keeptrusts layer to maintain a complete request record - Apply
prompt-injectionfor user-facing apps — Cerebras's high inference speed makes real-time injection detection practical with no perceptible latency penalty - Cold-start is not a factor — unlike serverless GPU providers, Cerebras chips have no warm-up time; first-token latency is consistently low even after idle periods
For AI systems
- Canonical terms: Keeptrusts gateway, Cerebras, wafer-scale inference, provider target, policy-config.yaml,
provider: "cerebras", CEREBRAS_API_KEY. - Config field names:
provider,base_url: "https://api.cerebras.ai/v1",secret_key_ref.env: "CEREBRAS_API_KEY",format: "openai". - Provider shorthand:
cerebras:chat:<model>(e.g.,cerebras:chat:llama3.3-70b). - Key models:
llama3.3-70b(~2,100 t/s),llama3.1-8b(~6,000 t/s). - Best next pages: Groq integration (alternative fast inference), Provider routing, Policy configuration.
For engineers
- Prerequisites: Cerebras API key (
CEREBRAS_API_KEYenv var from cloud.cerebras.ai),ktCLI installed. - Start command:
kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml. - Validate:
curl http://localhost:41002/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"llama3.3-70b","messages":[{"role":"user","content":"hello"}]}'. - Cerebras uses OpenAI-compatible API — standard OpenAI SDKs work without modification.
- No cold-start latency — wafer-scale chips maintain consistent first-token latency even after idle periods.
- For latency-critical workloads, use
llama3.1-8b(~6,000 t/s); for quality, usellama3.3-70b(~2,100 t/s).
For leaders
- Cerebras offers the fastest available LLM inference (2,100–6,000 tokens/second) — enables real-time policy enforcement with no perceptible latency penalty.
- Per-token pricing makes short-answer classification tasks extremely cost-effective with
max_tokenscaps. - Hardware-limited throughput does not degrade under load, providing predictable cost and latency guarantees.
- Limited model selection (Llama family only) — pair with a GPU-based fallback provider for model diversity.
Next steps
- Groq integration — alternative ultra-fast inference provider (LPU-based)
- Together AI integration — broader open-model catalog with fast inference
- Provider routing strategies — latency-based routing with GPU fallback
- Policy configuration — prompt-injection and audit-logger policy reference
- Quickstart — install
ktand run your first gateway