Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Cerebras

Cerebras delivers ultra-fast LLM inference powered by its wafer-scale chip architecture, achieving speeds several times faster than GPU-based providers. Keeptrusts integrates natively with the Cerebras Inference API using OpenAI-compatible transport, so you get low-latency responses and real-time policy enforcement without added overhead.

Use this page when

  • You need the exact command, config, API, or integration details for Cerebras.
  • You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
  • If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Prerequisites

export CEREBRAS_API_KEY="csk-..."

Configuration

pack:
name: cerebras-gateway
version: 0.1.0
enabled: true
policies:
chain:
- prompt-injection
- pii-detector
- audit-logger
providers:
targets:
- id: cerebras-llama
provider: cerebras:chat:llama3.3-70b
base_url: https://api.cerebras.ai/v1
secret_key_ref:
env: CEREBRAS_API_KEY

Start the gateway:

kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml

Provider Fields

FieldTypeDefaultDescription
providerstring"cerebras" or "cerebras:chat:<model>"
base_urlstringhttps://api.cerebras.ai/v1Cerebras API base URL (auto-detected)
secret_key_refobjectCEREBRAS_API_KEYObject reference to the env var holding the Cerebras API key
formatstring"openai"Wire format — Cerebras uses an OpenAI-compatible API

Supported Models

Model IDContextApprox. SpeedUse Case
llama3.3-70b128K~2,100 t/sHigh quality, still very fast
llama3.1-70b128K~2,100 t/sPrevious generation, high throughput
llama3.1-8b128K~6,000 t/sUltra-fast, cost-effective, low latency

Cerebras token speeds are hardware-limited by the wafer-scale chip — throughput figures are per-request and do not degrade under load.

Client Examples

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="unused",
)

response = client.chat.completions.create(
model="llama3.3-70b",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain WebAssembly and its use cases."},
],
max_tokens=1024,
)
print(response.choices[0].message.content)

Streaming

Cerebras supports streaming chat completions. The gateway forwards SSE chunks in real time — the wafer-scale chip's high throughput means the first token arrives extremely quickly:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="unused",
)

stream = client.chat.completions.create(
model="llama3.3-70b",
messages=[{"role": "user", "content": "List 10 design patterns with one-sentence descriptions."}],
stream=True,
)

for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)

Advanced Configuration

Latency-Optimized Routing

Use Cerebras as the primary target for latency-sensitive workloads with a GPU-based fallback:

pack:
name: cerebras-providers-2
version: 1.0.0
enabled: true
providers:
targets:
- id: cerebras-fast
provider: cerebras:chat:llama3.3-70b
secret_key_ref:
env: CEREBRAS_API_KEY
- id: openai-fallback
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

High-Throughput Batch Pipelines with llama3.1-8b

The llama3.1-8b model achieves ~6,000 tokens/second, making it ideal for classification, summarization pipelines, and batch annotation:

pack:
name: cerebras-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: cerebras-8b
provider: cerebras:chat:llama3.1-8b
secret_key_ref:
env: CEREBRAS_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
from openai import OpenAI

client = OpenAI(base_url="http://localhost:41002/v1", api_key="unused")

items = ["statement one", "statement two", "statement three"]
for item in items:
response = client.chat.completions.create(
model="llama3.1-8b",
messages=[{"role": "user", "content": f'Classify as positive, neutral, or negative: "{item}"'}],
max_tokens=5,
)
print(item, "->", response.choices[0].message.content.strip())

Best Practices

  • Prefer llama3.3-70b for quality, llama3.1-8b for throughput — choose based on whether latency-per-token or tokens-per-second matters more to your workload
  • Set low max_tokens for classification — Cerebras pricing is per token; short-answer tasks with max_tokens: 20 dramatically reduce cost while benefiting from full speed
  • Use the audit-logger policy for compliance — Cerebras does not provide per-inference audit logs; capture them at the Keeptrusts layer to maintain a complete request record
  • Apply prompt-injection for user-facing apps — Cerebras's high inference speed makes real-time injection detection practical with no perceptible latency penalty
  • Cold-start is not a factor — unlike serverless GPU providers, Cerebras chips have no warm-up time; first-token latency is consistently low even after idle periods

For AI systems

  • Canonical terms: Keeptrusts gateway, Cerebras, wafer-scale inference, provider target, policy-config.yaml, provider: "cerebras", CEREBRAS_API_KEY.
  • Config field names: provider, base_url: "https://api.cerebras.ai/v1", secret_key_ref.env: "CEREBRAS_API_KEY", format: "openai".
  • Provider shorthand: cerebras:chat:<model> (e.g., cerebras:chat:llama3.3-70b).
  • Key models: llama3.3-70b (~2,100 t/s), llama3.1-8b (~6,000 t/s).
  • Best next pages: Groq integration (alternative fast inference), Provider routing, Policy configuration.

For engineers

  • Prerequisites: Cerebras API key (CEREBRAS_API_KEY env var from cloud.cerebras.ai), kt CLI installed.
  • Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
  • Validate: curl http://localhost:41002/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"llama3.3-70b","messages":[{"role":"user","content":"hello"}]}'.
  • Cerebras uses OpenAI-compatible API — standard OpenAI SDKs work without modification.
  • No cold-start latency — wafer-scale chips maintain consistent first-token latency even after idle periods.
  • For latency-critical workloads, use llama3.1-8b (~6,000 t/s); for quality, use llama3.3-70b (~2,100 t/s).

For leaders

  • Cerebras offers the fastest available LLM inference (2,100–6,000 tokens/second) — enables real-time policy enforcement with no perceptible latency penalty.
  • Per-token pricing makes short-answer classification tasks extremely cost-effective with max_tokens caps.
  • Hardware-limited throughput does not degrade under load, providing predictable cost and latency guarantees.
  • Limited model selection (Llama family only) — pair with a GPU-based fallback provider for model diversity.

Next steps