Cerebras

Cerebras delivers ultra-fast LLM inference powered by its wafer-scale chip architecture, achieving speeds several times faster than GPU-based providers. Keeptrusts integrates natively with the Cerebras Inference API using OpenAI-compatible transport, so you get low-latency responses and real-time policy enforcement without added overhead.

Use this page when

You need the exact command, config, API, or integration details for Cerebras.
You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Prerequisites

A Cerebras account and API key from cloud.cerebras.ai
Keeptrusts CLI installed

export CEREBRAS_API_KEY="csk-..."

Configuration

pack:
  name: cerebras-gateway
  version: 0.1.0
  enabled: true
policies:
  chain:
  - prompt-injection
  - pii-detector
  - audit-logger
providers:
  targets:
  - id: cerebras-llama
    provider: cerebras:chat:llama3.3-70b
    base_url: https://api.cerebras.ai/v1
    secret_key_ref:
      env: CEREBRAS_API_KEY

Start the gateway:

kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml

Provider Fields

Field	Type	Default	Description
`provider`	string	—	`"cerebras"` or `"cerebras:chat:<model>"`
`base_url`	string	`https://api.cerebras.ai/v1`	Cerebras API base URL (auto-detected)
`secret_key_ref`	object	`CEREBRAS_API_KEY`	Object reference to the env var holding the Cerebras API key
`format`	string	`"openai"`	Wire format — Cerebras uses an OpenAI-compatible API

Supported Models

Model ID	Context	Approx. Speed	Use Case
`llama3.3-70b`	128K	~2,100 t/s	High quality, still very fast
`llama3.1-70b`	128K	~2,100 t/s	Previous generation, high throughput
`llama3.1-8b`	128K	~6,000 t/s	Ultra-fast, cost-effective, low latency

Cerebras token speeds are hardware-limited by the wafer-scale chip — throughput figures are per-request and do not degrade under load.

Client Examples

Python
Node.js
cURL

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:41002/v1",
    api_key="unused",
)

response = client.chat.completions.create(
    model="llama3.3-70b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain WebAssembly and its use cases."},
    ],
    max_tokens=1024,
)
print(response.choices[0].message.content)

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:41002/v1',
  apiKey: 'unused',
});

const response = await client.chat.completions.create({
  model: 'llama3.3-70b',
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Explain WebAssembly and its use cases.' },
  ],
  max_tokens: 1024,
});

console.log(response.choices[0].message.content);

curl http://localhost:41002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.3-70b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain WebAssembly and its use cases."}
    ],
    "max_tokens": 1024
  }'

Streaming

Cerebras supports streaming chat completions. The gateway forwards SSE chunks in real time — the wafer-scale chip's high throughput means the first token arrives extremely quickly:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:41002/v1",
    api_key="unused",
)

stream = client.chat.completions.create(
    model="llama3.3-70b",
    messages=[{"role": "user", "content": "List 10 design patterns with one-sentence descriptions."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Advanced Configuration

Latency-Optimized Routing

Use Cerebras as the primary target for latency-sensitive workloads with a GPU-based fallback:

pack:
  name: cerebras-providers-2
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: cerebras-fast
    provider: cerebras:chat:llama3.3-70b
    secret_key_ref:
      env: CEREBRAS_API_KEY
  - id: openai-fallback
    provider: openai
    model: gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

High-Throughput Batch Pipelines with llama3.1-8b

The llama3.1-8b model achieves ~6,000 tokens/second, making it ideal for classification, summarization pipelines, and batch annotation:

pack:
  name: cerebras-providers-3
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: cerebras-8b
    provider: cerebras:chat:llama3.1-8b
    secret_key_ref:
      env: CEREBRAS_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

from openai import OpenAI

client = OpenAI(base_url="http://localhost:41002/v1", api_key="unused")

items = ["statement one", "statement two", "statement three"]
for item in items:
    response = client.chat.completions.create(
        model="llama3.1-8b",
        messages=[{"role": "user", "content": f'Classify as positive, neutral, or negative: "{item}"'}],
        max_tokens=5,
    )
    print(item, "->", response.choices[0].message.content.strip())

Best Practices

Prefer llama3.3-70b for quality, llama3.1-8b for throughput — choose based on whether latency-per-token or tokens-per-second matters more to your workload
Set low max_tokens for classification — Cerebras pricing is per token; short-answer tasks with max_tokens: 20 dramatically reduce cost while benefiting from full speed
Use the audit-logger policy for compliance — Cerebras does not provide per-inference audit logs; capture them at the Keeptrusts layer to maintain a complete request record
Apply prompt-injection for user-facing apps — Cerebras's high inference speed makes real-time injection detection practical with no perceptible latency penalty
Cold-start is not a factor — unlike serverless GPU providers, Cerebras chips have no warm-up time; first-token latency is consistently low even after idle periods

For AI systems

Canonical terms: Keeptrusts gateway, Cerebras, wafer-scale inference, provider target, policy-config.yaml, provider: "cerebras", CEREBRAS_API_KEY.
Config field names: provider, base_url: "https://api.cerebras.ai/v1", secret_key_ref.env: "CEREBRAS_API_KEY", format: "openai".
Provider shorthand: cerebras:chat:<model> (e.g., cerebras:chat:llama3.3-70b).
Key models: llama3.3-70b (~2,100 t/s), llama3.1-8b (~6,000 t/s).
Best next pages: Groq integration (alternative fast inference), Provider routing, Policy configuration.

For engineers

Prerequisites: Cerebras API key (CEREBRAS_API_KEY env var from cloud.cerebras.ai), kt CLI installed.
Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
Validate: curl http://localhost:41002/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"llama3.3-70b","messages":[{"role":"user","content":"hello"}]}'.
Cerebras uses OpenAI-compatible API — standard OpenAI SDKs work without modification.
No cold-start latency — wafer-scale chips maintain consistent first-token latency even after idle periods.
For latency-critical workloads, use llama3.1-8b (~6,000 t/s); for quality, use llama3.3-70b (~2,100 t/s).

For leaders

Cerebras offers the fastest available LLM inference (2,100–6,000 tokens/second) — enables real-time policy enforcement with no perceptible latency penalty.
Per-token pricing makes short-answer classification tasks extremely cost-effective with max_tokens caps.
Hardware-limited throughput does not degrade under load, providing predictable cost and latency guarantees.
Limited model selection (Llama family only) — pair with a GPU-based fallback provider for model diversity.

Next steps

Groq integration — alternative ultra-fast inference provider (LPU-based)
Together AI integration — broader open-model catalog with fast inference
Provider routing strategies — latency-based routing with GPU fallback
Policy configuration — prompt-injection and audit-logger policy reference
Quickstart — install kt and run your first gateway

Use this page when​

Primary audience​

Prerequisites​

Configuration​

Provider Fields​

Supported Models​

Client Examples​

Streaming​

Advanced Configuration​

Latency-Optimized Routing​

High-Throughput Batch Pipelines with llama3.1-8b​

Best Practices​

For AI systems​

For engineers​

For leaders​

Next steps​