Groq
Groq delivers ultra-low latency inference powered by custom LPU hardware, making it one of the fastest hosted LLM providers available. Keeptrusts gateways Groq traffic through its policy engine so you get real-time safety enforcement, audit trails, and observability without sacrificing speed.
Use this page when
- You need the exact command, config, API, or integration details for Groq.
- You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
- If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Prerequisites
- Create a Groq account at console.groq.com.
- Generate an API key from the API Keys section of the Groq console.
- Export the key as an environment variable:
export GROQ_API_KEY="gsk_..."
Keeptrusts auto-detects GROQ_API_KEY when the provider is set to groq, so no additional environment configuration is required.
Configuration
Add a Groq provider to the providers list in your Keeptrusts policy configuration:
pack:
name: groq-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: groq-llama-70b
provider: groq:chat:llama-3.3-70b-versatile
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
The shorthand provider: "groq" uses Groq's default model. Use the full form groq:chat:<model> to pin a specific model.
Provider Fields
| Field | Type | Required | Description |
|---|---|---|---|
id | string | Yes | Unique identifier for this provider entry. |
provider | string | Yes | Provider selector. Use "groq" or "groq:chat:<model>". |
model | string | No | Model name override. Ignored when model is embedded in provider. |
base_url | string | No | API base URL. Auto-detected as https://api.groq.com/openai/v1. |
secret_key_ref | object | No | Object reference to the environment variable holding the API key. Auto-detected as GROQ_API_KEY. |
timeout_seconds | integer | No | Maximum seconds to wait for a non-streaming response. Default: 30. |
stream_timeout_seconds | integer | No | Maximum seconds to wait between streamed chunks. Default: 120. |
max_context_tokens | integer | No | Context window size in tokens. Used for prompt-length policy checks. |
format | string | No | Wire format. Auto-detected as "openai" (OpenAI-compatible). |
provider_type | string | No | Explicit provider type hint. Rarely needed; auto-inferred from provider. |
description | string | No | Human-readable label shown in the console and audit logs. |
weight | number | No | Routing weight when used in a provider group (0.0–1.0). |
data_policy | object | No | Data-handling metadata: region, retention, pii_allowed. |
pricing | object | No | Cost metadata: input_per_1k, output_per_1k (USD per 1 K tokens). |
health_probe | object | No | Liveness probe config: enabled, interval_seconds, timeout_seconds. |
Supported Models
| Model | Context Window | Notes |
|---|---|---|
llama-3.3-70b-versatile | 131 072 | General-purpose, best quality on Groq |
llama-3.1-8b-instant | 131 072 | Fastest option, ideal for high-throughput tasks |
mixtral-8x7b-32768 | 32 768 | Mixture-of-experts, strong reasoning |
gemma2-9b-it | 8 192 | Google Gemma 2, instruction-tuned |
Model availability is subject to Groq's catalog. Run kt providers list --provider groq to see the current set.
Client Examples
Point your application at the Keeptrusts gateway and use Groq models as if you were calling OpenAI.
- Python
- Node.js
- cURL
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:41002/v1", # Keeptrusts gateway
api_key="any", # gateway handles upstream auth
)
response = client.chat.completions.create(
model="groq:chat:llama-3.3-70b-versatile",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in two sentences."},
],
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:41002/v1", // Keeptrusts gateway
apiKey: "any",
});
const response = await client.chat.completions.create({
model: "groq:chat:llama-3.3-70b-versatile",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain quantum computing in two sentences." },
],
});
console.log(response.choices[0].message.content);
curl http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "groq:chat:llama-3.3-70b-versatile",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in two sentences."}
]
}'
Streaming
Streaming is the default for Groq through the Keeptrusts gateway. Policy checks (redaction, disclaimers, content filtering) are applied per-chunk in real time.
stream = client.chat.completions.create(
model="groq:chat:llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Write a haiku about latency."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Use stream_timeout_seconds to control how long the gateway waits between chunks before treating the stream as stalled.
Advanced Configuration
Fallback Chains
Route traffic to a backup provider if Groq becomes unavailable:
provider_groups:
- id: fast-with-fallback
strategy: "fallback"
providers:
- groq-llama-70b # primary — ultra-low latency
- together-llama-70b # fallback
Weighted Routing
Distribute traffic across Groq models or providers:
provider_groups:
- id: groq-balanced
strategy: "weighted"
providers:
- id: groq-llama-70b
provider: "groq:chat:llama-3.3-70b-versatile"
weight: 0.7
- id: groq-llama-8b
provider: "groq:chat:llama-3.1-8b-instant"
weight: 0.3
Model Groups
Restrict which models a policy allows:
model_groups:
groq-approved:
- "groq:chat:llama-3.3-70b-versatile"
- "groq:chat:llama-3.1-8b-instant"
policies:
- name: prod-policy
allowed_model_groups:
- groq-approved
Best Practices
- Pin models explicitly — use
groq:chat:llama-3.3-70b-versatileinstead of baregroqso policy audits reference a deterministic model. - Set
max_context_tokens— Groq models have large context windows; setting this field enables Keeptrusts's prompt-length policy checks before the request reaches Groq. - Enable health probes — Groq's LPU infrastructure is highly reliable, but probes let Keeptrusts trigger fallback routing automatically if latency spikes.
- Use
data_policy.retention: "none"— Groq does not retain prompt data by default; documenting this in config keeps your audit trail consistent. - Prefer streaming for long outputs — Groq's low latency makes streaming especially effective; combine with
stream_timeout_secondsfor resilient long-form generation.
For AI systems
- Canonical terms: Keeptrusts gateway, Groq, LPU, Language Processing Unit, ultra-low latency, provider target, policy-config.yaml,
provider: "groq", GROQ_API_KEY. - Config field names:
provider,model,base_url: "https://api.groq.com/openai/v1",secret_key_ref.env: "GROQ_API_KEY",format: "openai",data_policy. - Provider shorthand:
groq:chat:<model>(e.g.,groq:chat:llama-3.3-70b-versatile). - Key behavior: Groq does not retain prompt data by default — configure
data_policy.retention: "none"for audit consistency. - Best next pages: Cerebras integration (alternative fast inference), Together AI integration, Policy configuration.
For engineers
- Prerequisites: Groq API key (
GROQ_API_KEYenv var from console.groq.com),ktCLI installed. - Start command:
kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml. - Validate:
curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"llama-3.3-70b-versatile","messages":[{"role":"user","content":"hello"}]}'. - Groq uses OpenAI-compatible API — standard OpenAI SDKs work without modification.
- Prefer streaming for long outputs — Groq's low latency makes streaming especially effective.
- Set
data_policy.retention: "none"to match Groq's default no-retention posture in your audit trail.
For leaders
- Groq's LPU hardware delivers sub-second inference latency — enables real-time AI features without perceptible delay.
- No data retention by default aligns with strict compliance postures — document this in config for audit evidence.
- Limited model catalog (primarily Llama and Mixtral variants) — pair with a broader provider for model diversity.
- Competitive pricing for high-throughput workloads; combine with
audit-loggerfor complete request accounting.
Next steps
- Cerebras integration — alternative ultra-fast wafer-scale inference
- Together AI integration — broader open-model catalog
- Provider routing strategies — latency-based routing with fallback
- Policy configuration — audit-logger and PII policy reference
- Quickstart — install
ktand run your first gateway