Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Groq

Groq delivers ultra-low latency inference powered by custom LPU hardware, making it one of the fastest hosted LLM providers available. Keeptrusts gateways Groq traffic through its policy engine so you get real-time safety enforcement, audit trails, and observability without sacrificing speed.

Use this page when

  • You need the exact command, config, API, or integration details for Groq.
  • You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
  • If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Prerequisites

  1. Create a Groq account at console.groq.com.
  2. Generate an API key from the API Keys section of the Groq console.
  3. Export the key as an environment variable:
export GROQ_API_KEY="gsk_..."

Keeptrusts auto-detects GROQ_API_KEY when the provider is set to groq, so no additional environment configuration is required.

Configuration

Add a Groq provider to the providers list in your Keeptrusts policy configuration:

policy-config.yaml
pack:
name: groq-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: groq-llama-70b
provider: groq:chat:llama-3.3-70b-versatile
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

The shorthand provider: "groq" uses Groq's default model. Use the full form groq:chat:<model> to pin a specific model.

Provider Fields

FieldTypeRequiredDescription
idstringYesUnique identifier for this provider entry.
providerstringYesProvider selector. Use "groq" or "groq:chat:<model>".
modelstringNoModel name override. Ignored when model is embedded in provider.
base_urlstringNoAPI base URL. Auto-detected as https://api.groq.com/openai/v1.
secret_key_refobjectNoObject reference to the environment variable holding the API key. Auto-detected as GROQ_API_KEY.
timeout_secondsintegerNoMaximum seconds to wait for a non-streaming response. Default: 30.
stream_timeout_secondsintegerNoMaximum seconds to wait between streamed chunks. Default: 120.
max_context_tokensintegerNoContext window size in tokens. Used for prompt-length policy checks.
formatstringNoWire format. Auto-detected as "openai" (OpenAI-compatible).
provider_typestringNoExplicit provider type hint. Rarely needed; auto-inferred from provider.
descriptionstringNoHuman-readable label shown in the console and audit logs.
weightnumberNoRouting weight when used in a provider group (0.0–1.0).
data_policyobjectNoData-handling metadata: region, retention, pii_allowed.
pricingobjectNoCost metadata: input_per_1k, output_per_1k (USD per 1 K tokens).
health_probeobjectNoLiveness probe config: enabled, interval_seconds, timeout_seconds.

Supported Models

ModelContext WindowNotes
llama-3.3-70b-versatile131 072General-purpose, best quality on Groq
llama-3.1-8b-instant131 072Fastest option, ideal for high-throughput tasks
mixtral-8x7b-3276832 768Mixture-of-experts, strong reasoning
gemma2-9b-it8 192Google Gemma 2, instruction-tuned

Model availability is subject to Groq's catalog. Run kt providers list --provider groq to see the current set.

Client Examples

Point your application at the Keeptrusts gateway and use Groq models as if you were calling OpenAI.

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:41002/v1", # Keeptrusts gateway
api_key="any", # gateway handles upstream auth
)

response = client.chat.completions.create(
model="groq:chat:llama-3.3-70b-versatile",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in two sentences."},
],
)

print(response.choices[0].message.content)

Streaming

Streaming is the default for Groq through the Keeptrusts gateway. Policy checks (redaction, disclaimers, content filtering) are applied per-chunk in real time.

stream = client.chat.completions.create(
model="groq:chat:llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Write a haiku about latency."}],
stream=True,
)

for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")

Use stream_timeout_seconds to control how long the gateway waits between chunks before treating the stream as stalled.

Advanced Configuration

Fallback Chains

Route traffic to a backup provider if Groq becomes unavailable:

policy-config.yaml
provider_groups:
- id: fast-with-fallback
strategy: "fallback"
providers:
- groq-llama-70b # primary — ultra-low latency
- together-llama-70b # fallback

Weighted Routing

Distribute traffic across Groq models or providers:

policy-config.yaml
provider_groups:
- id: groq-balanced
strategy: "weighted"
providers:
- id: groq-llama-70b
provider: "groq:chat:llama-3.3-70b-versatile"
weight: 0.7
- id: groq-llama-8b
provider: "groq:chat:llama-3.1-8b-instant"
weight: 0.3

Model Groups

Restrict which models a policy allows:

policy-config.yaml
model_groups:
groq-approved:
- "groq:chat:llama-3.3-70b-versatile"
- "groq:chat:llama-3.1-8b-instant"

policies:
- name: prod-policy
allowed_model_groups:
- groq-approved

Best Practices

  • Pin models explicitly — use groq:chat:llama-3.3-70b-versatile instead of bare groq so policy audits reference a deterministic model.
  • Set max_context_tokens — Groq models have large context windows; setting this field enables Keeptrusts's prompt-length policy checks before the request reaches Groq.
  • Enable health probes — Groq's LPU infrastructure is highly reliable, but probes let Keeptrusts trigger fallback routing automatically if latency spikes.
  • Use data_policy.retention: "none" — Groq does not retain prompt data by default; documenting this in config keeps your audit trail consistent.
  • Prefer streaming for long outputs — Groq's low latency makes streaming especially effective; combine with stream_timeout_seconds for resilient long-form generation.

For AI systems

  • Canonical terms: Keeptrusts gateway, Groq, LPU, Language Processing Unit, ultra-low latency, provider target, policy-config.yaml, provider: "groq", GROQ_API_KEY.
  • Config field names: provider, model, base_url: "https://api.groq.com/openai/v1", secret_key_ref.env: "GROQ_API_KEY", format: "openai", data_policy.
  • Provider shorthand: groq:chat:<model> (e.g., groq:chat:llama-3.3-70b-versatile).
  • Key behavior: Groq does not retain prompt data by default — configure data_policy.retention: "none" for audit consistency.
  • Best next pages: Cerebras integration (alternative fast inference), Together AI integration, Policy configuration.

For engineers

  • Prerequisites: Groq API key (GROQ_API_KEY env var from console.groq.com), kt CLI installed.
  • Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
  • Validate: curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"llama-3.3-70b-versatile","messages":[{"role":"user","content":"hello"}]}'.
  • Groq uses OpenAI-compatible API — standard OpenAI SDKs work without modification.
  • Prefer streaming for long outputs — Groq's low latency makes streaming especially effective.
  • Set data_policy.retention: "none" to match Groq's default no-retention posture in your audit trail.

For leaders

  • Groq's LPU hardware delivers sub-second inference latency — enables real-time AI features without perceptible delay.
  • No data retention by default aligns with strict compliance postures — document this in config for audit evidence.
  • Limited model catalog (primarily Llama and Mixtral variants) — pair with a broader provider for model diversity.
  • Competitive pricing for high-throughput workloads; combine with audit-logger for complete request accounting.

Next steps