Groq

Groq delivers ultra-low latency inference powered by custom LPU hardware, making it one of the fastest hosted LLM providers available. Keeptrusts gateways Groq traffic through its policy engine so you get real-time safety enforcement, audit trails, and observability without sacrificing speed.

Use this page when

You need the exact command, config, API, or integration details for Groq.
You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Prerequisites

Create a Groq account at console.groq.com.
Generate an API key from the API Keys section of the Groq console.
Export the key as an environment variable:

export GROQ_API_KEY="gsk_..."

Keeptrusts auto-detects GROQ_API_KEY when the provider is set to groq, so no additional environment configuration is required.

Configuration

Add a Groq provider to the providers list in your Keeptrusts policy configuration:

policy-config.yaml
pack:
  name: groq-providers-1
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: groq-llama-70b
    provider: groq:chat:llama-3.3-70b-versatile
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

The shorthand provider: "groq" uses Groq's default model. Use the full form groq:chat:<model> to pin a specific model.

Provider Fields

Field	Type	Required	Description
`id`	string	Yes	Unique identifier for this provider entry.
`provider`	string	Yes	Provider selector. Use `"groq"` or `"groq:chat:<model>"`.
`model`	string	No	Model name override. Ignored when model is embedded in `provider`.
`base_url`	string	No	API base URL. Auto-detected as `https://api.groq.com/openai/v1`.
`secret_key_ref`	object	No	Object reference to the environment variable holding the API key. Auto-detected as `GROQ_API_KEY`.
`timeout_seconds`	integer	No	Maximum seconds to wait for a non-streaming response. Default: `30`.
`stream_timeout_seconds`	integer	No	Maximum seconds to wait between streamed chunks. Default: `120`.
`max_context_tokens`	integer	No	Context window size in tokens. Used for prompt-length policy checks.
`format`	string	No	Wire format. Auto-detected as `"openai"` (OpenAI-compatible).
`provider_type`	string	No	Explicit provider type hint. Rarely needed; auto-inferred from `provider`.
`description`	string	No	Human-readable label shown in the console and audit logs.
`weight`	number	No	Routing weight when used in a provider group (0.0–1.0).
`data_policy`	object	No	Data-handling metadata: `region`, `retention`, `pii_allowed`.
`pricing`	object	No	Cost metadata: `input_per_1k`, `output_per_1k` (USD per 1 K tokens).
`health_probe`	object	No	Liveness probe config: `enabled`, `interval_seconds`, `timeout_seconds`.

Supported Models

Model	Context Window	Notes
`llama-3.3-70b-versatile`	131 072	General-purpose, best quality on Groq
`llama-3.1-8b-instant`	131 072	Fastest option, ideal for high-throughput tasks
`mixtral-8x7b-32768`	32 768	Mixture-of-experts, strong reasoning
`gemma2-9b-it`	8 192	Google Gemma 2, instruction-tuned

Model availability is subject to Groq's catalog. Run kt providers list --provider groq to see the current set.

Client Examples

Point your application at the Keeptrusts gateway and use Groq models as if you were calling OpenAI.

Python
Node.js
cURL

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:41002/v1",  # Keeptrusts gateway
    api_key="any",                         # gateway handles upstream auth
)

response = client.chat.completions.create(
    model="groq:chat:llama-3.3-70b-versatile",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain quantum computing in two sentences."},
    ],
)

print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:41002/v1", // Keeptrusts gateway
  apiKey: "any",
});

const response = await client.chat.completions.create({
  model: "groq:chat:llama-3.3-70b-versatile",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Explain quantum computing in two sentences." },
  ],
});

console.log(response.choices[0].message.content);

curl http://localhost:41002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "groq:chat:llama-3.3-70b-versatile",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in two sentences."}
    ]
  }'

Streaming

Streaming is the default for Groq through the Keeptrusts gateway. Policy checks (redaction, disclaimers, content filtering) are applied per-chunk in real time.

stream = client.chat.completions.create(
    model="groq:chat:llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Write a haiku about latency."}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Use stream_timeout_seconds to control how long the gateway waits between chunks before treating the stream as stalled.

Advanced Configuration

Fallback Chains

Route traffic to a backup provider if Groq becomes unavailable:

policy-config.yaml
provider_groups:
  - id: fast-with-fallback
    strategy: "fallback"
    providers:
      - groq-llama-70b          # primary — ultra-low latency
      - together-llama-70b      # fallback

Weighted Routing

Distribute traffic across Groq models or providers:

policy-config.yaml
provider_groups:
  - id: groq-balanced
    strategy: "weighted"
    providers:
      - id: groq-llama-70b
        provider: "groq:chat:llama-3.3-70b-versatile"
        weight: 0.7
      - id: groq-llama-8b
        provider: "groq:chat:llama-3.1-8b-instant"
        weight: 0.3

Model Groups

Restrict which models a policy allows:

policy-config.yaml
model_groups:
  groq-approved:
    - "groq:chat:llama-3.3-70b-versatile"
    - "groq:chat:llama-3.1-8b-instant"

policies:
  - name: prod-policy
    allowed_model_groups:
      - groq-approved

Best Practices

Pin models explicitly — use groq:chat:llama-3.3-70b-versatile instead of bare groq so policy audits reference a deterministic model.
Set max_context_tokens — Groq models have large context windows; setting this field enables Keeptrusts's prompt-length policy checks before the request reaches Groq.
Enable health probes — Groq's LPU infrastructure is highly reliable, but probes let Keeptrusts trigger fallback routing automatically if latency spikes.
Use data_policy.retention: "none" — Groq does not retain prompt data by default; documenting this in config keeps your audit trail consistent.
Prefer streaming for long outputs — Groq's low latency makes streaming especially effective; combine with stream_timeout_seconds for resilient long-form generation.

For AI systems

Canonical terms: Keeptrusts gateway, Groq, LPU, Language Processing Unit, ultra-low latency, provider target, policy-config.yaml, provider: "groq", GROQ_API_KEY.
Config field names: provider, model, base_url: "https://api.groq.com/openai/v1", secret_key_ref.env: "GROQ_API_KEY", format: "openai", data_policy.
Provider shorthand: groq:chat:<model> (e.g., groq:chat:llama-3.3-70b-versatile).
Key behavior: Groq does not retain prompt data by default — configure data_policy.retention: "none" for audit consistency.
Best next pages: Cerebras integration (alternative fast inference), Together AI integration, Policy configuration.

For engineers

Prerequisites: Groq API key (GROQ_API_KEY env var from console.groq.com), kt CLI installed.
Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
Validate: curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"llama-3.3-70b-versatile","messages":[{"role":"user","content":"hello"}]}'.
Groq uses OpenAI-compatible API — standard OpenAI SDKs work without modification.
Prefer streaming for long outputs — Groq's low latency makes streaming especially effective.
Set data_policy.retention: "none" to match Groq's default no-retention posture in your audit trail.

For leaders

Groq's LPU hardware delivers sub-second inference latency — enables real-time AI features without perceptible delay.
No data retention by default aligns with strict compliance postures — document this in config for audit evidence.
Limited model catalog (primarily Llama and Mixtral variants) — pair with a broader provider for model diversity.
Competitive pricing for high-throughput workloads; combine with audit-logger for complete request accounting.

Next steps

Cerebras integration — alternative ultra-fast wafer-scale inference
Together AI integration — broader open-model catalog
Provider routing strategies — latency-based routing with fallback
Policy configuration — audit-logger and PII policy reference
Quickstart — install kt and run your first gateway

Use this page when​

Primary audience​

Prerequisites​

Configuration​

Provider Fields​

Supported Models​

Client Examples​

Streaming​

Advanced Configuration​

Fallback Chains​

Weighted Routing​

Model Groups​

Best Practices​

For AI systems​

For engineers​

For leaders​

Next steps​