OpenLLM
OpenLLM (by BentoML) provides production-ready LLM serving with OpenAI-compatible endpoints, supporting models from Hugging Face and custom fine-tunes. Keeptrusts routes OpenLLM endpoints through its policy engine so every request is governed, redacted, and audited whether you are serving locally or on a cloud deployment.
Use this page when
- You need the exact command, config, API, or integration details for OpenLLM.
- You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
- If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Prerequisites
Install OpenLLM and start a model server before configuring Keeptrusts:
# Install OpenLLM
pip install openllm
# Serve a model locally (downloads weights on first run)
openllm serve llama3:8b-instruct --port 3000
# Verify the OpenAI-compatible endpoint
curl http://localhost:3000/v1/models
The server binds to http://localhost:3000 by default. For cloud deployments, set OPENLLM_API_KEY in your environment and use the deployment URL as base_url. Keeptrusts must be configured and kt gateway run started after the OpenLLM server is ready.
Configuration
Add an OpenLLM target to your policy-config.yaml. The provider field identifies the runtime and the model being served.
providers:
targets:
- id: openllm-llama3
provider: openllm:chat:llama3:8b-instruct
base_url: http://localhost:3000/v1
- id: openllm-llama3-70b
provider: openllm:chat:llama3:70b-instruct
base_url: http://localhost:3001/v1
- id: openllm-cloud-mistral
provider: openllm:chat:mistral:7b-instruct
base_url: https://your-openllm-deployment.bentoml.com/v1
secret_key_ref:
env: OPENLLM_API_KEY
policies:
- id: openllm-governance
description: Enforce data governance for OpenLLM traffic
rules:
- type: pii_detection
action: redact
patterns:
- ssn
- credit_card
- email
- type: content_filter
action: block
categories:
- violence
- self_harm
Provider Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
id | string | yes | — | Unique identifier for this target. Referenced in routing rules and shown in audit logs. |
provider | string | yes | — | Provider string: openllm, openllm:chat:<model>, or openllm:completion:<model>. |
model | string | no | Derived from provider | Override model name separately when using the bare openllm provider. |
base_url | string | no | http://localhost:3000/v1 | Full base URL of the OpenLLM server, including /v1 path. |
secret_key_ref | object | no | — | Object reference to the environment variable holding a bearer token. Required for cloud deployments; optional for local. |
timeout_seconds | integer | no | 30 | Request timeout for non-streaming calls, in seconds. |
format | string | no | openai | Wire format. OpenLLM exposes an OpenAI-compatible API; this should not need to be changed. |
description | string | no | — | Human-readable label shown in the console and audit logs. |
weight | integer | no | 1 | Relative routing weight when this target belongs to a load-balanced group. |
health_probe | boolean | no | false | When true, Keeptrusts periodically checks the base URL and marks the target unhealthy if unreachable. |
Supported Models
OpenLLM supports any model available on Hugging Face. The following are commonly used with openllm serve:
| Model | Description | Use Case |
|---|---|---|
llama3:8b-instruct | Meta Llama 3 8B Instruct | General chat and instruction following |
llama3:70b-instruct | Meta Llama 3 70B Instruct | Complex reasoning and analysis |
mistral:7b-instruct | Mistral 7B Instruct v0.3 | Fast, broadly capable chat |
phi3:medium-4k-instruct | Microsoft Phi-3 Medium | Efficient reasoning, 4K context |
qwen2:7b-instruct | Alibaba Qwen 2 7B | Multilingual instruction following |
To list models available to your OpenLLM installation:
openllm models
Use the exact model identifier shown by openllm models (including tag) in the provider field.
Client Examples
- Python
- Node.js
- cURL
from openai import OpenAI
# Point at Keeptrusts gateway, not OpenLLM directly
client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="kt-your-api-key",
)
response = client.chat.completions.create(
model="openllm:chat:llama3:8b-instruct", # matches provider id in config
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the difference between fine-tuning and RAG."},
],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:41002/v1",
apiKey: "kt-your-api-key",
});
async function main() {
const response = await client.chat.completions.create({
model: "openllm:chat:llama3:8b-instruct",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "What are the key considerations for production LLM deployments?" },
],
temperature: 0.5,
max_tokens: 512,
});
console.log(response.choices[0].message.content);
}
main().catch(console.error);
curl -s http://localhost:41002/v1/chat/completions \
-H "Authorization: Bearer kt-your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "openllm:chat:llama3:8b-instruct",
"messages": [
{ "role": "user", "content": "Summarize the key principles of AI governance." }
],
"temperature": 0.7,
"max_tokens": 256
}' | jq .
Streaming
Keeptrusts forwards streaming responses from OpenLLM as OpenAI-compatible Server-Sent Events (SSE). No changes are required on the client side.
- Python
- cURL
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="kt-your-api-key",
)
with client.chat.completions.stream(
model="openllm:chat:llama3:8b-instruct",
messages=[{"role": "user", "content": "Describe the architecture of a transformer model in detail."}],
max_tokens=600,
) as stream:
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
curl -s http://localhost:41002/v1/chat/completions \
-H "Authorization: Bearer kt-your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "openllm:chat:llama3:8b-instruct",
"messages": [{ "role": "user", "content": "Count to 5 slowly." }],
"stream": true,
"max_tokens": 64
}'
What Keeptrusts does during streaming:
- Policy rules (redaction, blocking) are applied to the assembled response before any chunk is forwarded to the client.
- Token usage fields from OpenLLM are surfaced in the final SSE chunk as standard
usagecounters. - If a policy violation is detected mid-stream, the stream is terminated and a governance event is recorded in the audit log.
Advanced Configuration
Multi-Model Routing
Run multiple OpenLLM server instances on different ports and load-balance across them:
# Terminal 1 — Llama 3 8B on port 3000
openllm serve llama3:8b-instruct --port 3000
# Terminal 2 — Mistral 7B on port 3001
openllm serve mistral:7b-instruct --port 3001
pack:
name: openllm-providers-2
version: 1.0.0
enabled: true
providers:
targets:
- id: openllm-llama3
provider: openllm:chat:llama3:8b-instruct
base_url: http://localhost:3000/v1
- id: openllm-mistral
provider: openllm:chat:mistral:7b-instruct
base_url: http://localhost:3001/v1
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Cloud Deployment with BentoCloud
When using a BentoCloud-managed OpenLLM deployment, supply the deployment URL and an API key:
export OPENLLM_API_KEY="your-bentocloud-api-key"
pack:
name: openllm-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: openllm-bentocloud
provider: openllm:chat:llama3:70b-instruct
base_url: https://your-deployment.bentoml.com/v1
secret_key_ref:
env: OPENLLM_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Timeouts for Large Models
Large models (70B+) may take longer to generate responses. Increase the timeout accordingly:
pack:
name: openllm-providers-4
version: 1.0.0
enabled: true
providers:
targets:
- id: openllm-large
provider: openllm:chat:llama3:70b-instruct
base_url: http://localhost:3000/v1
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Best Practices
- Start OpenLLM before
kt gateway run. Keeptrusts performs a health check on startup ifhealth_probe: trueis set; a server that isn't ready will mark the target unhealthy. - Include
/v1inbase_url. OpenLLM's OpenAI-compatible API is served under the/v1path prefix. Omitting it will result in 404 errors. - Use environment variables for API keys. Never hard-code API keys in
policy-config.yaml. Usesecret_key_refto reference an environment variable. - Match the model identifier exactly. Use the same model string returned by
openllm models(e.g.,llama3:8b-instruct). Keeptrusts forwards it verbatim to the OpenLLM server. - Set
timeout_secondsbased on model size. 70B models may need 90–120 seconds per request at lower hardware budgets. - Log all traffic. Keeptrusts records events and traces for every request — this is essential for compliance and debugging in both local and cloud deployments.
For AI systems
- Canonical terms: Keeptrusts gateway, OpenLLM, BentoML, self-hosted serving, provider target, policy-config.yaml.
- Config field names:
provider,model,base_url,format: "openai",timeout_seconds,health_probe. - Key behavior: OpenLLM serves open-weight models with an OpenAI-compatible API; Keeptrusts routes to it and applies policies.
- Constraint: Set
timeout_secondsbased on model size — 70B models may need 90–120 seconds on lower hardware. - Best next pages: vLLM integration, HuggingFace integration, Policy configuration.
For engineers
- Prerequisites: OpenLLM installed and serving a model (
openllm start <model>),ktCLI installed. - Start command:
kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml. - Validate:
curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"your-model","messages":[{"role":"user","content":"hello"}]}'. - Set
timeout_secondsbased on model size — 70B models at lower hardware budgets need 90–120 seconds per request. - Keeptrusts records events for every request — essential for compliance and debugging in both local and cloud deployments.
- No
secret_key_refneeded for local deployments.
For leaders
- OpenLLM enables self-hosted inference with full data sovereignty — no data leaves your infrastructure.
- BentoML deployment framework supports containerized, scalable serving — Keeptrusts adds the governance layer.
- Keeptrusts audit logging provides compliance evidence where the serving framework has no built-in audit trail.
- Hardware cost is the primary expense; plan capacity for model size and expected concurrency.
Next steps
- vLLM integration — alternative high-throughput self-hosted serving
- HuggingFace integration — hosted model inference endpoints
- Ollama integration — simpler local model serving
- Policy configuration — audit-logger and prompt-injection reference
- Quickstart — install
ktand run your first gateway