OpenLLM

OpenLLM (by BentoML) provides production-ready LLM serving with OpenAI-compatible endpoints, supporting models from Hugging Face and custom fine-tunes. Keeptrusts routes OpenLLM endpoints through its policy engine so every request is governed, redacted, and audited whether you are serving locally or on a cloud deployment.

Use this page when

You need the exact command, config, API, or integration details for OpenLLM.
You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Prerequisites

Install OpenLLM and start a model server before configuring Keeptrusts:

# Install OpenLLM
pip install openllm

# Serve a model locally (downloads weights on first run)
openllm serve llama3:8b-instruct --port 3000

# Verify the OpenAI-compatible endpoint
curl http://localhost:3000/v1/models

The server binds to http://localhost:3000 by default. For cloud deployments, set OPENLLM_API_KEY in your environment and use the deployment URL as base_url. Keeptrusts must be configured and kt gateway run started after the OpenLLM server is ready.

Configuration

Add an OpenLLM target to your policy-config.yaml. The provider field identifies the runtime and the model being served.

providers:
  targets:
  - id: openllm-llama3
    provider: openllm:chat:llama3:8b-instruct
    base_url: http://localhost:3000/v1
  - id: openllm-llama3-70b
    provider: openllm:chat:llama3:70b-instruct
    base_url: http://localhost:3001/v1
  - id: openllm-cloud-mistral
    provider: openllm:chat:mistral:7b-instruct
    base_url: https://your-openllm-deployment.bentoml.com/v1
    secret_key_ref:
      env: OPENLLM_API_KEY
policies:
- id: openllm-governance
  description: Enforce data governance for OpenLLM traffic
  rules:
  - type: pii_detection
    action: redact
    patterns:
    - ssn
    - credit_card
    - email
  - type: content_filter
    action: block
    categories:
    - violence
    - self_harm

Provider Fields

Field	Type	Required	Default	Description
`id`	string	yes	—	Unique identifier for this target. Referenced in routing rules and shown in audit logs.
`provider`	string	yes	—	Provider string: `openllm`, `openllm:chat:<model>`, or `openllm:completion:<model>`.
`model`	string	no	Derived from provider	Override model name separately when using the bare `openllm` provider.
`base_url`	string	no	`http://localhost:3000/v1`	Full base URL of the OpenLLM server, including `/v1` path.
`secret_key_ref`	object	no	—	Object reference to the environment variable holding a bearer token. Required for cloud deployments; optional for local.
`timeout_seconds`	integer	no	`30`	Request timeout for non-streaming calls, in seconds.
`format`	string	no	`openai`	Wire format. OpenLLM exposes an OpenAI-compatible API; this should not need to be changed.
`description`	string	no	—	Human-readable label shown in the console and audit logs.
`weight`	integer	no	`1`	Relative routing weight when this target belongs to a load-balanced group.
`health_probe`	boolean	no	`false`	When `true`, Keeptrusts periodically checks the base URL and marks the target unhealthy if unreachable.

Supported Models

OpenLLM supports any model available on Hugging Face. The following are commonly used with openllm serve:

Model	Description	Use Case
`llama3:8b-instruct`	Meta Llama 3 8B Instruct	General chat and instruction following
`llama3:70b-instruct`	Meta Llama 3 70B Instruct	Complex reasoning and analysis
`mistral:7b-instruct`	Mistral 7B Instruct v0.3	Fast, broadly capable chat
`phi3:medium-4k-instruct`	Microsoft Phi-3 Medium	Efficient reasoning, 4K context
`qwen2:7b-instruct`	Alibaba Qwen 2 7B	Multilingual instruction following

To list models available to your OpenLLM installation:

openllm models

Use the exact model identifier shown by openllm models (including tag) in the provider field.

Client Examples

Python
Node.js
cURL

from openai import OpenAI

# Point at Keeptrusts gateway, not OpenLLM directly
client = OpenAI(
    base_url="http://localhost:41002/v1",
    api_key="kt-your-api-key",
)

response = client.chat.completions.create(
    model="openllm:chat:llama3:8b-instruct",  # matches provider id in config
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the difference between fine-tuning and RAG."},
    ],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:41002/v1",
  apiKey: "kt-your-api-key",
});

async function main() {
  const response = await client.chat.completions.create({
    model: "openllm:chat:llama3:8b-instruct",
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: "What are the key considerations for production LLM deployments?" },
    ],
    temperature: 0.5,
    max_tokens: 512,
  });

  console.log(response.choices[0].message.content);
}

main().catch(console.error);

curl -s http://localhost:41002/v1/chat/completions \
  -H "Authorization: Bearer kt-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openllm:chat:llama3:8b-instruct",
    "messages": [
      { "role": "user", "content": "Summarize the key principles of AI governance." }
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }' | jq .

Streaming

Keeptrusts forwards streaming responses from OpenLLM as OpenAI-compatible Server-Sent Events (SSE). No changes are required on the client side.

Python
cURL

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:41002/v1",
    api_key="kt-your-api-key",
)

with client.chat.completions.stream(
    model="openllm:chat:llama3:8b-instruct",
    messages=[{"role": "user", "content": "Describe the architecture of a transformer model in detail."}],
    max_tokens=600,
) as stream:
    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
print()

curl -s http://localhost:41002/v1/chat/completions \
  -H "Authorization: Bearer kt-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openllm:chat:llama3:8b-instruct",
    "messages": [{ "role": "user", "content": "Count to 5 slowly." }],
    "stream": true,
    "max_tokens": 64
  }'

What Keeptrusts does during streaming:

Policy rules (redaction, blocking) are applied to the assembled response before any chunk is forwarded to the client.
Token usage fields from OpenLLM are surfaced in the final SSE chunk as standard usage counters.
If a policy violation is detected mid-stream, the stream is terminated and a governance event is recorded in the audit log.

Advanced Configuration

Multi-Model Routing

Run multiple OpenLLM server instances on different ports and load-balance across them:

# Terminal 1 — Llama 3 8B on port 3000
openllm serve llama3:8b-instruct --port 3000

# Terminal 2 — Mistral 7B on port 3001
openllm serve mistral:7b-instruct --port 3001

pack:
  name: openllm-providers-2
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openllm-llama3
    provider: openllm:chat:llama3:8b-instruct
    base_url: http://localhost:3000/v1
  - id: openllm-mistral
    provider: openllm:chat:mistral:7b-instruct
    base_url: http://localhost:3001/v1
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Cloud Deployment with BentoCloud

When using a BentoCloud-managed OpenLLM deployment, supply the deployment URL and an API key:

export OPENLLM_API_KEY="your-bentocloud-api-key"

pack:
  name: openllm-providers-3
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openllm-bentocloud
    provider: openllm:chat:llama3:70b-instruct
    base_url: https://your-deployment.bentoml.com/v1
    secret_key_ref:
      env: OPENLLM_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Timeouts for Large Models

Large models (70B+) may take longer to generate responses. Increase the timeout accordingly:

pack:
  name: openllm-providers-4
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openllm-large
    provider: openllm:chat:llama3:70b-instruct
    base_url: http://localhost:3000/v1
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Best Practices

Start OpenLLM before kt gateway run. Keeptrusts performs a health check on startup if health_probe: true is set; a server that isn't ready will mark the target unhealthy.
Include /v1 in base_url. OpenLLM's OpenAI-compatible API is served under the /v1 path prefix. Omitting it will result in 404 errors.
Use environment variables for API keys. Never hard-code API keys in policy-config.yaml. Use secret_key_ref to reference an environment variable.
Match the model identifier exactly. Use the same model string returned by openllm models (e.g., llama3:8b-instruct). Keeptrusts forwards it verbatim to the OpenLLM server.
Set timeout_seconds based on model size. 70B models may need 90–120 seconds per request at lower hardware budgets.
Log all traffic. Keeptrusts records events and traces for every request — this is essential for compliance and debugging in both local and cloud deployments.

For AI systems

Canonical terms: Keeptrusts gateway, OpenLLM, BentoML, self-hosted serving, provider target, policy-config.yaml.
Config field names: provider, model, base_url, format: "openai", timeout_seconds, health_probe.
Key behavior: OpenLLM serves open-weight models with an OpenAI-compatible API; Keeptrusts routes to it and applies policies.
Constraint: Set timeout_seconds based on model size — 70B models may need 90–120 seconds on lower hardware.
Best next pages: vLLM integration, HuggingFace integration, Policy configuration.

For engineers

Prerequisites: OpenLLM installed and serving a model (openllm start <model>), kt CLI installed.
Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
Validate: curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"your-model","messages":[{"role":"user","content":"hello"}]}'.
Set timeout_seconds based on model size — 70B models at lower hardware budgets need 90–120 seconds per request.
Keeptrusts records events for every request — essential for compliance and debugging in both local and cloud deployments.
No secret_key_ref needed for local deployments.

For leaders

OpenLLM enables self-hosted inference with full data sovereignty — no data leaves your infrastructure.
BentoML deployment framework supports containerized, scalable serving — Keeptrusts adds the governance layer.
Keeptrusts audit logging provides compliance evidence where the serving framework has no built-in audit trail.
Hardware cost is the primary expense; plan capacity for model size and expected concurrency.

Next steps

vLLM integration — alternative high-throughput self-hosted serving
HuggingFace integration — hosted model inference endpoints
Ollama integration — simpler local model serving
Policy configuration — audit-logger and prompt-injection reference
Quickstart — install kt and run your first gateway

Use this page when​

Primary audience​

Prerequisites​

Configuration​

Provider Fields​

Supported Models​

Client Examples​

Streaming​

Advanced Configuration​

Multi-Model Routing​

Cloud Deployment with BentoCloud​

Timeouts for Large Models​

Best Practices​

For AI systems​

For engineers​

For leaders​

Next steps​