Ollama
Ollama runs open-source models locally on CPU or GPU with a simple REST API. Keeptrusts gateways Ollama through its enforcement engine, translating OpenAI-format requests into Ollama's native API format and converting responses back — so your existing OpenAI-compatible client code works without modification.
Use this page when
- You need the exact command, config, API, or integration details for Ollama.
- You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
- If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Prerequisites
Before configuring the Keeptrusts integration, make sure Ollama is installed and at least one model is available:
# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull models you intend to use
ollama pull llama3.3
ollama pull phi4
ollama pull mistral
# Verify the server is reachable (default port 11434)
curl http://localhost:11434/api/tags
Ollama must be running before kt gateway run starts. By default it binds to http://localhost:11434.
Configuration
Add an Ollama target to your policy-config.yaml. The provider field controls which model and endpoint kind Keeptrusts uses.
providers:
targets:
- id: ollama-llama3
provider: ollama:chat:llama3.3
base_url: http://localhost:11434
- id: ollama-phi4
provider: ollama:chat:phi4
base_url: http://localhost:11434
- id: ollama-mistral-completion
provider: ollama:completion:mistral
base_url: http://localhost:11434
- id: ollama-embed
provider: ollama:embedding:nomic-embed-text
base_url: http://localhost:11434
policies:
- id: local-privacy-policy
description: Block PII in locally-run models
rules:
- type: pii_detection
action: redact
patterns:
- ssn
- credit_card
- email
Provider Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
id | string | yes | — | Unique identifier for this target within the config. Used in routing rules and logs. |
provider | string | yes | — | Provider string: ollama, ollama:chat:<model>, ollama:completion:<model>, or ollama:embedding:<model>. |
model | string | no | Derived from provider | Override the model name separately when using the bare ollama provider. |
base_url | string | no | http://localhost:11434 | Full base URL of the Ollama server including port. |
secret_key_ref | object | no | OLLAMA_API_KEY | Object reference to the environment variable holding the bearer token. Only needed if Ollama is running behind an auth gateway. |
timeout_seconds | integer | no | 30 | Request timeout for non-streaming calls. |
stream_timeout_seconds | integer | no | 120 | Timeout for the full streaming response. |
max_context_tokens | integer | no | model default | Override the context window size passed to Ollama. |
description | string | no | — | Human-readable label shown in the console and audit logs. |
weight | integer | no | 1 | Relative routing weight when multiple targets are in the same group. |
health_probe | boolean | no | false | When true, Keeptrusts periodically checks GET / on the base URL and marks the target unhealthy if unreachable. |
Supported Models
List locally available models at runtime:
ollama list
Pull additional models as needed:
ollama pull llama3.3:70b # large, high quality
ollama pull llama3.1:8b # balanced
ollama pull phi4 # Microsoft, strong reasoning
ollama pull mistral:7b # fast, broadly capable
ollama pull gemma2:9b # Google Gemma 2
ollama pull qwen2.5:72b # Alibaba Qwen, multilingual
ollama pull nomic-embed-text # embeddings model
ollama pull mxbai-embed-large # high-dimensional embeddings
Use the exact name shown by ollama list (including tag) in the provider field:
provider: "ollama:chat:llama3.3:70b"
Client Examples
- Python
- Node.js
- cURL
from openai import OpenAI
# Point at Keeptrusts gateway, not Ollama directly
client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="kt-your-api-key",
)
# Chat completion — same API as OpenAI
response = client.chat.completions.create(
model="ollama:chat:llama3.3", # matches provider id in config
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain gradient descent in plain English."},
],
temperature=0.7,
)
print(response.choices[0].message.content)
# Embeddings
embed_response = client.embeddings.create(
model="ollama:embedding:nomic-embed-text",
input="Keeptrusts governs AI model interactions.",
)
print(embed_response.data[0].embedding[:5])
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:41002/v1",
apiKey: "kt-your-api-key",
});
async function main() {
// Chat completion
const response = await client.chat.completions.create({
model: "ollama:chat:llama3.3",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "What are the best practices for prompt engineering?" },
],
temperature: 0.5,
});
console.log(response.choices[0].message.content);
// Embeddings
const embed = await client.embeddings.create({
model: "ollama:embedding:nomic-embed-text",
input: "Keeptrusts enforces AI policies.",
});
console.log("Embedding dimensions:", embed.data[0].embedding.length);
}
main().catch(console.error);
# Chat completion through Keeptrusts gateway
curl -s http://localhost:41002/v1/chat/completions \
-H "Authorization: Bearer kt-your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "ollama:chat:llama3.3",
"messages": [
{ "role": "user", "content": "Summarize the key principles of AI governance." }
],
"temperature": 0.7
}' | jq .
# Embeddings
curl -s http://localhost:41002/v1/embeddings \
-H "Authorization: Bearer kt-your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "ollama:embedding:nomic-embed-text",
"input": "Document text to embed"
}' | jq '.data[0].embedding | length'
Streaming
Keeptrusts handles Ollama's native NDJSON streaming format and converts it to OpenAI-compatible Server-Sent Events (SSE) for your client. No changes are needed on the client side.
- Python
- cURL
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="kt-your-api-key",
)
with client.chat.completions.stream(
model="ollama:chat:llama3.3",
messages=[{"role": "user", "content": "Write a short poem about distributed systems."}],
) as stream:
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
curl -s http://localhost:41002/v1/chat/completions \
-H "Authorization: Bearer kt-your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "ollama:chat:llama3.3",
"messages": [{ "role": "user", "content": "Count to 10." }],
"stream": true
}'
# Each line is a standard SSE data: {...} chunk
What Keeptrusts does during streaming:
- Ollama NDJSON
data:lines are parsed incrementally and re-emitted as OpenAI SSE chunks. - Policy rules (redaction, blocking) are applied to the assembled response before any chunk is forwarded.
- Ollama's
eval_count/prompt_eval_countcounters are surfaced in the final SSE chunk asusage.completion_tokens/usage.prompt_tokens.
Advanced Configuration
Multi-Model Routing
Use multiple Ollama targets with different weights to load-balance across models or GPU nodes:
pack:
name: ollama-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: ollama-llama3-large
provider: ollama:chat:llama3.3:70b
base_url: http://gpu-node-1:11434
- id: ollama-llama3-small
provider: ollama:chat:llama3.1:8b
base_url: http://gpu-node-2:11434
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Policy Enforcement on Local Models
Local inference does not mean ungoverned inference. Apply the same policy rules you use for cloud providers:
policies:
- id: "local-data-policy"
description: "Enforce data handling on all local inference"
rules:
- type: "pii_detection"
action: "redact"
patterns: ["ssn", "credit_card", "passport_number"]
- type: "topic_block"
action: "block"
topics: ["weapons_manufacturing", "illegal_activity"]
- type: "prompt_injection"
action: "block"
- type: "response_length"
action: "truncate"
max_tokens: 2048
Routing to Ollama by Request Metadata
routing:
rules:
- match:
metadata:
task_type: "embedding"
target: "ollama-embed"
- match:
metadata:
latency_class: "low"
target: "ollama-phi4"
- default:
target: "ollama-llama3"
Best Practices
- Pull models before starting the gateway. Keeptrusts will fail health probes if a model is missing. Run
ollama pull <model>as part of your startup script. - Enable
health_probe: truefor targets when running multiple Ollama nodes. Keeptrusts will automatically stop routing to unreachable instances. - Set
timeout_secondsbased on model size. A 70B model on CPU can take significantly longer than a 7B model on GPU. Start withtimeout_seconds: 120for large models. - Use
stream: truefor long responses. Streaming reduces time-to-first-token and avoids gateway buffering limits. - Keep
base_urlexplicit. Even thoughhttp://localhost:11434is the default, setting it explicitly makes the config portable and self-documenting. - Separate embedding targets from chat targets. This ensures policy rules scoped to chat traffic do not interfere with embedding pipelines.
- Do not expose Ollama directly. Route all traffic through Keeptrusts, even in development, to ensure consistent audit logging and policy enforcement.
For AI systems
- Canonical terms: Keeptrusts gateway, Ollama, local models, self-hosted, embeddings, provider target, policy-config.yaml,
provider: "ollama". - Config field names:
provider,model,base_url: "http://localhost:11434",format: "openai",timeout_seconds,health_probe. - Provider shorthand:
ollama:chat:<model>(e.g.,ollama:chat:llama3.3). - Key behavior: Ollama serves models locally with an OpenAI-compatible API; Keeptrusts routes to it and applies policies.
- Best next pages: llama.cpp integration, vLLM integration, Policy configuration.
For engineers
- Prerequisites: Ollama installed and running (
ollama serve), model pulled (ollama pull llama3.3),ktCLI installed. - Start command:
kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml. - Validate:
curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"llama3.3","messages":[{"role":"user","content":"hello"}]}'. - Separate embedding targets from chat targets to avoid policy conflicts between chat and embedding workloads.
- Route all traffic through Keeptrusts — never expose Ollama directly — to ensure consistent audit logging even in development.
- Ollama auto-loads models on first request; first-request latency may be higher while the model loads into memory.
For leaders
- Ollama keeps all inference local — no data leaves the developer's machine, suitable for prototyping with sensitive data.
- Zero per-token cost (hardware-only) makes Ollama ideal for development, testing, and low-volume production workloads.
- Keeptrusts audit logging provides compliance evidence for local inference with no vendor-side audit trail.
- Limited throughput compared to cloud providers — plan migration to vLLM or hosted providers when scaling beyond single-machine capacity.
Next steps
- llama.cpp integration — lower-level local inference with GGUF models
- vLLM integration — production-grade self-hosted serving
- HuggingFace integration — hosted open models when local capacity is insufficient
- Policy configuration — audit-logger and PII policy reference
- Quickstart — install
ktand run your first gateway