Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Ollama

Ollama runs open-source models locally on CPU or GPU with a simple REST API. Keeptrusts gateways Ollama through its enforcement engine, translating OpenAI-format requests into Ollama's native API format and converting responses back — so your existing OpenAI-compatible client code works without modification.

Use this page when

  • You need the exact command, config, API, or integration details for Ollama.
  • You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
  • If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Prerequisites

Before configuring the Keeptrusts integration, make sure Ollama is installed and at least one model is available:

# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull models you intend to use
ollama pull llama3.3
ollama pull phi4
ollama pull mistral

# Verify the server is reachable (default port 11434)
curl http://localhost:11434/api/tags

Ollama must be running before kt gateway run starts. By default it binds to http://localhost:11434.

Configuration

Add an Ollama target to your policy-config.yaml. The provider field controls which model and endpoint kind Keeptrusts uses.

providers:
targets:
- id: ollama-llama3
provider: ollama:chat:llama3.3
base_url: http://localhost:11434
- id: ollama-phi4
provider: ollama:chat:phi4
base_url: http://localhost:11434
- id: ollama-mistral-completion
provider: ollama:completion:mistral
base_url: http://localhost:11434
- id: ollama-embed
provider: ollama:embedding:nomic-embed-text
base_url: http://localhost:11434
policies:
- id: local-privacy-policy
description: Block PII in locally-run models
rules:
- type: pii_detection
action: redact
patterns:
- ssn
- credit_card
- email

Provider Fields

FieldTypeRequiredDefaultDescription
idstringyesUnique identifier for this target within the config. Used in routing rules and logs.
providerstringyesProvider string: ollama, ollama:chat:<model>, ollama:completion:<model>, or ollama:embedding:<model>.
modelstringnoDerived from providerOverride the model name separately when using the bare ollama provider.
base_urlstringnohttp://localhost:11434Full base URL of the Ollama server including port.
secret_key_refobjectnoOLLAMA_API_KEYObject reference to the environment variable holding the bearer token. Only needed if Ollama is running behind an auth gateway.
timeout_secondsintegerno30Request timeout for non-streaming calls.
stream_timeout_secondsintegerno120Timeout for the full streaming response.
max_context_tokensintegernomodel defaultOverride the context window size passed to Ollama.
descriptionstringnoHuman-readable label shown in the console and audit logs.
weightintegerno1Relative routing weight when multiple targets are in the same group.
health_probebooleannofalseWhen true, Keeptrusts periodically checks GET / on the base URL and marks the target unhealthy if unreachable.

Supported Models

List locally available models at runtime:

ollama list

Pull additional models as needed:

ollama pull llama3.3:70b # large, high quality
ollama pull llama3.1:8b # balanced
ollama pull phi4 # Microsoft, strong reasoning
ollama pull mistral:7b # fast, broadly capable
ollama pull gemma2:9b # Google Gemma 2
ollama pull qwen2.5:72b # Alibaba Qwen, multilingual
ollama pull nomic-embed-text # embeddings model
ollama pull mxbai-embed-large # high-dimensional embeddings

Use the exact name shown by ollama list (including tag) in the provider field:

provider: "ollama:chat:llama3.3:70b"

Client Examples

from openai import OpenAI

# Point at Keeptrusts gateway, not Ollama directly
client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="kt-your-api-key",
)

# Chat completion — same API as OpenAI
response = client.chat.completions.create(
model="ollama:chat:llama3.3", # matches provider id in config
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain gradient descent in plain English."},
],
temperature=0.7,
)
print(response.choices[0].message.content)

# Embeddings
embed_response = client.embeddings.create(
model="ollama:embedding:nomic-embed-text",
input="Keeptrusts governs AI model interactions.",
)
print(embed_response.data[0].embedding[:5])

Streaming

Keeptrusts handles Ollama's native NDJSON streaming format and converts it to OpenAI-compatible Server-Sent Events (SSE) for your client. No changes are needed on the client side.

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="kt-your-api-key",
)

with client.chat.completions.stream(
model="ollama:chat:llama3.3",
messages=[{"role": "user", "content": "Write a short poem about distributed systems."}],
) as stream:
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()

What Keeptrusts does during streaming:

  • Ollama NDJSON data: lines are parsed incrementally and re-emitted as OpenAI SSE chunks.
  • Policy rules (redaction, blocking) are applied to the assembled response before any chunk is forwarded.
  • Ollama's eval_count / prompt_eval_count counters are surfaced in the final SSE chunk as usage.completion_tokens / usage.prompt_tokens.

Advanced Configuration

Multi-Model Routing

Use multiple Ollama targets with different weights to load-balance across models or GPU nodes:

pack:
name: ollama-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: ollama-llama3-large
provider: ollama:chat:llama3.3:70b
base_url: http://gpu-node-1:11434
- id: ollama-llama3-small
provider: ollama:chat:llama3.1:8b
base_url: http://gpu-node-2:11434
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Policy Enforcement on Local Models

Local inference does not mean ungoverned inference. Apply the same policy rules you use for cloud providers:

policies:
- id: "local-data-policy"
description: "Enforce data handling on all local inference"
rules:
- type: "pii_detection"
action: "redact"
patterns: ["ssn", "credit_card", "passport_number"]

- type: "topic_block"
action: "block"
topics: ["weapons_manufacturing", "illegal_activity"]

- type: "prompt_injection"
action: "block"

- type: "response_length"
action: "truncate"
max_tokens: 2048

Routing to Ollama by Request Metadata

routing:
rules:
- match:
metadata:
task_type: "embedding"
target: "ollama-embed"

- match:
metadata:
latency_class: "low"
target: "ollama-phi4"

- default:
target: "ollama-llama3"

Best Practices

  • Pull models before starting the gateway. Keeptrusts will fail health probes if a model is missing. Run ollama pull <model> as part of your startup script.
  • Enable health_probe: true for targets when running multiple Ollama nodes. Keeptrusts will automatically stop routing to unreachable instances.
  • Set timeout_seconds based on model size. A 70B model on CPU can take significantly longer than a 7B model on GPU. Start with timeout_seconds: 120 for large models.
  • Use stream: true for long responses. Streaming reduces time-to-first-token and avoids gateway buffering limits.
  • Keep base_url explicit. Even though http://localhost:11434 is the default, setting it explicitly makes the config portable and self-documenting.
  • Separate embedding targets from chat targets. This ensures policy rules scoped to chat traffic do not interfere with embedding pipelines.
  • Do not expose Ollama directly. Route all traffic through Keeptrusts, even in development, to ensure consistent audit logging and policy enforcement.

For AI systems

  • Canonical terms: Keeptrusts gateway, Ollama, local models, self-hosted, embeddings, provider target, policy-config.yaml, provider: "ollama".
  • Config field names: provider, model, base_url: "http://localhost:11434", format: "openai", timeout_seconds, health_probe.
  • Provider shorthand: ollama:chat:<model> (e.g., ollama:chat:llama3.3).
  • Key behavior: Ollama serves models locally with an OpenAI-compatible API; Keeptrusts routes to it and applies policies.
  • Best next pages: llama.cpp integration, vLLM integration, Policy configuration.

For engineers

  • Prerequisites: Ollama installed and running (ollama serve), model pulled (ollama pull llama3.3), kt CLI installed.
  • Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
  • Validate: curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"llama3.3","messages":[{"role":"user","content":"hello"}]}'.
  • Separate embedding targets from chat targets to avoid policy conflicts between chat and embedding workloads.
  • Route all traffic through Keeptrusts — never expose Ollama directly — to ensure consistent audit logging even in development.
  • Ollama auto-loads models on first request; first-request latency may be higher while the model loads into memory.

For leaders

  • Ollama keeps all inference local — no data leaves the developer's machine, suitable for prototyping with sensitive data.
  • Zero per-token cost (hardware-only) makes Ollama ideal for development, testing, and low-volume production workloads.
  • Keeptrusts audit logging provides compliance evidence for local inference with no vendor-side audit trail.
  • Limited throughput compared to cloud providers — plan migration to vLLM or hosted providers when scaling beyond single-machine capacity.

Next steps