Ollama

Ollama runs open-source models locally on CPU or GPU with a simple REST API. Keeptrusts gateways Ollama through its enforcement engine, translating OpenAI-format requests into Ollama's native API format and converting responses back — so your existing OpenAI-compatible client code works without modification.

Use this page when

You need the exact command, config, API, or integration details for Ollama.
You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Prerequisites

Before configuring the Keeptrusts integration, make sure Ollama is installed and at least one model is available:

# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull models you intend to use
ollama pull llama3.3
ollama pull phi4
ollama pull mistral

# Verify the server is reachable (default port 11434)
curl http://localhost:11434/api/tags

Ollama must be running before kt gateway run starts. By default it binds to http://localhost:11434.

Configuration

Add an Ollama target to your policy-config.yaml. The provider field controls which model and endpoint kind Keeptrusts uses.

providers:
  targets:
  - id: ollama-llama3
    provider: ollama:chat:llama3.3
    base_url: http://localhost:11434
  - id: ollama-phi4
    provider: ollama:chat:phi4
    base_url: http://localhost:11434
  - id: ollama-mistral-completion
    provider: ollama:completion:mistral
    base_url: http://localhost:11434
  - id: ollama-embed
    provider: ollama:embedding:nomic-embed-text
    base_url: http://localhost:11434
policies:
- id: local-privacy-policy
  description: Block PII in locally-run models
  rules:
  - type: pii_detection
    action: redact
    patterns:
    - ssn
    - credit_card
    - email

Provider Fields

Field	Type	Required	Default	Description
`id`	string	yes	—	Unique identifier for this target within the config. Used in routing rules and logs.
`provider`	string	yes	—	Provider string: `ollama`, `ollama:chat:<model>`, `ollama:completion:<model>`, or `ollama:embedding:<model>`.
`model`	string	no	Derived from provider	Override the model name separately when using the bare `ollama` provider.
`base_url`	string	no	`http://localhost:11434`	Full base URL of the Ollama server including port.
`secret_key_ref`	object	no	`OLLAMA_API_KEY`	Object reference to the environment variable holding the bearer token. Only needed if Ollama is running behind an auth gateway.
`timeout_seconds`	integer	no	`30`	Request timeout for non-streaming calls.
`stream_timeout_seconds`	integer	no	`120`	Timeout for the full streaming response.
`max_context_tokens`	integer	no	model default	Override the context window size passed to Ollama.
`description`	string	no	—	Human-readable label shown in the console and audit logs.
`weight`	integer	no	`1`	Relative routing weight when multiple targets are in the same group.
`health_probe`	boolean	no	`false`	When `true`, Keeptrusts periodically checks `GET /` on the base URL and marks the target unhealthy if unreachable.

Supported Models

List locally available models at runtime:

ollama list

Pull additional models as needed:

ollama pull llama3.3:70b       # large, high quality
ollama pull llama3.1:8b        # balanced
ollama pull phi4               # Microsoft, strong reasoning
ollama pull mistral:7b         # fast, broadly capable
ollama pull gemma2:9b          # Google Gemma 2
ollama pull qwen2.5:72b        # Alibaba Qwen, multilingual
ollama pull nomic-embed-text   # embeddings model
ollama pull mxbai-embed-large  # high-dimensional embeddings

Use the exact name shown by ollama list (including tag) in the provider field:

provider: "ollama:chat:llama3.3:70b"

Client Examples

Python
Node.js
cURL

from openai import OpenAI

# Point at Keeptrusts gateway, not Ollama directly
client = OpenAI(
    base_url="http://localhost:41002/v1",
    api_key="kt-your-api-key",
)

# Chat completion — same API as OpenAI
response = client.chat.completions.create(
    model="ollama:chat:llama3.3",  # matches provider id in config
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain gradient descent in plain English."},
    ],
    temperature=0.7,
)
print(response.choices[0].message.content)

# Embeddings
embed_response = client.embeddings.create(
    model="ollama:embedding:nomic-embed-text",
    input="Keeptrusts governs AI model interactions.",
)
print(embed_response.data[0].embedding[:5])

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:41002/v1",
  apiKey: "kt-your-api-key",
});

async function main() {
  // Chat completion
  const response = await client.chat.completions.create({
    model: "ollama:chat:llama3.3",
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: "What are the best practices for prompt engineering?" },
    ],
    temperature: 0.5,
  });

  console.log(response.choices[0].message.content);

  // Embeddings
  const embed = await client.embeddings.create({
    model: "ollama:embedding:nomic-embed-text",
    input: "Keeptrusts enforces AI policies.",
  });

  console.log("Embedding dimensions:", embed.data[0].embedding.length);
}

main().catch(console.error);

# Chat completion through Keeptrusts gateway
curl -s http://localhost:41002/v1/chat/completions \
  -H "Authorization: Bearer kt-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama:chat:llama3.3",
    "messages": [
      { "role": "user", "content": "Summarize the key principles of AI governance." }
    ],
    "temperature": 0.7
  }' | jq .

# Embeddings
curl -s http://localhost:41002/v1/embeddings \
  -H "Authorization: Bearer kt-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama:embedding:nomic-embed-text",
    "input": "Document text to embed"
  }' | jq '.data[0].embedding | length'

Streaming

Keeptrusts handles Ollama's native NDJSON streaming format and converts it to OpenAI-compatible Server-Sent Events (SSE) for your client. No changes are needed on the client side.

Python
cURL

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:41002/v1",
    api_key="kt-your-api-key",
)

with client.chat.completions.stream(
    model="ollama:chat:llama3.3",
    messages=[{"role": "user", "content": "Write a short poem about distributed systems."}],
) as stream:
    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
print()

curl -s http://localhost:41002/v1/chat/completions \
  -H "Authorization: Bearer kt-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ollama:chat:llama3.3",
    "messages": [{ "role": "user", "content": "Count to 10." }],
    "stream": true
  }'
# Each line is a standard SSE data: {...} chunk

What Keeptrusts does during streaming:

Ollama NDJSON data: lines are parsed incrementally and re-emitted as OpenAI SSE chunks.
Policy rules (redaction, blocking) are applied to the assembled response before any chunk is forwarded.
Ollama's eval_count / prompt_eval_count counters are surfaced in the final SSE chunk as usage.completion_tokens / usage.prompt_tokens.

Advanced Configuration

Multi-Model Routing

Use multiple Ollama targets with different weights to load-balance across models or GPU nodes:

pack:
  name: ollama-providers-3
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: ollama-llama3-large
    provider: ollama:chat:llama3.3:70b
    base_url: http://gpu-node-1:11434
  - id: ollama-llama3-small
    provider: ollama:chat:llama3.1:8b
    base_url: http://gpu-node-2:11434
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Policy Enforcement on Local Models

Local inference does not mean ungoverned inference. Apply the same policy rules you use for cloud providers:

policies:
  - id: "local-data-policy"
    description: "Enforce data handling on all local inference"
    rules:
      - type: "pii_detection"
        action: "redact"
        patterns: ["ssn", "credit_card", "passport_number"]

      - type: "topic_block"
        action: "block"
        topics: ["weapons_manufacturing", "illegal_activity"]

      - type: "prompt_injection"
        action: "block"

      - type: "response_length"
        action: "truncate"
        max_tokens: 2048

Routing to Ollama by Request Metadata

routing:
  rules:
    - match:
        metadata:
          task_type: "embedding"
      target: "ollama-embed"

    - match:
        metadata:
          latency_class: "low"
      target: "ollama-phi4"

    - default:
        target: "ollama-llama3"

Best Practices

Pull models before starting the gateway. Keeptrusts will fail health probes if a model is missing. Run ollama pull <model> as part of your startup script.
Enable health_probe: true for targets when running multiple Ollama nodes. Keeptrusts will automatically stop routing to unreachable instances.
Set timeout_seconds based on model size. A 70B model on CPU can take significantly longer than a 7B model on GPU. Start with timeout_seconds: 120 for large models.
Use stream: true for long responses. Streaming reduces time-to-first-token and avoids gateway buffering limits.
Keep base_url explicit. Even though http://localhost:11434 is the default, setting it explicitly makes the config portable and self-documenting.
Separate embedding targets from chat targets. This ensures policy rules scoped to chat traffic do not interfere with embedding pipelines.
Do not expose Ollama directly. Route all traffic through Keeptrusts, even in development, to ensure consistent audit logging and policy enforcement.

For AI systems

Canonical terms: Keeptrusts gateway, Ollama, local models, self-hosted, embeddings, provider target, policy-config.yaml, provider: "ollama".
Config field names: provider, model, base_url: "http://localhost:11434", format: "openai", timeout_seconds, health_probe.
Provider shorthand: ollama:chat:<model> (e.g., ollama:chat:llama3.3).
Key behavior: Ollama serves models locally with an OpenAI-compatible API; Keeptrusts routes to it and applies policies.
Best next pages: llama.cpp integration, vLLM integration, Policy configuration.

For engineers

Prerequisites: Ollama installed and running (ollama serve), model pulled (ollama pull llama3.3), kt CLI installed.
Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
Validate: curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"llama3.3","messages":[{"role":"user","content":"hello"}]}'.
Separate embedding targets from chat targets to avoid policy conflicts between chat and embedding workloads.
Route all traffic through Keeptrusts — never expose Ollama directly — to ensure consistent audit logging even in development.
Ollama auto-loads models on first request; first-request latency may be higher while the model loads into memory.

For leaders

Ollama keeps all inference local — no data leaves the developer's machine, suitable for prototyping with sensitive data.
Zero per-token cost (hardware-only) makes Ollama ideal for development, testing, and low-volume production workloads.
Keeptrusts audit logging provides compliance evidence for local inference with no vendor-side audit trail.
Limited throughput compared to cloud providers — plan migration to vLLM or hosted providers when scaling beyond single-machine capacity.

Next steps

llama.cpp integration — lower-level local inference with GGUF models
vLLM integration — production-grade self-hosted serving
HuggingFace integration — hosted open models when local capacity is insufficient
Policy configuration — audit-logger and PII policy reference
Quickstart — install kt and run your first gateway

Use this page when​

Primary audience​

Prerequisites​

Configuration​

Provider Fields​

Supported Models​

Client Examples​

Streaming​

Advanced Configuration​

Multi-Model Routing​

Policy Enforcement on Local Models​

Routing to Ollama by Request Metadata​

Best Practices​

For AI systems​

For engineers​

For leaders​

Next steps​