Llamafile

Llamafile packages GGUF models into single self-contained executables that run on any operating system without installation, drivers, or dependencies. Keeptrusts connects to llamafile's built-in llama.cpp HTTP server through its policy gateway, giving you full governance over local inference without changing your existing OpenAI-compatible client code.

Use this page when

You need the exact command, config, API, or integration details for Llamafile.
You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Prerequisites

Download a .llamafile executable for the model you want to run. No installation is required — just make it executable and run it.

# Download a llamafile (example: LLaVA 1.5 7B)
curl -LO https://huggingface.co/Mozilla/llava-1.5-7b-hf-llamafile/resolve/main/llava-1.5-7b-q4.llamafile

# Make executable (macOS / Linux)
chmod +x llava-1.5-7b-q4.llamafile

# Start the built-in llama.cpp server on port 8080 (OpenAI-compatible)
./llava-1.5-7b-q4.llamafile --server --port 8080

# Verify it is reachable
curl http://localhost:8080/v1/models

On Windows, rename the file with a .exe extension before running. The server binds to http://localhost:8080 by default. Keeptrusts must be configured and kt gateway run started after the llamafile server is up.

Configuration

Add a llamafile target to your policy-config.yaml. The provider field identifies the runtime and optionally specifies the model name served by that instance.

providers:
  targets:
  - id: llamafile-chat
    provider: llamafile:chat:llava-1.5-7b
    base_url: http://localhost:8080
  - id: llamafile-mistral
    provider: llamafile:chat:mistral-7b-instruct
    base_url: http://localhost:8081
  - id: llamafile-codegen
    provider: llamafile:completion:wizardcoder
    base_url: http://localhost:8082
policies:
- id: local-privacy-policy
  description: Block PII before sending to local model
  rules:
  - type: pii_detection
    action: redact
    patterns:
    - ssn
    - credit_card
    - email
  - type: content_filter
    action: block
    categories:
    - violence
    - self_harm

Provider Fields

Field	Type	Required	Default	Description
`id`	string	yes	—	Unique identifier for this target. Referenced in routing rules and shown in audit logs.
`provider`	string	yes	—	Provider string: `llamafile`, `llamafile:chat:<model>`, or `llamafile:completion:<model>`.
`model`	string	no	Derived from provider	Override model name separately when using the bare `llamafile` provider.
`base_url`	string	no	`http://localhost:8080`	Base URL of the llamafile server instance, including port.
`secret_key_ref`	object	no	—	Object reference to the environment variable holding a bearer token. Only needed if the llamafile server is behind an auth gateway.
`timeout_seconds`	integer	no	`30`	Request timeout for non-streaming calls, in seconds.
`format`	string	no	`openai`	Wire format. Llamafile uses the OpenAI-compatible format; this should not need to be changed.
`description`	string	no	—	Human-readable label shown in the console and audit logs.
`weight`	integer	no	`1`	Relative routing weight when this target belongs to a load-balanced group.
`health_probe`	boolean	no	`false`	When `true`, Keeptrusts periodically checks the base URL and marks the target unhealthy if unreachable.

Supported Models

Llamafile supports any model available in GGUF format. The following are commonly used with official pre-packaged executables:

Model	Description	Use Case
`llava-1.5-7b`	LLaVA 1.5 7B (multimodal)	Vision + language tasks
`mistral-7b-instruct`	Mistral 7B Instruct v0.3	General instruction following
`gemma-2-2b-it`	Google Gemma 2 2B Instruct	Lightweight, efficient chat
`wizardcoder-python-13b`	WizardCoder 13B	Code generation and completion
`tinyllama-1.1b-chat`	TinyLlama 1.1B Chat	Edge deployments, low memory

Use the model name you pass to --model (or the executable name) as the model identifier in the provider field. Keeptrusts forwards it verbatim to the llamafile server.

Client Examples

Python
Node.js
cURL

from openai import OpenAI

# Point at Keeptrusts gateway, not llamafile directly
client = OpenAI(
    base_url="http://localhost:41002/v1",
    api_key="kt-your-api-key",
)

response = client.chat.completions.create(
    model="llamafile:chat:mistral-7b-instruct",  # matches provider id in config
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the attention mechanism in transformers."},
    ],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:41002/v1",
  apiKey: "kt-your-api-key",
});

async function main() {
  const response = await client.chat.completions.create({
    model: "llamafile:chat:mistral-7b-instruct",
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: "What are the best practices for local LLM deployment?" },
    ],
    temperature: 0.5,
    max_tokens: 512,
  });

  console.log(response.choices[0].message.content);
}

main().catch(console.error);

curl -s http://localhost:41002/v1/chat/completions \
  -H "Authorization: Bearer kt-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llamafile:chat:mistral-7b-instruct",
    "messages": [
      { "role": "user", "content": "Summarize the key principles of AI governance." }
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }' | jq .

Streaming

Keeptrusts forwards streaming responses from the llamafile server as OpenAI-compatible Server-Sent Events (SSE). No changes are required on the client side.

Python
cURL

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:41002/v1",
    api_key="kt-your-api-key",
)

with client.chat.completions.stream(
    model="llamafile:chat:mistral-7b-instruct",
    messages=[{"role": "user", "content": "Write a short story about a robot learning to paint."}],
    max_tokens=400,
) as stream:
    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
print()

curl -s http://localhost:41002/v1/chat/completions \
  -H "Authorization: Bearer kt-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llamafile:chat:mistral-7b-instruct",
    "messages": [{ "role": "user", "content": "Count to 5 slowly." }],
    "stream": true,
    "max_tokens": 64
  }'

What Keeptrusts does during streaming:

Policy rules (redaction, blocking) are applied to the assembled response before any chunk is forwarded to the client.
Token usage fields from the llamafile server are surfaced in the final SSE chunk as standard usage counters.
If a policy violation is detected mid-stream, the stream is terminated and a governance event is recorded in the audit log.

Advanced Configuration

Running Multiple Llamafile Instances

Run multiple .llamafile executables on different ports and load-balance across them:

# Terminal 1 — Mistral 7B on port 8080
./mistral-7b-instruct.llamafile --server --port 8080

# Terminal 2 — Gemma 2 2B on port 8081
./gemma-2-2b-it.llamafile --server --port 8081

pack:
  name: llamafile-providers-2
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: llamafile-mistral
    provider: llamafile:chat:mistral-7b-instruct
    base_url: http://localhost:8080
  - id: llamafile-gemma
    provider: llamafile:chat:gemma-2-2b-it
    base_url: http://localhost:8081
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

GPU Acceleration

Pass llama.cpp GPU flags directly to the llamafile executable at startup:

# Use all available GPU layers (CUDA / Metal)
./mistral-7b-instruct.llamafile --server --port 8080 -ngl 99

# Limit GPU layers for lower VRAM budgets
./mistral-7b-instruct.llamafile --server --port 8080 -ngl 32

Keeptrusts communicates with the server over HTTP regardless of how the model is loaded and is unaffected by the GPU configuration.

Timeouts for Large Models

Large GGUF models (13B+) can take longer to generate responses. Increase the timeout accordingly:

pack:
  name: llamafile-providers-3
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: llamafile-large
    provider: llamafile:chat:wizardcoder-python-13b
    base_url: http://localhost:8080
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Best Practices

Start llamafile before kt gateway run. Keeptrusts performs a health check on startup if health_probe: true is set; a server that isn't ready will mark the target unhealthy.
Use one port per model. Llamafile loads exactly one model per process. Use separate ports and separate targets entries if you need multiple models simultaneously.
Match the model name to the executable. The model portion of the provider string is forwarded to the llamafile server. Use the same name you passed to --model or that the server reports at /v1/models.
Keep max_tokens bounded. Local inference generates at native speed. Set max_tokens in client requests or in your policy to avoid unexpectedly long runs.
Pin temperature for deterministic outputs. When llamafile is used in automated pipelines, set temperature: 0 to get reproducible results.
Log all local inference. Even though no data leaves the host, Keeptrusts still records events and traces for each request — this provides auditability for compliance and debugging.

For AI systems

Canonical terms: Keeptrusts gateway, Llamafile, llamafile, single-file inference, local AI, self-hosted, provider target, policy-config.yaml.
Config field names: provider, model, base_url: "http://localhost:8080", format: "openai", timeout_seconds, health_probe.
Key behavior: Llamafile bundles model weights and inference runtime into a single executable — Keeptrusts routes to its OpenAI-compatible API.
Constraint: One model per llamafile process. Zero external dependencies required.
Best next pages: llama.cpp integration, Ollama integration, Policy configuration.

For engineers

Prerequisites: Llamafile binary downloaded and running (e.g., ./mistral-7b.llamafile --port 8081), kt CLI installed.
Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
Validate: curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"local-model","messages":[{"role":"user","content":"hello"}]}'.
Set temperature: 0 for deterministic outputs when used in automated pipelines.
Keeptrusts records events for every request even though no data leaves the host — providing auditability for compliance.
No secret_key_ref needed for local deployments.

For leaders

Llamafile provides the simplest possible air-gapped deployment — single file, no dependencies, no data egress.
Keeptrusts audit logging creates compliance evidence for inference that has zero vendor-side observability.
Zero operational complexity for model deployment — ideal for developer workstations, edge devices, or environments without container infrastructure.
Limited to single-model serving; plan for Ollama or vLLM if multi-model requirements emerge.

Next steps

llama.cpp integration — more flexible local inference with multi-model support
Ollama integration — local model management with easy model switching
vLLM integration — production-grade self-hosted serving
Policy configuration — audit-logger and safety policy reference
Quickstart — install kt and run your first gateway

Use this page when​

Primary audience​

Prerequisites​

Configuration​

Provider Fields​

Supported Models​

Client Examples​

Streaming​

Advanced Configuration​

Running Multiple Llamafile Instances​

GPU Acceleration​

Timeouts for Large Models​

Best Practices​

For AI systems​

For engineers​

For leaders​

Next steps​