llama.cpp

llama.cpp provides efficient C++ inference for GGUF models with an OpenAI-compatible HTTP server that runs on CPU, Apple Silicon, and CUDA GPUs. Keeptrusts connects to llama.cpp servers and enforces policies on local inference without adding cloud dependencies, making it ideal for air-gapped or data-sensitive deployments.

Use this page when

You need the exact command, config, API, or integration details for llama.cpp.
You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Prerequisites

Build or install llama-server and download a GGUF model file before starting the gateway.

# Clone and build llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j$(nproc)

# Download a GGUF model (example: Llama 3.1 8B Q4_K_M quantization)
# Models are available at https://huggingface.co
huggingface-cli download \
  bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --local-dir ./models

# Start the server
./build/bin/llama-server \
  -m models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --port 8080 \
  --host 0.0.0.0 \
  -c 4096

# Verify the server is reachable
curl http://localhost:8080/v1/models

The server exposes an OpenAI-compatible API at http://localhost:8080 by default.

Configuration

Add a llama.cpp target to your policy-config.yaml. Set base_url explicitly or export LLAMA_BASE_URL to override the default.

providers:
  targets:
  - id: llama-cpp-chat
    provider: llama.cpp:chat:llama-3.1
    base_url: http://localhost:8080
  - id: llama-cpp-completion
    provider: llama.cpp:completion:llama-3.1
    base_url: http://localhost:8080
  - id: llama-cpp-mistral
    provider: llama.cpp:chat:mistral-7b
    base_url: http://localhost:8081
policies:
- id: local-inference-policy
  description: Policy controls for llama.cpp local inference
  rules:
  - type: pii_detection
    action: redact
    patterns:
    - ssn
    - credit_card
    - email
  - type: prompt_injection
    action: block

You can also configure the base URL via environment variable — useful for development environments where the port changes:

export LLAMA_BASE_URL=http://localhost:8080

When base_url is omitted from the provider config, Keeptrusts reads LLAMA_BASE_URL. If neither is set, it falls back to http://localhost:8080.

Provider Fields

Field	Type	Required	Default	Description
`id`	string	yes	—	Unique identifier for this target. Referenced in routing rules and logs.
`provider`	string	yes	—	Provider string: `llama.cpp`, `llama.cpp:chat:<model>`, or `llama.cpp:completion:<model>`.
`model`	string	no	Derived from provider	Override the model name separately when using the bare `llama.cpp` provider.
`base_url`	string	no	`$LLAMA_BASE_URL` or `http://localhost:8080`	Base URL of the llama-server instance. Omit to use the `LLAMA_BASE_URL` env var.
`secret_key_ref`	object	no	—	Object reference to the environment variable holding the bearer token. Local deployments typically do not use auth.
`timeout_seconds`	integer	no	`30`	Request timeout for non-streaming calls. Increase for large quantizations on CPU.
`stream_timeout_seconds`	integer	no	`120`	Timeout for the full streaming response.
`format`	string	no	`openai`	Wire format. Always `openai` for llama.cpp — no translation is performed.
`description`	string	no	—	Human-readable label shown in the console and audit logs.
`weight`	integer	no	`1`	Relative routing weight for group-based routing.
`health_probe`	boolean	no	`false`	When `true`, Keeptrusts periodically verifies the llama-server is reachable.

Supported Models

Any GGUF-format model is supported. The model name in the provider string is a label used by Keeptrusts — it does not need to match the filename. The actual model loaded depends on how you started llama-server.

Quantization Levels

Choosing the right quantization balances quality, memory, and speed:

Quantization	Size (7B)	Quality	Use case
`Q8_0`	~7 GB	Near-lossless	High-quality, plenty of RAM
`Q5_K_M`	~5 GB	Very good	Balanced quality/speed
`Q4_K_M`	~4 GB	Good	Most common, fast on CPU
`Q3_K_M`	~3 GB	Acceptable	Memory-constrained
`Q2_K`	~2.5 GB	Degraded	Minimal footprint only

# Example: download multiple quantizations
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  Meta-Llama-3.1-8B-Instruct-Q8_0.gguf \
  --local-dir ./models

Client Examples

Python
Node.js
cURL

from openai import OpenAI

# Point at Keeptrusts gateway, not llama-server directly
client = OpenAI(
    base_url="http://localhost:41002/v1",
    api_key="kt-your-api-key",
)

# Chat completion — same API as OpenAI
response = client.chat.completions.create(
    model="llama.cpp:chat:llama-3.1",  # matches provider id in config
    messages=[
        {"role": "system", "content": "You are a concise technical assistant."},
        {"role": "user", "content": "Explain quantization-aware training in two sentences."},
    ],
    temperature=0.6,
    max_tokens=256,
)
print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:41002/v1",
  apiKey: "kt-your-api-key",
});

async function main() {
  const response = await client.chat.completions.create({
    model: "llama.cpp:chat:llama-3.1",
    messages: [
      { role: "system", content: "You are a concise technical assistant." },
      { role: "user", content: "What are the trade-offs of running GGUF models on CPU vs GPU?" },
    ],
    temperature: 0.6,
    max_tokens: 300,
  });

  console.log(response.choices[0].message.content);
}

main().catch(console.error);

curl -s http://localhost:41002/v1/chat/completions \
  -H "Authorization: Bearer kt-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama.cpp:chat:llama-3.1",
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "What is GGUF format and why does it matter?" }
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }' | jq .

Streaming

llama-server supports OpenAI-compatible SSE streaming. Keeptrusts forwards stream chunks through its policy enforcement layer and delivers them to the client as standard SSE.

Python
cURL

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:41002/v1",
    api_key="kt-your-api-key",
)

with client.chat.completions.stream(
    model="llama.cpp:chat:llama-3.1",
    messages=[{"role": "user", "content": "Describe the GGUF file format in detail."}],
    max_tokens=400,
) as stream:
    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
print()

curl -s http://localhost:41002/v1/chat/completions \
  -H "Authorization: Bearer kt-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama.cpp:chat:llama-3.1",
    "messages": [{ "role": "user", "content": "List the advantages of llama.cpp." }],
    "stream": true,
    "max_tokens": 200
  }'

Advanced Configuration

Context Window Configuration

Set the context window size when starting the server. Larger windows use more memory:

# 8K context (default for many models)
./build/bin/llama-server -m models/llama-3.1-8b-Q4_K_M.gguf -c 8192 --port 8080

# 32K extended context (check model support)
./build/bin/llama-server -m models/llama-3.1-8b-Q4_K_M.gguf -c 32768 --port 8080

# Limit context in Keeptrusts to stay safely under the server's max

- id: "llama-cpp-chat"
  provider: "llama.cpp:chat:llama-3.1"
  base_url: "http://localhost:8080"
  max_context_tokens: 7168   # leave headroom below server's 8192 limit

Multiple Models with Separate Server Instances

Run multiple llama-server processes on different ports and configure a Keeptrusts routing group:

# Terminal 1 — fast small model
./build/bin/llama-server \
  -m models/phi-3.5-mini-Q4_K_M.gguf \
  --port 8080 -c 4096 --host 0.0.0.0

# Terminal 2 — high-quality larger model
./build/bin/llama-server \
  -m models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf \
  --port 8081 -c 8192 --host 0.0.0.0

pack:
  name: llama-cpp-providers-3
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: llama-phi35
    provider: llama.cpp:chat:phi-3.5-mini
    base_url: http://localhost:8080
  - id: llama-llama31
    provider: llama.cpp:chat:llama-3.1
    base_url: http://localhost:8081
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

GGUF Model Selection for Air-Gapped Deployments

In air-gapped environments, download models ahead of time and verify checksums:

# Download with checksum verification
huggingface-cli download \
  bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
  --local-dir /secure/models

sha256sum /secure/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
# Compare against the checksum published on the model card

Best Practices

Choose quantization based on available RAM. Q4_K_M is the most common starting point. Use Q5_K_M or Q8_0 when quality matters and memory allows.
Set timeout_seconds generously for CPU inference. Large models on CPU can take 30–120 seconds per response. A conservative default is timeout_seconds: 120.
Use health_probe: true in multi-model setups. If a llama-server process crashes or a model fails to load, Keeptrusts will stop routing to that target automatically.
Set -c (context window) on the server before max_context_tokens in Keeptrusts. If Keeptrusts sends a request that exceeds the server's context window, llama-server will return an error. Keep the Keeptrusts limit a few hundred tokens below the server's hard cap.
Use LLAMA_BASE_URL in development. Setting the env var avoids hardcoding localhost addresses across multiple config files during local development.
Do not expose llama-server directly. Route all traffic through Keeptrusts to capture audit events and enforce policies even on local-only deployments.
Run separate server instances per model. llama-server loads one model per process. For multi-model routing, start one process per model rather than trying to hot-swap models.

For AI systems

Canonical terms: Keeptrusts gateway, llama.cpp, llama-server, GGUF, local inference, self-hosted, provider target, policy-config.yaml.
Config field names: provider, model, base_url: "http://localhost:8081", format: "openai", timeout_seconds, health_probe.
Key behavior: llama.cpp's built-in server (llama-server) exposes an OpenAI-compatible API; Keeptrusts routes to it and applies policies.
Constraint: One model per llama-server process. For multi-model routing, run separate processes on different ports.
Best next pages: Llamafile integration, Ollama integration, vLLM integration.

For engineers

Prerequisites: llama.cpp built and running (llama-server --model your-model.gguf --port 8081), GGUF model file downloaded, kt CLI installed.
Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
Validate: curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"local-model","messages":[{"role":"user","content":"hello"}]}'.
Route all traffic through Keeptrusts — never expose llama-server directly — to capture audit events and enforce policies even on local deployments.
For multi-model setups, start one llama-server per model on separate ports and configure separate provider targets.
No secret_key_ref needed for local deployments (set api_key: "" or omit).

For leaders

llama.cpp enables fully air-gapped inference — no data leaves the host machine, addressing the strictest data sovereignty requirements.
Keeptrusts audit logging provides compliance evidence for local inference that has no vendor-side audit trail.
Hardware cost is the primary expense (GPU/CPU) — no per-token charges, but throughput is limited by available compute.
One model per process means infrastructure scaling requires explicit capacity planning for multi-model deployments.

Next steps

Llamafile integration — single-file local inference with zero dependencies
Ollama integration — user-friendly local model management
vLLM integration — high-throughput self-hosted serving
Policy configuration — audit-logger and prompt-injection reference
Quickstart — install kt and run your first gateway

Use this page when​

Primary audience​

Prerequisites​

Configuration​

Provider Fields​

Supported Models​

Quantization Levels​

Client Examples​

Streaming​

Advanced Configuration​

Context Window Configuration​

Multiple Models with Separate Server Instances​

GGUF Model Selection for Air-Gapped Deployments​

Best Practices​

For AI systems​

For engineers​

For leaders​

Next steps​