llama.cpp
llama.cpp provides efficient C++ inference for GGUF models with an OpenAI-compatible HTTP server that runs on CPU, Apple Silicon, and CUDA GPUs. Keeptrusts connects to llama.cpp servers and enforces policies on local inference without adding cloud dependencies, making it ideal for air-gapped or data-sensitive deployments.
Use this page when
- You need the exact command, config, API, or integration details for llama.cpp.
- You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
- If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Prerequisites
Build or install llama-server and download a GGUF model file before starting the gateway.
# Clone and build llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j$(nproc)
# Download a GGUF model (example: Llama 3.1 8B Q4_K_M quantization)
# Models are available at https://huggingface.co
huggingface-cli download \
bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--local-dir ./models
# Start the server
./build/bin/llama-server \
-m models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--port 8080 \
--host 0.0.0.0 \
-c 4096
# Verify the server is reachable
curl http://localhost:8080/v1/models
The server exposes an OpenAI-compatible API at http://localhost:8080 by default.
Configuration
Add a llama.cpp target to your policy-config.yaml. Set base_url explicitly or export LLAMA_BASE_URL to override the default.
providers:
targets:
- id: llama-cpp-chat
provider: llama.cpp:chat:llama-3.1
base_url: http://localhost:8080
- id: llama-cpp-completion
provider: llama.cpp:completion:llama-3.1
base_url: http://localhost:8080
- id: llama-cpp-mistral
provider: llama.cpp:chat:mistral-7b
base_url: http://localhost:8081
policies:
- id: local-inference-policy
description: Policy controls for llama.cpp local inference
rules:
- type: pii_detection
action: redact
patterns:
- ssn
- credit_card
- email
- type: prompt_injection
action: block
You can also configure the base URL via environment variable — useful for development environments where the port changes:
export LLAMA_BASE_URL=http://localhost:8080
When base_url is omitted from the provider config, Keeptrusts reads LLAMA_BASE_URL. If neither is set, it falls back to http://localhost:8080.
Provider Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
id | string | yes | — | Unique identifier for this target. Referenced in routing rules and logs. |
provider | string | yes | — | Provider string: llama.cpp, llama.cpp:chat:<model>, or llama.cpp:completion:<model>. |
model | string | no | Derived from provider | Override the model name separately when using the bare llama.cpp provider. |
base_url | string | no | $LLAMA_BASE_URL or http://localhost:8080 | Base URL of the llama-server instance. Omit to use the LLAMA_BASE_URL env var. |
secret_key_ref | object | no | — | Object reference to the environment variable holding the bearer token. Local deployments typically do not use auth. |
timeout_seconds | integer | no | 30 | Request timeout for non-streaming calls. Increase for large quantizations on CPU. |
stream_timeout_seconds | integer | no | 120 | Timeout for the full streaming response. |
format | string | no | openai | Wire format. Always openai for llama.cpp — no translation is performed. |
description | string | no | — | Human-readable label shown in the console and audit logs. |
weight | integer | no | 1 | Relative routing weight for group-based routing. |
health_probe | boolean | no | false | When true, Keeptrusts periodically verifies the llama-server is reachable. |
Supported Models
Any GGUF-format model is supported. The model name in the provider string is a label used by Keeptrusts — it does not need to match the filename. The actual model loaded depends on how you started llama-server.
Quantization Levels
Choosing the right quantization balances quality, memory, and speed:
| Quantization | Size (7B) | Quality | Use case |
|---|---|---|---|
Q8_0 | ~7 GB | Near-lossless | High-quality, plenty of RAM |
Q5_K_M | ~5 GB | Very good | Balanced quality/speed |
Q4_K_M | ~4 GB | Good | Most common, fast on CPU |
Q3_K_M | ~3 GB | Acceptable | Memory-constrained |
Q2_K | ~2.5 GB | Degraded | Minimal footprint only |
# Example: download multiple quantizations
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
Meta-Llama-3.1-8B-Instruct-Q8_0.gguf \
--local-dir ./models
Client Examples
- Python
- Node.js
- cURL
from openai import OpenAI
# Point at Keeptrusts gateway, not llama-server directly
client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="kt-your-api-key",
)
# Chat completion — same API as OpenAI
response = client.chat.completions.create(
model="llama.cpp:chat:llama-3.1", # matches provider id in config
messages=[
{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": "Explain quantization-aware training in two sentences."},
],
temperature=0.6,
max_tokens=256,
)
print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:41002/v1",
apiKey: "kt-your-api-key",
});
async function main() {
const response = await client.chat.completions.create({
model: "llama.cpp:chat:llama-3.1",
messages: [
{ role: "system", content: "You are a concise technical assistant." },
{ role: "user", content: "What are the trade-offs of running GGUF models on CPU vs GPU?" },
],
temperature: 0.6,
max_tokens: 300,
});
console.log(response.choices[0].message.content);
}
main().catch(console.error);
curl -s http://localhost:41002/v1/chat/completions \
-H "Authorization: Bearer kt-your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "llama.cpp:chat:llama-3.1",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "What is GGUF format and why does it matter?" }
],
"temperature": 0.7,
"max_tokens": 256
}' | jq .
Streaming
llama-server supports OpenAI-compatible SSE streaming. Keeptrusts forwards stream chunks through its policy enforcement layer and delivers them to the client as standard SSE.
- Python
- cURL
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="kt-your-api-key",
)
with client.chat.completions.stream(
model="llama.cpp:chat:llama-3.1",
messages=[{"role": "user", "content": "Describe the GGUF file format in detail."}],
max_tokens=400,
) as stream:
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
curl -s http://localhost:41002/v1/chat/completions \
-H "Authorization: Bearer kt-your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "llama.cpp:chat:llama-3.1",
"messages": [{ "role": "user", "content": "List the advantages of llama.cpp." }],
"stream": true,
"max_tokens": 200
}'
Advanced Configuration
Context Window Configuration
Set the context window size when starting the server. Larger windows use more memory:
# 8K context (default for many models)
./build/bin/llama-server -m models/llama-3.1-8b-Q4_K_M.gguf -c 8192 --port 8080
# 32K extended context (check model support)
./build/bin/llama-server -m models/llama-3.1-8b-Q4_K_M.gguf -c 32768 --port 8080
# Limit context in Keeptrusts to stay safely under the server's max
- id: "llama-cpp-chat"
provider: "llama.cpp:chat:llama-3.1"
base_url: "http://localhost:8080"
max_context_tokens: 7168 # leave headroom below server's 8192 limit
Multiple Models with Separate Server Instances
Run multiple llama-server processes on different ports and configure a Keeptrusts routing group:
# Terminal 1 — fast small model
./build/bin/llama-server \
-m models/phi-3.5-mini-Q4_K_M.gguf \
--port 8080 -c 4096 --host 0.0.0.0
# Terminal 2 — high-quality larger model
./build/bin/llama-server \
-m models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf \
--port 8081 -c 8192 --host 0.0.0.0
pack:
name: llama-cpp-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: llama-phi35
provider: llama.cpp:chat:phi-3.5-mini
base_url: http://localhost:8080
- id: llama-llama31
provider: llama.cpp:chat:llama-3.1
base_url: http://localhost:8081
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
GGUF Model Selection for Air-Gapped Deployments
In air-gapped environments, download models ahead of time and verify checksums:
# Download with checksum verification
huggingface-cli download \
bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--local-dir /secure/models
sha256sum /secure/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
# Compare against the checksum published on the model card
Best Practices
- Choose quantization based on available RAM. Q4_K_M is the most common starting point. Use Q5_K_M or Q8_0 when quality matters and memory allows.
- Set
timeout_secondsgenerously for CPU inference. Large models on CPU can take 30–120 seconds per response. A conservative default istimeout_seconds: 120. - Use
health_probe: truein multi-model setups. If a llama-server process crashes or a model fails to load, Keeptrusts will stop routing to that target automatically. - Set
-c(context window) on the server beforemax_context_tokensin Keeptrusts. If Keeptrusts sends a request that exceeds the server's context window, llama-server will return an error. Keep the Keeptrusts limit a few hundred tokens below the server's hard cap. - Use
LLAMA_BASE_URLin development. Setting the env var avoids hardcoding localhost addresses across multiple config files during local development. - Do not expose llama-server directly. Route all traffic through Keeptrusts to capture audit events and enforce policies even on local-only deployments.
- Run separate server instances per model. llama-server loads one model per process. For multi-model routing, start one process per model rather than trying to hot-swap models.
For AI systems
- Canonical terms: Keeptrusts gateway, llama.cpp, llama-server, GGUF, local inference, self-hosted, provider target, policy-config.yaml.
- Config field names:
provider,model,base_url: "http://localhost:8081",format: "openai",timeout_seconds,health_probe. - Key behavior: llama.cpp's built-in server (
llama-server) exposes an OpenAI-compatible API; Keeptrusts routes to it and applies policies. - Constraint: One model per llama-server process. For multi-model routing, run separate processes on different ports.
- Best next pages: Llamafile integration, Ollama integration, vLLM integration.
For engineers
- Prerequisites: llama.cpp built and running (
llama-server --model your-model.gguf --port 8081), GGUF model file downloaded,ktCLI installed. - Start command:
kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml. - Validate:
curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"local-model","messages":[{"role":"user","content":"hello"}]}'. - Route all traffic through Keeptrusts — never expose llama-server directly — to capture audit events and enforce policies even on local deployments.
- For multi-model setups, start one llama-server per model on separate ports and configure separate provider targets.
- No
secret_key_refneeded for local deployments (setapi_key: ""or omit).
For leaders
- llama.cpp enables fully air-gapped inference — no data leaves the host machine, addressing the strictest data sovereignty requirements.
- Keeptrusts audit logging provides compliance evidence for local inference that has no vendor-side audit trail.
- Hardware cost is the primary expense (GPU/CPU) — no per-token charges, but throughput is limited by available compute.
- One model per process means infrastructure scaling requires explicit capacity planning for multi-model deployments.
Next steps
- Llamafile integration — single-file local inference with zero dependencies
- Ollama integration — user-friendly local model management
- vLLM integration — high-throughput self-hosted serving
- Policy configuration — audit-logger and prompt-injection reference
- Quickstart — install
ktand run your first gateway