Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

llama.cpp

llama.cpp provides efficient C++ inference for GGUF models with an OpenAI-compatible HTTP server that runs on CPU, Apple Silicon, and CUDA GPUs. Keeptrusts connects to llama.cpp servers and enforces policies on local inference without adding cloud dependencies, making it ideal for air-gapped or data-sensitive deployments.

Use this page when

  • You need the exact command, config, API, or integration details for llama.cpp.
  • You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
  • If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Prerequisites

Build or install llama-server and download a GGUF model file before starting the gateway.

# Clone and build llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release -j$(nproc)

# Download a GGUF model (example: Llama 3.1 8B Q4_K_M quantization)
# Models are available at https://huggingface.co
huggingface-cli download \
bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--local-dir ./models

# Start the server
./build/bin/llama-server \
-m models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--port 8080 \
--host 0.0.0.0 \
-c 4096

# Verify the server is reachable
curl http://localhost:8080/v1/models

The server exposes an OpenAI-compatible API at http://localhost:8080 by default.

Configuration

Add a llama.cpp target to your policy-config.yaml. Set base_url explicitly or export LLAMA_BASE_URL to override the default.

providers:
targets:
- id: llama-cpp-chat
provider: llama.cpp:chat:llama-3.1
base_url: http://localhost:8080
- id: llama-cpp-completion
provider: llama.cpp:completion:llama-3.1
base_url: http://localhost:8080
- id: llama-cpp-mistral
provider: llama.cpp:chat:mistral-7b
base_url: http://localhost:8081
policies:
- id: local-inference-policy
description: Policy controls for llama.cpp local inference
rules:
- type: pii_detection
action: redact
patterns:
- ssn
- credit_card
- email
- type: prompt_injection
action: block

You can also configure the base URL via environment variable — useful for development environments where the port changes:

export LLAMA_BASE_URL=http://localhost:8080

When base_url is omitted from the provider config, Keeptrusts reads LLAMA_BASE_URL. If neither is set, it falls back to http://localhost:8080.

Provider Fields

FieldTypeRequiredDefaultDescription
idstringyesUnique identifier for this target. Referenced in routing rules and logs.
providerstringyesProvider string: llama.cpp, llama.cpp:chat:<model>, or llama.cpp:completion:<model>.
modelstringnoDerived from providerOverride the model name separately when using the bare llama.cpp provider.
base_urlstringno$LLAMA_BASE_URL or http://localhost:8080Base URL of the llama-server instance. Omit to use the LLAMA_BASE_URL env var.
secret_key_refobjectnoObject reference to the environment variable holding the bearer token. Local deployments typically do not use auth.
timeout_secondsintegerno30Request timeout for non-streaming calls. Increase for large quantizations on CPU.
stream_timeout_secondsintegerno120Timeout for the full streaming response.
formatstringnoopenaiWire format. Always openai for llama.cpp — no translation is performed.
descriptionstringnoHuman-readable label shown in the console and audit logs.
weightintegerno1Relative routing weight for group-based routing.
health_probebooleannofalseWhen true, Keeptrusts periodically verifies the llama-server is reachable.

Supported Models

Any GGUF-format model is supported. The model name in the provider string is a label used by Keeptrusts — it does not need to match the filename. The actual model loaded depends on how you started llama-server.

Quantization Levels

Choosing the right quantization balances quality, memory, and speed:

QuantizationSize (7B)QualityUse case
Q8_0~7 GBNear-losslessHigh-quality, plenty of RAM
Q5_K_M~5 GBVery goodBalanced quality/speed
Q4_K_M~4 GBGoodMost common, fast on CPU
Q3_K_M~3 GBAcceptableMemory-constrained
Q2_K~2.5 GBDegradedMinimal footprint only
# Example: download multiple quantizations
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
Meta-Llama-3.1-8B-Instruct-Q8_0.gguf \
--local-dir ./models

Client Examples

from openai import OpenAI

# Point at Keeptrusts gateway, not llama-server directly
client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="kt-your-api-key",
)

# Chat completion — same API as OpenAI
response = client.chat.completions.create(
model="llama.cpp:chat:llama-3.1", # matches provider id in config
messages=[
{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": "Explain quantization-aware training in two sentences."},
],
temperature=0.6,
max_tokens=256,
)
print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

Streaming

llama-server supports OpenAI-compatible SSE streaming. Keeptrusts forwards stream chunks through its policy enforcement layer and delivers them to the client as standard SSE.

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="kt-your-api-key",
)

with client.chat.completions.stream(
model="llama.cpp:chat:llama-3.1",
messages=[{"role": "user", "content": "Describe the GGUF file format in detail."}],
max_tokens=400,
) as stream:
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()

Advanced Configuration

Context Window Configuration

Set the context window size when starting the server. Larger windows use more memory:

# 8K context (default for many models)
./build/bin/llama-server -m models/llama-3.1-8b-Q4_K_M.gguf -c 8192 --port 8080

# 32K extended context (check model support)
./build/bin/llama-server -m models/llama-3.1-8b-Q4_K_M.gguf -c 32768 --port 8080

# Limit context in Keeptrusts to stay safely under the server's max
- id: "llama-cpp-chat"
provider: "llama.cpp:chat:llama-3.1"
base_url: "http://localhost:8080"
max_context_tokens: 7168 # leave headroom below server's 8192 limit

Multiple Models with Separate Server Instances

Run multiple llama-server processes on different ports and configure a Keeptrusts routing group:

# Terminal 1 — fast small model
./build/bin/llama-server \
-m models/phi-3.5-mini-Q4_K_M.gguf \
--port 8080 -c 4096 --host 0.0.0.0

# Terminal 2 — high-quality larger model
./build/bin/llama-server \
-m models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf \
--port 8081 -c 8192 --host 0.0.0.0
pack:
name: llama-cpp-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: llama-phi35
provider: llama.cpp:chat:phi-3.5-mini
base_url: http://localhost:8080
- id: llama-llama31
provider: llama.cpp:chat:llama-3.1
base_url: http://localhost:8081
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

GGUF Model Selection for Air-Gapped Deployments

In air-gapped environments, download models ahead of time and verify checksums:

# Download with checksum verification
huggingface-cli download \
bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--local-dir /secure/models

sha256sum /secure/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
# Compare against the checksum published on the model card

Best Practices

  • Choose quantization based on available RAM. Q4_K_M is the most common starting point. Use Q5_K_M or Q8_0 when quality matters and memory allows.
  • Set timeout_seconds generously for CPU inference. Large models on CPU can take 30–120 seconds per response. A conservative default is timeout_seconds: 120.
  • Use health_probe: true in multi-model setups. If a llama-server process crashes or a model fails to load, Keeptrusts will stop routing to that target automatically.
  • Set -c (context window) on the server before max_context_tokens in Keeptrusts. If Keeptrusts sends a request that exceeds the server's context window, llama-server will return an error. Keep the Keeptrusts limit a few hundred tokens below the server's hard cap.
  • Use LLAMA_BASE_URL in development. Setting the env var avoids hardcoding localhost addresses across multiple config files during local development.
  • Do not expose llama-server directly. Route all traffic through Keeptrusts to capture audit events and enforce policies even on local-only deployments.
  • Run separate server instances per model. llama-server loads one model per process. For multi-model routing, start one process per model rather than trying to hot-swap models.

For AI systems

  • Canonical terms: Keeptrusts gateway, llama.cpp, llama-server, GGUF, local inference, self-hosted, provider target, policy-config.yaml.
  • Config field names: provider, model, base_url: "http://localhost:8081", format: "openai", timeout_seconds, health_probe.
  • Key behavior: llama.cpp's built-in server (llama-server) exposes an OpenAI-compatible API; Keeptrusts routes to it and applies policies.
  • Constraint: One model per llama-server process. For multi-model routing, run separate processes on different ports.
  • Best next pages: Llamafile integration, Ollama integration, vLLM integration.

For engineers

  • Prerequisites: llama.cpp built and running (llama-server --model your-model.gguf --port 8081), GGUF model file downloaded, kt CLI installed.
  • Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
  • Validate: curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"local-model","messages":[{"role":"user","content":"hello"}]}'.
  • Route all traffic through Keeptrusts — never expose llama-server directly — to capture audit events and enforce policies even on local deployments.
  • For multi-model setups, start one llama-server per model on separate ports and configure separate provider targets.
  • No secret_key_ref needed for local deployments (set api_key: "" or omit).

For leaders

  • llama.cpp enables fully air-gapped inference — no data leaves the host machine, addressing the strictest data sovereignty requirements.
  • Keeptrusts audit logging provides compliance evidence for local inference that has no vendor-side audit trail.
  • Hardware cost is the primary expense (GPU/CPU) — no per-token charges, but throughput is limited by available compute.
  • One model per process means infrastructure scaling requires explicit capacity planning for multi-model deployments.

Next steps