Text Generation WebUI
Text Generation WebUI (oobabooga) provides a browser-based interface and OpenAI-compatible API for local LLM inference, supporting GGUF, GPTQ, AWQ, and transformers models. Keeptrusts uses the built-in OpenAI-compatible extension to gateway requests through its policy engine, adding enforcement, redaction, and audit trails to any model loaded in the WebUI.
Use this page when
- You need the exact command, config, API, or integration details for Text Generation WebUI.
- You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
- If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Prerequisites
Install Text Generation WebUI and enable its OpenAI-compatible API extension before configuring Keeptrusts:
# Clone and set up (one-time)
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt
# Start with a specific model and the OpenAI extension enabled
python server.py --model Llama-3.1-8B-Instruct --api --api-port 5000
Alternatively, enable the OpenAI extension through the UI: Settings → Extensions → openai → Apply and restart.
Verify the endpoint is reachable:
curl http://localhost:5000/v1/models
The server binds to http://localhost:5000 by default. Keeptrusts must be configured and kt gateway run started after the WebUI server is up.
Configuration
Add a Text Generation WebUI target to your policy-config.yaml. The provider field identifies the runtime and the model currently loaded in the WebUI.
providers:
targets:
- id: webui-chat
provider: text-generation-webui:chat:Llama-3.1-8B-Instruct
base_url: http://localhost:5000/v1
- id: webui-mistral
provider: text-generation-webui:chat:Mistral-7B-v0.3
base_url: http://localhost:5001/v1
policies:
- id: webui-governance
description: Enforce data governance for WebUI traffic
rules:
- type: pii_detection
action: redact
patterns:
- ssn
- credit_card
- email
- type: content_filter
action: block
categories:
- violence
- self_harm
Provider Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
id | string | yes | — | Unique identifier for this target. Referenced in routing rules and shown in audit logs. |
provider | string | yes | — | Provider string: text-generation-webui or text-generation-webui:chat:<model>. |
model | string | no | Derived from provider | Override model name separately when using the bare text-generation-webui provider. |
base_url | string | no | http://localhost:5000/v1 | Full base URL of the WebUI server, including /v1 path. |
secret_key_ref | object | no | — | Object reference to the environment variable holding a bearer token. Only needed if WebUI is behind an auth gateway. |
timeout_seconds | integer | no | 30 | Request timeout for non-streaming calls, in seconds. |
format | string | no | openai | Wire format. The WebUI OpenAI extension uses the OpenAI-compatible format; this should not need to be changed. |
description | string | no | — | Human-readable label shown in the console and audit logs. |
weight | integer | no | 1 | Relative routing weight when this target belongs to a load-balanced group. |
health_probe | boolean | no | false | When true, Keeptrusts periodically checks the base URL and marks the target unhealthy if unreachable. |
Supported Models
Text Generation WebUI supports any model that can be loaded through its interface. Common models used in governance contexts include:
| Model | Format | Use Case |
|---|---|---|
Llama-3.1-8B-Instruct | GGUF / transformers | General purpose chat |
Mistral-7B-v0.3 | GGUF / transformers | Fast, broadly capable chat |
Phi-3-mini-4k-instruct | GGUF / transformers | Lightweight, efficient reasoning |
Mixtral-8x7B-Instruct-v0.1 | GGUF | High capability, mixture of experts |
CodeLlama-13B-Instruct | GGUF | Code generation and completion |
The model name in the provider field should match the folder name as shown in the WebUI Model tab. Keeptrusts forwards it verbatim to the OpenAI extension endpoint.
Client Examples
- Python
- Node.js
- cURL
from openai import OpenAI
# Point at Keeptrusts gateway, not the WebUI directly
client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="kt-your-api-key",
)
response = client.chat.completions.create(
model="text-generation-webui:chat:Llama-3.1-8B-Instruct", # matches provider id in config
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the difference between GGUF and GPTQ model formats."},
],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:41002/v1",
apiKey: "kt-your-api-key",
});
async function main() {
const response = await client.chat.completions.create({
model: "text-generation-webui:chat:Llama-3.1-8B-Instruct",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "What makes local LLM inference suitable for privacy-sensitive workloads?" },
],
temperature: 0.5,
max_tokens: 512,
});
console.log(response.choices[0].message.content);
}
main().catch(console.error);
curl -s http://localhost:41002/v1/chat/completions \
-H "Authorization: Bearer kt-your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "text-generation-webui:chat:Llama-3.1-8B-Instruct",
"messages": [
{ "role": "user", "content": "Summarize the key principles of AI governance." }
],
"temperature": 0.7,
"max_tokens": 256
}' | jq .
Streaming
Keeptrusts forwards streaming responses from the WebUI OpenAI extension as OpenAI-compatible Server-Sent Events (SSE). No changes are required on the client side.
- Python
- cURL
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="kt-your-api-key",
)
with client.chat.completions.stream(
model="text-generation-webui:chat:Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Write a haiku about responsible AI."}],
max_tokens=128,
) as stream:
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
curl -s http://localhost:41002/v1/chat/completions \
-H "Authorization: Bearer kt-your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "text-generation-webui:chat:Llama-3.1-8B-Instruct",
"messages": [{ "role": "user", "content": "Count to 5 slowly." }],
"stream": true,
"max_tokens": 64
}'
What Keeptrusts does during streaming:
- Policy rules (redaction, blocking) are applied to the assembled response before any chunk is forwarded to the client.
- Token usage fields from the WebUI extension are surfaced in the final SSE chunk as standard
usagecounters. - If a policy violation is detected mid-stream, the stream is terminated and a governance event is recorded in the audit log.
Advanced Configuration
Running Multiple WebUI Instances
To serve multiple models simultaneously, run separate Text Generation WebUI processes on different ports:
# Terminal 1 — Llama 3.1 8B on port 5000
python server.py --model Llama-3.1-8B-Instruct --api --api-port 5000
# Terminal 2 — Mistral 7B on port 5001
python server.py --model Mistral-7B-v0.3 --api --api-port 5001
pack:
name: text-generation-webui-providers-2
version: 1.0.0
enabled: true
providers:
targets:
- id: webui-llama
provider: text-generation-webui:chat:Llama-3.1-8B-Instruct
base_url: http://localhost:5000/v1
- id: webui-mistral
provider: text-generation-webui:chat:Mistral-7B-v0.3
base_url: http://localhost:5001/v1
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Adjusting Generation Parameters
WebUI generation parameters (temperature, repetition penalty, etc.) are set in the WebUI interface or passed in the request body. Keeptrusts forwards all standard OpenAI parameters unmodified:
response = client.chat.completions.create(
model="text-generation-webui:chat:Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Draft a privacy policy summary."}],
temperature=0.3, # lower = more deterministic
top_p=0.9,
frequency_penalty=0.1,
max_tokens=1024,
)
Timeouts for Large Models
Large models (13B+) at full precision may generate slowly. Increase the timeout to avoid premature failures:
pack:
name: text-generation-webui-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: webui-large
provider: text-generation-webui:chat:Mixtral-8x7B-Instruct-v0.1
base_url: http://localhost:5000/v1
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Best Practices
- Enable the OpenAI extension before starting Keeptrusts. Without the extension active, the WebUI does not expose a
/v1API and all requests will fail. - Start WebUI before
kt gateway run. Keeptrusts performs a health check on startup ifhealth_probe: trueis set; a server that isn't ready will mark the target unhealthy. - Include
/v1inbase_url. The OpenAI extension serves its API under the/v1prefix. Omitting it will result in 404 errors. - Match the model name exactly. Use the folder name as shown in the WebUI Model tab. Keeptrusts forwards it verbatim to the extension endpoint.
- Use one WebUI process per model. Text Generation WebUI loads one model at a time. For multi-model configurations, run separate processes on different ports.
- Log all local inference. Even though no data leaves the host, Keeptrusts still records events and traces for each request — providing auditability for compliance and debugging.
For AI systems
- Canonical terms: Keeptrusts gateway, Text Generation WebUI, oobabooga, Gradio, local inference, self-hosted, provider target, policy-config.yaml.
- Config field names:
provider,model,base_url: "http://localhost:5000",format: "openai",timeout_seconds,health_probe. - Key behavior: Text Generation WebUI exposes an OpenAI-compatible API extension; Keeptrusts routes to it and applies policies.
- Constraint: One model per WebUI process. For multi-model configs, run separate processes on different ports.
- Best next pages: llama.cpp integration, Ollama integration, Policy configuration.
For engineers
- Prerequisites: Text Generation WebUI running with
--apiflag (exposes OpenAI-compatible endpoint), model loaded,ktCLI installed. - Start command:
kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml. - Validate:
curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"local-model","messages":[{"role":"user","content":"hello"}]}'. - Use one WebUI process per model — for multi-model configurations, run separate processes on different ports.
- Keeptrusts records events for every request even for local inference — provides auditability for compliance and debugging.
- No
secret_key_refneeded for local deployments.
For leaders
- Text Generation WebUI enables fully local inference with GPU acceleration — no data leaves the host.
- Keeptrusts audit logging provides compliance evidence for local inference with no vendor-side audit trail.
- One-model-per-process constraint means multi-model deployments require proportional hardware and port planning.
- Suitable for research, development, and air-gapped environments where cloud inference is not permitted.
Next steps
- llama.cpp integration — lighter-weight local inference without a GUI
- Ollama integration — simpler local model management with automatic model switching
- vLLM integration — production-grade self-hosted serving
- Policy configuration — audit-logger and safety policy reference
- Quickstart — install
ktand run your first gateway