Text Generation WebUI

Text Generation WebUI (oobabooga) provides a browser-based interface and OpenAI-compatible API for local LLM inference, supporting GGUF, GPTQ, AWQ, and transformers models. Keeptrusts uses the built-in OpenAI-compatible extension to gateway requests through its policy engine, adding enforcement, redaction, and audit trails to any model loaded in the WebUI.

Use this page when

You need the exact command, config, API, or integration details for Text Generation WebUI.
You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Prerequisites

Install Text Generation WebUI and enable its OpenAI-compatible API extension before configuring Keeptrusts:

# Clone and set up (one-time)
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt

# Start with a specific model and the OpenAI extension enabled
python server.py --model Llama-3.1-8B-Instruct --api --api-port 5000

Alternatively, enable the OpenAI extension through the UI: Settings → Extensions → openai → Apply and restart.

Verify the endpoint is reachable:

curl http://localhost:5000/v1/models

The server binds to http://localhost:5000 by default. Keeptrusts must be configured and kt gateway run started after the WebUI server is up.

Configuration

Add a Text Generation WebUI target to your policy-config.yaml. The provider field identifies the runtime and the model currently loaded in the WebUI.

providers:
  targets:
  - id: webui-chat
    provider: text-generation-webui:chat:Llama-3.1-8B-Instruct
    base_url: http://localhost:5000/v1
  - id: webui-mistral
    provider: text-generation-webui:chat:Mistral-7B-v0.3
    base_url: http://localhost:5001/v1
policies:
- id: webui-governance
  description: Enforce data governance for WebUI traffic
  rules:
  - type: pii_detection
    action: redact
    patterns:
    - ssn
    - credit_card
    - email
  - type: content_filter
    action: block
    categories:
    - violence
    - self_harm

Provider Fields

Field	Type	Required	Default	Description
`id`	string	yes	—	Unique identifier for this target. Referenced in routing rules and shown in audit logs.
`provider`	string	yes	—	Provider string: `text-generation-webui` or `text-generation-webui:chat:<model>`.
`model`	string	no	Derived from provider	Override model name separately when using the bare `text-generation-webui` provider.
`base_url`	string	no	`http://localhost:5000/v1`	Full base URL of the WebUI server, including `/v1` path.
`secret_key_ref`	object	no	—	Object reference to the environment variable holding a bearer token. Only needed if WebUI is behind an auth gateway.
`timeout_seconds`	integer	no	`30`	Request timeout for non-streaming calls, in seconds.
`format`	string	no	`openai`	Wire format. The WebUI OpenAI extension uses the OpenAI-compatible format; this should not need to be changed.
`description`	string	no	—	Human-readable label shown in the console and audit logs.
`weight`	integer	no	`1`	Relative routing weight when this target belongs to a load-balanced group.
`health_probe`	boolean	no	`false`	When `true`, Keeptrusts periodically checks the base URL and marks the target unhealthy if unreachable.

Supported Models

Text Generation WebUI supports any model that can be loaded through its interface. Common models used in governance contexts include:

Model	Format	Use Case
`Llama-3.1-8B-Instruct`	GGUF / transformers	General purpose chat
`Mistral-7B-v0.3`	GGUF / transformers	Fast, broadly capable chat
`Phi-3-mini-4k-instruct`	GGUF / transformers	Lightweight, efficient reasoning
`Mixtral-8x7B-Instruct-v0.1`	GGUF	High capability, mixture of experts
`CodeLlama-13B-Instruct`	GGUF	Code generation and completion

The model name in the provider field should match the folder name as shown in the WebUI Model tab. Keeptrusts forwards it verbatim to the OpenAI extension endpoint.

Client Examples

Python
Node.js
cURL

from openai import OpenAI

# Point at Keeptrusts gateway, not the WebUI directly
client = OpenAI(
    base_url="http://localhost:41002/v1",
    api_key="kt-your-api-key",
)

response = client.chat.completions.create(
    model="text-generation-webui:chat:Llama-3.1-8B-Instruct",  # matches provider id in config
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the difference between GGUF and GPTQ model formats."},
    ],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:41002/v1",
  apiKey: "kt-your-api-key",
});

async function main() {
  const response = await client.chat.completions.create({
    model: "text-generation-webui:chat:Llama-3.1-8B-Instruct",
    messages: [
      { role: "system", content: "You are a helpful assistant." },
      { role: "user", content: "What makes local LLM inference suitable for privacy-sensitive workloads?" },
    ],
    temperature: 0.5,
    max_tokens: 512,
  });

  console.log(response.choices[0].message.content);
}

main().catch(console.error);

curl -s http://localhost:41002/v1/chat/completions \
  -H "Authorization: Bearer kt-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-generation-webui:chat:Llama-3.1-8B-Instruct",
    "messages": [
      { "role": "user", "content": "Summarize the key principles of AI governance." }
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }' | jq .

Streaming

Keeptrusts forwards streaming responses from the WebUI OpenAI extension as OpenAI-compatible Server-Sent Events (SSE). No changes are required on the client side.

Python
cURL

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:41002/v1",
    api_key="kt-your-api-key",
)

with client.chat.completions.stream(
    model="text-generation-webui:chat:Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Write a haiku about responsible AI."}],
    max_tokens=128,
) as stream:
    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
print()

curl -s http://localhost:41002/v1/chat/completions \
  -H "Authorization: Bearer kt-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "text-generation-webui:chat:Llama-3.1-8B-Instruct",
    "messages": [{ "role": "user", "content": "Count to 5 slowly." }],
    "stream": true,
    "max_tokens": 64
  }'

What Keeptrusts does during streaming:

Policy rules (redaction, blocking) are applied to the assembled response before any chunk is forwarded to the client.
Token usage fields from the WebUI extension are surfaced in the final SSE chunk as standard usage counters.
If a policy violation is detected mid-stream, the stream is terminated and a governance event is recorded in the audit log.

Advanced Configuration

Running Multiple WebUI Instances

To serve multiple models simultaneously, run separate Text Generation WebUI processes on different ports:

# Terminal 1 — Llama 3.1 8B on port 5000
python server.py --model Llama-3.1-8B-Instruct --api --api-port 5000

# Terminal 2 — Mistral 7B on port 5001
python server.py --model Mistral-7B-v0.3 --api --api-port 5001

pack:
  name: text-generation-webui-providers-2
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: webui-llama
    provider: text-generation-webui:chat:Llama-3.1-8B-Instruct
    base_url: http://localhost:5000/v1
  - id: webui-mistral
    provider: text-generation-webui:chat:Mistral-7B-v0.3
    base_url: http://localhost:5001/v1
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Adjusting Generation Parameters

WebUI generation parameters (temperature, repetition penalty, etc.) are set in the WebUI interface or passed in the request body. Keeptrusts forwards all standard OpenAI parameters unmodified:

response = client.chat.completions.create(
    model="text-generation-webui:chat:Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Draft a privacy policy summary."}],
    temperature=0.3,           # lower = more deterministic
    top_p=0.9,
    frequency_penalty=0.1,
    max_tokens=1024,
)

Timeouts for Large Models

Large models (13B+) at full precision may generate slowly. Increase the timeout to avoid premature failures:

pack:
  name: text-generation-webui-providers-3
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: webui-large
    provider: text-generation-webui:chat:Mixtral-8x7B-Instruct-v0.1
    base_url: http://localhost:5000/v1
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Best Practices

Enable the OpenAI extension before starting Keeptrusts. Without the extension active, the WebUI does not expose a /v1 API and all requests will fail.
Start WebUI before kt gateway run. Keeptrusts performs a health check on startup if health_probe: true is set; a server that isn't ready will mark the target unhealthy.
Include /v1 in base_url. The OpenAI extension serves its API under the /v1 prefix. Omitting it will result in 404 errors.
Match the model name exactly. Use the folder name as shown in the WebUI Model tab. Keeptrusts forwards it verbatim to the extension endpoint.
Use one WebUI process per model. Text Generation WebUI loads one model at a time. For multi-model configurations, run separate processes on different ports.
Log all local inference. Even though no data leaves the host, Keeptrusts still records events and traces for each request — providing auditability for compliance and debugging.

For AI systems

Canonical terms: Keeptrusts gateway, Text Generation WebUI, oobabooga, Gradio, local inference, self-hosted, provider target, policy-config.yaml.
Config field names: provider, model, base_url: "http://localhost:5000", format: "openai", timeout_seconds, health_probe.
Key behavior: Text Generation WebUI exposes an OpenAI-compatible API extension; Keeptrusts routes to it and applies policies.
Constraint: One model per WebUI process. For multi-model configs, run separate processes on different ports.
Best next pages: llama.cpp integration, Ollama integration, Policy configuration.

For engineers

Prerequisites: Text Generation WebUI running with --api flag (exposes OpenAI-compatible endpoint), model loaded, kt CLI installed.
Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
Validate: curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"local-model","messages":[{"role":"user","content":"hello"}]}'.
Use one WebUI process per model — for multi-model configurations, run separate processes on different ports.
Keeptrusts records events for every request even for local inference — provides auditability for compliance and debugging.
No secret_key_ref needed for local deployments.

For leaders

Text Generation WebUI enables fully local inference with GPU acceleration — no data leaves the host.
Keeptrusts audit logging provides compliance evidence for local inference with no vendor-side audit trail.
One-model-per-process constraint means multi-model deployments require proportional hardware and port planning.
Suitable for research, development, and air-gapped environments where cloud inference is not permitted.

Next steps

llama.cpp integration — lighter-weight local inference without a GUI
Ollama integration — simpler local model management with automatic model switching
vLLM integration — production-grade self-hosted serving
Policy configuration — audit-logger and safety policy reference
Quickstart — install kt and run your first gateway

Use this page when​

Primary audience​

Prerequisites​

Configuration​

Provider Fields​

Supported Models​

Client Examples​

Streaming​

Advanced Configuration​

Running Multiple WebUI Instances​

Adjusting Generation Parameters​

Timeouts for Large Models​

Best Practices​

For AI systems​

For engineers​

For leaders​

Next steps​