vLLM
vLLM is a high-throughput LLM serving engine with a built-in OpenAI-compatible API. Keeptrusts connects to vLLM deployments — local, containerized, or cloud — and enforces policies without modifying the underlying vLLM setup, giving you governance over high-performance inference at any scale.
Use this page when
- You need the exact command, config, API, or integration details for vLLM.
- You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
- If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Prerequisites
vLLM must be running and accessible before kt gateway run starts. The server exposes an OpenAI-compatible API on port 8080 by default.
# Install vLLM (requires Python 3.9+, CUDA recommended)
pip install vllm
# Start the server with a Hugging Face model
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mixtral-8x7B-v0.1 \
--port 8080
# Verify the API is reachable
curl http://localhost:8080/v1/models
For containerized deployment:
docker run --gpus all \
-p 8080:8080 \
vllm/vllm-openai:latest \
--model mistralai/Mixtral-8x7B-v0.1
Configuration
Add a vLLM target to your policy-config.yaml. Because vLLM is OpenAI-compatible, Keeptrusts forwards requests directly without format translation.
providers:
targets:
- id: vllm-mixtral-chat
provider: vllm:chat:mistralai/Mixtral-8x7B-v0.1
base_url: http://localhost:8080/v1
- id: vllm-mixtral-completion
provider: vllm:completion:mistralai/Mixtral-8x7B-v0.1
base_url: http://localhost:8080/v1
- id: vllm-embed
provider: vllm:embedding:BAAI/bge-large-en-v1.5
base_url: http://localhost:8081/v1
- id: vllm-remote
provider: vllm:chat:meta-llama/Llama-3.1-70B-Instruct
base_url: https://vllm.internal.example.com/v1
secret_key_ref:
env: VLLM_API_KEY
policies:
- id: vllm-compliance-policy
description: Compliance controls for vLLM serving
rules:
- type: pii_detection
action: redact
patterns:
- ssn
- credit_card
- phi
- type: prompt_injection
action: block
Provider Fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
id | string | yes | — | Unique identifier for this target. Referenced in routing rules, logs, and the console. |
provider | string | yes | — | Provider string: vllm, vllm:chat:<model>, vllm:completion:<model>, or vllm:embedding:<model>. Use the full Hugging Face model ID as the model name. |
model | string | no | Derived from provider | Override the model name separately when using the bare vllm provider. |
base_url | string | no | http://localhost:8080/v1 | Full base URL including /v1 path prefix. |
secret_key_ref | object | no | — | Object reference to the environment variable holding the bearer token. Local deployments typically do not use auth. |
timeout_seconds | integer | no | 30 | Request timeout for non-streaming calls. Increase for large batch sizes. |
stream_timeout_seconds | integer | no | 120 | Timeout for the full streaming response. |
max_context_tokens | integer | no | model default | Soft cap on context window size forwarded to vLLM. |
format | string | no | openai | Wire format. Always openai for vLLM — no translation is performed. |
description | string | no | — | Human-readable label shown in the console and audit logs. |
weight | integer | no | 1 | Relative routing weight when multiple targets are in the same group. |
health_probe | boolean | no | false | When true, Keeptrusts periodically calls GET /v1/models to verify the server is up. |
allow_insecure_tls | boolean | no | false | Skip TLS certificate verification. Use only for internal deployments with self-signed certificates. |
Supported Models
vLLM supports any model available on Hugging Face Hub. Use the full org/model-name identifier:
# List models loaded by the running vLLM server
curl http://localhost:8080/v1/models | jq '.data[].id'
Popular models:
| Model | Provider string | Notes |
|---|---|---|
| Mixtral 8x7B | vllm:chat:mistralai/Mixtral-8x7B-v0.1 | MoE, strong reasoning |
| Llama 3.1 70B | vllm:chat:meta-llama/Llama-3.1-70B-Instruct | High quality, requires 2+ GPUs |
| Llama 3.1 8B | vllm:chat:meta-llama/Llama-3.1-8B-Instruct | Fast, single-GPU |
| Mistral 7B | vllm:chat:mistralai/Mistral-7B-Instruct-v0.3 | Broadly capable |
| Qwen 2.5 72B | vllm:chat:Qwen/Qwen2.5-72B-Instruct | Multilingual |
| BGE Large | vllm:embedding:BAAI/bge-large-en-v1.5 | English embeddings |
The model string in the provider field must exactly match the model ID reported by GET /v1/models.
Client Examples
- Python
- Node.js
- cURL
from openai import OpenAI
# Point at Keeptrusts gateway, not vLLM directly
client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="kt-your-api-key",
)
# Chat completion
response = client.chat.completions.create(
model="vllm:chat:mistralai/Mixtral-8x7B-v0.1",
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What are the trade-offs between MoE and dense transformer architectures?"},
],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:41002/v1",
apiKey: "kt-your-api-key",
});
async function main() {
const response = await client.chat.completions.create({
model: "vllm:chat:mistralai/Mixtral-8x7B-v0.1",
messages: [
{ role: "system", content: "You are a helpful AI assistant." },
{ role: "user", content: "Explain mixture-of-experts architecture in simple terms." },
],
temperature: 0.7,
max_tokens: 512,
});
console.log(response.choices[0].message.content);
console.log("Tokens used:", response.usage?.total_tokens);
}
main().catch(console.error);
curl -s http://localhost:41002/v1/chat/completions \
-H "Authorization: Bearer kt-your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "vllm:chat:mistralai/Mixtral-8x7B-v0.1",
"messages": [
{ "role": "system", "content": "You are a helpful assistant." },
{ "role": "user", "content": "Describe vLLM paged attention in one paragraph." }
],
"temperature": 0.5,
"max_tokens": 300
}' | jq .
Streaming
vLLM supports OpenAI-compatible SSE streaming natively. Keeptrusts passes the stream through its enforcement layer and forwards the SSE chunks to your client.
- Python
- cURL
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="kt-your-api-key",
)
with client.chat.completions.stream(
model="vllm:chat:mistralai/Mixtral-8x7B-v0.1",
messages=[{"role": "user", "content": "Write a concise overview of transformer self-attention."}],
max_tokens=400,
) as stream:
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
curl -s http://localhost:41002/v1/chat/completions \
-H "Authorization: Bearer kt-your-api-key" \
-H "Content-Type: application/json" \
-d '{
"model": "vllm:chat:mistralai/Mixtral-8x7B-v0.1",
"messages": [{ "role": "user", "content": "List 5 key vLLM features." }],
"stream": true,
"max_tokens": 200
}'
Advanced Configuration
Multi-GPU Setup with Tensor Parallelism
For large models that require multiple GPUs, start vLLM with --tensor-parallel-size:
# 2-GPU setup for Llama 3.1 70B
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--port 8080
# 4-GPU setup for Mixtral 8x22B
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mixtral-8x22B-v0.1 \
--tensor-parallel-size 4 \
--port 8080
The Keeptrusts configuration does not change — only base_url points to the single vLLM endpoint.
Dockerized vLLM Deployment
# docker-compose.yaml excerpt
services:
vllm:
image: vllm/vllm-openai:latest
ports:
- "8080:8080"
volumes:
- huggingface-cache:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command:
- "--model"
- "mistralai/Mixtral-8x7B-v0.1"
- "--port"
- "8080"
- "--tensor-parallel-size"
- "2"
Point base_url at the Docker service name from within the Keeptrusts container:
base_url: "http://vllm:8080/v1"
Multiple vLLM Instances with Failover
pack:
name: vllm-providers-4
version: 1.0.0
enabled: true
providers:
targets:
- id: vllm-primary
provider: vllm:chat:mistralai/Mixtral-8x7B-v0.1
base_url: http://vllm-node-1:8080/v1
- id: vllm-secondary
provider: vllm:chat:mistralai/Mixtral-8x7B-v0.1
base_url: http://vllm-node-2:8080/v1
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Health Probe for GPU Memory Checks
When health_probe: true is set, Keeptrusts polls GET /v1/models. If vLLM is OOM or the process has crashed, the target is marked unhealthy and traffic is rerouted to healthy instances.
To monitor GPU memory separately, set up an external probe that checks nvidia-smi and calls the Keeptrusts admin API to manually mark targets as healthy or unhealthy.
Best Practices
- Match the model ID exactly. The
providermodel string must match the ID returned byGET /v1/models. Mismatches result in 404 errors from vLLM. - Set
allow_insecure_tls: trueonly for internal deployments. Never disable TLS verification for publicly accessible vLLM instances. - Increase timeout values for large models. Models with high sequence lengths or large parameter counts can have multi-second TTFT. Set
timeout_seconds: 120or higher for 70B+ models. - Enable
health_probe: truein production. vLLM can OOM under GPU memory pressure. Health probes let Keeptrusts detect and route around down instances automatically. - Do not expose vLLM directly. Route all client traffic through Keeptrusts to enforce policies and maintain a complete audit trail.
- Start vLLM before the Keeptrusts gateway. If vLLM is unavailable at startup and
health_probeis disabled, the first requests will fail with connection errors. - Use
max_context_tokensto cap inference costs. Setting a conservative limit in the Keeptrusts config prevents runaway token consumption even if the client sends large contexts.
For AI systems
- Canonical terms: Keeptrusts gateway, vLLM, PagedAttention, high-throughput serving, self-hosted, provider target, policy-config.yaml.
- Config field names:
provider,model,base_url,format: "openai",timeout_seconds,max_context_tokens,health_probe. - Key behavior: vLLM serves models with an OpenAI-compatible API; Keeptrusts routes to it, applies policies, and caps context size.
- Constraint: Start vLLM before the Keeptrusts gateway if
health_probeis disabled — first requests will fail with connection errors otherwise. - Best next pages: Ollama integration, llama.cpp integration, HuggingFace integration.
For engineers
- Prerequisites: vLLM installed and serving a model (
python -m vllm.entrypoints.openai.api_server --model <model> --port 8081),ktCLI installed. - Start command:
kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml. - Validate:
curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"your-model","messages":[{"role":"user","content":"hello"}]}'. - Start vLLM before the gateway — if vLLM is unavailable at startup and
health_probeis disabled, first requests fail. - Use
max_context_tokensto cap inference costs — prevents runaway token consumption from large client contexts. - No
secret_key_refneeded for local deployments. Enablehealth_probefor production to detect vLLM restarts.
For leaders
- vLLM provides production-grade self-hosted serving with high throughput via PagedAttention — best-in-class for GPU utilization.
- Full data sovereignty — no data leaves your infrastructure, satisfying strict data residency requirements.
- Hardware cost is the primary expense; vLLM maximizes tokens-per-GPU-dollar through efficient memory management.
- Keeptrusts
max_context_tokensenforcement prevents unexpected cost spikes from large context requests.
Next steps
- Ollama integration — simpler local serving for development and small-scale deployments
- llama.cpp integration — CPU-optimized inference for smaller models
- HuggingFace integration — hosted alternative when self-hosting is impractical
- Policy configuration — audit-logger and prompt-injection reference
- Quickstart — install
ktand run your first gateway