Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

vLLM

vLLM is a high-throughput LLM serving engine with a built-in OpenAI-compatible API. Keeptrusts connects to vLLM deployments — local, containerized, or cloud — and enforces policies without modifying the underlying vLLM setup, giving you governance over high-performance inference at any scale.

Use this page when

  • You need the exact command, config, API, or integration details for vLLM.
  • You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
  • If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Prerequisites

vLLM must be running and accessible before kt gateway run starts. The server exposes an OpenAI-compatible API on port 8080 by default.

# Install vLLM (requires Python 3.9+, CUDA recommended)
pip install vllm

# Start the server with a Hugging Face model
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mixtral-8x7B-v0.1 \
--port 8080

# Verify the API is reachable
curl http://localhost:8080/v1/models

For containerized deployment:

docker run --gpus all \
-p 8080:8080 \
vllm/vllm-openai:latest \
--model mistralai/Mixtral-8x7B-v0.1

Configuration

Add a vLLM target to your policy-config.yaml. Because vLLM is OpenAI-compatible, Keeptrusts forwards requests directly without format translation.

providers:
targets:
- id: vllm-mixtral-chat
provider: vllm:chat:mistralai/Mixtral-8x7B-v0.1
base_url: http://localhost:8080/v1
- id: vllm-mixtral-completion
provider: vllm:completion:mistralai/Mixtral-8x7B-v0.1
base_url: http://localhost:8080/v1
- id: vllm-embed
provider: vllm:embedding:BAAI/bge-large-en-v1.5
base_url: http://localhost:8081/v1
- id: vllm-remote
provider: vllm:chat:meta-llama/Llama-3.1-70B-Instruct
base_url: https://vllm.internal.example.com/v1
secret_key_ref:
env: VLLM_API_KEY
policies:
- id: vllm-compliance-policy
description: Compliance controls for vLLM serving
rules:
- type: pii_detection
action: redact
patterns:
- ssn
- credit_card
- phi
- type: prompt_injection
action: block

Provider Fields

FieldTypeRequiredDefaultDescription
idstringyesUnique identifier for this target. Referenced in routing rules, logs, and the console.
providerstringyesProvider string: vllm, vllm:chat:<model>, vllm:completion:<model>, or vllm:embedding:<model>. Use the full Hugging Face model ID as the model name.
modelstringnoDerived from providerOverride the model name separately when using the bare vllm provider.
base_urlstringnohttp://localhost:8080/v1Full base URL including /v1 path prefix.
secret_key_refobjectnoObject reference to the environment variable holding the bearer token. Local deployments typically do not use auth.
timeout_secondsintegerno30Request timeout for non-streaming calls. Increase for large batch sizes.
stream_timeout_secondsintegerno120Timeout for the full streaming response.
max_context_tokensintegernomodel defaultSoft cap on context window size forwarded to vLLM.
formatstringnoopenaiWire format. Always openai for vLLM — no translation is performed.
descriptionstringnoHuman-readable label shown in the console and audit logs.
weightintegerno1Relative routing weight when multiple targets are in the same group.
health_probebooleannofalseWhen true, Keeptrusts periodically calls GET /v1/models to verify the server is up.
allow_insecure_tlsbooleannofalseSkip TLS certificate verification. Use only for internal deployments with self-signed certificates.

Supported Models

vLLM supports any model available on Hugging Face Hub. Use the full org/model-name identifier:

# List models loaded by the running vLLM server
curl http://localhost:8080/v1/models | jq '.data[].id'

Popular models:

ModelProvider stringNotes
Mixtral 8x7Bvllm:chat:mistralai/Mixtral-8x7B-v0.1MoE, strong reasoning
Llama 3.1 70Bvllm:chat:meta-llama/Llama-3.1-70B-InstructHigh quality, requires 2+ GPUs
Llama 3.1 8Bvllm:chat:meta-llama/Llama-3.1-8B-InstructFast, single-GPU
Mistral 7Bvllm:chat:mistralai/Mistral-7B-Instruct-v0.3Broadly capable
Qwen 2.5 72Bvllm:chat:Qwen/Qwen2.5-72B-InstructMultilingual
BGE Largevllm:embedding:BAAI/bge-large-en-v1.5English embeddings

The model string in the provider field must exactly match the model ID reported by GET /v1/models.

Client Examples

from openai import OpenAI

# Point at Keeptrusts gateway, not vLLM directly
client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="kt-your-api-key",
)

# Chat completion
response = client.chat.completions.create(
model="vllm:chat:mistralai/Mixtral-8x7B-v0.1",
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What are the trade-offs between MoE and dense transformer architectures?"},
],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

Streaming

vLLM supports OpenAI-compatible SSE streaming natively. Keeptrusts passes the stream through its enforcement layer and forwards the SSE chunks to your client.

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="kt-your-api-key",
)

with client.chat.completions.stream(
model="vllm:chat:mistralai/Mixtral-8x7B-v0.1",
messages=[{"role": "user", "content": "Write a concise overview of transformer self-attention."}],
max_tokens=400,
) as stream:
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()

Advanced Configuration

Multi-GPU Setup with Tensor Parallelism

For large models that require multiple GPUs, start vLLM with --tensor-parallel-size:

# 2-GPU setup for Llama 3.1 70B
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 2 \
--port 8080

# 4-GPU setup for Mixtral 8x22B
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mixtral-8x22B-v0.1 \
--tensor-parallel-size 4 \
--port 8080

The Keeptrusts configuration does not change — only base_url points to the single vLLM endpoint.

Dockerized vLLM Deployment

# docker-compose.yaml excerpt
services:
vllm:
image: vllm/vllm-openai:latest
ports:
- "8080:8080"
volumes:
- huggingface-cache:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
command:
- "--model"
- "mistralai/Mixtral-8x7B-v0.1"
- "--port"
- "8080"
- "--tensor-parallel-size"
- "2"

Point base_url at the Docker service name from within the Keeptrusts container:

base_url: "http://vllm:8080/v1"

Multiple vLLM Instances with Failover

pack:
name: vllm-providers-4
version: 1.0.0
enabled: true
providers:
targets:
- id: vllm-primary
provider: vllm:chat:mistralai/Mixtral-8x7B-v0.1
base_url: http://vllm-node-1:8080/v1
- id: vllm-secondary
provider: vllm:chat:mistralai/Mixtral-8x7B-v0.1
base_url: http://vllm-node-2:8080/v1
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Health Probe for GPU Memory Checks

When health_probe: true is set, Keeptrusts polls GET /v1/models. If vLLM is OOM or the process has crashed, the target is marked unhealthy and traffic is rerouted to healthy instances.

To monitor GPU memory separately, set up an external probe that checks nvidia-smi and calls the Keeptrusts admin API to manually mark targets as healthy or unhealthy.

Best Practices

  • Match the model ID exactly. The provider model string must match the ID returned by GET /v1/models. Mismatches result in 404 errors from vLLM.
  • Set allow_insecure_tls: true only for internal deployments. Never disable TLS verification for publicly accessible vLLM instances.
  • Increase timeout values for large models. Models with high sequence lengths or large parameter counts can have multi-second TTFT. Set timeout_seconds: 120 or higher for 70B+ models.
  • Enable health_probe: true in production. vLLM can OOM under GPU memory pressure. Health probes let Keeptrusts detect and route around down instances automatically.
  • Do not expose vLLM directly. Route all client traffic through Keeptrusts to enforce policies and maintain a complete audit trail.
  • Start vLLM before the Keeptrusts gateway. If vLLM is unavailable at startup and health_probe is disabled, the first requests will fail with connection errors.
  • Use max_context_tokens to cap inference costs. Setting a conservative limit in the Keeptrusts config prevents runaway token consumption even if the client sends large contexts.

For AI systems

  • Canonical terms: Keeptrusts gateway, vLLM, PagedAttention, high-throughput serving, self-hosted, provider target, policy-config.yaml.
  • Config field names: provider, model, base_url, format: "openai", timeout_seconds, max_context_tokens, health_probe.
  • Key behavior: vLLM serves models with an OpenAI-compatible API; Keeptrusts routes to it, applies policies, and caps context size.
  • Constraint: Start vLLM before the Keeptrusts gateway if health_probe is disabled — first requests will fail with connection errors otherwise.
  • Best next pages: Ollama integration, llama.cpp integration, HuggingFace integration.

For engineers

  • Prerequisites: vLLM installed and serving a model (python -m vllm.entrypoints.openai.api_server --model <model> --port 8081), kt CLI installed.
  • Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
  • Validate: curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"your-model","messages":[{"role":"user","content":"hello"}]}'.
  • Start vLLM before the gateway — if vLLM is unavailable at startup and health_probe is disabled, first requests fail.
  • Use max_context_tokens to cap inference costs — prevents runaway token consumption from large client contexts.
  • No secret_key_ref needed for local deployments. Enable health_probe for production to detect vLLM restarts.

For leaders

  • vLLM provides production-grade self-hosted serving with high throughput via PagedAttention — best-in-class for GPU utilization.
  • Full data sovereignty — no data leaves your infrastructure, satisfying strict data residency requirements.
  • Hardware cost is the primary expense; vLLM maximizes tokens-per-GPU-dollar through efficient memory management.
  • Keeptrusts max_context_tokens enforcement prevents unexpected cost spikes from large context requests.

Next steps