vLLM

vLLM is a high-throughput LLM serving engine with a built-in OpenAI-compatible API. Keeptrusts connects to vLLM deployments — local, containerized, or cloud — and enforces policies without modifying the underlying vLLM setup, giving you governance over high-performance inference at any scale.

Use this page when

You need the exact command, config, API, or integration details for vLLM.
You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Prerequisites

vLLM must be running and accessible before kt gateway run starts. The server exposes an OpenAI-compatible API on port 8080 by default.

# Install vLLM (requires Python 3.9+, CUDA recommended)
pip install vllm

# Start the server with a Hugging Face model
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mixtral-8x7B-v0.1 \
  --port 8080

# Verify the API is reachable
curl http://localhost:8080/v1/models

For containerized deployment:

docker run --gpus all \
  -p 8080:8080 \
  vllm/vllm-openai:latest \
  --model mistralai/Mixtral-8x7B-v0.1

Configuration

Add a vLLM target to your policy-config.yaml. Because vLLM is OpenAI-compatible, Keeptrusts forwards requests directly without format translation.

providers:
  targets:
  - id: vllm-mixtral-chat
    provider: vllm:chat:mistralai/Mixtral-8x7B-v0.1
    base_url: http://localhost:8080/v1
  - id: vllm-mixtral-completion
    provider: vllm:completion:mistralai/Mixtral-8x7B-v0.1
    base_url: http://localhost:8080/v1
  - id: vllm-embed
    provider: vllm:embedding:BAAI/bge-large-en-v1.5
    base_url: http://localhost:8081/v1
  - id: vllm-remote
    provider: vllm:chat:meta-llama/Llama-3.1-70B-Instruct
    base_url: https://vllm.internal.example.com/v1
    secret_key_ref:
      env: VLLM_API_KEY
policies:
- id: vllm-compliance-policy
  description: Compliance controls for vLLM serving
  rules:
  - type: pii_detection
    action: redact
    patterns:
    - ssn
    - credit_card
    - phi
  - type: prompt_injection
    action: block

Provider Fields

Field	Type	Required	Default	Description
`id`	string	yes	—	Unique identifier for this target. Referenced in routing rules, logs, and the console.
`provider`	string	yes	—	Provider string: `vllm`, `vllm:chat:<model>`, `vllm:completion:<model>`, or `vllm:embedding:<model>`. Use the full Hugging Face model ID as the model name.
`model`	string	no	Derived from provider	Override the model name separately when using the bare `vllm` provider.
`base_url`	string	no	`http://localhost:8080/v1`	Full base URL including `/v1` path prefix.
`secret_key_ref`	object	no	—	Object reference to the environment variable holding the bearer token. Local deployments typically do not use auth.
`timeout_seconds`	integer	no	`30`	Request timeout for non-streaming calls. Increase for large batch sizes.
`stream_timeout_seconds`	integer	no	`120`	Timeout for the full streaming response.
`max_context_tokens`	integer	no	model default	Soft cap on context window size forwarded to vLLM.
`format`	string	no	`openai`	Wire format. Always `openai` for vLLM — no translation is performed.
`description`	string	no	—	Human-readable label shown in the console and audit logs.
`weight`	integer	no	`1`	Relative routing weight when multiple targets are in the same group.
`health_probe`	boolean	no	`false`	When `true`, Keeptrusts periodically calls `GET /v1/models` to verify the server is up.
`allow_insecure_tls`	boolean	no	`false`	Skip TLS certificate verification. Use only for internal deployments with self-signed certificates.

Supported Models

vLLM supports any model available on Hugging Face Hub. Use the full org/model-name identifier:

# List models loaded by the running vLLM server
curl http://localhost:8080/v1/models | jq '.data[].id'

Popular models:

Model	Provider string	Notes
Mixtral 8x7B	`vllm:chat:mistralai/Mixtral-8x7B-v0.1`	MoE, strong reasoning
Llama 3.1 70B	`vllm:chat:meta-llama/Llama-3.1-70B-Instruct`	High quality, requires 2+ GPUs
Llama 3.1 8B	`vllm:chat:meta-llama/Llama-3.1-8B-Instruct`	Fast, single-GPU
Mistral 7B	`vllm:chat:mistralai/Mistral-7B-Instruct-v0.3`	Broadly capable
Qwen 2.5 72B	`vllm:chat:Qwen/Qwen2.5-72B-Instruct`	Multilingual
BGE Large	`vllm:embedding:BAAI/bge-large-en-v1.5`	English embeddings

The model string in the provider field must exactly match the model ID reported by GET /v1/models.

Client Examples

Python
Node.js
cURL

from openai import OpenAI

# Point at Keeptrusts gateway, not vLLM directly
client = OpenAI(
    base_url="http://localhost:41002/v1",
    api_key="kt-your-api-key",
)

# Chat completion
response = client.chat.completions.create(
    model="vllm:chat:mistralai/Mixtral-8x7B-v0.1",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "What are the trade-offs between MoE and dense transformer architectures?"},
    ],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:41002/v1",
  apiKey: "kt-your-api-key",
});

async function main() {
  const response = await client.chat.completions.create({
    model: "vllm:chat:mistralai/Mixtral-8x7B-v0.1",
    messages: [
      { role: "system", content: "You are a helpful AI assistant." },
      { role: "user", content: "Explain mixture-of-experts architecture in simple terms." },
    ],
    temperature: 0.7,
    max_tokens: 512,
  });

  console.log(response.choices[0].message.content);
  console.log("Tokens used:", response.usage?.total_tokens);
}

main().catch(console.error);

curl -s http://localhost:41002/v1/chat/completions \
  -H "Authorization: Bearer kt-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm:chat:mistralai/Mixtral-8x7B-v0.1",
    "messages": [
      { "role": "system", "content": "You are a helpful assistant." },
      { "role": "user", "content": "Describe vLLM paged attention in one paragraph." }
    ],
    "temperature": 0.5,
    "max_tokens": 300
  }' | jq .

Streaming

vLLM supports OpenAI-compatible SSE streaming natively. Keeptrusts passes the stream through its enforcement layer and forwards the SSE chunks to your client.

Python
cURL

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:41002/v1",
    api_key="kt-your-api-key",
)

with client.chat.completions.stream(
    model="vllm:chat:mistralai/Mixtral-8x7B-v0.1",
    messages=[{"role": "user", "content": "Write a concise overview of transformer self-attention."}],
    max_tokens=400,
) as stream:
    for chunk in stream:
        if chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
print()

curl -s http://localhost:41002/v1/chat/completions \
  -H "Authorization: Bearer kt-your-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm:chat:mistralai/Mixtral-8x7B-v0.1",
    "messages": [{ "role": "user", "content": "List 5 key vLLM features." }],
    "stream": true,
    "max_tokens": 200
  }'

Advanced Configuration

Multi-GPU Setup with Tensor Parallelism

For large models that require multiple GPUs, start vLLM with --tensor-parallel-size:

# 2-GPU setup for Llama 3.1 70B
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --port 8080

# 4-GPU setup for Mixtral 8x22B
python -m vllm.entrypoints.openai.api_server \
  --model mistralai/Mixtral-8x22B-v0.1 \
  --tensor-parallel-size 4 \
  --port 8080

The Keeptrusts configuration does not change — only base_url points to the single vLLM endpoint.

Dockerized vLLM Deployment

# docker-compose.yaml excerpt
services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8080:8080"
    volumes:
      - huggingface-cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    command:
      - "--model"
      - "mistralai/Mixtral-8x7B-v0.1"
      - "--port"
      - "8080"
      - "--tensor-parallel-size"
      - "2"

Point base_url at the Docker service name from within the Keeptrusts container:

base_url: "http://vllm:8080/v1"

Multiple vLLM Instances with Failover

pack:
  name: vllm-providers-4
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: vllm-primary
    provider: vllm:chat:mistralai/Mixtral-8x7B-v0.1
    base_url: http://vllm-node-1:8080/v1
  - id: vllm-secondary
    provider: vllm:chat:mistralai/Mixtral-8x7B-v0.1
    base_url: http://vllm-node-2:8080/v1
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Health Probe for GPU Memory Checks

When health_probe: true is set, Keeptrusts polls GET /v1/models. If vLLM is OOM or the process has crashed, the target is marked unhealthy and traffic is rerouted to healthy instances.

To monitor GPU memory separately, set up an external probe that checks nvidia-smi and calls the Keeptrusts admin API to manually mark targets as healthy or unhealthy.

Best Practices

Match the model ID exactly. The provider model string must match the ID returned by GET /v1/models. Mismatches result in 404 errors from vLLM.
Set allow_insecure_tls: true only for internal deployments. Never disable TLS verification for publicly accessible vLLM instances.
Increase timeout values for large models. Models with high sequence lengths or large parameter counts can have multi-second TTFT. Set timeout_seconds: 120 or higher for 70B+ models.
Enable health_probe: true in production. vLLM can OOM under GPU memory pressure. Health probes let Keeptrusts detect and route around down instances automatically.
Do not expose vLLM directly. Route all client traffic through Keeptrusts to enforce policies and maintain a complete audit trail.
Start vLLM before the Keeptrusts gateway. If vLLM is unavailable at startup and health_probe is disabled, the first requests will fail with connection errors.
Use max_context_tokens to cap inference costs. Setting a conservative limit in the Keeptrusts config prevents runaway token consumption even if the client sends large contexts.

For AI systems

Canonical terms: Keeptrusts gateway, vLLM, PagedAttention, high-throughput serving, self-hosted, provider target, policy-config.yaml.
Config field names: provider, model, base_url, format: "openai", timeout_seconds, max_context_tokens, health_probe.
Key behavior: vLLM serves models with an OpenAI-compatible API; Keeptrusts routes to it, applies policies, and caps context size.
Constraint: Start vLLM before the Keeptrusts gateway if health_probe is disabled — first requests will fail with connection errors otherwise.
Best next pages: Ollama integration, llama.cpp integration, HuggingFace integration.

For engineers

Prerequisites: vLLM installed and serving a model (python -m vllm.entrypoints.openai.api_server --model <model> --port 8081), kt CLI installed.
Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
Validate: curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"your-model","messages":[{"role":"user","content":"hello"}]}'.
Start vLLM before the gateway — if vLLM is unavailable at startup and health_probe is disabled, first requests fail.
Use max_context_tokens to cap inference costs — prevents runaway token consumption from large client contexts.
No secret_key_ref needed for local deployments. Enable health_probe for production to detect vLLM restarts.

For leaders

vLLM provides production-grade self-hosted serving with high throughput via PagedAttention — best-in-class for GPU utilization.
Full data sovereignty — no data leaves your infrastructure, satisfying strict data residency requirements.
Hardware cost is the primary expense; vLLM maximizes tokens-per-GPU-dollar through efficient memory management.
Keeptrusts max_context_tokens enforcement prevents unexpected cost spikes from large context requests.

Next steps

Ollama integration — simpler local serving for development and small-scale deployments
llama.cpp integration — CPU-optimized inference for smaller models
HuggingFace integration — hosted alternative when self-hosting is impractical
Policy configuration — audit-logger and prompt-injection reference
Quickstart — install kt and run your first gateway

Use this page when​

Primary audience​

Prerequisites​

Configuration​

Provider Fields​

Supported Models​

Client Examples​

Streaming​

Advanced Configuration​

Multi-GPU Setup with Tensor Parallelism​

Dockerized vLLM Deployment​

Multiple vLLM Instances with Failover​

Health Probe for GPU Memory Checks​

Best Practices​

For AI systems​

For engineers​

For leaders​

Next steps​