HuggingFace
HuggingFace offers two deployment paths for open-weight models: the Serverless Inference API for rapid prototyping and lower-volume usage, and Dedicated Inference Endpoints for production workloads requiring private GPU allocation, custom regions, and VPC networking. Keeptrusts supports both modes through the huggingface provider, automatically translating HuggingFace's native request/response format to the OpenAI-compatible shape your clients expect, so you can enforce prompt injection detection, PII redaction, and audit logging over open-weight models without changing your application code.
Use this page when
- You need the exact command, config, API, or integration details for HuggingFace.
- You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
- If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Prerequisites
- A HuggingFace account with an API token (
HF_API_TOKEN) - For serverless: your account must have access to any gated models you intend to use (e.g. Llama 3)
- For dedicated endpoints: a running Inference Endpoint (created via HuggingFace Endpoints)
ktCLI installed and authenticated (kt auth login)
Set your token before starting the gateway:
export HF_API_TOKEN="hf_..."
Configuration
Minimal — serverless inference
pack:
name: huggingface-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: hf-llama-8b
provider: huggingface:chat:meta-llama/Meta-Llama-3.1-8B-Instruct
secret_key_ref:
env: HF_API_TOKEN
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Full governance config — serverless + dedicated endpoints
pack:
name: huggingface-governed
version: 1.0.0
enabled: true
policies:
chain:
- prompt-injection
- pii-detector
- safety-filter
- content-filter
- audit-logger
policy:
pii-detector:
action: redact
entities:
- PERSON
- EMAIL_ADDRESS
- PHONE_NUMBER
- CREDIT_CARD
safety-filter:
check_toxicity: true
action: block
content-filter:
categories:
- hate_speech
- harassment
action: block
audit-logger:
destination: api
include_request: true
include_response: true
include_policy_decisions: true
providers:
targets:
- id: hf-llama-8b
provider: huggingface:chat:meta-llama/Meta-Llama-3.1-8B-Instruct
secret_key_ref:
env: HF_API_TOKEN
- id: hf-llama-70b
provider: huggingface:chat:meta-llama/Meta-Llama-3.1-70B-Instruct
secret_key_ref:
env: HF_API_TOKEN
- id: hf-mistral-7b
provider: huggingface:chat:mistralai/Mistral-7B-Instruct-v0.3
secret_key_ref:
env: HF_API_TOKEN
- id: hf-sky-t1
provider: huggingface:chat:NovaSky-Berkeley/Sky-T1-32B-Preview
secret_key_ref:
env: HF_API_TOKEN
- id: hf-dedicated-llama-70b
provider: huggingface:chat:meta-llama/Meta-Llama-3.1-70B-Instruct
base_url: https://{unique-id}.us-east-1.aws.endpoints.huggingface.cloud/v1
secret_key_ref:
env: HF_API_TOKEN
Provider Fields
| Field | Required | Description |
|---|---|---|
provider | Yes | "huggingface" or "huggingface:chat:{org/model-id}" |
secret_key_ref | Yes | Environment variable holding the HuggingFace API token (e.g. HF_API_TOKEN) |
base_url | No | For serverless, defaults to https://api-inference.huggingface.co/models/{model}/v1/chat/completions. For dedicated endpoints, set to your endpoint URL: https://{unique-id}.{region}.aws.endpoints.huggingface.cloud/v1 |
model | No | Full org/model path when using the bare "huggingface" provider |
format | No | "huggingface" — Keeptrusts auto-translates to/from OpenAI-compatible format for your clients |
provider_type | No | "huggingface" — explicitly marks the provider family for format translation routing |
stream_timeout_seconds | No | Increase for large models on cold-start dedicated endpoints |
Supported Models
Serverless Inference API
The following models are available via the HuggingFace Serverless API. Serverless is subject to rate limits that vary by account tier (free, PRO, Enterprise).
| Model | Context | Type | Notes |
|---|---|---|---|
meta-llama/Meta-Llama-3.1-8B-Instruct | 128k | Chat | Fastest Llama; recommended for high-volume |
meta-llama/Meta-Llama-3.1-70B-Instruct | 128k | Chat | Best open-weight balance of quality and speed |
mistralai/Mistral-7B-Instruct-v0.3 | 32k | Chat | Compact and efficient; strong instruction following |
NovaSky-Berkeley/Sky-T1-32B-Preview | 32k | Chat / Reasoning | Strong reasoning model from Berkeley; free weights |
sentence-transformers/all-MiniLM-L6-v2 | 512 tokens | Embeddings | Fast, compact semantic embeddings |
BAAI/bge-large-en-v1.5 | 512 tokens | Embeddings | High-quality dense retrieval embeddings |
Dedicated Inference Endpoints
Dedicated endpoints support any HuggingFace model. Common production choices:
| Model | Context | Use Case |
|---|---|---|
meta-llama/Meta-Llama-3.1-70B-Instruct | 128k | General enterprise chat |
meta-llama/Meta-Llama-3.1-405B-Instruct | 128k | Max capability open-weight |
mistralai/Mixtral-8x22B-Instruct-v0.1 | 64k | High-throughput MoE |
Qwen/Qwen2.5-72B-Instruct | 128k | Strong multilingual |
deepseek-ai/DeepSeek-R1 | 64k | Reasoning-specialised |
Client Examples
Start the gateway:
export HF_API_TOKEN="hf_..."
kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml
- Python
- Node.js
- cURL
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="unused", # auth handled by Keeptrusts
)
# Serverless — Llama 3.1 8B
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the attention mechanism in transformers."},
],
max_tokens=1024,
temperature=0.7,
)
print(response.choices[0].message.content)
# Dedicated endpoint — Llama 3.1 70B
dedicated = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
messages=[
{"role": "user", "content": "Write a production-ready Python function to validate EU VAT numbers."},
],
max_tokens=2048,
temperature=0.2,
)
print(dedicated.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:41002/v1",
apiKey: "unused",
});
// Serverless — Llama 3.1 8B
const response = await client.chat.completions.create({
model: "meta-llama/Meta-Llama-3.1-8B-Instruct",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{
role: "user",
content: "Explain the attention mechanism in transformers.",
},
],
max_tokens: 1024,
temperature: 0.7,
});
console.log(response.choices[0].message.content);
// Dedicated endpoint — Llama 3.1 70B
const dedicated = await client.chat.completions.create({
model: "meta-llama/Meta-Llama-3.1-70B-Instruct",
messages: [
{
role: "user",
content:
"Write a production-ready TypeScript function to validate EU VAT numbers.",
},
],
max_tokens: 2048,
temperature: 0.2,
});
console.log(dedicated.choices[0].message.content);
# Serverless — Llama 3.1 8B
curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the attention mechanism in transformers."}
],
"max_tokens": 1024,
"temperature": 0.7
}' | jq .choices[0].message.content
# Dedicated endpoint — Llama 3.1 70B
curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
"messages": [
{"role": "user", "content": "Write a production-ready Python function to validate EU VAT numbers."}
],
"max_tokens": 2048,
"temperature": 0.2
}' | jq .choices[0].message.content
Streaming
HuggingFace Inference Endpoints support SSE streaming for text-generation models. Keeptrusts translates the stream from HuggingFace's native SSE format to the OpenAI-compatible data: {"choices":[{"delta":{...}}]} shape your client expects.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:41002/v1", api_key="unused")
with client.chat.completions.stream(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
messages=[
{
"role": "user",
"content": "Write a detailed explanation of how RLHF works, step by step.",
}
],
max_tokens=2048,
) as stream:
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
print()
Cold-start note — Serverless inference endpoints may take 20–60 seconds to warm up after a period of inactivity. Dedicated endpoints remain warm for a configurable minimum time but can still cold-start after scale-to-zero events. Increase stream_timeout_seconds to accommodate cold starts:
pack:
name: huggingface-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: hf-llama-70b
provider: huggingface:chat:meta-llama/Meta-Llama-3.1-70B-Instruct
secret_key_ref:
env: HF_API_TOKEN
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Advanced Configuration
Switching between serverless and dedicated endpoints
Use Keeptrusts's routing policy to send low-priority requests to the serverless tier and production requests to the dedicated endpoint. This maximises utilisation of your dedicated GPU allocation while keeping development costs low:
policies:
chain:
- prompt-injection
- pii-detector
- router
- audit-logger
policy:
router:
rules:
- when_role: production
target: hf-dedicated-llama-70b
- when_role: developer
target: hf-llama-8b
- default:
target: hf-llama-8b
providers:
targets:
- id: hf-llama-8b
provider: huggingface:chat:meta-llama/Meta-Llama-3.1-8B-Instruct
secret_key_ref:
env: HF_API_TOKEN
- id: hf-dedicated-llama-70b
provider: huggingface:chat:meta-llama/Meta-Llama-3.1-70B-Instruct
base_url: https://{unique-id}.us-east-1.aws.endpoints.huggingface.cloud/v1
secret_key_ref:
env: HF_API_TOKEN
EU-region dedicated endpoints for GDPR compliance
HuggingFace Dedicated Endpoints can be deployed in eu-west-1 (Ireland) and eu-central-1 (Frankfurt). For GDPR workloads, use a EU endpoint and document the deployment region in your data processing records:
pack:
name: huggingface-providers-5
version: 1.0.0
enabled: true
providers:
targets:
- id: hf-eu-llama-70b
provider: huggingface:chat:meta-llama/Meta-Llama-3.1-70B-Instruct
base_url: https://{unique-id}.eu-west-1.aws.endpoints.huggingface.cloud/v1
secret_key_ref:
env: HF_API_TOKEN
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Embeddings via serverless API
HuggingFace hosts embedding models suitable for RAG pipelines. Keeptrusts gateways embedding requests using the same token and PII policies as chat:
pack:
name: huggingface-providers-6
version: 1.0.0
enabled: true
providers:
targets:
- id: hf-bge-embeddings
provider: huggingface
model: BAAI/bge-large-en-v1.5
secret_key_ref:
env: HF_API_TOKEN
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
from openai import OpenAI
client = OpenAI(base_url="http://localhost:41002/v1", api_key="unused")
embedding = client.embeddings.create(
model="BAAI/bge-large-en-v1.5",
input="enterprise AI governance policy enforcement gateway",
)
print(f"Vector dimensions: {len(embedding.data[0].embedding)}")
Best Practices
-
For production, always use Dedicated Inference Endpoints — Serverless inference is subject to shared rate limits, cold starts, and variable latency. Dedicated endpoints provide guaranteed GPU allocation, predictable latency, and SLA-backed availability. Use serverless only for development and low-frequency evaluation.
-
Match the endpoint region to your data residency requirements — HuggingFace offers EU (
eu-west-1,eu-central-1) and US (us-east-1) regions for dedicated endpoints. For GDPR-governed workloads, deploy in an EU region and record the endpoint URL and region in your data processing registry. -
Respect model access gating — Meta's Llama models require accepting a license on HuggingFace before the token grants API access. Attempting to call a gated model without license acceptance returns a 403. Verify model access in the HuggingFace UI before deploying the gateway config.
-
Set realistic
stream_timeout_secondsfor cold starts — Dedicated endpoints configured with scale-to-zero can take 30–90 seconds on the first request after idle. Setstream_timeout_seconds: 120for production dedicated targets to avoid false timeout errors during scale-up events. -
Apply
pii-detectorbefore requests reach open-weight models — Unlike managed model providers, self-hosted or HuggingFace-hosted models do not have built-in PII filtering. Keeptrusts'spii-detectoris your primary defence. Redact on the request path for all HuggingFace targets without exception. -
Use
format: "huggingface"andprovider_type: "huggingface"together — These two fields enable Keeptrusts's automatic format translation layer, which converts OpenAI-style chat messages to HuggingFace's nativeinputs/parametersschema and maps the response back. Omitting either field may result in malformed requests or unparsed responses, particularly for non-TGI-compatible endpoints.
For AI systems
- Canonical terms: Keeptrusts gateway, HuggingFace, Inference Endpoints, TGI (Text Generation Inference), provider target, policy-config.yaml,
provider: "huggingface", HF_TOKEN. - Config field names:
provider,model,base_url,secret_key_ref.env: "HF_TOKEN",format: "huggingface",provider_type: "huggingface",timeout_seconds. - Key behavior: Keeptrusts translates OpenAI-style chat messages to HuggingFace's native
inputs/parametersschema and maps responses back. - Both
format: "huggingface"andprovider_type: "huggingface"are required for automatic format translation. - Best next pages: vLLM integration (self-hosted), Together AI integration, Policy configuration.
For engineers
- Prerequisites: HuggingFace token (
HF_TOKENenv var from huggingface.co/settings/tokens), Inference Endpoint or public Inference API,ktCLI installed. - Start command:
kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml. - Validate:
curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"meta-llama/Llama-3.3-70B-Instruct","messages":[{"role":"user","content":"hello"}]}'. - Set both
format: "huggingface"andprovider_type: "huggingface"— omitting either may cause malformed requests or unparsed responses. - For dedicated Inference Endpoints, set
base_urlto your endpoint URL (e.g.,https://xxx.us-east-1.aws.endpoints.huggingface.cloud). - Set
timeout_secondsbased on model size — 70B+ models on shared infrastructure may need 90–120 seconds.
For leaders
- HuggingFace provides access to thousands of open-weight models — Keeptrusts enables governance over this broad model catalog.
- Inference Endpoints offer dedicated GPU instances with data residency options (AWS, GCP, Azure regions).
- Open-weight models eliminate vendor lock-in for the model layer — Keeptrusts provides the consistent governance layer across model changes.
- Format translation between OpenAI and HuggingFace schemas means existing application code works unchanged when switching model providers.
Next steps
- vLLM integration — self-hosted high-throughput serving for HuggingFace models
- Together AI integration — hosted open models with faster inference
- Ollama integration — local model serving for development
- Policy configuration — prompt-injection and PII policy reference
- Quickstart — install
ktand run your first gateway