Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

HuggingFace

HuggingFace offers two deployment paths for open-weight models: the Serverless Inference API for rapid prototyping and lower-volume usage, and Dedicated Inference Endpoints for production workloads requiring private GPU allocation, custom regions, and VPC networking. Keeptrusts supports both modes through the huggingface provider, automatically translating HuggingFace's native request/response format to the OpenAI-compatible shape your clients expect, so you can enforce prompt injection detection, PII redaction, and audit logging over open-weight models without changing your application code.

Use this page when

  • You need the exact command, config, API, or integration details for HuggingFace.
  • You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
  • If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Prerequisites

  • A HuggingFace account with an API token (HF_API_TOKEN)
  • For serverless: your account must have access to any gated models you intend to use (e.g. Llama 3)
  • For dedicated endpoints: a running Inference Endpoint (created via HuggingFace Endpoints)
  • kt CLI installed and authenticated (kt auth login)

Set your token before starting the gateway:

export HF_API_TOKEN="hf_..."

Configuration

Minimal — serverless inference

pack:
name: huggingface-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: hf-llama-8b
provider: huggingface:chat:meta-llama/Meta-Llama-3.1-8B-Instruct
secret_key_ref:
env: HF_API_TOKEN
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Full governance config — serverless + dedicated endpoints

pack:
name: huggingface-governed
version: 1.0.0
enabled: true
policies:
chain:
- prompt-injection
- pii-detector
- safety-filter
- content-filter
- audit-logger
policy:
pii-detector:
action: redact
entities:
- PERSON
- EMAIL_ADDRESS
- PHONE_NUMBER
- CREDIT_CARD
safety-filter:
check_toxicity: true
action: block
content-filter:
categories:
- hate_speech
- harassment
action: block
audit-logger:
destination: api
include_request: true
include_response: true
include_policy_decisions: true
providers:
targets:
- id: hf-llama-8b
provider: huggingface:chat:meta-llama/Meta-Llama-3.1-8B-Instruct
secret_key_ref:
env: HF_API_TOKEN
- id: hf-llama-70b
provider: huggingface:chat:meta-llama/Meta-Llama-3.1-70B-Instruct
secret_key_ref:
env: HF_API_TOKEN
- id: hf-mistral-7b
provider: huggingface:chat:mistralai/Mistral-7B-Instruct-v0.3
secret_key_ref:
env: HF_API_TOKEN
- id: hf-sky-t1
provider: huggingface:chat:NovaSky-Berkeley/Sky-T1-32B-Preview
secret_key_ref:
env: HF_API_TOKEN
- id: hf-dedicated-llama-70b
provider: huggingface:chat:meta-llama/Meta-Llama-3.1-70B-Instruct
base_url: https://{unique-id}.us-east-1.aws.endpoints.huggingface.cloud/v1
secret_key_ref:
env: HF_API_TOKEN

Provider Fields

FieldRequiredDescription
providerYes"huggingface" or "huggingface:chat:{org/model-id}"
secret_key_refYesEnvironment variable holding the HuggingFace API token (e.g. HF_API_TOKEN)
base_urlNoFor serverless, defaults to https://api-inference.huggingface.co/models/{model}/v1/chat/completions. For dedicated endpoints, set to your endpoint URL: https://{unique-id}.{region}.aws.endpoints.huggingface.cloud/v1
modelNoFull org/model path when using the bare "huggingface" provider
formatNo"huggingface" — Keeptrusts auto-translates to/from OpenAI-compatible format for your clients
provider_typeNo"huggingface" — explicitly marks the provider family for format translation routing
stream_timeout_secondsNoIncrease for large models on cold-start dedicated endpoints

Supported Models

Serverless Inference API

The following models are available via the HuggingFace Serverless API. Serverless is subject to rate limits that vary by account tier (free, PRO, Enterprise).

ModelContextTypeNotes
meta-llama/Meta-Llama-3.1-8B-Instruct128kChatFastest Llama; recommended for high-volume
meta-llama/Meta-Llama-3.1-70B-Instruct128kChatBest open-weight balance of quality and speed
mistralai/Mistral-7B-Instruct-v0.332kChatCompact and efficient; strong instruction following
NovaSky-Berkeley/Sky-T1-32B-Preview32kChat / ReasoningStrong reasoning model from Berkeley; free weights
sentence-transformers/all-MiniLM-L6-v2512 tokensEmbeddingsFast, compact semantic embeddings
BAAI/bge-large-en-v1.5512 tokensEmbeddingsHigh-quality dense retrieval embeddings

Dedicated Inference Endpoints

Dedicated endpoints support any HuggingFace model. Common production choices:

ModelContextUse Case
meta-llama/Meta-Llama-3.1-70B-Instruct128kGeneral enterprise chat
meta-llama/Meta-Llama-3.1-405B-Instruct128kMax capability open-weight
mistralai/Mixtral-8x22B-Instruct-v0.164kHigh-throughput MoE
Qwen/Qwen2.5-72B-Instruct128kStrong multilingual
deepseek-ai/DeepSeek-R164kReasoning-specialised

Client Examples

Start the gateway:

export HF_API_TOKEN="hf_..."
kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="unused", # auth handled by Keeptrusts
)

# Serverless — Llama 3.1 8B
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the attention mechanism in transformers."},
],
max_tokens=1024,
temperature=0.7,
)
print(response.choices[0].message.content)

# Dedicated endpoint — Llama 3.1 70B
dedicated = client.chat.completions.create(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
messages=[
{"role": "user", "content": "Write a production-ready Python function to validate EU VAT numbers."},
],
max_tokens=2048,
temperature=0.2,
)
print(dedicated.choices[0].message.content)

Streaming

HuggingFace Inference Endpoints support SSE streaming for text-generation models. Keeptrusts translates the stream from HuggingFace's native SSE format to the OpenAI-compatible data: {"choices":[{"delta":{...}}]} shape your client expects.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:41002/v1", api_key="unused")

with client.chat.completions.stream(
model="meta-llama/Meta-Llama-3.1-70B-Instruct",
messages=[
{
"role": "user",
"content": "Write a detailed explanation of how RLHF works, step by step.",
}
],
max_tokens=2048,
) as stream:
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
print()

Cold-start note — Serverless inference endpoints may take 20–60 seconds to warm up after a period of inactivity. Dedicated endpoints remain warm for a configurable minimum time but can still cold-start after scale-to-zero events. Increase stream_timeout_seconds to accommodate cold starts:

pack:
name: huggingface-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: hf-llama-70b
provider: huggingface:chat:meta-llama/Meta-Llama-3.1-70B-Instruct
secret_key_ref:
env: HF_API_TOKEN
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Advanced Configuration

Switching between serverless and dedicated endpoints

Use Keeptrusts's routing policy to send low-priority requests to the serverless tier and production requests to the dedicated endpoint. This maximises utilisation of your dedicated GPU allocation while keeping development costs low:

policies:
chain:
- prompt-injection
- pii-detector
- router
- audit-logger
policy:
router:
rules:
- when_role: production
target: hf-dedicated-llama-70b
- when_role: developer
target: hf-llama-8b
- default:
target: hf-llama-8b
providers:
targets:
- id: hf-llama-8b
provider: huggingface:chat:meta-llama/Meta-Llama-3.1-8B-Instruct
secret_key_ref:
env: HF_API_TOKEN
- id: hf-dedicated-llama-70b
provider: huggingface:chat:meta-llama/Meta-Llama-3.1-70B-Instruct
base_url: https://{unique-id}.us-east-1.aws.endpoints.huggingface.cloud/v1
secret_key_ref:
env: HF_API_TOKEN

EU-region dedicated endpoints for GDPR compliance

HuggingFace Dedicated Endpoints can be deployed in eu-west-1 (Ireland) and eu-central-1 (Frankfurt). For GDPR workloads, use a EU endpoint and document the deployment region in your data processing records:

pack:
name: huggingface-providers-5
version: 1.0.0
enabled: true
providers:
targets:
- id: hf-eu-llama-70b
provider: huggingface:chat:meta-llama/Meta-Llama-3.1-70B-Instruct
base_url: https://{unique-id}.eu-west-1.aws.endpoints.huggingface.cloud/v1
secret_key_ref:
env: HF_API_TOKEN
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Embeddings via serverless API

HuggingFace hosts embedding models suitable for RAG pipelines. Keeptrusts gateways embedding requests using the same token and PII policies as chat:

pack:
name: huggingface-providers-6
version: 1.0.0
enabled: true
providers:
targets:
- id: hf-bge-embeddings
provider: huggingface
model: BAAI/bge-large-en-v1.5
secret_key_ref:
env: HF_API_TOKEN
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
from openai import OpenAI

client = OpenAI(base_url="http://localhost:41002/v1", api_key="unused")

embedding = client.embeddings.create(
model="BAAI/bge-large-en-v1.5",
input="enterprise AI governance policy enforcement gateway",
)
print(f"Vector dimensions: {len(embedding.data[0].embedding)}")

Best Practices

  1. For production, always use Dedicated Inference Endpoints — Serverless inference is subject to shared rate limits, cold starts, and variable latency. Dedicated endpoints provide guaranteed GPU allocation, predictable latency, and SLA-backed availability. Use serverless only for development and low-frequency evaluation.

  2. Match the endpoint region to your data residency requirements — HuggingFace offers EU (eu-west-1, eu-central-1) and US (us-east-1) regions for dedicated endpoints. For GDPR-governed workloads, deploy in an EU region and record the endpoint URL and region in your data processing registry.

  3. Respect model access gating — Meta's Llama models require accepting a license on HuggingFace before the token grants API access. Attempting to call a gated model without license acceptance returns a 403. Verify model access in the HuggingFace UI before deploying the gateway config.

  4. Set realistic stream_timeout_seconds for cold starts — Dedicated endpoints configured with scale-to-zero can take 30–90 seconds on the first request after idle. Set stream_timeout_seconds: 120 for production dedicated targets to avoid false timeout errors during scale-up events.

  5. Apply pii-detector before requests reach open-weight models — Unlike managed model providers, self-hosted or HuggingFace-hosted models do not have built-in PII filtering. Keeptrusts's pii-detector is your primary defence. Redact on the request path for all HuggingFace targets without exception.

  6. Use format: "huggingface" and provider_type: "huggingface" together — These two fields enable Keeptrusts's automatic format translation layer, which converts OpenAI-style chat messages to HuggingFace's native inputs/parameters schema and maps the response back. Omitting either field may result in malformed requests or unparsed responses, particularly for non-TGI-compatible endpoints.

For AI systems

  • Canonical terms: Keeptrusts gateway, HuggingFace, Inference Endpoints, TGI (Text Generation Inference), provider target, policy-config.yaml, provider: "huggingface", HF_TOKEN.
  • Config field names: provider, model, base_url, secret_key_ref.env: "HF_TOKEN", format: "huggingface", provider_type: "huggingface", timeout_seconds.
  • Key behavior: Keeptrusts translates OpenAI-style chat messages to HuggingFace's native inputs/parameters schema and maps responses back.
  • Both format: "huggingface" and provider_type: "huggingface" are required for automatic format translation.
  • Best next pages: vLLM integration (self-hosted), Together AI integration, Policy configuration.

For engineers

  • Prerequisites: HuggingFace token (HF_TOKEN env var from huggingface.co/settings/tokens), Inference Endpoint or public Inference API, kt CLI installed.
  • Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
  • Validate: curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"meta-llama/Llama-3.3-70B-Instruct","messages":[{"role":"user","content":"hello"}]}'.
  • Set both format: "huggingface" and provider_type: "huggingface" — omitting either may cause malformed requests or unparsed responses.
  • For dedicated Inference Endpoints, set base_url to your endpoint URL (e.g., https://xxx.us-east-1.aws.endpoints.huggingface.cloud).
  • Set timeout_seconds based on model size — 70B+ models on shared infrastructure may need 90–120 seconds.

For leaders

  • HuggingFace provides access to thousands of open-weight models — Keeptrusts enables governance over this broad model catalog.
  • Inference Endpoints offer dedicated GPU instances with data residency options (AWS, GCP, Azure regions).
  • Open-weight models eliminate vendor lock-in for the model layer — Keeptrusts provides the consistent governance layer across model changes.
  • Format translation between OpenAI and HuggingFace schemas means existing application code works unchanged when switching model providers.

Next steps