HuggingFace

HuggingFace offers two deployment paths for open-weight models: the Serverless Inference API for rapid prototyping and lower-volume usage, and Dedicated Inference Endpoints for production workloads requiring private GPU allocation, custom regions, and VPC networking. Keeptrusts supports both modes through the huggingface provider, automatically translating HuggingFace's native request/response format to the OpenAI-compatible shape your clients expect, so you can enforce prompt injection detection, PII redaction, and audit logging over open-weight models without changing your application code.

Use this page when

You need the exact command, config, API, or integration details for HuggingFace.
You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Prerequisites

A HuggingFace account with an API token (HF_API_TOKEN)
For serverless: your account must have access to any gated models you intend to use (e.g. Llama 3)
For dedicated endpoints: a running Inference Endpoint (created via HuggingFace Endpoints)
kt CLI installed and authenticated (kt auth login)

Set your token before starting the gateway:

export HF_API_TOKEN="hf_..."

Configuration

Minimal — serverless inference

pack:
  name: huggingface-providers-1
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: hf-llama-8b
    provider: huggingface:chat:meta-llama/Meta-Llama-3.1-8B-Instruct
    secret_key_ref:
      env: HF_API_TOKEN
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Full governance config — serverless + dedicated endpoints

pack:
  name: huggingface-governed
  version: 1.0.0
  enabled: true
policies:
  chain:
  - prompt-injection
  - pii-detector
  - safety-filter
  - content-filter
  - audit-logger
policy:
  pii-detector:
    action: redact
    entities:
    - PERSON
    - EMAIL_ADDRESS
    - PHONE_NUMBER
    - CREDIT_CARD
  safety-filter:
    check_toxicity: true
    action: block
  content-filter:
    categories:
    - hate_speech
    - harassment
    action: block
  audit-logger:
    destination: api
    include_request: true
    include_response: true
    include_policy_decisions: true
providers:
  targets:
  - id: hf-llama-8b
    provider: huggingface:chat:meta-llama/Meta-Llama-3.1-8B-Instruct
    secret_key_ref:
      env: HF_API_TOKEN
  - id: hf-llama-70b
    provider: huggingface:chat:meta-llama/Meta-Llama-3.1-70B-Instruct
    secret_key_ref:
      env: HF_API_TOKEN
  - id: hf-mistral-7b
    provider: huggingface:chat:mistralai/Mistral-7B-Instruct-v0.3
    secret_key_ref:
      env: HF_API_TOKEN
  - id: hf-sky-t1
    provider: huggingface:chat:NovaSky-Berkeley/Sky-T1-32B-Preview
    secret_key_ref:
      env: HF_API_TOKEN
  - id: hf-dedicated-llama-70b
    provider: huggingface:chat:meta-llama/Meta-Llama-3.1-70B-Instruct
    base_url: https://{unique-id}.us-east-1.aws.endpoints.huggingface.cloud/v1
    secret_key_ref:
      env: HF_API_TOKEN

Provider Fields

Field	Required	Description
`provider`	Yes	`"huggingface"` or `"huggingface:chat:{org/model-id}"`
`secret_key_ref`	Yes	Environment variable holding the HuggingFace API token (e.g. `HF_API_TOKEN`)
`base_url`	No	For serverless, defaults to `https://api-inference.huggingface.co/models/{model}/v1/chat/completions`. For dedicated endpoints, set to your endpoint URL: `https://{unique-id}.{region}.aws.endpoints.huggingface.cloud/v1`
`model`	No	Full `org/model` path when using the bare `"huggingface"` provider
`format`	No	`"huggingface"` — Keeptrusts auto-translates to/from OpenAI-compatible format for your clients
`provider_type`	No	`"huggingface"` — explicitly marks the provider family for format translation routing
`stream_timeout_seconds`	No	Increase for large models on cold-start dedicated endpoints

Supported Models

Serverless Inference API

The following models are available via the HuggingFace Serverless API. Serverless is subject to rate limits that vary by account tier (free, PRO, Enterprise).

Model	Context	Type	Notes
`meta-llama/Meta-Llama-3.1-8B-Instruct`	128k	Chat	Fastest Llama; recommended for high-volume
`meta-llama/Meta-Llama-3.1-70B-Instruct`	128k	Chat	Best open-weight balance of quality and speed
`mistralai/Mistral-7B-Instruct-v0.3`	32k	Chat	Compact and efficient; strong instruction following
`NovaSky-Berkeley/Sky-T1-32B-Preview`	32k	Chat / Reasoning	Strong reasoning model from Berkeley; free weights
`sentence-transformers/all-MiniLM-L6-v2`	512 tokens	Embeddings	Fast, compact semantic embeddings
`BAAI/bge-large-en-v1.5`	512 tokens	Embeddings	High-quality dense retrieval embeddings

Dedicated Inference Endpoints

Dedicated endpoints support any HuggingFace model. Common production choices:

Model	Context	Use Case
`meta-llama/Meta-Llama-3.1-70B-Instruct`	128k	General enterprise chat
`meta-llama/Meta-Llama-3.1-405B-Instruct`	128k	Max capability open-weight
`mistralai/Mixtral-8x22B-Instruct-v0.1`	64k	High-throughput MoE
`Qwen/Qwen2.5-72B-Instruct`	128k	Strong multilingual
`deepseek-ai/DeepSeek-R1`	64k	Reasoning-specialised

Client Examples

Start the gateway:

export HF_API_TOKEN="hf_..."
kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml

Python
Node.js
cURL

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:41002/v1",
    api_key="unused",  # auth handled by Keeptrusts
)

# Serverless — Llama 3.1 8B
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the attention mechanism in transformers."},
    ],
    max_tokens=1024,
    temperature=0.7,
)
print(response.choices[0].message.content)

# Dedicated endpoint — Llama 3.1 70B
dedicated = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[
        {"role": "user", "content": "Write a production-ready Python function to validate EU VAT numbers."},
    ],
    max_tokens=2048,
    temperature=0.2,
)
print(dedicated.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:41002/v1",
  apiKey: "unused",
});

// Serverless — Llama 3.1 8B
const response = await client.chat.completions.create({
  model: "meta-llama/Meta-Llama-3.1-8B-Instruct",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    {
      role: "user",
      content: "Explain the attention mechanism in transformers.",
    },
  ],
  max_tokens: 1024,
  temperature: 0.7,
});
console.log(response.choices[0].message.content);

// Dedicated endpoint — Llama 3.1 70B
const dedicated = await client.chat.completions.create({
  model: "meta-llama/Meta-Llama-3.1-70B-Instruct",
  messages: [
    {
      role: "user",
      content:
        "Write a production-ready TypeScript function to validate EU VAT numbers.",
    },
  ],
  max_tokens: 2048,
  temperature: 0.2,
});
console.log(dedicated.choices[0].message.content);

# Serverless — Llama 3.1 8B
curl -s http://localhost:41002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain the attention mechanism in transformers."}
    ],
    "max_tokens": 1024,
    "temperature": 0.7
  }' | jq .choices[0].message.content

# Dedicated endpoint — Llama 3.1 70B
curl -s http://localhost:41002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
    "messages": [
      {"role": "user", "content": "Write a production-ready Python function to validate EU VAT numbers."}
    ],
    "max_tokens": 2048,
    "temperature": 0.2
  }' | jq .choices[0].message.content

Streaming

HuggingFace Inference Endpoints support SSE streaming for text-generation models. Keeptrusts translates the stream from HuggingFace's native SSE format to the OpenAI-compatible data: {"choices":[{"delta":{...}}]} shape your client expects.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:41002/v1", api_key="unused")

with client.chat.completions.stream(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct",
    messages=[
        {
            "role": "user",
            "content": "Write a detailed explanation of how RLHF works, step by step.",
        }
    ],
    max_tokens=2048,
) as stream:
    for chunk in stream:
        delta = chunk.choices[0].delta.content
        if delta:
            print(delta, end="", flush=True)
print()

Cold-start note — Serverless inference endpoints may take 20–60 seconds to warm up after a period of inactivity. Dedicated endpoints remain warm for a configurable minimum time but can still cold-start after scale-to-zero events. Increase stream_timeout_seconds to accommodate cold starts:

pack:
  name: huggingface-providers-3
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: hf-llama-70b
    provider: huggingface:chat:meta-llama/Meta-Llama-3.1-70B-Instruct
    secret_key_ref:
      env: HF_API_TOKEN
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Advanced Configuration

Switching between serverless and dedicated endpoints

Use Keeptrusts's routing policy to send low-priority requests to the serverless tier and production requests to the dedicated endpoint. This maximises utilisation of your dedicated GPU allocation while keeping development costs low:

policies:
  chain:
  - prompt-injection
  - pii-detector
  - router
  - audit-logger
policy:
  router:
    rules:
    - when_role: production
      target: hf-dedicated-llama-70b
    - when_role: developer
      target: hf-llama-8b
    - default: 
      target: hf-llama-8b
providers:
  targets:
  - id: hf-llama-8b
    provider: huggingface:chat:meta-llama/Meta-Llama-3.1-8B-Instruct
    secret_key_ref:
      env: HF_API_TOKEN
  - id: hf-dedicated-llama-70b
    provider: huggingface:chat:meta-llama/Meta-Llama-3.1-70B-Instruct
    base_url: https://{unique-id}.us-east-1.aws.endpoints.huggingface.cloud/v1
    secret_key_ref:
      env: HF_API_TOKEN

HuggingFace Dedicated Endpoints can be deployed in eu-west-1 (Ireland) and eu-central-1 (Frankfurt). For GDPR workloads, use a EU endpoint and document the deployment region in your data processing records:

pack:
  name: huggingface-providers-5
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: hf-eu-llama-70b
    provider: huggingface:chat:meta-llama/Meta-Llama-3.1-70B-Instruct
    base_url: https://{unique-id}.eu-west-1.aws.endpoints.huggingface.cloud/v1
    secret_key_ref:
      env: HF_API_TOKEN
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Embeddings via serverless API

HuggingFace hosts embedding models suitable for RAG pipelines. Keeptrusts gateways embedding requests using the same token and PII policies as chat:

pack:
  name: huggingface-providers-6
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: hf-bge-embeddings
    provider: huggingface
    model: BAAI/bge-large-en-v1.5
    secret_key_ref:
      env: HF_API_TOKEN
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

from openai import OpenAI

client = OpenAI(base_url="http://localhost:41002/v1", api_key="unused")

embedding = client.embeddings.create(
    model="BAAI/bge-large-en-v1.5",
    input="enterprise AI governance policy enforcement gateway",
)
print(f"Vector dimensions: {len(embedding.data[0].embedding)}")

Best Practices

For production, always use Dedicated Inference Endpoints — Serverless inference is subject to shared rate limits, cold starts, and variable latency. Dedicated endpoints provide guaranteed GPU allocation, predictable latency, and SLA-backed availability. Use serverless only for development and low-frequency evaluation.
Match the endpoint region to your data residency requirements — HuggingFace offers EU (eu-west-1, eu-central-1) and US (us-east-1) regions for dedicated endpoints. For GDPR-governed workloads, deploy in an EU region and record the endpoint URL and region in your data processing registry.
Respect model access gating — Meta's Llama models require accepting a license on HuggingFace before the token grants API access. Attempting to call a gated model without license acceptance returns a 403. Verify model access in the HuggingFace UI before deploying the gateway config.
Set realistic stream_timeout_seconds for cold starts — Dedicated endpoints configured with scale-to-zero can take 30–90 seconds on the first request after idle. Set stream_timeout_seconds: 120 for production dedicated targets to avoid false timeout errors during scale-up events.
Apply pii-detector before requests reach open-weight models — Unlike managed model providers, self-hosted or HuggingFace-hosted models do not have built-in PII filtering. Keeptrusts's pii-detector is your primary defence. Redact on the request path for all HuggingFace targets without exception.
Use format: "huggingface" and provider_type: "huggingface" together — These two fields enable Keeptrusts's automatic format translation layer, which converts OpenAI-style chat messages to HuggingFace's native inputs/parameters schema and maps the response back. Omitting either field may result in malformed requests or unparsed responses, particularly for non-TGI-compatible endpoints.

For AI systems

Canonical terms: Keeptrusts gateway, HuggingFace, Inference Endpoints, TGI (Text Generation Inference), provider target, policy-config.yaml, provider: "huggingface", HF_TOKEN.
Config field names: provider, model, base_url, secret_key_ref.env: "HF_TOKEN", format: "huggingface", provider_type: "huggingface", timeout_seconds.
Key behavior: Keeptrusts translates OpenAI-style chat messages to HuggingFace's native inputs/parameters schema and maps responses back.
Both format: "huggingface" and provider_type: "huggingface" are required for automatic format translation.
Best next pages: vLLM integration (self-hosted), Together AI integration, Policy configuration.

For engineers

Prerequisites: HuggingFace token (HF_TOKEN env var from huggingface.co/settings/tokens), Inference Endpoint or public Inference API, kt CLI installed.
Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
Validate: curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"meta-llama/Llama-3.3-70B-Instruct","messages":[{"role":"user","content":"hello"}]}'.
Set both format: "huggingface" and provider_type: "huggingface" — omitting either may cause malformed requests or unparsed responses.
For dedicated Inference Endpoints, set base_url to your endpoint URL (e.g., https://xxx.us-east-1.aws.endpoints.huggingface.cloud).
Set timeout_seconds based on model size — 70B+ models on shared infrastructure may need 90–120 seconds.

For leaders

HuggingFace provides access to thousands of open-weight models — Keeptrusts enables governance over this broad model catalog.
Inference Endpoints offer dedicated GPU instances with data residency options (AWS, GCP, Azure regions).
Open-weight models eliminate vendor lock-in for the model layer — Keeptrusts provides the consistent governance layer across model changes.
Format translation between OpenAI and HuggingFace schemas means existing application code works unchanged when switching model providers.

Next steps

vLLM integration — self-hosted high-throughput serving for HuggingFace models
Together AI integration — hosted open models with faster inference
Ollama integration — local model serving for development
Policy configuration — prompt-injection and PII policy reference
Quickstart — install kt and run your first gateway

Use this page when​

Primary audience​

Prerequisites​

Configuration​

Minimal — serverless inference​

Full governance config — serverless + dedicated endpoints​

Provider Fields​

Supported Models​

Serverless Inference API​

Dedicated Inference Endpoints​

Client Examples​

Streaming​

Advanced Configuration​

Switching between serverless and dedicated endpoints​

EU-region dedicated endpoints for GDPR compliance​

Embeddings via serverless API​

Best Practices​

For AI systems​

For engineers​

For leaders​

Next steps​