Cloudera AI

Cloudera AI Inference provides enterprise LLM hosting on Cloudera Data Platform (CDP) with data residency guarantees, private deployment, and integration with existing Cloudera governance and security controls. Organizations running sensitive workloads that cannot go to public cloud APIs can host models such as Llama 3.3 70B and Mistral within their own CDP environment.

Use this page when

You need the exact command, config, API, or integration details for Cloudera AI.
You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Keeptrusts enforces governance policies on Cloudera-hosted models — prompt-injection detection, PII redaction, content safety filters, and audit logging — without requiring any changes to the Cloudera deployment itself. The gateway sits in front of the Cloudera AI Inference endpoint and applies your policy chain on every request and response.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Prerequisites

Cloudera AI Inference endpoint — deploy one or more model endpoints in your CDP environment. Your administrator will provide the full endpoint URL and a suitable access token.
Keeptrusts CLI — install kt (quickstart guide).
Export your Cloudera access token so the gateway can read it at startup:

export CLOUDERA_API_KEY="your-cloudera-access-token"

Unlike cloud-hosted providers, Cloudera does not have a fixed public base URL — every deployment has its own endpoint. The base_url field is required and must point at the root of your Cloudera AI Inference endpoint (e.g. https://your-domain/namespaces/serving-default/endpoints/llama-3-3-70b/v1).

Configuration

A complete policy-config.yaml that routes traffic through a Cloudera AI Inference endpoint with prompt-injection, PII, and safety policies:

pack:
  name: cloudera-gateway
  version: 1.0.0
  enabled: true
policies:
  chain:
  - prompt-injection
  - pii-detector
  - safety-filter
  - audit-logger
policy:
  prompt-injection:
    threshold: 0.8
    action: block
  pii-detector:
    action: redact
  safety-filter:
    mode: strict
    action: block
  audit-logger:
    retention_days: 365
providers:
  strategy: single
  targets:
  - id: cloudera-llama
    provider: cloudera
    model: meta/llama-3.3-70b-instruct
    base_url: https://your-domain/namespaces/serving-default/endpoints/llama-3-3-70b/v1
    secret_key_ref:
      env: CLOUDERA_API_KEY

Start the gateway:

kt gateway run \
  --listen 0.0.0.0:41002 \
  --policy-config policy-config.yaml

Compact Provider Shorthand

You can encode the model directly in the provider field. The two forms below are equivalent:

# Shorthand — model embedded in the provider string
- id: "cloudera-llama"
  provider: "cloudera:chat:meta/llama-3.3-70b-instruct"
  base_url: "https://your-domain/namespaces/serving-default/endpoints/llama-3-3-70b/v1"

# Explicit — separate provider and model fields
- id: "cloudera-llama"
  provider: "cloudera"
  model: "meta/llama-3.3-70b-instruct"
  base_url: "https://your-domain/namespaces/serving-default/endpoints/llama-3-3-70b/v1"

Provider Fields

All fields available on a providers.targets[] entry for Cloudera AI:

Field	Type	Default	Description
`id`	string	required	Unique identifier for this target. Used in logs, the console dashboard, and routing decisions.
`provider`	string	required	Provider ID. Use `"cloudera"` or the shorthand `"cloudera:chat:<model>"`.
`model`	string	required	Model name as registered in your Cloudera AI deployment, e.g. `"meta/llama-3.3-70b-instruct"`.
`base_url`	string	required	Full URL to your Cloudera AI Inference endpoint root. Must be your organization-specific endpoint.
`secret_key_ref`	object	`CLOUDERA_API_KEY`	Object reference to the environment variable holding the CDP access token.
`format`	string	`"openai"`	Wire format. Cloudera AI Inference exposes an OpenAI-compatible API.
`timeout_seconds`	integer	`60`	Maximum wall-clock time for non-streaming requests. Larger enterprise models may need higher values.
`stream_timeout_seconds`	integer	inherits `timeout_seconds`	Maximum wall-clock time for streaming requests. Set higher for long completions.
`max_context_tokens`	integer	none	Maximum token budget for the request. When set, the gateway rejects requests that exceed this limit before forwarding upstream.
`description`	string	none	Human-readable label shown in the console dashboard and health-check output.
`weight`	float	`1.0`	Routing weight used by the `weighted_round_robin` strategy.
`health_probe`	object	none	Active health probe configuration. Sub-fields: `enabled` (bool), `interval_seconds` (int), `timeout_seconds` (int).

Supported Models

The models available depend entirely on what your organization has deployed in Cloudera AI. Common deployments include:

Model	Notes
`meta/llama-3.3-70b-instruct`	Meta Llama 3.3 70B — strong general-purpose instruction-following
`meta/llama-3.1-8b-instruct`	Llama 3.1 8B — fast and efficient for lower-latency workloads
`mistral/mistral-7b-instruct`	Mistral 7B — compact multilingual model
`ibm/granite-13b-instruct-v2`	IBM Granite — enterprise-focused instruction model

Contact your Cloudera administrator for the exact model IDs and endpoint URLs available in your CDP deployment.

Keeptrusts passes the model field through to the upstream endpoint as-is. Use the exact model identifier string that your Cloudera AI Inference deployment expects.

Client Examples

Once the gateway is running, point your client SDK to http://localhost:8080 instead of your Cloudera endpoint URL. The standard OpenAI SDK works directly — no Cloudera-specific SDK is needed.

Python
Node.js
cURL

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="unused",  # auth is handled by Keeptrusts via CLOUDERA_API_KEY
)

response = client.chat.completions.create(
    model="meta/llama-3.3-70b-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful enterprise assistant."},
        {"role": "user", "content": "Summarize the key compliance requirements for HIPAA data handling."},
    ],
    temperature=0.3,
    max_tokens=512,
)

print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "unused", // auth handled by Keeptrusts via CLOUDERA_API_KEY
});

const response = await client.chat.completions.create({
  model: "meta/llama-3.3-70b-instruct",
  messages: [
    { role: "system", content: "You are a helpful enterprise assistant." },
    { role: "user", content: "Summarize the key compliance requirements for HIPAA data handling." },
  ],
  temperature: 0.3,
  max_tokens: 512,
});

console.log(response.choices[0].message.content);

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.3-70b-instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful enterprise assistant."},
      {"role": "user", "content": "Summarize the key compliance requirements for HIPAA data handling."}
    ],
    "temperature": 0.3,
    "max_tokens": 512
  }'

Streaming

Keeptrusts fully supports streaming for Cloudera AI Inference. Set stream: true in your request — the gateway applies policies to each chunk in real time. Enterprise-hosted models may have higher first-token latency, so configure stream_timeout_seconds generously:

pack:
  name: cloudera-providers-3
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: cloudera-streaming
    provider: cloudera
    model: meta/llama-3.3-70b-instruct
    base_url: https://your-domain/namespaces/serving-default/endpoints/llama-3-3-70b/v1
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Python
Node.js
cURL

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")

stream = client.chat.completions.create(
    model="meta/llama-3.3-70b-instruct",
    messages=[{"role": "user", "content": "Draft a data processing agreement summary."}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "unused",
});

const stream = await client.chat.completions.create({
  model: "meta/llama-3.3-70b-instruct",
  messages: [{ role: "user", content: "Draft a data processing agreement summary." }],
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) process.stdout.write(content);
}

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -N \
  -d '{
    "model": "meta/llama-3.3-70b-instruct",
    "messages": [{"role": "user", "content": "Draft a data processing agreement summary."}],
    "stream": true
  }'

Advanced Configuration

Multi-Endpoint Failover

Route to a backup Cloudera AI endpoint if the primary is unavailable:

pack:
  name: cloudera-providers-4
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: cloudera-primary
    provider: cloudera:chat:meta/llama-3.3-70b-instruct
    base_url: https://your-domain/namespaces/serving-default/endpoints/llama-3-3-70b-primary/v1
    secret_key_ref:
      env: CLOUDERA_API_KEY
  - id: cloudera-backup
    provider: cloudera:chat:meta/llama-3.3-70b-instruct
    base_url: https://your-domain/namespaces/serving-default/endpoints/llama-3-3-70b-backup/v1
    secret_key_ref:
      env: CLOUDERA_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Multi-Model Routing

Route different workloads to different Cloudera-hosted models:

pack:
  name: cloudera-providers-5
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: cloudera-llama-70b
    provider: cloudera:chat:meta/llama-3.3-70b-instruct
    base_url: https://your-domain/namespaces/serving-default/endpoints/llama70b/v1
    secret_key_ref:
      env: CLOUDERA_API_KEY
  - id: cloudera-llama-8b
    provider: cloudera:chat:meta/llama-3.1-8b-instruct
    base_url: https://your-domain/namespaces/serving-default/endpoints/llama8b/v1
    secret_key_ref:
      env: CLOUDERA_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Best Practices

Specify the full base_url — Cloudera AI Inference endpoints are organization-specific. Always provide the complete endpoint URL including namespace, serving environment, and endpoint name. Do not rely on Keeptrusts to synthesize paths.
Use environment variables for credentials — set secret_key_ref to a variable name and keep CDP access tokens out of policy-config.yaml. Rotate tokens via environment updates without redeploying the gateway.
Set higher timeouts for large models — enterprise-hosted 70B models may have 10–30 second first-token latencies under load. Set timeout_seconds: 120 and stream_timeout_seconds: 300 to avoid premature client disconnects.
Enable health_probe — on-premise infrastructure can have scheduled maintenance windows; the health probe ensures Keeptrusts detects unavailability promptly and activates fallback routing.
Combine with PII redaction — a primary use case for Cloudera is data residency. Pair the pii-detector policy with Cloudera to ensure sensitive fields are redacted before they leave your application layer, providing defense in depth even within your private network.

For AI systems

Canonical terms: Keeptrusts gateway, Cloudera AI, Cloudera Machine Learning (CML), on-premise, private cloud, provider target, policy-config.yaml.
Config field names: provider, model, base_url, secret_key_ref.env, format: "openai", health_probe, timeout_seconds.
Key behavior: Keeptrusts routes to Cloudera's OpenAI-compatible inference endpoints within private infrastructure.
Best next pages: vLLM integration (alternative self-hosted), Ollama integration, Policy configuration.

For engineers

Prerequisites: Cloudera ML workspace with deployed model endpoint, API credentials, kt CLI installed, network access to the CML endpoint.
Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
Validate: curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"your-cml-model","messages":[{"role":"user","content":"hello"}]}'.
Enable health_probe to detect scheduled maintenance windows and trigger fallback routing automatically.
Combine with pii-detector policy for defense-in-depth — redact sensitive data before it reaches the model even within your private network.

For leaders

Cloudera deployments keep all data within your private network — no prompts or completions leave the organization's infrastructure.
On-premise maintenance windows require health-probe-driven fallback routing to maintain availability SLAs.
Keeptrusts audit logging provides compliance evidence even for fully on-premise inference that has no vendor-side audit trail.
Pair PII redaction with Cloudera for layered data protection that satisfies data residency regulations.

Next steps

vLLM integration — alternative high-throughput self-hosted serving
Ollama integration — lightweight local model serving
IBM WatsonX integration — alternative enterprise AI platform
Policy configuration — PII redaction and audit-logger reference
Quickstart — install kt and run your first gateway

Use this page when​

Primary audience​

Prerequisites​

Configuration​

Compact Provider Shorthand​

Provider Fields​

Supported Models​

Client Examples​

Streaming​

Advanced Configuration​

Multi-Endpoint Failover​

Multi-Model Routing​

Best Practices​

For AI systems​

For engineers​

For leaders​

Next steps​