Replicate
Replicate hosts thousands of open-source models — including the largest Llama variants, Stable Diffusion, and community fine-tunes — with pay-per-prediction pricing. Keeptrusts's native Replicate runtime translates OpenAI-compatible requests into Replicate's prediction API format and converts responses back, so your existing OpenAI clients work unchanged.
Use this page when
- You need the exact command, config, API, or integration details for Replicate.
- You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
- If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Prerequisites
- A Replicate account and API token from replicate.com/account/api-tokens
- Keeptrusts CLI installed
export REPLICATE_API_TOKEN="r8_..."
Configuration
pack:
name: replicate-gateway
version: 0.1.0
enabled: true
policies:
chain:
- prompt-injection
- pii-detector
- audit-logger
providers:
targets:
- id: replicate-llama
provider: replicate:chat:meta/llama-3.3-70b-instruct
base_url: https://api.replicate.com/v1
secret_key_ref:
env: REPLICATE_API_TOKEN
Start the gateway:
kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml
Provider Fields
| Field | Type | Default | Description |
|---|---|---|---|
provider | string | — | "replicate" or "replicate:chat:<owner/model>" |
base_url | string | https://api.replicate.com/v1 | Replicate API base URL (auto-detected) |
secret_key_ref | object | REPLICATE_API_TOKEN | Object reference to the env var holding the Replicate API token (auto-detected) |
format | string | "replicate" | Wire format — Keeptrusts auto-translates OpenAI→Replicate request and Replicate→OpenAI response |
provider_type | string | "replicate" | Explicit provider type for routing |
Supported Models
| Model ID | Type | Notes |
|---|---|---|
meta/llama-3.3-70b-instruct | Chat | Latest Llama 3.3, top open-weight model |
meta/llama-3.1-405b-instruct | Chat | Largest open-weight model available |
mistralai/mixtral-8x7b-instruct-v0.1 | Chat | Efficient mixture-of-experts model |
black-forest-labs/flux-schnell | Image generation | Fast FLUX image generation |
stability-ai/stable-diffusion-3.5 | Image generation | Stability AI SD 3.5 |
Pass the model path as the model field in client requests, or embed it in the provider shorthand.
Client Examples
- Python
- Node.js
- cURL
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="unused",
)
response = client.chat.completions.create(
model="meta/llama-3.3-70b-instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Compare monolithic and microservice architectures."},
],
max_tokens=1024,
)
print(response.choices[0].message.content)
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'http://localhost:41002/v1',
apiKey: 'unused',
});
const response = await client.chat.completions.create({
model: 'meta/llama-3.3-70b-instruct',
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Compare monolithic and microservice architectures.' },
],
max_tokens: 1024,
});
console.log(response.choices[0].message.content);
curl http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta/llama-3.3-70b-instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Compare monolithic and microservice architectures."}
],
"max_tokens": 1024
}'
Streaming
Replicate supports streaming responses. Keeptrusts forwards server-sent events transparently:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="unused",
)
stream = client.chat.completions.create(
model="meta/llama-3.3-70b-instruct",
messages=[{"role": "user", "content": "Explain transformer attention mechanisms."}],
stream=True,
)
for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
Advanced Configuration
Custom and Fine-Tuned Models
Replicate hosts community fine-tunes and private model versions. Reference them by their full versioned path:
pack:
name: replicate-providers-2
version: 1.0.0
enabled: true
providers:
targets:
- id: custom-fine-tune
provider: replicate:chat:your-org/your-fine-tuned-model
secret_key_ref:
env: REPLICATE_API_TOKEN
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Fallback to OpenAI
Configure Replicate as primary with OpenAI as fallback for high-availability deployments:
pack:
name: replicate-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: replicate-primary
provider: replicate:chat:meta/llama-3.3-70b-instruct
secret_key_ref:
env: REPLICATE_API_TOKEN
- id: openai-fallback
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Best Practices
- Use versioned model paths in production — Replicate model behavior can change with upstream updates; pin a specific model version hash for reproducibility
- Set
max_tokensexplicitly — Replicate predictions are billed per token; leaving it unbounded can produce unexpectedly large and costly responses - Apply
pii-detectorbefore sending — inputs are processed on Replicate's infrastructure; ensure sensitive data is redacted at the gateway layer before leaving your network - Combine with
audit-logger— Replicate does not provide per-inference audit logs; capture them at the Keeptrusts layer for compliance - Monitor cold-start latency — some Replicate models require a warm-up period after inactivity; set an appropriate
timeout_msfor bursty or infrequent workloads
For AI systems
- Canonical terms: Keeptrusts gateway, Replicate, serverless GPU, cold start, open models, provider target, policy-config.yaml,
provider: "replicate". - Config field names:
provider,model,base_url,secret_key_ref.env: "REPLICATE_API_TOKEN",format: "openai",timeout_seconds. - Key behavior: Replicate serves open models on serverless GPUs with an OpenAI-compatible API; Keeptrusts adds policy enforcement.
- Constraint: Cold-start latency is common for infrequently-used models — set appropriate
timeout_seconds. - Best next pages: HuggingFace integration, Together AI integration, Policy configuration.
For engineers
- Prerequisites: Replicate API token (
REPLICATE_API_TOKENenv var from replicate.com/account/api-tokens),ktCLI installed. - Start command:
kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml. - Validate:
curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"meta/llama-3.3-70b-instruct","messages":[{"role":"user","content":"hello"}]}'. - Replicate does not provide per-inference audit logs — Keeptrusts
audit-loggeris required for compliance. - Set generous
timeout_secondsfor bursty/infrequent workloads — cold-start can add 10–30 seconds on first request. - Monitor cold-start latency via Keeptrusts events dashboard to tune timeout and fallback thresholds.
For leaders
- Replicate's serverless model means zero fixed infrastructure cost — you pay only for compute time used.
- Cold-start latency (10–30s) makes Replicate unsuitable for latency-critical production paths without warm-up strategies.
- Keeptrusts provides the audit trail that Replicate's serverless platform does not offer natively.
- Broad open-model catalog with no upfront commitment — ideal for experimentation and low-volume production workloads.
Next steps
- Together AI integration — always-warm hosted open models
- HuggingFace integration — dedicated Inference Endpoints for consistent latency
- Provider routing strategies — fallback routing for cold-start mitigation
- Policy configuration — audit-logger reference
- Quickstart — install
ktand run your first gateway