Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Replicate

Replicate hosts thousands of open-source models — including the largest Llama variants, Stable Diffusion, and community fine-tunes — with pay-per-prediction pricing. Keeptrusts's native Replicate runtime translates OpenAI-compatible requests into Replicate's prediction API format and converts responses back, so your existing OpenAI clients work unchanged.

Use this page when

  • You need the exact command, config, API, or integration details for Replicate.
  • You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
  • If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Prerequisites

export REPLICATE_API_TOKEN="r8_..."

Configuration

pack:
name: replicate-gateway
version: 0.1.0
enabled: true
policies:
chain:
- prompt-injection
- pii-detector
- audit-logger
providers:
targets:
- id: replicate-llama
provider: replicate:chat:meta/llama-3.3-70b-instruct
base_url: https://api.replicate.com/v1
secret_key_ref:
env: REPLICATE_API_TOKEN

Start the gateway:

kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml

Provider Fields

FieldTypeDefaultDescription
providerstring"replicate" or "replicate:chat:<owner/model>"
base_urlstringhttps://api.replicate.com/v1Replicate API base URL (auto-detected)
secret_key_refobjectREPLICATE_API_TOKENObject reference to the env var holding the Replicate API token (auto-detected)
formatstring"replicate"Wire format — Keeptrusts auto-translates OpenAI→Replicate request and Replicate→OpenAI response
provider_typestring"replicate"Explicit provider type for routing

Supported Models

Model IDTypeNotes
meta/llama-3.3-70b-instructChatLatest Llama 3.3, top open-weight model
meta/llama-3.1-405b-instructChatLargest open-weight model available
mistralai/mixtral-8x7b-instruct-v0.1ChatEfficient mixture-of-experts model
black-forest-labs/flux-schnellImage generationFast FLUX image generation
stability-ai/stable-diffusion-3.5Image generationStability AI SD 3.5

Pass the model path as the model field in client requests, or embed it in the provider shorthand.

Client Examples

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="unused",
)

response = client.chat.completions.create(
model="meta/llama-3.3-70b-instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Compare monolithic and microservice architectures."},
],
max_tokens=1024,
)
print(response.choices[0].message.content)

Streaming

Replicate supports streaming responses. Keeptrusts forwards server-sent events transparently:

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:41002/v1",
api_key="unused",
)

stream = client.chat.completions.create(
model="meta/llama-3.3-70b-instruct",
messages=[{"role": "user", "content": "Explain transformer attention mechanisms."}],
stream=True,
)

for chunk in stream:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)

Advanced Configuration

Custom and Fine-Tuned Models

Replicate hosts community fine-tunes and private model versions. Reference them by their full versioned path:

pack:
name: replicate-providers-2
version: 1.0.0
enabled: true
providers:
targets:
- id: custom-fine-tune
provider: replicate:chat:your-org/your-fine-tuned-model
secret_key_ref:
env: REPLICATE_API_TOKEN
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Fallback to OpenAI

Configure Replicate as primary with OpenAI as fallback for high-availability deployments:

pack:
name: replicate-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: replicate-primary
provider: replicate:chat:meta/llama-3.3-70b-instruct
secret_key_ref:
env: REPLICATE_API_TOKEN
- id: openai-fallback
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Best Practices

  • Use versioned model paths in production — Replicate model behavior can change with upstream updates; pin a specific model version hash for reproducibility
  • Set max_tokens explicitly — Replicate predictions are billed per token; leaving it unbounded can produce unexpectedly large and costly responses
  • Apply pii-detector before sending — inputs are processed on Replicate's infrastructure; ensure sensitive data is redacted at the gateway layer before leaving your network
  • Combine with audit-logger — Replicate does not provide per-inference audit logs; capture them at the Keeptrusts layer for compliance
  • Monitor cold-start latency — some Replicate models require a warm-up period after inactivity; set an appropriate timeout_ms for bursty or infrequent workloads

For AI systems

  • Canonical terms: Keeptrusts gateway, Replicate, serverless GPU, cold start, open models, provider target, policy-config.yaml, provider: "replicate".
  • Config field names: provider, model, base_url, secret_key_ref.env: "REPLICATE_API_TOKEN", format: "openai", timeout_seconds.
  • Key behavior: Replicate serves open models on serverless GPUs with an OpenAI-compatible API; Keeptrusts adds policy enforcement.
  • Constraint: Cold-start latency is common for infrequently-used models — set appropriate timeout_seconds.
  • Best next pages: HuggingFace integration, Together AI integration, Policy configuration.

For engineers

  • Prerequisites: Replicate API token (REPLICATE_API_TOKEN env var from replicate.com/account/api-tokens), kt CLI installed.
  • Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
  • Validate: curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"meta/llama-3.3-70b-instruct","messages":[{"role":"user","content":"hello"}]}'.
  • Replicate does not provide per-inference audit logs — Keeptrusts audit-logger is required for compliance.
  • Set generous timeout_seconds for bursty/infrequent workloads — cold-start can add 10–30 seconds on first request.
  • Monitor cold-start latency via Keeptrusts events dashboard to tune timeout and fallback thresholds.

For leaders

  • Replicate's serverless model means zero fixed infrastructure cost — you pay only for compute time used.
  • Cold-start latency (10–30s) makes Replicate unsuitable for latency-critical production paths without warm-up strategies.
  • Keeptrusts provides the audit trail that Replicate's serverless platform does not offer natively.
  • Broad open-model catalog with no upfront commitment — ideal for experimentation and low-volume production workloads.

Next steps