Replicate

Replicate hosts thousands of open-source models — including the largest Llama variants, Stable Diffusion, and community fine-tunes — with pay-per-prediction pricing. Keeptrusts's native Replicate runtime translates OpenAI-compatible requests into Replicate's prediction API format and converts responses back, so your existing OpenAI clients work unchanged.

Use this page when

You need the exact command, config, API, or integration details for Replicate.
You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Prerequisites

A Replicate account and API token from replicate.com/account/api-tokens
Keeptrusts CLI installed

export REPLICATE_API_TOKEN="r8_..."

Configuration

pack:
  name: replicate-gateway
  version: 0.1.0
  enabled: true
policies:
  chain:
  - prompt-injection
  - pii-detector
  - audit-logger
providers:
  targets:
  - id: replicate-llama
    provider: replicate:chat:meta/llama-3.3-70b-instruct
    base_url: https://api.replicate.com/v1
    secret_key_ref:
      env: REPLICATE_API_TOKEN

Start the gateway:

kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml

Provider Fields

Field	Type	Default	Description
`provider`	string	—	`"replicate"` or `"replicate:chat:<owner/model>"`
`base_url`	string	`https://api.replicate.com/v1`	Replicate API base URL (auto-detected)
`secret_key_ref`	object	`REPLICATE_API_TOKEN`	Object reference to the env var holding the Replicate API token (auto-detected)
`format`	string	`"replicate"`	Wire format — Keeptrusts auto-translates OpenAI→Replicate request and Replicate→OpenAI response
`provider_type`	string	`"replicate"`	Explicit provider type for routing

Supported Models

Model ID	Type	Notes
`meta/llama-3.3-70b-instruct`	Chat	Latest Llama 3.3, top open-weight model
`meta/llama-3.1-405b-instruct`	Chat	Largest open-weight model available
`mistralai/mixtral-8x7b-instruct-v0.1`	Chat	Efficient mixture-of-experts model
`black-forest-labs/flux-schnell`	Image generation	Fast FLUX image generation
`stability-ai/stable-diffusion-3.5`	Image generation	Stability AI SD 3.5

Pass the model path as the model field in client requests, or embed it in the provider shorthand.

Client Examples

Python
Node.js
cURL

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:41002/v1",
    api_key="unused",
)

response = client.chat.completions.create(
    model="meta/llama-3.3-70b-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Compare monolithic and microservice architectures."},
    ],
    max_tokens=1024,
)
print(response.choices[0].message.content)

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:41002/v1',
  apiKey: 'unused',
});

const response = await client.chat.completions.create({
  model: 'meta/llama-3.3-70b-instruct',
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Compare monolithic and microservice architectures.' },
  ],
  max_tokens: 1024,
});

console.log(response.choices[0].message.content);

curl http://localhost:41002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta/llama-3.3-70b-instruct",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Compare monolithic and microservice architectures."}
    ],
    "max_tokens": 1024
  }'

Streaming

Replicate supports streaming responses. Keeptrusts forwards server-sent events transparently:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:41002/v1",
    api_key="unused",
)

stream = client.chat.completions.create(
    model="meta/llama-3.3-70b-instruct",
    messages=[{"role": "user", "content": "Explain transformer attention mechanisms."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="", flush=True)

Advanced Configuration

Custom and Fine-Tuned Models

Replicate hosts community fine-tunes and private model versions. Reference them by their full versioned path:

pack:
  name: replicate-providers-2
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: custom-fine-tune
    provider: replicate:chat:your-org/your-fine-tuned-model
    secret_key_ref:
      env: REPLICATE_API_TOKEN
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Fallback to OpenAI

Configure Replicate as primary with OpenAI as fallback for high-availability deployments:

pack:
  name: replicate-providers-3
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: replicate-primary
    provider: replicate:chat:meta/llama-3.3-70b-instruct
    secret_key_ref:
      env: REPLICATE_API_TOKEN
  - id: openai-fallback
    provider: openai
    model: gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Best Practices

Use versioned model paths in production — Replicate model behavior can change with upstream updates; pin a specific model version hash for reproducibility
Set max_tokens explicitly — Replicate predictions are billed per token; leaving it unbounded can produce unexpectedly large and costly responses
Apply pii-detector before sending — inputs are processed on Replicate's infrastructure; ensure sensitive data is redacted at the gateway layer before leaving your network
Combine with audit-logger — Replicate does not provide per-inference audit logs; capture them at the Keeptrusts layer for compliance
Monitor cold-start latency — some Replicate models require a warm-up period after inactivity; set an appropriate timeout_ms for bursty or infrequent workloads

For AI systems

Canonical terms: Keeptrusts gateway, Replicate, serverless GPU, cold start, open models, provider target, policy-config.yaml, provider: "replicate".
Config field names: provider, model, base_url, secret_key_ref.env: "REPLICATE_API_TOKEN", format: "openai", timeout_seconds.
Key behavior: Replicate serves open models on serverless GPUs with an OpenAI-compatible API; Keeptrusts adds policy enforcement.
Constraint: Cold-start latency is common for infrequently-used models — set appropriate timeout_seconds.
Best next pages: HuggingFace integration, Together AI integration, Policy configuration.

For engineers

Prerequisites: Replicate API token (REPLICATE_API_TOKEN env var from replicate.com/account/api-tokens), kt CLI installed.
Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
Validate: curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"meta/llama-3.3-70b-instruct","messages":[{"role":"user","content":"hello"}]}'.
Replicate does not provide per-inference audit logs — Keeptrusts audit-logger is required for compliance.
Set generous timeout_seconds for bursty/infrequent workloads — cold-start can add 10–30 seconds on first request.
Monitor cold-start latency via Keeptrusts events dashboard to tune timeout and fallback thresholds.

For leaders

Replicate's serverless model means zero fixed infrastructure cost — you pay only for compute time used.
Cold-start latency (10–30s) makes Replicate unsuitable for latency-critical production paths without warm-up strategies.
Keeptrusts provides the audit trail that Replicate's serverless platform does not offer natively.
Broad open-model catalog with no upfront commitment — ideal for experimentation and low-volume production workloads.

Next steps

Together AI integration — always-warm hosted open models
HuggingFace integration — dedicated Inference Endpoints for consistent latency
Provider routing strategies — fallback routing for cold-start mitigation
Policy configuration — audit-logger reference
Quickstart — install kt and run your first gateway

Use this page when​

Primary audience​

Prerequisites​

Configuration​

Provider Fields​

Supported Models​

Client Examples​

Streaming​

Advanced Configuration​

Custom and Fine-Tuned Models​

Fallback to OpenAI​

Best Practices​

For AI systems​

For engineers​

For leaders​

Next steps​