Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Together AI

Together AI provides fast, cost-effective inference for open-weight models through an OpenAI-compatible REST API. Keeptrusts sits between your application and Together's API endpoint, enforcing policy chains — prompt-injection detection, PII redaction, safety filters, content-quality scoring — on every request and response without requiring application-side changes.

Use this page when

  • You need the exact command, config, API, or integration details for Together AI.
  • You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
  • If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Because Together exposes a standard /v1/chat/completions surface, Keeptrusts needs no format translation. Requests and responses flow through the gateway in native OpenAI wire format, and any OpenAI SDK client can be pointed at the gateway with zero code changes.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Prerequisites

  1. Together API key — create one in the Together Console → API Keys.
  2. Keeptrusts CLI — install kt (quickstart guide).
  3. Export your key so the gateway can read it at startup:
export TOGETHER_API_KEY="your-together-api-key"

When the provider field is set to "together", Keeptrusts auto-detects both the base URL (https://api.together.xyz/v1) and the API key environment variable (TOGETHER_API_KEY). You only need to override these if you use a custom deployment or a non-standard env-var name.

Configuration

A minimal policy-config.yaml that routes traffic through Together with prompt-injection and PII policies:

pack:
name: together-gateway
version: 1.0.0
enabled: true
policies:
chain:
- prompt-injection
- pii-detector
- safety-filter
- audit-logger
policy:
prompt-injection:
threshold: 0.8
action: block
pii-detector:
action: redact
safety-filter:
mode: strict
action: block
audit-logger:
retention_days: 365
providers:
strategy: single
targets:
- id: together-llama-70b
provider: together
model: meta-llama/Llama-3.3-70B-Instruct-Turbo
base_url: https://api.together.xyz/v1
secret_key_ref:
env: TOGETHER_API_KEY

Start the gateway:

kt gateway run \
--listen 0.0.0.0:41002 \
--policy-config policy-config.yaml

Compact Provider Shorthand

You can encode the model directly in the provider field. The two forms below are equivalent:

# Shorthand — model embedded in the provider string
- id: "together-llama"
provider: "together:chat:meta-llama/Llama-3.3-70B-Instruct-Turbo"

# Explicit — separate provider and model fields
- id: "together-llama"
provider: "together"
model: "meta-llama/Llama-3.3-70B-Instruct-Turbo"

The shorthand form is convenient for quick setups. The explicit form is preferred when you also set pricing, health_probe, or other per-target fields.

Provider Fields

All fields available on a providers.targets[] entry for Together AI:

FieldTypeDefaultDescription
idstringrequiredUnique identifier for this target. Used in logs, dashboards, and routing references.
providerstringrequiredProvider ID. Use "together" or the shorthand "together:chat:<model>".
modelstringrequiredFull model path in Together's org/model format (e.g. "meta-llama/Llama-3.3-70B-Instruct-Turbo").
base_urlstringhttps://api.together.xyz/v1API base URL. Auto-detected when provider is "together". Override for private deployments or gateway chains.
secret_key_refobjectTOGETHER_API_KEYObject reference to the environment variable holding the API key. Auto-detected for Together targets. Use distinct names per environment (e.g. TOGETHER_API_KEY_PROD).
timeout_secondsinteger60Maximum wall-clock time for non-streaming requests before the gateway returns a timeout error.
stream_timeout_secondsintegerfalls back to timeout_secondsMaximum time for streaming requests. Set higher than timeout_seconds when long generations are expected.
formatstring"openai"Wire format. Together's API is natively OpenAI-compatible, so this is always "openai".
descriptionstringnoneHuman-readable label shown in the Keeptrusts console dashboards, event logs, and trace views.
weightfloat1.0Routing weight when using the weighted_round_robin strategy. Higher values receive proportionally more traffic.
pricingobjectnoneToken pricing in USD per 1 million tokens. Fields: prompt (input) and completion (output). Enables cost dashboards, per-request cost tracking, and budget enforcement policies.
health_probeobjectnoneActive health-check configuration. Sub-fields: enabled (bool), interval_seconds (int), timeout_seconds (int). When enabled, the gateway periodically probes the upstream and removes unhealthy targets from rotations.
quantizationsstringnoneModel quantization level (e.g. "fp16", "fp8", "int8", "int4"). Informational — used for dashboards and routing metadata when Together offers multiple quantized variants of the same model.

Supported Models

Together's catalog contains hundreds of open-weight models. The table below lists popular choices that work well with Keeptrusts:

ModelContext WindowTypeTypical Use
meta-llama/Llama-3.3-70B-Instruct-Turbo128KChatGeneral-purpose flagship, strong reasoning and instruction-following
meta-llama/Llama-3.1-8B-Instruct-Turbo128KChatFast and cost-effective for latency-sensitive workloads
mistralai/Mixtral-8x22B-Instruct-v0.164KChatMixture-of-experts architecture, strong reasoning at lower cost
Qwen/QwQ-32B32KChatMultilingual, strong mathematical and logical reasoning

Any model available on the Together Models page can be used — set the model field to the full org/model path. Keeptrusts passes the model identifier through to the upstream without validation, so newly added models work immediately.

Client Examples

Once the gateway is running, point your client to http://localhost:8080 instead of https://api.together.xyz/v1. Clients send standard OpenAI-format requests — no SDK changes are needed beyond the base URL.

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="unused", # auth is handled by Keeptrusts via TOGETHER_API_KEY
)

response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the CAP theorem in distributed systems."},
],
temperature=0.7,
max_tokens=512,
)

print(response.choices[0].message.content)

Streaming

Keeptrusts fully supports Together's streaming mode. The gateway applies policies to each SSE chunk in real time — prompt-injection checks run on the assembled request before it reaches Together, and content filters process each response chunk as it arrives.

Set stream: true in your request and configure stream_timeout_seconds to accommodate longer generations:

pack:
name: together-ai-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: together-streaming
provider: together
model: meta-llama/Llama-3.3-70B-Instruct-Turbo
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")

stream = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
messages=[{"role": "user", "content": "Write a poem about open-source AI."}],
stream=True,
)

for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)

Advanced Configuration

Multi-Model Fallback

Automatically fail over from the primary 70B model to the faster 8B model when the primary is unhealthy or times out. The gateway tries targets in order and stops at the first successful response:

pack:
name: together-ai-providers-4
version: 1.0.0
enabled: true
providers:
targets:
- id: together-70b-primary
provider: together
model: meta-llama/Llama-3.3-70B-Instruct-Turbo
secret_key_ref:
env: TOGETHER_API_KEY
- id: together-8b-fallback
provider: together
model: meta-llama/Llama-3.1-8B-Instruct-Turbo
secret_key_ref:
env: TOGETHER_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Cross-Provider Fallback

Use Together as the primary and a different provider as the safety net. Both targets share the same policy chain, so governance is consistent regardless of which upstream serves the request:

pack:
name: together-ai-providers-5
version: 1.0.0
enabled: true
providers:
targets:
- id: together-primary
provider: together
model: meta-llama/Llama-3.3-70B-Instruct-Turbo
secret_key_ref:
env: TOGETHER_API_KEY
- id: openai-fallback
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Latency-Based Routing

Route each request to the target with the lowest observed p50 latency. The gateway continuously measures upstream response times and adjusts routing weights:

pack:
name: together-ai-providers-6
version: 1.0.0
enabled: true
providers:
targets:
- id: together-70b
provider: together
model: meta-llama/Llama-3.3-70B-Instruct-Turbo
secret_key_ref:
env: TOGETHER_API_KEY
- id: together-8b
provider: together
model: meta-llama/Llama-3.1-8B-Instruct-Turbo
secret_key_ref:
env: TOGETHER_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Weighted A/B Testing

Split traffic across models by weight for experimentation. Combine with audit-logger and the Keeptrusts console to compare quality, latency, and cost per variant:

pack:
name: together-ai-providers-7
version: 1.0.0
enabled: true
providers:
targets:
- id: variant-llama
provider: together
model: meta-llama/Llama-3.3-70B-Instruct-Turbo
secret_key_ref:
env: TOGETHER_API_KEY
- id: variant-qwen
provider: together
model: Qwen/QwQ-32B
secret_key_ref:
env: TOGETHER_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Circuit Breaker

Temporarily remove unhealthy targets from the active rotation. After failure_threshold consecutive failures the target is opened; after recovery_timeout_seconds the gateway sends a limited number of probe requests before fully closing the circuit:

pack:
name: together-ai-providers-8
version: 1.0.0
enabled: true
providers:
targets:
- id: together-main
provider: together
model: meta-llama/Llama-3.3-70B-Instruct-Turbo
secret_key_ref:
env: TOGETHER_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Retry Policy

Retry transient upstream failures with exponential backoff. Only the status codes listed in retryable_status_codes trigger retries — client errors (4xx) are returned immediately:

pack:
name: together-ai-providers-9
version: 1.0.0
enabled: true
providers:
targets:
- id: together-main
provider: together
model: meta-llama/Llama-3.3-70B-Instruct-Turbo
secret_key_ref:
env: TOGETHER_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Cost Tracking

Declare pricing on each target to enable per-request cost calculations in the Keeptrusts console. Costs are computed from token usage reported by Together and are visible in the Events dashboard:

pack:
name: together-ai-providers-10
version: 1.0.0
enabled: true
providers:
targets:
- id: together-llama-70b
provider: together
model: meta-llama/Llama-3.3-70B-Instruct-Turbo
secret_key_ref:
env: TOGETHER_API_KEY
- id: together-llama-8b
provider: together
model: meta-llama/Llama-3.1-8B-Instruct-Turbo
secret_key_ref:
env: TOGETHER_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Best Practices

  • Together is OpenAI-compatible — no format translation is needed. Use any OpenAI SDK client with only a base-URL change.
  • Use full model paths — Together model IDs follow the org/model pattern (e.g. meta-llama/Llama-3.3-70B-Instruct-Turbo). Never abbreviate.
  • Prefer Turbo variants — models ending in -Turbo are optimized for throughput on Together's infrastructure and are typically the best price-to-performance choice.
  • Use quantizations to document which quantization level is deployed when Together offers multiple variants of the same model (e.g. fp16 vs int8).
  • Enable health probes on production targets so routing strategies (fallback, latency, weighted) can react to Together API outages or degradations within seconds.
  • Prefer fallback strategy for critical workloads — pair Together with a second provider (OpenAI, Anthropic, etc.) to guarantee availability.
  • Declare pricing on every target, even if approximate — this enables cost dashboards, per-request cost attribution, and budget-enforcement policies in the console.
  • Separate API keys per environment — use distinct secret_key_ref values like TOGETHER_API_KEY_DEV and TOGETHER_API_KEY_PROD to prevent dev traffic from consuming production quotas.
  • Set stream_timeout_seconds for streaming workloads — long generations (code, documents) can exceed the default timeout_seconds; a separate streaming timeout avoids premature disconnects.
  • Combine circuit_breaker with retry for resilient production configs — retries handle transient blips while the circuit breaker prevents cascading failures during sustained outages.

For AI systems

  • Canonical terms: Keeptrusts gateway, Together AI, Together, open models, fast inference, fine-tuning, provider target, policy-config.yaml, provider: "together-ai", TOGETHER_API_KEY.
  • Config field names: provider, model, base_url: "https://api.together.xyz/v1", secret_key_ref.env: "TOGETHER_API_KEY", format: "openai", stream_timeout_seconds, circuit_breaker.
  • Provider shorthand: together-ai:chat:<model> (e.g., together-ai:chat:meta-llama/Llama-3.3-70B-Instruct-Turbo).
  • Key behavior: Together AI hosts open models with optimized inference and OpenAI-compatible API; Keeptrusts adds policy enforcement.
  • Best next pages: Fireworks integration, Groq integration, Provider routing.

For engineers

  • Prerequisites: Together AI API key (TOGETHER_API_KEY env var from api.together.xyz), kt CLI installed.
  • Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
  • Validate: curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"meta-llama/Llama-3.3-70B-Instruct-Turbo","messages":[{"role":"user","content":"hello"}]}'.
  • Set stream_timeout_seconds for streaming — long generations (code, documents) can exceed default timeout_seconds.
  • Combine circuit_breaker with retry for resilient production configs — retries handle transient blips, circuit breaker prevents cascading failures.
  • Together AI uses OpenAI-compatible API — standard OpenAI SDKs work without modification.

For leaders

  • Together AI offers broad open-model catalog with competitive pricing and fast inference — good balance of cost, speed, and model selection.
  • Fine-tuning support means you can serve custom models through the same API — Keeptrusts policies apply uniformly.
  • OpenAI-compatible format enables switching between Together AI and other providers with only config changes.
  • Circuit breaker and retry configuration in Keeptrusts provide production resilience beyond what Together AI offers natively.

Next steps