Together AI
Together AI provides fast, cost-effective inference for open-weight models through an OpenAI-compatible REST API. Keeptrusts sits between your application and Together's API endpoint, enforcing policy chains — prompt-injection detection, PII redaction, safety filters, content-quality scoring — on every request and response without requiring application-side changes.
Use this page when
- You need the exact command, config, API, or integration details for Together AI.
- You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
- If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.
Because Together exposes a standard /v1/chat/completions surface, Keeptrusts needs no format translation. Requests and responses flow through the gateway in native OpenAI wire format, and any OpenAI SDK client can be pointed at the gateway with zero code changes.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Prerequisites
- Together API key — create one in the Together Console → API Keys.
- Keeptrusts CLI — install
kt(quickstart guide). - Export your key so the gateway can read it at startup:
export TOGETHER_API_KEY="your-together-api-key"
When the provider field is set to "together", Keeptrusts auto-detects both the base URL (https://api.together.xyz/v1) and the API key environment variable (TOGETHER_API_KEY). You only need to override these if you use a custom deployment or a non-standard env-var name.
Configuration
A minimal policy-config.yaml that routes traffic through Together with prompt-injection and PII policies:
pack:
name: together-gateway
version: 1.0.0
enabled: true
policies:
chain:
- prompt-injection
- pii-detector
- safety-filter
- audit-logger
policy:
prompt-injection:
threshold: 0.8
action: block
pii-detector:
action: redact
safety-filter:
mode: strict
action: block
audit-logger:
retention_days: 365
providers:
strategy: single
targets:
- id: together-llama-70b
provider: together
model: meta-llama/Llama-3.3-70B-Instruct-Turbo
base_url: https://api.together.xyz/v1
secret_key_ref:
env: TOGETHER_API_KEY
Start the gateway:
kt gateway run \
--listen 0.0.0.0:41002 \
--policy-config policy-config.yaml
Compact Provider Shorthand
You can encode the model directly in the provider field. The two forms below are equivalent:
# Shorthand — model embedded in the provider string
- id: "together-llama"
provider: "together:chat:meta-llama/Llama-3.3-70B-Instruct-Turbo"
# Explicit — separate provider and model fields
- id: "together-llama"
provider: "together"
model: "meta-llama/Llama-3.3-70B-Instruct-Turbo"
The shorthand form is convenient for quick setups. The explicit form is preferred when you also set pricing, health_probe, or other per-target fields.
Provider Fields
All fields available on a providers.targets[] entry for Together AI:
| Field | Type | Default | Description |
|---|---|---|---|
id | string | required | Unique identifier for this target. Used in logs, dashboards, and routing references. |
provider | string | required | Provider ID. Use "together" or the shorthand "together:chat:<model>". |
model | string | required | Full model path in Together's org/model format (e.g. "meta-llama/Llama-3.3-70B-Instruct-Turbo"). |
base_url | string | https://api.together.xyz/v1 | API base URL. Auto-detected when provider is "together". Override for private deployments or gateway chains. |
secret_key_ref | object | TOGETHER_API_KEY | Object reference to the environment variable holding the API key. Auto-detected for Together targets. Use distinct names per environment (e.g. TOGETHER_API_KEY_PROD). |
timeout_seconds | integer | 60 | Maximum wall-clock time for non-streaming requests before the gateway returns a timeout error. |
stream_timeout_seconds | integer | falls back to timeout_seconds | Maximum time for streaming requests. Set higher than timeout_seconds when long generations are expected. |
format | string | "openai" | Wire format. Together's API is natively OpenAI-compatible, so this is always "openai". |
description | string | none | Human-readable label shown in the Keeptrusts console dashboards, event logs, and trace views. |
weight | float | 1.0 | Routing weight when using the weighted_round_robin strategy. Higher values receive proportionally more traffic. |
pricing | object | none | Token pricing in USD per 1 million tokens. Fields: prompt (input) and completion (output). Enables cost dashboards, per-request cost tracking, and budget enforcement policies. |
health_probe | object | none | Active health-check configuration. Sub-fields: enabled (bool), interval_seconds (int), timeout_seconds (int). When enabled, the gateway periodically probes the upstream and removes unhealthy targets from rotations. |
quantizations | string | none | Model quantization level (e.g. "fp16", "fp8", "int8", "int4"). Informational — used for dashboards and routing metadata when Together offers multiple quantized variants of the same model. |
Supported Models
Together's catalog contains hundreds of open-weight models. The table below lists popular choices that work well with Keeptrusts:
| Model | Context Window | Type | Typical Use |
|---|---|---|---|
meta-llama/Llama-3.3-70B-Instruct-Turbo | 128K | Chat | General-purpose flagship, strong reasoning and instruction-following |
meta-llama/Llama-3.1-8B-Instruct-Turbo | 128K | Chat | Fast and cost-effective for latency-sensitive workloads |
mistralai/Mixtral-8x22B-Instruct-v0.1 | 64K | Chat | Mixture-of-experts architecture, strong reasoning at lower cost |
Qwen/QwQ-32B | 32K | Chat | Multilingual, strong mathematical and logical reasoning |
Any model available on the Together Models page can be used — set the model field to the full org/model path. Keeptrusts passes the model identifier through to the upstream without validation, so newly added models work immediately.
Client Examples
Once the gateway is running, point your client to http://localhost:8080 instead of https://api.together.xyz/v1. Clients send standard OpenAI-format requests — no SDK changes are needed beyond the base URL.
- Python
- Node.js
- cURL
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="unused", # auth is handled by Keeptrusts via TOGETHER_API_KEY
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the CAP theorem in distributed systems."},
],
temperature=0.7,
max_tokens=512,
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8080/v1",
apiKey: "unused", // auth handled by Keeptrusts via TOGETHER_API_KEY
});
const response = await client.chat.completions.create({
model: "meta-llama/Llama-3.3-70B-Instruct-Turbo",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Explain the CAP theorem in distributed systems." },
],
temperature: 0.7,
max_tokens: 512,
});
console.log(response.choices[0].message.content);
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.3-70B-Instruct-Turbo",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the CAP theorem in distributed systems."}
],
"temperature": 0.7,
"max_tokens": 512
}'
Streaming
Keeptrusts fully supports Together's streaming mode. The gateway applies policies to each SSE chunk in real time — prompt-injection checks run on the assembled request before it reaches Together, and content filters process each response chunk as it arrives.
Set stream: true in your request and configure stream_timeout_seconds to accommodate longer generations:
pack:
name: together-ai-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: together-streaming
provider: together
model: meta-llama/Llama-3.3-70B-Instruct-Turbo
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
- Python
- Node.js
- cURL
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
stream = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
messages=[{"role": "user", "content": "Write a poem about open-source AI."}],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://localhost:8080/v1",
apiKey: "unused",
});
const stream = await client.chat.completions.create({
model: "meta-llama/Llama-3.3-70B-Instruct-Turbo",
messages: [{ role: "user", content: "Write a poem about open-source AI." }],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) process.stdout.write(content);
}
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-N \
-d '{
"model": "meta-llama/Llama-3.3-70B-Instruct-Turbo",
"messages": [{"role": "user", "content": "Write a poem about open-source AI."}],
"stream": true
}'
Advanced Configuration
Multi-Model Fallback
Automatically fail over from the primary 70B model to the faster 8B model when the primary is unhealthy or times out. The gateway tries targets in order and stops at the first successful response:
pack:
name: together-ai-providers-4
version: 1.0.0
enabled: true
providers:
targets:
- id: together-70b-primary
provider: together
model: meta-llama/Llama-3.3-70B-Instruct-Turbo
secret_key_ref:
env: TOGETHER_API_KEY
- id: together-8b-fallback
provider: together
model: meta-llama/Llama-3.1-8B-Instruct-Turbo
secret_key_ref:
env: TOGETHER_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Cross-Provider Fallback
Use Together as the primary and a different provider as the safety net. Both targets share the same policy chain, so governance is consistent regardless of which upstream serves the request:
pack:
name: together-ai-providers-5
version: 1.0.0
enabled: true
providers:
targets:
- id: together-primary
provider: together
model: meta-llama/Llama-3.3-70B-Instruct-Turbo
secret_key_ref:
env: TOGETHER_API_KEY
- id: openai-fallback
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Latency-Based Routing
Route each request to the target with the lowest observed p50 latency. The gateway continuously measures upstream response times and adjusts routing weights:
pack:
name: together-ai-providers-6
version: 1.0.0
enabled: true
providers:
targets:
- id: together-70b
provider: together
model: meta-llama/Llama-3.3-70B-Instruct-Turbo
secret_key_ref:
env: TOGETHER_API_KEY
- id: together-8b
provider: together
model: meta-llama/Llama-3.1-8B-Instruct-Turbo
secret_key_ref:
env: TOGETHER_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Weighted A/B Testing
Split traffic across models by weight for experimentation. Combine with audit-logger and the Keeptrusts console to compare quality, latency, and cost per variant:
pack:
name: together-ai-providers-7
version: 1.0.0
enabled: true
providers:
targets:
- id: variant-llama
provider: together
model: meta-llama/Llama-3.3-70B-Instruct-Turbo
secret_key_ref:
env: TOGETHER_API_KEY
- id: variant-qwen
provider: together
model: Qwen/QwQ-32B
secret_key_ref:
env: TOGETHER_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Circuit Breaker
Temporarily remove unhealthy targets from the active rotation. After failure_threshold consecutive failures the target is opened; after recovery_timeout_seconds the gateway sends a limited number of probe requests before fully closing the circuit:
pack:
name: together-ai-providers-8
version: 1.0.0
enabled: true
providers:
targets:
- id: together-main
provider: together
model: meta-llama/Llama-3.3-70B-Instruct-Turbo
secret_key_ref:
env: TOGETHER_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Retry Policy
Retry transient upstream failures with exponential backoff. Only the status codes listed in retryable_status_codes trigger retries — client errors (4xx) are returned immediately:
pack:
name: together-ai-providers-9
version: 1.0.0
enabled: true
providers:
targets:
- id: together-main
provider: together
model: meta-llama/Llama-3.3-70B-Instruct-Turbo
secret_key_ref:
env: TOGETHER_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Cost Tracking
Declare pricing on each target to enable per-request cost calculations in the Keeptrusts console. Costs are computed from token usage reported by Together and are visible in the Events dashboard:
pack:
name: together-ai-providers-10
version: 1.0.0
enabled: true
providers:
targets:
- id: together-llama-70b
provider: together
model: meta-llama/Llama-3.3-70B-Instruct-Turbo
secret_key_ref:
env: TOGETHER_API_KEY
- id: together-llama-8b
provider: together
model: meta-llama/Llama-3.1-8B-Instruct-Turbo
secret_key_ref:
env: TOGETHER_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Best Practices
- Together is OpenAI-compatible — no
formattranslation is needed. Use any OpenAI SDK client with only a base-URL change. - Use full model paths — Together model IDs follow the
org/modelpattern (e.g.meta-llama/Llama-3.3-70B-Instruct-Turbo). Never abbreviate. - Prefer Turbo variants — models ending in
-Turboare optimized for throughput on Together's infrastructure and are typically the best price-to-performance choice. - Use
quantizationsto document which quantization level is deployed when Together offers multiple variants of the same model (e.g.fp16vsint8). - Enable health probes on production targets so routing strategies (fallback, latency, weighted) can react to Together API outages or degradations within seconds.
- Prefer
fallbackstrategy for critical workloads — pair Together with a second provider (OpenAI, Anthropic, etc.) to guarantee availability. - Declare
pricingon every target, even if approximate — this enables cost dashboards, per-request cost attribution, and budget-enforcement policies in the console. - Separate API keys per environment — use distinct
secret_key_refvalues likeTOGETHER_API_KEY_DEVandTOGETHER_API_KEY_PRODto prevent dev traffic from consuming production quotas. - Set
stream_timeout_secondsfor streaming workloads — long generations (code, documents) can exceed the defaulttimeout_seconds; a separate streaming timeout avoids premature disconnects. - Combine
circuit_breakerwithretryfor resilient production configs — retries handle transient blips while the circuit breaker prevents cascading failures during sustained outages.
For AI systems
- Canonical terms: Keeptrusts gateway, Together AI, Together, open models, fast inference, fine-tuning, provider target, policy-config.yaml,
provider: "together-ai", TOGETHER_API_KEY. - Config field names:
provider,model,base_url: "https://api.together.xyz/v1",secret_key_ref.env: "TOGETHER_API_KEY",format: "openai",stream_timeout_seconds,circuit_breaker. - Provider shorthand:
together-ai:chat:<model>(e.g.,together-ai:chat:meta-llama/Llama-3.3-70B-Instruct-Turbo). - Key behavior: Together AI hosts open models with optimized inference and OpenAI-compatible API; Keeptrusts adds policy enforcement.
- Best next pages: Fireworks integration, Groq integration, Provider routing.
For engineers
- Prerequisites: Together AI API key (
TOGETHER_API_KEYenv var from api.together.xyz),ktCLI installed. - Start command:
kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml. - Validate:
curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"meta-llama/Llama-3.3-70B-Instruct-Turbo","messages":[{"role":"user","content":"hello"}]}'. - Set
stream_timeout_secondsfor streaming — long generations (code, documents) can exceed defaulttimeout_seconds. - Combine
circuit_breakerwithretryfor resilient production configs — retries handle transient blips, circuit breaker prevents cascading failures. - Together AI uses OpenAI-compatible API — standard OpenAI SDKs work without modification.
For leaders
- Together AI offers broad open-model catalog with competitive pricing and fast inference — good balance of cost, speed, and model selection.
- Fine-tuning support means you can serve custom models through the same API — Keeptrusts policies apply uniformly.
- OpenAI-compatible format enables switching between Together AI and other providers with only config changes.
- Circuit breaker and retry configuration in Keeptrusts provide production resilience beyond what Together AI offers natively.
Next steps
- Fireworks integration — alternative fast inference with function calling
- Groq integration — ultra-low latency inference (LPU)
- HuggingFace integration — alternative access to open models
- Provider routing strategies — weighted routing and fallback configuration
- Quickstart — install
ktand run your first gateway