Together AI

Together AI provides fast, cost-effective inference for open-weight models through an OpenAI-compatible REST API. Keeptrusts sits between your application and Together's API endpoint, enforcing policy chains — prompt-injection detection, PII redaction, safety filters, content-quality scoring — on every request and response without requiring application-side changes.

Use this page when

You need the exact command, config, API, or integration details for Together AI.
You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Because Together exposes a standard /v1/chat/completions surface, Keeptrusts needs no format translation. Requests and responses flow through the gateway in native OpenAI wire format, and any OpenAI SDK client can be pointed at the gateway with zero code changes.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Prerequisites

Together API key — create one in the Together Console → API Keys.
Keeptrusts CLI — install kt (quickstart guide).
Export your key so the gateway can read it at startup:

export TOGETHER_API_KEY="your-together-api-key"

When the provider field is set to "together", Keeptrusts auto-detects both the base URL (https://api.together.xyz/v1) and the API key environment variable (TOGETHER_API_KEY). You only need to override these if you use a custom deployment or a non-standard env-var name.

Configuration

A minimal policy-config.yaml that routes traffic through Together with prompt-injection and PII policies:

pack:
  name: together-gateway
  version: 1.0.0
  enabled: true
policies:
  chain:
  - prompt-injection
  - pii-detector
  - safety-filter
  - audit-logger
policy:
  prompt-injection:
    threshold: 0.8
    action: block
  pii-detector:
    action: redact
  safety-filter:
    mode: strict
    action: block
  audit-logger:
    retention_days: 365
providers:
  strategy: single
  targets:
  - id: together-llama-70b
    provider: together
    model: meta-llama/Llama-3.3-70B-Instruct-Turbo
    base_url: https://api.together.xyz/v1
    secret_key_ref:
      env: TOGETHER_API_KEY

Start the gateway:

kt gateway run \
  --listen 0.0.0.0:41002 \
  --policy-config policy-config.yaml

Compact Provider Shorthand

You can encode the model directly in the provider field. The two forms below are equivalent:

# Shorthand — model embedded in the provider string
- id: "together-llama"
  provider: "together:chat:meta-llama/Llama-3.3-70B-Instruct-Turbo"

# Explicit — separate provider and model fields
- id: "together-llama"
  provider: "together"
  model: "meta-llama/Llama-3.3-70B-Instruct-Turbo"

The shorthand form is convenient for quick setups. The explicit form is preferred when you also set pricing, health_probe, or other per-target fields.

Provider Fields

All fields available on a providers.targets[] entry for Together AI:

Field	Type	Default	Description
`id`	string	required	Unique identifier for this target. Used in logs, dashboards, and routing references.
`provider`	string	required	Provider ID. Use `"together"` or the shorthand `"together:chat:<model>"`.
`model`	string	required	Full model path in Together's `org/model` format (e.g. `"meta-llama/Llama-3.3-70B-Instruct-Turbo"`).
`base_url`	string	`https://api.together.xyz/v1`	API base URL. Auto-detected when `provider` is `"together"`. Override for private deployments or gateway chains.
`secret_key_ref`	object	`TOGETHER_API_KEY`	Object reference to the environment variable holding the API key. Auto-detected for Together targets. Use distinct names per environment (e.g. `TOGETHER_API_KEY_PROD`).
`timeout_seconds`	integer	`60`	Maximum wall-clock time for non-streaming requests before the gateway returns a timeout error.
`stream_timeout_seconds`	integer	falls back to `timeout_seconds`	Maximum time for streaming requests. Set higher than `timeout_seconds` when long generations are expected.
`format`	string	`"openai"`	Wire format. Together's API is natively OpenAI-compatible, so this is always `"openai"`.
`description`	string	none	Human-readable label shown in the Keeptrusts console dashboards, event logs, and trace views.
`weight`	float	`1.0`	Routing weight when using the `weighted_round_robin` strategy. Higher values receive proportionally more traffic.
`pricing`	object	none	Token pricing in USD per 1 million tokens. Fields: `prompt` (input) and `completion` (output). Enables cost dashboards, per-request cost tracking, and budget enforcement policies.
`health_probe`	object	none	Active health-check configuration. Sub-fields: `enabled` (bool), `interval_seconds` (int), `timeout_seconds` (int). When enabled, the gateway periodically probes the upstream and removes unhealthy targets from rotations.
`quantizations`	string	none	Model quantization level (e.g. `"fp16"`, `"fp8"`, `"int8"`, `"int4"`). Informational — used for dashboards and routing metadata when Together offers multiple quantized variants of the same model.

Supported Models

Together's catalog contains hundreds of open-weight models. The table below lists popular choices that work well with Keeptrusts:

Model	Context Window	Type	Typical Use
`meta-llama/Llama-3.3-70B-Instruct-Turbo`	128K	Chat	General-purpose flagship, strong reasoning and instruction-following
`meta-llama/Llama-3.1-8B-Instruct-Turbo`	128K	Chat	Fast and cost-effective for latency-sensitive workloads
`mistralai/Mixtral-8x22B-Instruct-v0.1`	64K	Chat	Mixture-of-experts architecture, strong reasoning at lower cost
`Qwen/QwQ-32B`	32K	Chat	Multilingual, strong mathematical and logical reasoning

Any model available on the Together Models page can be used — set the model field to the full org/model path. Keeptrusts passes the model identifier through to the upstream without validation, so newly added models work immediately.

Client Examples

Once the gateway is running, point your client to http://localhost:8080 instead of https://api.together.xyz/v1. Clients send standard OpenAI-format requests — no SDK changes are needed beyond the base URL.

Python
Node.js
cURL

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="unused",  # auth is handled by Keeptrusts via TOGETHER_API_KEY
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain the CAP theorem in distributed systems."},
    ],
    temperature=0.7,
    max_tokens=512,
)

print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "unused", // auth handled by Keeptrusts via TOGETHER_API_KEY
});

const response = await client.chat.completions.create({
  model: "meta-llama/Llama-3.3-70B-Instruct-Turbo",
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "Explain the CAP theorem in distributed systems." },
  ],
  temperature: 0.7,
  max_tokens: 512,
});

console.log(response.choices[0].message.content);

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct-Turbo",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain the CAP theorem in distributed systems."}
    ],
    "temperature": 0.7,
    "max_tokens": 512
  }'

Streaming

Keeptrusts fully supports Together's streaming mode. The gateway applies policies to each SSE chunk in real time — prompt-injection checks run on the assembled request before it reaches Together, and content filters process each response chunk as it arrives.

Set stream: true in your request and configure stream_timeout_seconds to accommodate longer generations:

pack:
  name: together-ai-providers-3
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: together-streaming
    provider: together
    model: meta-llama/Llama-3.3-70B-Instruct-Turbo
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Python
Node.js
cURL

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Write a poem about open-source AI."}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://localhost:8080/v1",
  apiKey: "unused",
});

const stream = await client.chat.completions.create({
  model: "meta-llama/Llama-3.3-70B-Instruct-Turbo",
  messages: [{ role: "user", content: "Write a poem about open-source AI." }],
  stream: true,
});

for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content;
  if (content) process.stdout.write(content);
}

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -N \
  -d '{
    "model": "meta-llama/Llama-3.3-70B-Instruct-Turbo",
    "messages": [{"role": "user", "content": "Write a poem about open-source AI."}],
    "stream": true
  }'

Advanced Configuration

Multi-Model Fallback

Automatically fail over from the primary 70B model to the faster 8B model when the primary is unhealthy or times out. The gateway tries targets in order and stops at the first successful response:

pack:
  name: together-ai-providers-4
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: together-70b-primary
    provider: together
    model: meta-llama/Llama-3.3-70B-Instruct-Turbo
    secret_key_ref:
      env: TOGETHER_API_KEY
  - id: together-8b-fallback
    provider: together
    model: meta-llama/Llama-3.1-8B-Instruct-Turbo
    secret_key_ref:
      env: TOGETHER_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Cross-Provider Fallback

Use Together as the primary and a different provider as the safety net. Both targets share the same policy chain, so governance is consistent regardless of which upstream serves the request:

pack:
  name: together-ai-providers-5
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: together-primary
    provider: together
    model: meta-llama/Llama-3.3-70B-Instruct-Turbo
    secret_key_ref:
      env: TOGETHER_API_KEY
  - id: openai-fallback
    provider: openai
    model: gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Latency-Based Routing

Route each request to the target with the lowest observed p50 latency. The gateway continuously measures upstream response times and adjusts routing weights:

pack:
  name: together-ai-providers-6
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: together-70b
    provider: together
    model: meta-llama/Llama-3.3-70B-Instruct-Turbo
    secret_key_ref:
      env: TOGETHER_API_KEY
  - id: together-8b
    provider: together
    model: meta-llama/Llama-3.1-8B-Instruct-Turbo
    secret_key_ref:
      env: TOGETHER_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Weighted A/B Testing

Split traffic across models by weight for experimentation. Combine with audit-logger and the Keeptrusts console to compare quality, latency, and cost per variant:

pack:
  name: together-ai-providers-7
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: variant-llama
    provider: together
    model: meta-llama/Llama-3.3-70B-Instruct-Turbo
    secret_key_ref:
      env: TOGETHER_API_KEY
  - id: variant-qwen
    provider: together
    model: Qwen/QwQ-32B
    secret_key_ref:
      env: TOGETHER_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Circuit Breaker

Temporarily remove unhealthy targets from the active rotation. After failure_threshold consecutive failures the target is opened; after recovery_timeout_seconds the gateway sends a limited number of probe requests before fully closing the circuit:

pack:
  name: together-ai-providers-8
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: together-main
    provider: together
    model: meta-llama/Llama-3.3-70B-Instruct-Turbo
    secret_key_ref:
      env: TOGETHER_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Retry Policy

Retry transient upstream failures with exponential backoff. Only the status codes listed in retryable_status_codes trigger retries — client errors (4xx) are returned immediately:

pack:
  name: together-ai-providers-9
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: together-main
    provider: together
    model: meta-llama/Llama-3.3-70B-Instruct-Turbo
    secret_key_ref:
      env: TOGETHER_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Cost Tracking

Declare pricing on each target to enable per-request cost calculations in the Keeptrusts console. Costs are computed from token usage reported by Together and are visible in the Events dashboard:

pack:
  name: together-ai-providers-10
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: together-llama-70b
    provider: together
    model: meta-llama/Llama-3.3-70B-Instruct-Turbo
    secret_key_ref:
      env: TOGETHER_API_KEY
  - id: together-llama-8b
    provider: together
    model: meta-llama/Llama-3.1-8B-Instruct-Turbo
    secret_key_ref:
      env: TOGETHER_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Best Practices

Together is OpenAI-compatible — no format translation is needed. Use any OpenAI SDK client with only a base-URL change.
Use full model paths — Together model IDs follow the org/model pattern (e.g. meta-llama/Llama-3.3-70B-Instruct-Turbo). Never abbreviate.
Prefer Turbo variants — models ending in -Turbo are optimized for throughput on Together's infrastructure and are typically the best price-to-performance choice.
Use quantizations to document which quantization level is deployed when Together offers multiple variants of the same model (e.g. fp16 vs int8).
Enable health probes on production targets so routing strategies (fallback, latency, weighted) can react to Together API outages or degradations within seconds.
Prefer fallback strategy for critical workloads — pair Together with a second provider (OpenAI, Anthropic, etc.) to guarantee availability.
Declare pricing on every target, even if approximate — this enables cost dashboards, per-request cost attribution, and budget-enforcement policies in the console.
Separate API keys per environment — use distinct secret_key_ref values like TOGETHER_API_KEY_DEV and TOGETHER_API_KEY_PROD to prevent dev traffic from consuming production quotas.
Set stream_timeout_seconds for streaming workloads — long generations (code, documents) can exceed the default timeout_seconds; a separate streaming timeout avoids premature disconnects.
Combine circuit_breaker with retry for resilient production configs — retries handle transient blips while the circuit breaker prevents cascading failures during sustained outages.

For AI systems

Canonical terms: Keeptrusts gateway, Together AI, Together, open models, fast inference, fine-tuning, provider target, policy-config.yaml, provider: "together-ai", TOGETHER_API_KEY.
Config field names: provider, model, base_url: "https://api.together.xyz/v1", secret_key_ref.env: "TOGETHER_API_KEY", format: "openai", stream_timeout_seconds, circuit_breaker.
Provider shorthand: together-ai:chat:<model> (e.g., together-ai:chat:meta-llama/Llama-3.3-70B-Instruct-Turbo).
Key behavior: Together AI hosts open models with optimized inference and OpenAI-compatible API; Keeptrusts adds policy enforcement.
Best next pages: Fireworks integration, Groq integration, Provider routing.

For engineers

Prerequisites: Together AI API key (TOGETHER_API_KEY env var from api.together.xyz), kt CLI installed.
Start command: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml.
Validate: curl http://localhost:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"meta-llama/Llama-3.3-70B-Instruct-Turbo","messages":[{"role":"user","content":"hello"}]}'.
Set stream_timeout_seconds for streaming — long generations (code, documents) can exceed default timeout_seconds.
Combine circuit_breaker with retry for resilient production configs — retries handle transient blips, circuit breaker prevents cascading failures.
Together AI uses OpenAI-compatible API — standard OpenAI SDKs work without modification.

For leaders

Together AI offers broad open-model catalog with competitive pricing and fast inference — good balance of cost, speed, and model selection.
Fine-tuning support means you can serve custom models through the same API — Keeptrusts policies apply uniformly.
OpenAI-compatible format enables switching between Together AI and other providers with only config changes.
Circuit breaker and retry configuration in Keeptrusts provide production resilience beyond what Together AI offers natively.

Next steps

Fireworks integration — alternative fast inference with function calling
Groq integration — ultra-low latency inference (LPU)
HuggingFace integration — alternative access to open models
Provider routing strategies — weighted routing and fallback configuration
Quickstart — install kt and run your first gateway

Use this page when​

Primary audience​

Prerequisites​

Configuration​

Compact Provider Shorthand​

Provider Fields​

Supported Models​

Client Examples​

Streaming​

Advanced Configuration​

Multi-Model Fallback​

Cross-Provider Fallback​

Latency-Based Routing​

Weighted A/B Testing​

Circuit Breaker​

Retry Policy​

Cost Tracking​

Best Practices​

For AI systems​

For engineers​

For leaders​

Next steps​