Skip to main content
Browse docs

Tutorial: A/B Testing AI Models with Traffic Splitting

This tutorial shows you how to configure weighted model groups in the Keeptrusts gateway to split traffic between two LLM providers, measure quality and cost differences through decision events, and adjust routing weights based on results.

Use this page when

  • You are configuring weighted traffic splitting between two or more LLM models.
  • You want to compare quality, latency, and cost across providers using production traffic.
  • You need to adjust routing weights based on A/B test results.
  • You are evaluating a model migration (e.g., GPT-4o to Claude) with data-driven evidence.

Primary audience

  • Primary: ML engineers and platform teams evaluating model quality and cost trade-offs
  • Secondary: Product managers deciding on model selection; finance teams comparing per-model spend

Prerequisites

  • kt CLI installed (first-run tutorial)
  • API keys for OpenAI and Anthropic exported as environment variables
  • A running Keeptrusts API instance (for event analytics)
  • curl and jq installed

How Traffic Splitting Works

The gateway evaluates model_groups at request time and selects a target model based on configured weights. Every decision event records which model handled the request, enabling side-by-side comparison of quality, latency, and cost.

Step 1: Set Provider API Keys

Export both provider API keys:

export OPENAI_API_KEY="sk-your-openai-key"
export ANTHROPIC_API_KEY="sk-ant-your-anthropic-key"

Step 2: Create the A/B Test Configuration

Create policy-config.yaml with a model group that splits traffic 70/30:

version: '1'
providers:
targets:
- id: openai
provider: openai
secret_key_ref:
env: OPENAI_API_KEY
- id: anthropic
provider: anthropic
secret_key_ref:
env: ANTHROPIC_API_KEY
model_groups:
- name: quality-ab-test
strategy: weighted
targets:
- provider: openai
model: gpt-4o
weight: 70
- provider: anthropic
model: claude-sonnet-4-20250514
weight: 30
policies:
- name: basic-filter
type: content_filter
action: flag
config:
categories:
- hate
threshold: medium

Step 3: Validate and Start the Gateway

kt policy lint --file policy-config.yaml

Expected output:

✓ Configuration is valid
Providers: 2 (openai, anthropic)
Model groups: 1 (quality-ab-test)
Policies: 1 (basic-filter)

Start the gateway:

kt gateway run --policy-config policy-config.yaml --port 41002

Expected output:

INFO keeptrusts::gateway Starting gateway on 0.0.0.0:41002
INFO keeptrusts::gateway Loaded 2 provider(s), 1 model group(s), 1 policy(ies)
INFO keeptrusts::gateway Model group "quality-ab-test": gpt-4o (70%) / claude-sonnet-4-20250514 (30%)
INFO keeptrusts::gateway Gateway ready

Step 4: Send Test Requests

Send a batch of requests through the gateway. The model group handles routing — you do not need to specify a model:

for i in $(seq 1 20); do
curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "quality-ab-test",
"messages": [
{"role": "user", "content": "Explain the principle of least privilege in two sentences."}
]
}' | jq '{model: .model, tokens: .usage.total_tokens}'
echo "---"
done

You should see roughly 14 of 20 requests routed to gpt-4o and 6 to claude-sonnet-4-20250514.

Step 5: Analyze Routing Distribution

Use kt events tail to verify the traffic split:

kt events tail --last 20 --format json | jq -r '.model' | sort | uniq -c | sort -rn

Expected output (approximately):

14 gpt-4o
6 claude-sonnet-4-20250514

Step 6: Compare Cost per Model

Aggregate cost data from events:

kt events tail --last 20 --format json | jq -s '
group_by(.model) | map({
model: .[0].model,
requests: length,
total_tokens: (map(.usage.total_tokens) | add),
avg_latency_ms: (map(.latency_ms) | add / length | floor)
})
'

Example output:

[
{
"model": "gpt-4o",
"requests": 14,
"total_tokens": 1820,
"avg_latency_ms": 340
},
{
"model": "claude-sonnet-4-20250514",
"requests": 6,
"total_tokens": 780,
"avg_latency_ms": 290
}
]

Step 7: Adjust Weights Based on Results

If Claude shows better latency or cost efficiency, shift more traffic to it. Update policy-config.yaml:

model_groups:
- name: quality-ab-test
strategy: weighted
targets:
- provider: openai
model: gpt-4o
weight: 50
- provider: anthropic
model: claude-sonnet-4-20250514
weight: 50

Reload without downtime:

kt config reload

Expected output:

✓ Configuration reloaded
Model group "quality-ab-test": gpt-4o (50%) / claude-sonnet-4-20250514 (50%)

Step 8: Graduate the Winner

Once you have enough data, set the winning model to 100% weight:

model_groups:
- name: quality-ab-test
strategy: weighted
targets:
- provider: anthropic
model: claude-sonnet-4-20250514
weight: 100

Reload and confirm:

kt config reload
kt events tail --last 5

All new requests now route to the graduated model.

Summary

  • model_groups with strategy: weighted splits traffic by percentage
  • kt events tail provides real-time routing and cost data
  • kt config reload adjusts weights without restarting the gateway
  • Graduate the winning model by setting its weight to 100

For AI systems

  • Canonical terms: Keeptrusts gateway, model groups, traffic splitting, weighted routing, A/B test, model comparison.
  • Config fields: model_groups[].name, model_groups[].strategy: weighted, model_groups[].targets[].weight, model_groups[].targets[].provider, model_groups[].targets[].model.
  • CLI commands: kt gateway run, kt policy lint, kt events tail, kt events tail --json | jq '.model'.
  • Best next pages: Traffic Mirroring, Multi-Provider Failover, Cost Tracking & Budgets.

For engineers

  • Prerequisites: kt CLI, API keys for both providers (e.g., OPENAI_API_KEY + ANTHROPIC_API_KEY), running Keeptrusts API, curl and jq.
  • Validate: kt policy lint confirms model group strategy and weights sum correctly.
  • Measure: kt events tail --json | jq '{model, latency_ms, tokens}' shows per-request routing decisions.
  • Adjust weights: edit weight values and hot-reload — no restart needed.
  • Sufficient sample: run at least 100 requests per variant before drawing conclusions.

For leaders

  • A/B testing enables data-driven model selection instead of relying on benchmarks alone.
  • Compare cost per request, latency, and output quality side-by-side on your actual workloads.
  • Weighted routing lets you gradually shift traffic (e.g., 90/10 → 50/50 → 0/100) to de-risk migrations.
  • Decision events provide an auditable record of which model served each request.

Next steps