Tutorial: A/B Testing AI Models with Traffic Splitting

This tutorial shows you how to configure weighted model groups in the Keeptrusts gateway to split traffic between two LLM providers, measure quality and cost differences through decision events, and adjust routing weights based on results.

Use this page when

You are configuring weighted traffic splitting between two or more LLM models.
You want to compare quality, latency, and cost across providers using production traffic.
You need to adjust routing weights based on A/B test results.
You are evaluating a model migration (e.g., GPT-4o to Claude) with data-driven evidence.

Primary audience

Primary: ML engineers and platform teams evaluating model quality and cost trade-offs
Secondary: Product managers deciding on model selection; finance teams comparing per-model spend

Prerequisites

kt CLI installed (first-run tutorial)
API keys for OpenAI and Anthropic exported as environment variables
A running Keeptrusts API instance (for event analytics)
curl and jq installed

How Traffic Splitting Works

The gateway evaluates model_groups at request time and selects a target model based on configured weights. Every decision event records which model handled the request, enabling side-by-side comparison of quality, latency, and cost.

Step 1: Set Provider API Keys

Export both provider API keys:

export OPENAI_API_KEY="sk-your-openai-key"
export ANTHROPIC_API_KEY="sk-ant-your-anthropic-key"

Step 2: Create the A/B Test Configuration

Create policy-config.yaml with a model group that splits traffic 70/30:

version: '1'
providers:
  targets:
  - id: openai
    provider: openai
    secret_key_ref:
      env: OPENAI_API_KEY
  - id: anthropic
    provider: anthropic
    secret_key_ref:
      env: ANTHROPIC_API_KEY
model_groups:
- name: quality-ab-test
  strategy: weighted
  targets:
  - provider: openai
    model: gpt-4o
    weight: 70
  - provider: anthropic
    model: claude-sonnet-4-20250514
    weight: 30
policies:
- name: basic-filter
  type: content_filter
  action: flag
  config:
    categories:
    - hate
    threshold: medium

Step 3: Validate and Start the Gateway

kt policy lint --file policy-config.yaml

Expected output:

✓ Configuration is valid
  Providers: 2 (openai, anthropic)
  Model groups: 1 (quality-ab-test)
  Policies: 1 (basic-filter)

Start the gateway:

kt gateway run --policy-config policy-config.yaml --port 41002

Expected output:

INFO  keeptrusts::gateway Starting gateway on 0.0.0.0:41002
INFO  keeptrusts::gateway Loaded 2 provider(s), 1 model group(s), 1 policy(ies)
INFO  keeptrusts::gateway Model group "quality-ab-test": gpt-4o (70%) / claude-sonnet-4-20250514 (30%)
INFO  keeptrusts::gateway Gateway ready

Step 4: Send Test Requests

Send a batch of requests through the gateway. The model group handles routing — you do not need to specify a model:

for i in $(seq 1 20); do
  curl -s http://localhost:41002/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "quality-ab-test",
      "messages": [
        {"role": "user", "content": "Explain the principle of least privilege in two sentences."}
      ]
    }' | jq '{model: .model, tokens: .usage.total_tokens}'
  echo "---"
done

You should see roughly 14 of 20 requests routed to gpt-4o and 6 to claude-sonnet-4-20250514.

Step 5: Analyze Routing Distribution

Use kt events tail to verify the traffic split:

kt events tail --last 20 --format json | jq -r '.model' | sort | uniq -c | sort -rn

Expected output (approximately):

  14 gpt-4o
   6 claude-sonnet-4-20250514

Step 6: Compare Cost per Model

Aggregate cost data from events:

kt events tail --last 20 --format json | jq -s '
  group_by(.model) | map({
    model: .[0].model,
    requests: length,
    total_tokens: (map(.usage.total_tokens) | add),
    avg_latency_ms: (map(.latency_ms) | add / length | floor)
  })
'

Example output:

[
  {
    "model": "gpt-4o",
    "requests": 14,
    "total_tokens": 1820,
    "avg_latency_ms": 340
  },
  {
    "model": "claude-sonnet-4-20250514",
    "requests": 6,
    "total_tokens": 780,
    "avg_latency_ms": 290
  }
]

Step 7: Adjust Weights Based on Results

If Claude shows better latency or cost efficiency, shift more traffic to it. Update policy-config.yaml:

model_groups:
  - name: quality-ab-test
    strategy: weighted
    targets:
      - provider: openai
        model: gpt-4o
        weight: 50
      - provider: anthropic
        model: claude-sonnet-4-20250514
        weight: 50

Reload without downtime:

kt config reload

Expected output:

✓ Configuration reloaded
  Model group "quality-ab-test": gpt-4o (50%) / claude-sonnet-4-20250514 (50%)

Step 8: Graduate the Winner

Once you have enough data, set the winning model to 100% weight:

model_groups:
  - name: quality-ab-test
    strategy: weighted
    targets:
      - provider: anthropic
        model: claude-sonnet-4-20250514
        weight: 100

Reload and confirm:

kt config reload
kt events tail --last 5

All new requests now route to the graduated model.

Summary

model_groups with strategy: weighted splits traffic by percentage
kt events tail provides real-time routing and cost data
kt config reload adjusts weights without restarting the gateway
Graduate the winning model by setting its weight to 100

For AI systems

Canonical terms: Keeptrusts gateway, model groups, traffic splitting, weighted routing, A/B test, model comparison.
Config fields: model_groups[].name, model_groups[].strategy: weighted, model_groups[].targets[].weight, model_groups[].targets[].provider, model_groups[].targets[].model.
CLI commands: kt gateway run, kt policy lint, kt events tail, kt events tail --json | jq '.model'.
Best next pages: Traffic Mirroring, Multi-Provider Failover, Cost Tracking & Budgets.

For engineers

Prerequisites: kt CLI, API keys for both providers (e.g., OPENAI_API_KEY + ANTHROPIC_API_KEY), running Keeptrusts API, curl and jq.
Validate: kt policy lint confirms model group strategy and weights sum correctly.
Measure: kt events tail --json | jq '{model, latency_ms, tokens}' shows per-request routing decisions.
Adjust weights: edit weight values and hot-reload — no restart needed.
Sufficient sample: run at least 100 requests per variant before drawing conclusions.

For leaders

A/B testing enables data-driven model selection instead of relying on benchmarks alone.
Compare cost per request, latency, and output quality side-by-side on your actual workloads.
Weighted routing lets you gradually shift traffic (e.g., 90/10 → 50/50 → 0/100) to de-risk migrations.
Decision events provide an auditable record of which model served each request.

Next steps

Traffic Mirroring — shadow-test a new model without user impact
Multi-Provider Failover — priority-based routing for resilience
Cost Tracking & Budgets — compare per-model spend from A/B results

Use this page when​

Primary audience​

Prerequisites​

How Traffic Splitting Works​

Step 1: Set Provider API Keys​

Step 2: Create the A/B Test Configuration​

Step 3: Validate and Start the Gateway​

Step 4: Send Test Requests​

Step 5: Analyze Routing Distribution​

Step 6: Compare Cost per Model​

Step 7: Adjust Weights Based on Results​

Step 8: Graduate the Winner​

Summary​

For AI systems​

For engineers​

For leaders​

Next steps​

Use this page when

Primary audience

Prerequisites

How Traffic Splitting Works

Step 1: Set Provider API Keys

Step 2: Create the A/B Test Configuration

Step 3: Validate and Start the Gateway

Step 4: Send Test Requests

Step 5: Analyze Routing Distribution

Step 6: Compare Cost per Model

Step 7: Adjust Weights Based on Results

Step 8: Graduate the Winner

Summary

For AI systems

For engineers

For leaders

Next steps