Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

A/B Test AI Models Safely with Traffic Mirroring

Switching AI models is risky. A cheaper model might degrade quality. A newer model might introduce unexpected behavior. Keeptrusts lets you mirror traffic to candidate models, compare quality scores side-by-side, and roll out changes gradually — without exposing users to untested models.

Use this page when

  • You need to compare a candidate model against your current production model without risking user-facing quality.
  • You are planning a model migration (e.g., GPT-4o to GPT-4o-mini) and want data-driven quality and cost comparisons.
  • You want to gradually shift traffic to a new provider using canary percentages before full cutover.

Primary audience

  • Primary: Technical Leaders
  • Secondary: Technical Engineers, AI Agents

What you'll achieve

  • Traffic mirroring — send a copy of production traffic to a candidate model without affecting users
  • Model groups — define pools of models with different routing strategies
  • Canary deployments — shift a small percentage of traffic to a new model
  • Quality comparison — automated scoring across both models using the same requests
  • Safe rollback — instantly revert to the previous model if quality drops

Traffic mirroring: test without risk

Traffic mirroring sends a copy of each request to a shadow model. The user always gets the response from the primary model. The shadow response is scored and logged but never delivered.

pack:
name: ab-testing-ai-models-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: primary-gpt4o
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: candidate-gpt4o-mini
provider: openai
model: gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

How it works:

  1. Request arrives at the gateway
  2. Gateway forwards to primary-gpt4o — user gets this response
  3. Gateway simultaneously forwards a copy to candidate-gpt4o-mini
  4. Both responses are scored by the quality scorer
  5. Shadow response is logged but discarded
  6. You compare quality scores in the Events page

Model groups: organize candidates

Define model groups to organize providers by capability tier and test routing strategies:

providers:
model_groups:
- name: production
models:
- provider: openai
model: gpt-4o
- provider: anthropic
model: claude-sonnet-4-20250514
routing: cost_optimized

- name: candidate
models:
- provider: openai
model: gpt-4o-mini
- provider: anthropic
model: claude-haiku
routing: lowest_latency

- name: premium
models:
- provider: openai
model: gpt-4o
- provider: anthropic
model: claude-opus-4-20250514
routing: highest_quality

Applications target a group name rather than a specific model. You can swap the models within a group without application changes.


Canary deployments: gradual rollout

Once mirroring confirms a candidate model meets quality standards, shift a small percentage of live traffic to it:

pack:
name: ab-testing-ai-models-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: current-gpt4o
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: candidate-gpt4o-mini
provider: openai
model: gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Rollout progression:

  1. Start at 10% — monitor quality scores for 24 hours
  2. Increase to 25% — watch for edge cases
  3. Increase to 50% — validate at scale
  4. Promote to 100% — or rollback if quality degrades

Automated quality gate

Combine canary routing with quality scoring to automate rollback decisions:

policies:
chain:
- quality-scorer
- audit-logger

policy:
quality-scorer:
dimensions:
relevance:
weight: 0.5
min_score: 0.7
coherence:
weight: 0.3
min_score: 0.6
completeness:
weight: 0.2
min_score: 0.5
overall_min_score: 0.65
on_fail: escalate

If the candidate model's quality scores consistently fall below thresholds, escalations alert your team to pause the rollout.


Quality comparison

Export quality data to compare models side by side:

# Export quality events for both models
kt events export \
--from "2025-04-01" \
--to "2025-04-07" \
--filter "quality_scorer" \
--format csv \
--output model-comparison.csv

Key metrics to compare:

MetricWhat to look for
Average relevance scoreIs the candidate as relevant as the primary?
Average coherence scoreDoes the candidate produce coherent outputs?
Block/escalation rateDoes the candidate trigger more quality failures?
Cost per requestHow much cheaper is the candidate?
Latency (p50, p95)Is the candidate faster or slower?

Example: full A/B testing config

pack:
name: model-ab-test
version: '1.0'
providers:
targets:
- id: primary-gpt4o
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: candidate-gpt4o-mini
provider: openai
model: gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
mirroring:
enabled: true
shadow_targets:
- candidate-gpt4o-mini
sample_rate: 0.5
log_shadow_responses: true
policies:
chain:
- quality-scorer
- audit-logger
policy:
quality-scorer:
dimensions:
relevance:
weight: 0.4
min_score: 0.7
coherence:
weight: 0.3
min_score: 0.6
completeness:
weight: 0.3
min_score: 0.5
overall_min_score: 0.65
on_fail: escalate
audit-logger:
retention_days: 90

Quick wins

  1. Enable mirroring with one shadow target — start collecting quality data without user impact
  2. Compare quality scores after 24 hours — see if the candidate model is viable
  3. Start a 10% canary for a confirmed candidate — begin the gradual rollout
  4. Export comparison data — build a business case for switching models

For AI systems

  • Canonical terms: traffic mirroring, shadow model, model groups, canary deployment, quality scorer, provider routing.
  • Config keys: providers.mirroring, providers.model_groups, provider_routing.strategy, canary.traffic_percentage, quality-scorer.
  • CLI commands: kt gateway run, kt events list, kt events tail.
  • Best next pages: Declarative Config Reference, Quality Assurance, Reduce AI Spend.

For engineers

  • Prerequisites: a running gateway with at least two configured provider targets.
  • Add providers.mirroring with shadow_targets and sample_rate to your policy-config.yaml.
  • Validate: run kt gateway run and confirm shadow responses appear in Events with role: shadow metadata.
  • Compare quality scores between primary and shadow by filtering Events on provider target ID.

For leaders

  • A/B testing reduces model-switch risk by providing quality and cost data before full cutover.
  • Shadow traffic doubles upstream API calls during the test period — budget for the added spend.
  • Canary deployments let you cap blast radius to a defined percentage of production traffic.
  • Use quality comparison data to justify model changes to compliance and finance stakeholders.

Next steps