A/B Test AI Models Safely with Traffic Mirroring

Switching AI models is risky. A cheaper model might degrade quality. A newer model might introduce unexpected behavior. Keeptrusts lets you mirror traffic to candidate models, compare quality scores side-by-side, and roll out changes gradually — without exposing users to untested models.

Use this page when

You need to compare a candidate model against your current production model without risking user-facing quality.
You are planning a model migration (e.g., GPT-4o to GPT-4o-mini) and want data-driven quality and cost comparisons.
You want to gradually shift traffic to a new provider using canary percentages before full cutover.

Primary audience

Primary: Technical Leaders
Secondary: Technical Engineers, AI Agents

What you'll achieve

Traffic mirroring — send a copy of production traffic to a candidate model without affecting users
Model groups — define pools of models with different routing strategies
Canary deployments — shift a small percentage of traffic to a new model
Quality comparison — automated scoring across both models using the same requests
Safe rollback — instantly revert to the previous model if quality drops

Traffic mirroring: test without risk

Traffic mirroring sends a copy of each request to a shadow model. The user always gets the response from the primary model. The shadow response is scored and logged but never delivered.

pack:
  name: ab-testing-ai-models-providers-1
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: primary-gpt4o
    provider: openai
    model: gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
  - id: candidate-gpt4o-mini
    provider: openai
    model: gpt-4o-mini
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

How it works:

Request arrives at the gateway
Gateway forwards to primary-gpt4o — user gets this response
Gateway simultaneously forwards a copy to candidate-gpt4o-mini
Both responses are scored by the quality scorer
Shadow response is logged but discarded
You compare quality scores in the Events page

Model groups: organize candidates

Define model groups to organize providers by capability tier and test routing strategies:

providers:
  model_groups:
    - name: production
      models:
        - provider: openai
          model: gpt-4o
        - provider: anthropic
          model: claude-sonnet-4-20250514
      routing: cost_optimized

    - name: candidate
      models:
        - provider: openai
          model: gpt-4o-mini
        - provider: anthropic
          model: claude-haiku
      routing: lowest_latency

    - name: premium
      models:
        - provider: openai
          model: gpt-4o
        - provider: anthropic
          model: claude-opus-4-20250514
      routing: highest_quality

Applications target a group name rather than a specific model. You can swap the models within a group without application changes.

Canary deployments: gradual rollout

Once mirroring confirms a candidate model meets quality standards, shift a small percentage of live traffic to it:

pack:
  name: ab-testing-ai-models-providers-3
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: current-gpt4o
    provider: openai
    model: gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
  - id: candidate-gpt4o-mini
    provider: openai
    model: gpt-4o-mini
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Rollout progression:

Start at 10% — monitor quality scores for 24 hours
Increase to 25% — watch for edge cases
Increase to 50% — validate at scale
Promote to 100% — or rollback if quality degrades

Automated quality gate

Combine canary routing with quality scoring to automate rollback decisions:

policies:
  chain:
    - quality-scorer
    - audit-logger

policy:
  quality-scorer:
    dimensions:
      relevance:
        weight: 0.5
        min_score: 0.7
      coherence:
        weight: 0.3
        min_score: 0.6
      completeness:
        weight: 0.2
        min_score: 0.5
    overall_min_score: 0.65
    on_fail: escalate

If the candidate model's quality scores consistently fall below thresholds, escalations alert your team to pause the rollout.

Quality comparison

Export quality data to compare models side by side:

# Export quality events for both models
kt events export \
  --from "2025-04-01" \
  --to "2025-04-07" \
  --filter "quality_scorer" \
  --format csv \
  --output model-comparison.csv

Key metrics to compare:

Metric	What to look for
Average relevance score	Is the candidate as relevant as the primary?
Average coherence score	Does the candidate produce coherent outputs?
Block/escalation rate	Does the candidate trigger more quality failures?
Cost per request	How much cheaper is the candidate?
Latency (p50, p95)	Is the candidate faster or slower?

Example: full A/B testing config

pack:
  name: model-ab-test
  version: '1.0'
providers:
  targets:
  - id: primary-gpt4o
    provider: openai
    model: gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
  - id: candidate-gpt4o-mini
    provider: openai
    model: gpt-4o-mini
    secret_key_ref:
      env: OPENAI_API_KEY
  mirroring:
    enabled: true
    shadow_targets:
    - candidate-gpt4o-mini
    sample_rate: 0.5
    log_shadow_responses: true
policies:
  chain:
  - quality-scorer
  - audit-logger
policy:
  quality-scorer:
    dimensions:
      relevance:
        weight: 0.4
        min_score: 0.7
      coherence:
        weight: 0.3
        min_score: 0.6
      completeness:
        weight: 0.3
        min_score: 0.5
    overall_min_score: 0.65
    on_fail: escalate
  audit-logger:
    retention_days: 90

Quick wins

Enable mirroring with one shadow target — start collecting quality data without user impact
Compare quality scores after 24 hours — see if the candidate model is viable
Start a 10% canary for a confirmed candidate — begin the gradual rollout
Export comparison data — build a business case for switching models

For AI systems

Canonical terms: traffic mirroring, shadow model, model groups, canary deployment, quality scorer, provider routing.
Config keys: providers.mirroring, providers.model_groups, provider_routing.strategy, canary.traffic_percentage, quality-scorer.
CLI commands: kt gateway run, kt events list, kt events tail.
Best next pages: Declarative Config Reference, Quality Assurance, Reduce AI Spend.

For engineers

Prerequisites: a running gateway with at least two configured provider targets.
Add providers.mirroring with shadow_targets and sample_rate to your policy-config.yaml.
Validate: run kt gateway run and confirm shadow responses appear in Events with role: shadow metadata.
Compare quality scores between primary and shadow by filtering Events on provider target ID.

For leaders

A/B testing reduces model-switch risk by providing quality and cost data before full cutover.
Shadow traffic doubles upstream API calls during the test period — budget for the added spend.
Canary deployments let you cap blast radius to a defined percentage of production traffic.
Use quality comparison data to justify model changes to compliance and finance stakeholders.

Next steps

Reduce AI Spend by 40% — use A/B testing to validate cost optimization
Quality Assurance for AI Outputs — deep dive into quality scoring
Centralize AI Observability — track model performance across providers
Gateways & Actions — understand routing strategies
Declarative Config Reference — full provider routing configuration

Use this page when​

Primary audience​

What you'll achieve​

Traffic mirroring: test without risk​

Model groups: organize candidates​

Canary deployments: gradual rollout​

Automated quality gate​

Quality comparison​

Example: full A/B testing config​

Quick wins​

For AI systems​

For engineers​

For leaders​

Next steps​