A/B Test AI Models Safely with Traffic Mirroring
Switching AI models is risky. A cheaper model might degrade quality. A newer model might introduce unexpected behavior. Keeptrusts lets you mirror traffic to candidate models, compare quality scores side-by-side, and roll out changes gradually — without exposing users to untested models.
Use this page when
- You need to compare a candidate model against your current production model without risking user-facing quality.
- You are planning a model migration (e.g., GPT-4o to GPT-4o-mini) and want data-driven quality and cost comparisons.
- You want to gradually shift traffic to a new provider using canary percentages before full cutover.
Primary audience
- Primary: Technical Leaders
- Secondary: Technical Engineers, AI Agents
What you'll achieve
- Traffic mirroring — send a copy of production traffic to a candidate model without affecting users
- Model groups — define pools of models with different routing strategies
- Canary deployments — shift a small percentage of traffic to a new model
- Quality comparison — automated scoring across both models using the same requests
- Safe rollback — instantly revert to the previous model if quality drops
Traffic mirroring: test without risk
Traffic mirroring sends a copy of each request to a shadow model. The user always gets the response from the primary model. The shadow response is scored and logged but never delivered.
pack:
name: ab-testing-ai-models-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: primary-gpt4o
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: candidate-gpt4o-mini
provider: openai
model: gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
How it works:
- Request arrives at the gateway
- Gateway forwards to
primary-gpt4o— user gets this response - Gateway simultaneously forwards a copy to
candidate-gpt4o-mini - Both responses are scored by the quality scorer
- Shadow response is logged but discarded
- You compare quality scores in the Events page
Model groups: organize candidates
Define model groups to organize providers by capability tier and test routing strategies:
providers:
model_groups:
- name: production
models:
- provider: openai
model: gpt-4o
- provider: anthropic
model: claude-sonnet-4-20250514
routing: cost_optimized
- name: candidate
models:
- provider: openai
model: gpt-4o-mini
- provider: anthropic
model: claude-haiku
routing: lowest_latency
- name: premium
models:
- provider: openai
model: gpt-4o
- provider: anthropic
model: claude-opus-4-20250514
routing: highest_quality
Applications target a group name rather than a specific model. You can swap the models within a group without application changes.
Canary deployments: gradual rollout
Once mirroring confirms a candidate model meets quality standards, shift a small percentage of live traffic to it:
pack:
name: ab-testing-ai-models-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: current-gpt4o
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: candidate-gpt4o-mini
provider: openai
model: gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Rollout progression:
- Start at 10% — monitor quality scores for 24 hours
- Increase to 25% — watch for edge cases
- Increase to 50% — validate at scale
- Promote to 100% — or rollback if quality degrades
Automated quality gate
Combine canary routing with quality scoring to automate rollback decisions:
policies:
chain:
- quality-scorer
- audit-logger
policy:
quality-scorer:
dimensions:
relevance:
weight: 0.5
min_score: 0.7
coherence:
weight: 0.3
min_score: 0.6
completeness:
weight: 0.2
min_score: 0.5
overall_min_score: 0.65
on_fail: escalate
If the candidate model's quality scores consistently fall below thresholds, escalations alert your team to pause the rollout.
Quality comparison
Export quality data to compare models side by side:
# Export quality events for both models
kt events export \
--from "2025-04-01" \
--to "2025-04-07" \
--filter "quality_scorer" \
--format csv \
--output model-comparison.csv
Key metrics to compare:
| Metric | What to look for |
|---|---|
| Average relevance score | Is the candidate as relevant as the primary? |
| Average coherence score | Does the candidate produce coherent outputs? |
| Block/escalation rate | Does the candidate trigger more quality failures? |
| Cost per request | How much cheaper is the candidate? |
| Latency (p50, p95) | Is the candidate faster or slower? |
Example: full A/B testing config
pack:
name: model-ab-test
version: '1.0'
providers:
targets:
- id: primary-gpt4o
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: candidate-gpt4o-mini
provider: openai
model: gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
mirroring:
enabled: true
shadow_targets:
- candidate-gpt4o-mini
sample_rate: 0.5
log_shadow_responses: true
policies:
chain:
- quality-scorer
- audit-logger
policy:
quality-scorer:
dimensions:
relevance:
weight: 0.4
min_score: 0.7
coherence:
weight: 0.3
min_score: 0.6
completeness:
weight: 0.3
min_score: 0.5
overall_min_score: 0.65
on_fail: escalate
audit-logger:
retention_days: 90
Quick wins
- Enable mirroring with one shadow target — start collecting quality data without user impact
- Compare quality scores after 24 hours — see if the candidate model is viable
- Start a 10% canary for a confirmed candidate — begin the gradual rollout
- Export comparison data — build a business case for switching models
For AI systems
- Canonical terms: traffic mirroring, shadow model, model groups, canary deployment, quality scorer, provider routing.
- Config keys:
providers.mirroring,providers.model_groups,provider_routing.strategy,canary.traffic_percentage,quality-scorer. - CLI commands:
kt gateway run,kt events list,kt events tail. - Best next pages: Declarative Config Reference, Quality Assurance, Reduce AI Spend.
For engineers
- Prerequisites: a running gateway with at least two configured provider targets.
- Add
providers.mirroringwithshadow_targetsandsample_rateto yourpolicy-config.yaml. - Validate: run
kt gateway runand confirm shadow responses appear in Events withrole: shadowmetadata. - Compare quality scores between primary and shadow by filtering Events on provider target ID.
For leaders
- A/B testing reduces model-switch risk by providing quality and cost data before full cutover.
- Shadow traffic doubles upstream API calls during the test period — budget for the added spend.
- Canary deployments let you cap blast radius to a defined percentage of production traffic.
- Use quality comparison data to justify model changes to compliance and finance stakeholders.
Next steps
- Reduce AI Spend by 40% — use A/B testing to validate cost optimization
- Quality Assurance for AI Outputs — deep dive into quality scoring
- Centralize AI Observability — track model performance across providers
- Gateways & Actions — understand routing strategies
- Declarative Config Reference — full provider routing configuration