Traffic Mirroring & A/B Testing
Keeptrusts supports shadow traffic mirroring and A/B testing to safely evaluate new models without impacting production. Traffic mirroring lets you send a copy of live requests to a secondary provider in the background — with no impact on response latency for your users. A/B testing lets you split production traffic across two or more model variants with explicit weights, so you can run statistically valid model comparisons entirely within the gateway layer.
Use this page when
- You need the exact command, config, API, or integration details for Traffic Mirroring & A/B Testing.
- You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
- If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Traffic Mirroring
When mirroring is enabled, Keeptrusts forwards a copy of each sampled request to a secondary provider (the mirror target) after the primary provider has responded. The primary response is always returned to the client; the mirror call is fire-and-forget and its result is captured only in the event log.
Configuration Fields
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable traffic mirroring for this gateway or route. |
mirror_target | string | — | Provider target ID to receive mirrored traffic. Must match a target declared under providers.targets. |
sample_rate | float (0.0–1.0) | 1.0 | Fraction of requests to mirror. 0.1 mirrors 10% of traffic, 1.0 mirrors all. |
log_mirror_response | bool | true | When true, the mirror provider's response body is captured in the event log for later analysis. |
timeout_ms | integer | 5000 | Maximum time to wait for the mirror response before discarding it. Does not affect the primary response. |
How It Works
Client Request
│
▼
┌─────────────┐ primary response
│ Keeptrusts │ ──────────────────────► Client Response
│ Gateway │
└──────┬──────┘
│ (sampled, parallel, fire-and-forget)
▼
┌─────────────────────┐
│ Mirror Target │ response captured in event log
│ (e.g. gpt-4o) │ (mirror: true)
└─────────────────────┘
Primary response latency is never affected by mirror target latency. If the mirror call exceeds timeout_ms it is silently discarded and an event with mirror_timeout: true is emitted.
Use Cases
- Model validation before promoting: Run the challenger model as a mirror at 10% traffic for a week before routing any production requests to it.
- Compliance auditing: Mirror all requests to an auditing provider that applies stricter policy checks without affecting end-user responses.
- Cost profiling: Mirror 5% of requests to a premium model to estimate cost deltas before committing to a full switch.
- Regression detection: Mirror production traffic against a new model version and compare output quality scores offline.
YAML Example
The following configuration routes primary traffic to claude-3-5-sonnet and mirrors 10% of requests to gpt-4o for shadow evaluation.
pack:
name: traffic-mirroring-ab-testing-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: anthropic-primary
provider: anthropic:claude-3-5-sonnet-20241022
secret_key_ref:
env: ANTHROPIC_API_KEY
- id: openai-shadow
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
A/B Testing
A/B testing in Keeptrusts routes production traffic across two or more model variants according to explicit weights. Unlike mirroring, both variants receive real traffic and both return real responses to clients — the gateway selects one variant per request based on the configured split. Variant selection can be made sticky so that the same session or user always receives the same variant.
Configuration Fields
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable A/B testing for this gateway or route. |
variants | list | — | Ordered list of AbTestVariant entries defining each variant. |
sticky_by | string | none | Stickiness scope: session, user, or none. |
AbTestVariant Fields
| Field | Type | Description |
|---|---|---|
provider_id | string | Target provider ID (must match a providers.targets entry). |
weight | integer | Relative weight. Traffic share = weight / sum(all weights). |
label | string | Optional human-readable label recorded in trace metadata (e.g. "control", "challenger"). |
Stickiness Modes
| Mode | Behaviour |
|---|---|
none | Variant is selected randomly on every request. Good for aggregate-level analysis. |
session | All requests sharing the same session cookie or X-Session-Id header are routed to the same variant. |
user | All requests sharing the same X-User-Id header are routed to the same variant. Useful for per-user experiment cohorts. |
When stickiness is enabled, Keeptrusts maintains a lightweight in-memory hash of session/user → variant assignments. The assignment is deterministic (hash-based), so it survives gateway restarts without requiring an external state store.
YAML Example
80/20 split between a production model (control) and a challenger (challenger), sticky by user.
pack:
name: traffic-mirroring-ab-testing-providers-2
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-control
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: openai-challenger
provider: openai:chat:gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Combining Mirror + A/B
You can layer mirroring on top of A/B testing. In the following pattern:
- A/B routes 80% of traffic to the production model and 20% to the challenger.
- All challenger traffic is mirrored to a dedicated logging-only endpoint that records full response payloads for offline analysis.
pack:
name: traffic-mirroring-ab-testing-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: prod-model
provider: anthropic:claude-3-5-sonnet-20241022
secret_key_ref:
env: ANTHROPIC_API_KEY
- id: challenger-model
provider: anthropic:claude-3-opus-20240229
secret_key_ref:
env: ANTHROPIC_API_KEY
- id: logging-endpoint
provider: openai:chat:gpt-4o
base_url: https://ingest.internal.example.com
secret_key_ref:
env: INTERNAL_LOG_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
sample_rate applies on top of the A/B split. To mirror only challenger traffic, set mirror_target to the challenger's provider ID and sample_rate: 1.0 — the mirror runs after variant selection, so only the 20% challenger requests are forwarded to the logging endpoint.Analyzing Results
Mirror Events
Every mirrored request produces an event in the Keeptrusts event log with the flag mirror: true. Mirror events include:
mirror_target: the ID of the mirror provider.mirror_latency_ms: end-to-end mirror call duration.mirror_response_body: full response payload whenlog_mirror_response: true.mirror_timeout:trueif the mirror call exceededtimeout_ms.
To query mirror events in the console, filter by mirror: true in the Events view, or use the API:
curl -s "https://api.keeptrusts.com/v1/events?mirror=true&limit=100" \
-H "Authorization: Bearer $KEEPTRUSTS_API_TOKEN" | jq '.events[].mirror_latency_ms'
A/B Variant Metadata
The selected A/B variant is also propagated into OTLP trace metadata under the ab_variant key. The public Keeptrusts trace API has been removed, so inspect that metadata in VictoriaTraces or your own OTLP backend when you need span-level comparison.
Comparing Quality Scores Across Variants
If you have quality scoring enabled via policy rules, each event also carries a quality_score field. You can compare scores across A/B variants to measure model quality differences:
# Average quality score per variant
curl -s "https://api.keeptrusts.com/v1/events?limit=5000" \
-H "Authorization: Bearer $KEEPTRUSTS_API_TOKEN" \
| jq 'group_by(.ab_variant) | map({
variant: .[0].ab_variant,
avg_quality: (map(.quality_score // 0) | add / length),
count: length
})'
Promoting a Challenger Model
Once your challenger model has accumulated sufficient data and meets your quality and latency thresholds, you can graduate it to full production traffic in three steps:
Step 1 — Increase challenger weight
Shift the A/B split from 80/20 to 50/50 and monitor for a few hours:
ab_test:
enabled: true
variants:
- provider_id: prod-model
weight: 50
label: control
- provider_id: challenger-model
weight: 50
label: challenger
Step 2 — Route all traffic to challenger
Set the challenger to weight: 100 and the control to weight: 0, or simply remove the control variant:
ab_test:
enabled: true
variants:
- provider_id: challenger-model
weight: 100
label: production
Step 3 — Disable A/B and set default target
Once stable, disable A/B testing entirely and promote the challenger to providers.default_target:
ab_test:
enabled: false
providers:
default_target: challenger-model
Optionally keep the old model configured as a named target so it can be re-enabled quickly for rollback.
Best Practices
-
Start mirrors at low sample rates. Begin at
sample_rate: 0.05or lower to limit cost impact before you know the mirror model's behavior. -
Always set
timeout_mson mirrors. Without a timeout, a slow mirror provider can accumulate open connections under high traffic and consume file descriptors. A value between 5–10 seconds is safe for most cloud LLMs. -
Use
sticky_by: userfor user-facing A/B experiments. Random-per-request stickiness (modenone) can cause the same user to see different model behaviors within the same conversation, which degrades user experience and makes manual QA harder. -
Keep variant labels short and consistent. Labels appear in trace metadata and events. Using
control/challenger(rather than model-name strings) makes it straightforward to re-run the logic of your analysis scripts when you rotate models. -
Run experiments for statistical significance. A 20% challenger split on 100 requests per day gives a very wide confidence interval. Aim for at least 500–1000 requests per variant before drawing quality conclusions.
-
Combine with circuit breakers. A/B variants should each have their own circuit breaker configuration in
providers.targets. If a challenger model starts failing, the circuit breaker will open and the gateway will fall back to the control without disrupting the experiment framework.
Route-Level Overrides
Traffic mirroring and A/B testing can be scoped to specific routes rather than applied globally. This lets you run experiments on one endpoint (e.g., /v1/chat/completions) while keeping other routes (e.g., /v1/embeddings) deterministic and unmirrored.
pack:
name: traffic-mirroring-ab-testing-routes-7
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-primary
provider: openai
model: gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
routes:
- path: "/v1/chat/completions"
ab_test:
enabled: true
sticky_by: user
variants:
- provider_id: openai-control
weight: 90
label: control
- provider_id: openai-new
weight: 10
label: challenger
- path: "/v1/embeddings"
ab_test:
enabled: false
providers:
default_target: embeddings-provider
Route-level settings take precedence over top-level settings. A null value inherits the top-level setting; an explicit enabled: false disables it regardless of the top-level setting.
Metrics and Alerting
Keeptrusts emits Prometheus-compatible metrics for both mirroring and A/B testing surfaces:
| Metric | Labels | Description |
|---|---|---|
kt_mirror_requests_total | mirror_target, status | Total mirrored requests, labelled by status (success, timeout, error). |
kt_mirror_latency_ms | mirror_target | Histogram of mirror call durations. |
kt_ab_variant_requests_total | variant_label, provider_id | Total requests routed to each A/B variant. |
kt_ab_variant_latency_ms | variant_label | Histogram of response latency per variant. |
Access the metrics endpoint at http://<gateway-host>:9090/metrics when Prometheus integration is enabled in your config:
observability:
prometheus:
enabled: true
port: 9090
path: /metrics
You can build a Grafana alert on kt_mirror_requests_total{status="timeout"} to trigger if the mirror provider starts timing out at a rate that suggests it is degraded before you formally evaluate experiment results.
Experiment Lifecycle Reference
| Phase | Action | Config Change |
|---|---|---|
| Design | Select challenger model and success metrics | None — planning only |
| Shadow | Mirror 5–10% of traffic, no user impact | traffic_mirror.enabled: true, sample_rate: 0.05 |
| Canary | Route 10–20% to challenger with A/B | ab_test.enabled: true, weights 80/20 |
| Ramp | Gradually increase challenger share | Adjust weight values in increments |
| Promote | Route all traffic to challenger | Set challenger weight: 100 or update default_target |
| Cleanup | Remove control target from config | Remove old target entry, disable A/B |
For AI systems
- Canonical terms: Keeptrusts Traffic Mirroring, A/B Testing, shadow traffic, mirror target, A/B variant, sticky routing, experiment lifecycle.
- Config keys:
traffic_mirror.enabled,traffic_mirror.mirror_target,traffic_mirror.sample_rate,traffic_mirror.log_mirror_response,traffic_mirror.timeout_ms,ab_test.enabled,ab_test.sticky_by(none|session|user),ab_test.variants[].provider_id,ab_test.variants[].weight,ab_test.variants[].label. - Event fields:
mirror: true,mirror_target,mirror_latency_ms,mirror_timeout,ab_variant. - Metrics:
kt_mirror_requests_total,kt_mirror_latency_ms,kt_ab_variant_requests_total,kt_ab_variant_latency_ms. - Route-level override:
routes[].ab_testandroutes[].traffic_mirroroverride top-level settings for that path. - Best next pages: Provider Routing, Model Groups, Custom Routes.
For engineers
- Prerequisites: at least two provider targets (primary and mirror/challenger); Prometheus enabled for metrics.
- Start mirrors at
sample_rate: 0.05to limit cost before validating mirror model behavior. - Always set
timeout_mson mirrors (5000–10000ms) to prevent file descriptor exhaustion under high traffic. - Use
sticky_by: userfor A/B experiments on user-facing endpoints to avoid inconsistent model behavior within a conversation. - Promote a challenger: shift weights gradually (80/20 → 50/50 → 100/0), monitoring
kt_ab_variant_latency_msat each step. - Query mirror results:
GET /v1/events?mirror=trueshows mirror latency and response quality for offline comparison. - Combine with circuit breakers: each A/B variant should have its own
circuit_breakerconfig to prevent a failing challenger from disrupting the experiment.
For leaders
- Risk-free model evaluation: traffic mirroring tests new models on real production queries with zero user impact.
- Data-driven decisions: A/B testing provides statistically valid quality and latency comparisons before committing to a model switch.
- Cost visibility: mirror events capture
cost_usdper mirrored request, enabling precise cost projections before full migration. - Gradual rollout: the experiment lifecycle (shadow → canary → ramp → promote) minimizes blast radius during model transitions.
Next steps
- Provider Routing — routing strategies that determine primary model selection
- Model Groups — define model pools for A/B variant targets
- Custom Routes — scope experiments to specific API paths
- Circuit Breakers & Retry — protect experiments from challenger model failures