Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Traffic Mirroring & A/B Testing

Keeptrusts supports shadow traffic mirroring and A/B testing to safely evaluate new models without impacting production. Traffic mirroring lets you send a copy of live requests to a secondary provider in the background — with no impact on response latency for your users. A/B testing lets you split production traffic across two or more model variants with explicit weights, so you can run statistically valid model comparisons entirely within the gateway layer.

Use this page when

  • You need the exact command, config, API, or integration details for Traffic Mirroring & A/B Testing.
  • You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
  • If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Traffic Mirroring

When mirroring is enabled, Keeptrusts forwards a copy of each sampled request to a secondary provider (the mirror target) after the primary provider has responded. The primary response is always returned to the client; the mirror call is fire-and-forget and its result is captured only in the event log.

Configuration Fields

FieldTypeDefaultDescription
enabledboolfalseEnable traffic mirroring for this gateway or route.
mirror_targetstringProvider target ID to receive mirrored traffic. Must match a target declared under providers.targets.
sample_ratefloat (0.0–1.0)1.0Fraction of requests to mirror. 0.1 mirrors 10% of traffic, 1.0 mirrors all.
log_mirror_responsebooltrueWhen true, the mirror provider's response body is captured in the event log for later analysis.
timeout_msinteger5000Maximum time to wait for the mirror response before discarding it. Does not affect the primary response.

How It Works

Client Request


┌─────────────┐ primary response
│ Keeptrusts │ ──────────────────────► Client Response
│ Gateway │
└──────┬──────┘
│ (sampled, parallel, fire-and-forget)

┌─────────────────────┐
│ Mirror Target │ response captured in event log
│ (e.g. gpt-4o) │ (mirror: true)
└─────────────────────┘

Primary response latency is never affected by mirror target latency. If the mirror call exceeds timeout_ms it is silently discarded and an event with mirror_timeout: true is emitted.

Use Cases

  • Model validation before promoting: Run the challenger model as a mirror at 10% traffic for a week before routing any production requests to it.
  • Compliance auditing: Mirror all requests to an auditing provider that applies stricter policy checks without affecting end-user responses.
  • Cost profiling: Mirror 5% of requests to a premium model to estimate cost deltas before committing to a full switch.
  • Regression detection: Mirror production traffic against a new model version and compare output quality scores offline.

YAML Example

The following configuration routes primary traffic to claude-3-5-sonnet and mirrors 10% of requests to gpt-4o for shadow evaluation.

pack:
name: traffic-mirroring-ab-testing-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: anthropic-primary
provider: anthropic:claude-3-5-sonnet-20241022
secret_key_ref:
env: ANTHROPIC_API_KEY
- id: openai-shadow
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

A/B Testing

A/B testing in Keeptrusts routes production traffic across two or more model variants according to explicit weights. Unlike mirroring, both variants receive real traffic and both return real responses to clients — the gateway selects one variant per request based on the configured split. Variant selection can be made sticky so that the same session or user always receives the same variant.

Configuration Fields

FieldTypeDefaultDescription
enabledboolfalseEnable A/B testing for this gateway or route.
variantslistOrdered list of AbTestVariant entries defining each variant.
sticky_bystringnoneStickiness scope: session, user, or none.

AbTestVariant Fields

FieldTypeDescription
provider_idstringTarget provider ID (must match a providers.targets entry).
weightintegerRelative weight. Traffic share = weight / sum(all weights).
labelstringOptional human-readable label recorded in trace metadata (e.g. "control", "challenger").

Stickiness Modes

ModeBehaviour
noneVariant is selected randomly on every request. Good for aggregate-level analysis.
sessionAll requests sharing the same session cookie or X-Session-Id header are routed to the same variant.
userAll requests sharing the same X-User-Id header are routed to the same variant. Useful for per-user experiment cohorts.

When stickiness is enabled, Keeptrusts maintains a lightweight in-memory hash of session/user → variant assignments. The assignment is deterministic (hash-based), so it survives gateway restarts without requiring an external state store.

YAML Example

80/20 split between a production model (control) and a challenger (challenger), sticky by user.

pack:
name: traffic-mirroring-ab-testing-providers-2
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-control
provider: openai:chat:gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
- id: openai-challenger
provider: openai:chat:gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Combining Mirror + A/B

You can layer mirroring on top of A/B testing. In the following pattern:

  1. A/B routes 80% of traffic to the production model and 20% to the challenger.
  2. All challenger traffic is mirrored to a dedicated logging-only endpoint that records full response payloads for offline analysis.
pack:
name: traffic-mirroring-ab-testing-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: prod-model
provider: anthropic:claude-3-5-sonnet-20241022
secret_key_ref:
env: ANTHROPIC_API_KEY
- id: challenger-model
provider: anthropic:claude-3-opus-20240229
secret_key_ref:
env: ANTHROPIC_API_KEY
- id: logging-endpoint
provider: openai:chat:gpt-4o
base_url: https://ingest.internal.example.com
secret_key_ref:
env: INTERNAL_LOG_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Mirror sample_rate applies on top of the A/B split. To mirror only challenger traffic, set mirror_target to the challenger's provider ID and sample_rate: 1.0 — the mirror runs after variant selection, so only the 20% challenger requests are forwarded to the logging endpoint.

Analyzing Results

Mirror Events

Every mirrored request produces an event in the Keeptrusts event log with the flag mirror: true. Mirror events include:

  • mirror_target: the ID of the mirror provider.
  • mirror_latency_ms: end-to-end mirror call duration.
  • mirror_response_body: full response payload when log_mirror_response: true.
  • mirror_timeout: true if the mirror call exceeded timeout_ms.

To query mirror events in the console, filter by mirror: true in the Events view, or use the API:

curl -s "https://api.keeptrusts.com/v1/events?mirror=true&limit=100" \
-H "Authorization: Bearer $KEEPTRUSTS_API_TOKEN" | jq '.events[].mirror_latency_ms'

A/B Variant Metadata

The selected A/B variant is also propagated into OTLP trace metadata under the ab_variant key. The public Keeptrusts trace API has been removed, so inspect that metadata in VictoriaTraces or your own OTLP backend when you need span-level comparison.

Comparing Quality Scores Across Variants

If you have quality scoring enabled via policy rules, each event also carries a quality_score field. You can compare scores across A/B variants to measure model quality differences:

# Average quality score per variant
curl -s "https://api.keeptrusts.com/v1/events?limit=5000" \
-H "Authorization: Bearer $KEEPTRUSTS_API_TOKEN" \
| jq 'group_by(.ab_variant) | map({
variant: .[0].ab_variant,
avg_quality: (map(.quality_score // 0) | add / length),
count: length
})'

Promoting a Challenger Model

Once your challenger model has accumulated sufficient data and meets your quality and latency thresholds, you can graduate it to full production traffic in three steps:

Step 1 — Increase challenger weight

Shift the A/B split from 80/20 to 50/50 and monitor for a few hours:

ab_test:
enabled: true
variants:
- provider_id: prod-model
weight: 50
label: control
- provider_id: challenger-model
weight: 50
label: challenger

Step 2 — Route all traffic to challenger

Set the challenger to weight: 100 and the control to weight: 0, or simply remove the control variant:

ab_test:
enabled: true
variants:
- provider_id: challenger-model
weight: 100
label: production

Step 3 — Disable A/B and set default target

Once stable, disable A/B testing entirely and promote the challenger to providers.default_target:

ab_test:
enabled: false

providers:
default_target: challenger-model

Optionally keep the old model configured as a named target so it can be re-enabled quickly for rollback.


Best Practices

  1. Start mirrors at low sample rates. Begin at sample_rate: 0.05 or lower to limit cost impact before you know the mirror model's behavior.

  2. Always set timeout_ms on mirrors. Without a timeout, a slow mirror provider can accumulate open connections under high traffic and consume file descriptors. A value between 5–10 seconds is safe for most cloud LLMs.

  3. Use sticky_by: user for user-facing A/B experiments. Random-per-request stickiness (mode none) can cause the same user to see different model behaviors within the same conversation, which degrades user experience and makes manual QA harder.

  4. Keep variant labels short and consistent. Labels appear in trace metadata and events. Using control / challenger (rather than model-name strings) makes it straightforward to re-run the logic of your analysis scripts when you rotate models.

  5. Run experiments for statistical significance. A 20% challenger split on 100 requests per day gives a very wide confidence interval. Aim for at least 500–1000 requests per variant before drawing quality conclusions.

  6. Combine with circuit breakers. A/B variants should each have their own circuit breaker configuration in providers.targets. If a challenger model starts failing, the circuit breaker will open and the gateway will fall back to the control without disrupting the experiment framework.


Route-Level Overrides

Traffic mirroring and A/B testing can be scoped to specific routes rather than applied globally. This lets you run experiments on one endpoint (e.g., /v1/chat/completions) while keeping other routes (e.g., /v1/embeddings) deterministic and unmirrored.

pack:
name: traffic-mirroring-ab-testing-routes-7
version: 1.0.0
enabled: true
providers:
targets:
- id: openai-primary
provider: openai
model: gpt-4o-mini
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
routes:
- path: "/v1/chat/completions"
ab_test:
enabled: true
sticky_by: user
variants:
- provider_id: openai-control
weight: 90
label: control
- provider_id: openai-new
weight: 10
label: challenger
- path: "/v1/embeddings"
ab_test:
enabled: false
providers:
default_target: embeddings-provider

Route-level settings take precedence over top-level settings. A null value inherits the top-level setting; an explicit enabled: false disables it regardless of the top-level setting.


Metrics and Alerting

Keeptrusts emits Prometheus-compatible metrics for both mirroring and A/B testing surfaces:

MetricLabelsDescription
kt_mirror_requests_totalmirror_target, statusTotal mirrored requests, labelled by status (success, timeout, error).
kt_mirror_latency_msmirror_targetHistogram of mirror call durations.
kt_ab_variant_requests_totalvariant_label, provider_idTotal requests routed to each A/B variant.
kt_ab_variant_latency_msvariant_labelHistogram of response latency per variant.

Access the metrics endpoint at http://<gateway-host>:9090/metrics when Prometheus integration is enabled in your config:

observability:
prometheus:
enabled: true
port: 9090
path: /metrics

You can build a Grafana alert on kt_mirror_requests_total{status="timeout"} to trigger if the mirror provider starts timing out at a rate that suggests it is degraded before you formally evaluate experiment results.


Experiment Lifecycle Reference

PhaseActionConfig Change
DesignSelect challenger model and success metricsNone — planning only
ShadowMirror 5–10% of traffic, no user impacttraffic_mirror.enabled: true, sample_rate: 0.05
CanaryRoute 10–20% to challenger with A/Bab_test.enabled: true, weights 80/20
RampGradually increase challenger shareAdjust weight values in increments
PromoteRoute all traffic to challengerSet challenger weight: 100 or update default_target
CleanupRemove control target from configRemove old target entry, disable A/B

For AI systems

  • Canonical terms: Keeptrusts Traffic Mirroring, A/B Testing, shadow traffic, mirror target, A/B variant, sticky routing, experiment lifecycle.
  • Config keys: traffic_mirror.enabled, traffic_mirror.mirror_target, traffic_mirror.sample_rate, traffic_mirror.log_mirror_response, traffic_mirror.timeout_ms, ab_test.enabled, ab_test.sticky_by (none | session | user), ab_test.variants[].provider_id, ab_test.variants[].weight, ab_test.variants[].label.
  • Event fields: mirror: true, mirror_target, mirror_latency_ms, mirror_timeout, ab_variant.
  • Metrics: kt_mirror_requests_total, kt_mirror_latency_ms, kt_ab_variant_requests_total, kt_ab_variant_latency_ms.
  • Route-level override: routes[].ab_test and routes[].traffic_mirror override top-level settings for that path.
  • Best next pages: Provider Routing, Model Groups, Custom Routes.

For engineers

  • Prerequisites: at least two provider targets (primary and mirror/challenger); Prometheus enabled for metrics.
  • Start mirrors at sample_rate: 0.05 to limit cost before validating mirror model behavior.
  • Always set timeout_ms on mirrors (5000–10000ms) to prevent file descriptor exhaustion under high traffic.
  • Use sticky_by: user for A/B experiments on user-facing endpoints to avoid inconsistent model behavior within a conversation.
  • Promote a challenger: shift weights gradually (80/20 → 50/50 → 100/0), monitoring kt_ab_variant_latency_ms at each step.
  • Query mirror results: GET /v1/events?mirror=true shows mirror latency and response quality for offline comparison.
  • Combine with circuit breakers: each A/B variant should have its own circuit_breaker config to prevent a failing challenger from disrupting the experiment.

For leaders

  • Risk-free model evaluation: traffic mirroring tests new models on real production queries with zero user impact.
  • Data-driven decisions: A/B testing provides statistically valid quality and latency comparisons before committing to a model switch.
  • Cost visibility: mirror events capture cost_usd per mirrored request, enabling precise cost projections before full migration.
  • Gradual rollout: the experiment lifecycle (shadow → canary → ramp → promote) minimizes blast radius during model transitions.

Next steps