Traffic Mirroring & A/B Testing

Keeptrusts supports shadow traffic mirroring and A/B testing to safely evaluate new models without impacting production. Traffic mirroring lets you send a copy of live requests to a secondary provider in the background — with no impact on response latency for your users. A/B testing lets you split production traffic across two or more model variants with explicit weights, so you can run statistically valid model comparisons entirely within the gateway layer.

Use this page when

You need the exact command, config, API, or integration details for Traffic Mirroring & A/B Testing.
You are wiring automation or AI retrieval and need canonical names, examples, and constraints.
If you want a guided rollout instead of a reference page, use the linked workflow pages in Next steps.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Traffic Mirroring

When mirroring is enabled, Keeptrusts forwards a copy of each sampled request to a secondary provider (the mirror target) after the primary provider has responded. The primary response is always returned to the client; the mirror call is fire-and-forget and its result is captured only in the event log.

Configuration Fields

Field	Type	Default	Description
`enabled`	bool	`false`	Enable traffic mirroring for this gateway or route.
`mirror_target`	string	—	Provider target ID to receive mirrored traffic. Must match a target declared under `providers.targets`.
`sample_rate`	float (0.0–1.0)	`1.0`	Fraction of requests to mirror. `0.1` mirrors 10% of traffic, `1.0` mirrors all.
`log_mirror_response`	bool	`true`	When `true`, the mirror provider's response body is captured in the event log for later analysis.
`timeout_ms`	integer	`5000`	Maximum time to wait for the mirror response before discarding it. Does not affect the primary response.

How It Works

  Client Request
       │
       ▼
  ┌─────────────┐    primary response
  │  Keeptrusts  │ ──────────────────────► Client Response
  │   Gateway     │
  └──────┬──────┘
         │  (sampled, parallel, fire-and-forget)
         ▼
  ┌─────────────────────┐
  │  Mirror Target      │  response captured in event log
  │  (e.g. gpt-4o)      │  (mirror: true)
  └─────────────────────┘

Primary response latency is never affected by mirror target latency. If the mirror call exceeds timeout_ms it is silently discarded and an event with mirror_timeout: true is emitted.

Use Cases

Model validation before promoting: Run the challenger model as a mirror at 10% traffic for a week before routing any production requests to it.
Compliance auditing: Mirror all requests to an auditing provider that applies stricter policy checks without affecting end-user responses.
Cost profiling: Mirror 5% of requests to a premium model to estimate cost deltas before committing to a full switch.
Regression detection: Mirror production traffic against a new model version and compare output quality scores offline.

YAML Example

The following configuration routes primary traffic to claude-3-5-sonnet and mirrors 10% of requests to gpt-4o for shadow evaluation.

pack:
  name: traffic-mirroring-ab-testing-providers-1
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: anthropic-primary
    provider: anthropic:claude-3-5-sonnet-20241022
    secret_key_ref:
      env: ANTHROPIC_API_KEY
  - id: openai-shadow
    provider: openai:chat:gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

A/B Testing

A/B testing in Keeptrusts routes production traffic across two or more model variants according to explicit weights. Unlike mirroring, both variants receive real traffic and both return real responses to clients — the gateway selects one variant per request based on the configured split. Variant selection can be made sticky so that the same session or user always receives the same variant.

Configuration Fields

Field	Type	Default	Description
`enabled`	bool	`false`	Enable A/B testing for this gateway or route.
`variants`	list	—	Ordered list of `AbTestVariant` entries defining each variant.
`sticky_by`	string	`none`	Stickiness scope: `session`, `user`, or `none`.

`AbTestVariant` Fields

Field	Type	Description
`provider_id`	string	Target provider ID (must match a `providers.targets` entry).
`weight`	integer	Relative weight. Traffic share = `weight / sum(all weights)`.
`label`	string	Optional human-readable label recorded in trace metadata (e.g. `"control"`, `"challenger"`).

Stickiness Modes

Mode	Behaviour
`none`	Variant is selected randomly on every request. Good for aggregate-level analysis.
`session`	All requests sharing the same session cookie or `X-Session-Id` header are routed to the same variant.
`user`	All requests sharing the same `X-User-Id` header are routed to the same variant. Useful for per-user experiment cohorts.

When stickiness is enabled, Keeptrusts maintains a lightweight in-memory hash of session/user → variant assignments. The assignment is deterministic (hash-based), so it survives gateway restarts without requiring an external state store.

YAML Example

80/20 split between a production model (control) and a challenger (challenger), sticky by user.

pack:
  name: traffic-mirroring-ab-testing-providers-2
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai-control
    provider: openai:chat:gpt-4o
    secret_key_ref:
      env: OPENAI_API_KEY
  - id: openai-challenger
    provider: openai:chat:gpt-4o-mini
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Combining Mirror + A/B

You can layer mirroring on top of A/B testing. In the following pattern:

A/B routes 80% of traffic to the production model and 20% to the challenger.
All challenger traffic is mirrored to a dedicated logging-only endpoint that records full response payloads for offline analysis.

pack:
  name: traffic-mirroring-ab-testing-providers-3
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: prod-model
    provider: anthropic:claude-3-5-sonnet-20241022
    secret_key_ref:
      env: ANTHROPIC_API_KEY
  - id: challenger-model
    provider: anthropic:claude-3-opus-20240229
    secret_key_ref:
      env: ANTHROPIC_API_KEY
  - id: logging-endpoint
    provider: openai:chat:gpt-4o
    base_url: https://ingest.internal.example.com
    secret_key_ref:
      env: INTERNAL_LOG_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Mirror sample_rate applies on top of the A/B split. To mirror only challenger traffic, set mirror_target to the challenger's provider ID and sample_rate: 1.0 — the mirror runs after variant selection, so only the 20% challenger requests are forwarded to the logging endpoint.

Analyzing Results

Mirror Events

Every mirrored request produces an event in the Keeptrusts event log with the flag mirror: true. Mirror events include:

mirror_target: the ID of the mirror provider.
mirror_latency_ms: end-to-end mirror call duration.
mirror_response_body: full response payload when log_mirror_response: true.
mirror_timeout: true if the mirror call exceeded timeout_ms.

To query mirror events in the console, filter by mirror: true in the Events view, or use the API:

curl -s "https://api.keeptrusts.com/v1/events?mirror=true&limit=100" \
  -H "Authorization: Bearer $KEEPTRUSTS_API_TOKEN" | jq '.events[].mirror_latency_ms'

A/B Variant Metadata

The selected A/B variant is also propagated into OTLP trace metadata under the ab_variant key. The public Keeptrusts trace API has been removed, so inspect that metadata in VictoriaTraces or your own OTLP backend when you need span-level comparison.

Comparing Quality Scores Across Variants

If you have quality scoring enabled via policy rules, each event also carries a quality_score field. You can compare scores across A/B variants to measure model quality differences:

# Average quality score per variant
curl -s "https://api.keeptrusts.com/v1/events?limit=5000" \
  -H "Authorization: Bearer $KEEPTRUSTS_API_TOKEN" \
  | jq 'group_by(.ab_variant) | map({
      variant: .[0].ab_variant,
      avg_quality: (map(.quality_score // 0) | add / length),
      count: length
    })'

Promoting a Challenger Model

Once your challenger model has accumulated sufficient data and meets your quality and latency thresholds, you can graduate it to full production traffic in three steps:

Step 1 — Increase challenger weight

Shift the A/B split from 80/20 to 50/50 and monitor for a few hours:

ab_test:
  enabled: true
  variants:
    - provider_id: prod-model
      weight: 50
      label: control
    - provider_id: challenger-model
      weight: 50
      label: challenger

Step 2 — Route all traffic to challenger

Set the challenger to weight: 100 and the control to weight: 0, or simply remove the control variant:

ab_test:
  enabled: true
  variants:
    - provider_id: challenger-model
      weight: 100
      label: production

Step 3 — Disable A/B and set default target

Once stable, disable A/B testing entirely and promote the challenger to providers.default_target:

ab_test:
  enabled: false

providers:
  default_target: challenger-model

Optionally keep the old model configured as a named target so it can be re-enabled quickly for rollback.

Best Practices

Start mirrors at low sample rates. Begin at sample_rate: 0.05 or lower to limit cost impact before you know the mirror model's behavior.
Always set timeout_ms on mirrors. Without a timeout, a slow mirror provider can accumulate open connections under high traffic and consume file descriptors. A value between 5–10 seconds is safe for most cloud LLMs.
Use sticky_by: user for user-facing A/B experiments. Random-per-request stickiness (mode none) can cause the same user to see different model behaviors within the same conversation, which degrades user experience and makes manual QA harder.
Keep variant labels short and consistent. Labels appear in trace metadata and events. Using control / challenger (rather than model-name strings) makes it straightforward to re-run the logic of your analysis scripts when you rotate models.
Run experiments for statistical significance. A 20% challenger split on 100 requests per day gives a very wide confidence interval. Aim for at least 500–1000 requests per variant before drawing quality conclusions.
Combine with circuit breakers. A/B variants should each have their own circuit breaker configuration in providers.targets. If a challenger model starts failing, the circuit breaker will open and the gateway will fall back to the control without disrupting the experiment framework.

Route-Level Overrides

Traffic mirroring and A/B testing can be scoped to specific routes rather than applied globally. This lets you run experiments on one endpoint (e.g., /v1/chat/completions) while keeping other routes (e.g., /v1/embeddings) deterministic and unmirrored.

pack:
  name: traffic-mirroring-ab-testing-routes-7
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai-primary
    provider: openai
    model: gpt-4o-mini
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true
routes:
- path: "/v1/chat/completions"
  ab_test:
    enabled: true
    sticky_by: user
    variants:
    - provider_id: openai-control
      weight: 90
      label: control
    - provider_id: openai-new
      weight: 10
      label: challenger
- path: "/v1/embeddings"
  ab_test:
    enabled: false
  providers:
    default_target: embeddings-provider

Route-level settings take precedence over top-level settings. A null value inherits the top-level setting; an explicit enabled: false disables it regardless of the top-level setting.

Metrics and Alerting

Keeptrusts emits Prometheus-compatible metrics for both mirroring and A/B testing surfaces:

Metric	Labels	Description
`kt_mirror_requests_total`	`mirror_target`, `status`	Total mirrored requests, labelled by status (`success`, `timeout`, `error`).
`kt_mirror_latency_ms`	`mirror_target`	Histogram of mirror call durations.
`kt_ab_variant_requests_total`	`variant_label`, `provider_id`	Total requests routed to each A/B variant.
`kt_ab_variant_latency_ms`	`variant_label`	Histogram of response latency per variant.

Access the metrics endpoint at http://<gateway-host>:9090/metrics when Prometheus integration is enabled in your config:

observability:
  prometheus:
    enabled: true
    port: 9090
    path: /metrics

You can build a Grafana alert on kt_mirror_requests_total{status="timeout"} to trigger if the mirror provider starts timing out at a rate that suggests it is degraded before you formally evaluate experiment results.

Experiment Lifecycle Reference

Phase	Action	Config Change
Design	Select challenger model and success metrics	None — planning only
Shadow	Mirror 5–10% of traffic, no user impact	`traffic_mirror.enabled: true`, `sample_rate: 0.05`
Canary	Route 10–20% to challenger with A/B	`ab_test.enabled: true`, weights 80/20
Ramp	Gradually increase challenger share	Adjust `weight` values in increments
Promote	Route all traffic to challenger	Set challenger `weight: 100` or update `default_target`
Cleanup	Remove control target from config	Remove old target entry, disable A/B

For AI systems

Canonical terms: Keeptrusts Traffic Mirroring, A/B Testing, shadow traffic, mirror target, A/B variant, sticky routing, experiment lifecycle.
Config keys: traffic_mirror.enabled, traffic_mirror.mirror_target, traffic_mirror.sample_rate, traffic_mirror.log_mirror_response, traffic_mirror.timeout_ms, ab_test.enabled, ab_test.sticky_by (none | session | user), ab_test.variants[].provider_id, ab_test.variants[].weight, ab_test.variants[].label.
Event fields: mirror: true, mirror_target, mirror_latency_ms, mirror_timeout, ab_variant.
Metrics: kt_mirror_requests_total, kt_mirror_latency_ms, kt_ab_variant_requests_total, kt_ab_variant_latency_ms.
Route-level override: routes[].ab_test and routes[].traffic_mirror override top-level settings for that path.
Best next pages: Provider Routing, Model Groups, Custom Routes.

For engineers

Prerequisites: at least two provider targets (primary and mirror/challenger); Prometheus enabled for metrics.
Start mirrors at sample_rate: 0.05 to limit cost before validating mirror model behavior.
Always set timeout_ms on mirrors (5000–10000ms) to prevent file descriptor exhaustion under high traffic.
Use sticky_by: user for A/B experiments on user-facing endpoints to avoid inconsistent model behavior within a conversation.
Promote a challenger: shift weights gradually (80/20 → 50/50 → 100/0), monitoring kt_ab_variant_latency_ms at each step.
Query mirror results: GET /v1/events?mirror=true shows mirror latency and response quality for offline comparison.
Combine with circuit breakers: each A/B variant should have its own circuit_breaker config to prevent a failing challenger from disrupting the experiment.

For leaders

Risk-free model evaluation: traffic mirroring tests new models on real production queries with zero user impact.
Data-driven decisions: A/B testing provides statistically valid quality and latency comparisons before committing to a model switch.
Cost visibility: mirror events capture cost_usd per mirrored request, enabling precise cost projections before full migration.
Gradual rollout: the experiment lifecycle (shadow → canary → ramp → promote) minimizes blast radius during model transitions.

Next steps

Provider Routing — routing strategies that determine primary model selection
Model Groups — define model pools for A/B variant targets
Custom Routes — scope experiments to specific API paths
Circuit Breakers & Retry — protect experiments from challenger model failures

Use this page when​

Primary audience​

Traffic Mirroring​

Configuration Fields​

How It Works​

Use Cases​

YAML Example​

A/B Testing​

Configuration Fields​

AbTestVariant Fields​

Stickiness Modes​

YAML Example​

Combining Mirror + A/B​

Analyzing Results​

Mirror Events​

A/B Variant Metadata​

Comparing Quality Scores Across Variants​

Promoting a Challenger Model​

Step 1 — Increase challenger weight​

Step 2 — Route all traffic to challenger​

Step 3 — Disable A/B and set default target​

Best Practices​

Route-Level Overrides​

Metrics and Alerting​

Experiment Lifecycle Reference​

For AI systems​

For engineers​

For leaders​

Next steps​

Use this page when

Primary audience

Traffic Mirroring

Configuration Fields

How It Works

Use Cases

YAML Example

A/B Testing

Configuration Fields

`AbTestVariant` Fields

Stickiness Modes

YAML Example

Combining Mirror + A/B

Analyzing Results

Mirror Events

A/B Variant Metadata

Comparing Quality Scores Across Variants

Promoting a Challenger Model

Step 1 — Increase challenger weight

Step 2 — Route all traffic to challenger

Step 3 — Disable A/B and set default target

Best Practices

Route-Level Overrides

Metrics and Alerting

Experiment Lifecycle Reference

For AI systems

For engineers

For leaders

Next steps