ML Engineer Guide: Model Routing & A/B Testing

The Keeptrusts gateway sits at the intersection of your applications and LLM providers, making it the ideal control point for model routing, A/B testing, quality evaluation, and model lifecycle management. This guide shows ML engineers how to leverage gateway policies for controlled model experimentation and production model management.

Use this page when

You are configuring model routing rules to direct requests to different LLM providers by use case
You want to A/B test models with traffic splitting and measure quality/cost/latency differences
You are managing model lifecycle (introduction, evaluation, deprecation) through gateway policies
You need to evaluate model quality using the gateway's quality-scorer policy
You want to optimize cost-per-quality by routing simple tasks to cheaper models

Primary audience

Primary: Technical Engineers (ML Engineers, AI Engineers, Applied Scientists)
Secondary: Data Scientists, Platform Engineers, Product Managers

Model Routing Fundamentals

How Gateway Routing Works

The Keeptrusts gateway evaluates every LLM request against a policy chain. Model routing policies determine which provider and model handle each request based on configurable rules.

providers:
  targets:
  - id: openai
    provider: 
    secret_key_ref:
      env: OPENAI_API_KEY
  - id: anthropic
    provider: 
    secret_key_ref:
      env: ANTHROPIC_API_KEY
  - id: azure-openai
    provider: 
    secret_key_ref:
      env: AZURE_OPENAI_API_KEY
policies:
- name: model-routing
  type: model_filter
  description: Route requests to approved models
  allowed_models:
  - gpt-4o
  - gpt-4o-mini
  - claude-sonnet-4-20250514
  enabled: true

Routing by Use Case

Configure different models for different application contexts:

policies:
  - name: route-complex-reasoning
    type: model_filter
    description: "Route complex tasks to high-capability models"
    conditions:
      max_tokens_gt: 2000
    preferred_model: gpt-4o
    enabled: true

  - name: route-simple-tasks
    type: model_filter
    description: "Route simple tasks to cost-effective models"
    conditions:
      max_tokens_lte: 2000
    preferred_model: gpt-4o-mini
    enabled: true

A/B Testing Models

Traffic Splitting Configuration

Split traffic between models to compare quality, latency, and cost:

policies:
  - name: ab-test-models
    type: traffic_split
    description: "A/B test between GPT-4o and Claude Sonnet"
    variants:
      - model: gpt-4o
        weight: 50
        tag: variant-a
      - model: claude-sonnet-4-20250514
        weight: 50
        tag: variant-b
    enabled: true

Measuring A/B Test Results

Pull test results from the Events API:

# Get events tagged with A/B test variants
curl -H "Authorization: Bearer $API_TOKEN" \
  "https://api.keeptrusts.com/v1/events?since=7d&format=json" | \
  jq '[.[] | select(.metadata.ab_tag != null)] |
      group_by(.metadata.ab_tag) |
      map({
        variant: .[0].metadata.ab_tag,
        count: length,
        avg_latency: (map(.latency_ms) | add / length),
        avg_cost: (map(.cost | tonumber) | add / length),
        total_cost: (map(.cost | tonumber) | add)
      })'

Gradual Rollout

Progress from experiment to production using graduated traffic splits:

Phase	Variant A (incumbent)	Variant B (challenger)	Duration
Canary	95%	5%	3 days
Expand	70%	30%	7 days
Equal	50%	50%	7 days
Promote	10%	90%	3 days
Complete	0%	100%	—

Update the traffic split at each phase:

# Validate the updated config
kt policy lint --file ab-test-phase-2.yaml

Quality Scoring and Evaluation

Capturing Quality Signals

Use the events stream to evaluate model output quality:

# Export events with full metadata for quality analysis
kt export create \
  --type events \
  --format json \
  --since 7d \
  --description "Model quality evaluation dataset"

Evaluation Framework Integration

Feed Keeptrusts events into your evaluation pipeline:

import requests

API_URL = "https://api.keeptrusts.com/v1/events"
HEADERS = {"Authorization": f"Bearer {API_TOKEN}"}

def get_model_events(model, days=7):
    """Pull events for a specific model."""
    params = {
        "since": f"{days}d",
        "model": model,
        "format": "json",
        "limit": 1000
    }
    response = requests.get(API_URL, headers=HEADERS, params=params)
    response.raise_for_status()
    return response.json()

def compare_models(model_a, model_b, days=7):
    """Compare two models on key metrics."""
    events_a = get_model_events(model_a, days)
    events_b = get_model_events(model_b, days)

    def metrics(events):
        latencies = [e["latency_ms"] for e in events]
        costs = [float(e["cost"]) for e in events]
        return {
            "count": len(events),
            "avg_latency_ms": sum(latencies) / len(latencies) if latencies else 0,
            "p99_latency_ms": sorted(latencies)[int(len(latencies) * 0.99)] if latencies else 0,
            "avg_cost": sum(costs) / len(costs) if costs else 0,
            "total_cost": sum(costs),
        }

    return {
        model_a: metrics(events_a),
        model_b: metrics(events_b),
    }

Quality Metrics to Track

Metric	Source	Evaluation method
Latency (p50, p95, p99)	Events API `latency_ms`	Statistical comparison
Cost per request	Events API `cost`	Aggregation by model
Token efficiency	`output_tokens / input_tokens`	Ratio analysis
Error rate	Events with error status	Percentage comparison
Policy trigger rate	`policies_triggered`	Safety comparison

Model Lifecycle Management

Model Inventory

Track which models are in use across your organization:

# List distinct models in recent events
curl -H "Authorization: Bearer $API_TOKEN" \
  "https://api.keeptrusts.com/v1/events?since=30d&format=json" | \
  jq '[.[].model] | unique'

Model Deprecation Workflow

When retiring a model:

Announce — Notify teams via your communication channel
Warn — Add a warning policy for the deprecated model
Redirect — Route traffic to the replacement model
Block — Block requests to the deprecated model after the deadline

# Phase 1: Warn
policies:
  - name: deprecation-warning
    type: log
    description: "Log warning for deprecated model usage"
    conditions:
      model: gpt-4-turbo
    severity: warn
    enabled: true

# Phase 3: Block
policies:
  - name: block-deprecated
    type: model_filter
    description: "Block deprecated model"
    blocked_models:
      - gpt-4-turbo
    enabled: true

Cost Optimization

Identify opportunities to use more cost-effective models:

# Analyze cost by model
curl -H "Authorization: Bearer $API_TOKEN" \
  "https://api.keeptrusts.com/v1/events?since=30d&format=json" | \
  jq 'group_by(.model) | map({
    model: .[0].model,
    requests: length,
    total_cost: (map(.cost | tonumber) | add),
    avg_tokens: (map(.input_tokens + .output_tokens) | add / length)
  }) | sort_by(-.total_cost)'

Use the Console Cost Center for visual cost breakdowns by model and provider.

Gateway Configuration for ML Workflows

Validating Configuration Changes

Always validate before deploying:

# Validate the configuration
kt policy lint --file ml-routing-config.yaml

# Check gateway health after deployment
kt doctor

Monitoring Model Performance in Production

# Tail events for a specific model
kt events tail --model gpt-4o

# Check recent event statistics
kt events list --since 1h --format table

The Console Events page provides filtering by model, provider, and decision type for visual exploration.

Success Metrics for ML Engineers

Metric	Target	Source
Model routing accuracy	> 99% correct routing	Events metadata verification
A/B test statistical power	p < 0.05 significance	Evaluation framework
Model switch downtime	Zero	Event continuity check
Cost per quality-unit	Decreasing trend	Cost / quality score ratio
Model deprecation compliance	100% by deadline	Events showing zero deprecated model usage

Next steps

Configure model routing: Policy Reference
Explore provider configuration: Gateway Configuration
Review events: Events Guide

For AI systems

Canonical terms: Keeptrusts, model routing, A/B testing, traffic splitting, quality scoring, model lifecycle, model filter
Key surfaces: Console Usage, Events API (with variant tags), Console Configurations
Commands: kt gateway run, kt policy lint, kt events list
Policy types: model_filter (allowed_models, preferred_model, conditions), traffic_split (variants with weight and tag), quality-scorer (min_score, action), cost_limit
Config concepts: multi-provider routing, use-case-based model selection (max_tokens conditions), canary rollout via traffic split weights
Best next pages: Policy Reference, Gateway Configuration, Events Guide

For engineers

Configure model routing with model_filter policy: set allowed_models, preferred_model, and conditions (e.g., max_tokens_gt: 2000)
Set up A/B tests with traffic_split policy: define variants with model, weight, and tag for each arm
Measure results via Events API filtering by variant tag metadata
Validate routing config: kt policy lint --file routing-policy.yaml
Deploy: kt gateway run --listen 0.0.0.0:41002 --policy-config routing-policy.yaml
Monitor cost-per-quality ratio in Console Usage

For leaders

Model routing through the gateway enables cost optimization — simple tasks go to cheaper models (gpt-4o-mini) while complex reasoning uses high-capability models (gpt-4o)
A/B testing with traffic splitting provides objective data for model selection decisions: quality scores, latency, and cost per variant
Model lifecycle management (introduction, canary, full rollout, deprecation) is controlled through policy configuration changes rather than application code deploys
Zero-downtime model switches through gateway routing mean model upgrades do not require application redeployment

Use this page when​

Primary audience​

Model Routing Fundamentals​

How Gateway Routing Works​

Routing by Use Case​

A/B Testing Models​

Traffic Splitting Configuration​

Measuring A/B Test Results​

Gradual Rollout​

Quality Scoring and Evaluation​

Capturing Quality Signals​

Evaluation Framework Integration​

Quality Metrics to Track​

Model Lifecycle Management​

Model Inventory​

Model Deprecation Workflow​

Cost Optimization​

Gateway Configuration for ML Workflows​

Validating Configuration Changes​

Monitoring Model Performance in Production​

Success Metrics for ML Engineers​

Next steps​

For AI systems​

For engineers​

For leaders​