Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

ML Engineer Guide: Model Routing & A/B Testing

The Keeptrusts gateway sits at the intersection of your applications and LLM providers, making it the ideal control point for model routing, A/B testing, quality evaluation, and model lifecycle management. This guide shows ML engineers how to leverage gateway policies for controlled model experimentation and production model management.

Use this page when

  • You are configuring model routing rules to direct requests to different LLM providers by use case
  • You want to A/B test models with traffic splitting and measure quality/cost/latency differences
  • You are managing model lifecycle (introduction, evaluation, deprecation) through gateway policies
  • You need to evaluate model quality using the gateway's quality-scorer policy
  • You want to optimize cost-per-quality by routing simple tasks to cheaper models

Primary audience

  • Primary: Technical Engineers (ML Engineers, AI Engineers, Applied Scientists)
  • Secondary: Data Scientists, Platform Engineers, Product Managers

Model Routing Fundamentals

How Gateway Routing Works

The Keeptrusts gateway evaluates every LLM request against a policy chain. Model routing policies determine which provider and model handle each request based on configurable rules.

providers:
targets:
- id: openai
provider:
secret_key_ref:
env: OPENAI_API_KEY
- id: anthropic
provider:
secret_key_ref:
env: ANTHROPIC_API_KEY
- id: azure-openai
provider:
secret_key_ref:
env: AZURE_OPENAI_API_KEY
policies:
- name: model-routing
type: model_filter
description: Route requests to approved models
allowed_models:
- gpt-4o
- gpt-4o-mini
- claude-sonnet-4-20250514
enabled: true

Routing by Use Case

Configure different models for different application contexts:

policies:
- name: route-complex-reasoning
type: model_filter
description: "Route complex tasks to high-capability models"
conditions:
max_tokens_gt: 2000
preferred_model: gpt-4o
enabled: true

- name: route-simple-tasks
type: model_filter
description: "Route simple tasks to cost-effective models"
conditions:
max_tokens_lte: 2000
preferred_model: gpt-4o-mini
enabled: true

A/B Testing Models

Traffic Splitting Configuration

Split traffic between models to compare quality, latency, and cost:

policies:
- name: ab-test-models
type: traffic_split
description: "A/B test between GPT-4o and Claude Sonnet"
variants:
- model: gpt-4o
weight: 50
tag: variant-a
- model: claude-sonnet-4-20250514
weight: 50
tag: variant-b
enabled: true

Measuring A/B Test Results

Pull test results from the Events API:

# Get events tagged with A/B test variants
curl -H "Authorization: Bearer $API_TOKEN" \
"https://api.keeptrusts.com/v1/events?since=7d&format=json" | \
jq '[.[] | select(.metadata.ab_tag != null)] |
group_by(.metadata.ab_tag) |
map({
variant: .[0].metadata.ab_tag,
count: length,
avg_latency: (map(.latency_ms) | add / length),
avg_cost: (map(.cost | tonumber) | add / length),
total_cost: (map(.cost | tonumber) | add)
})'

Gradual Rollout

Progress from experiment to production using graduated traffic splits:

PhaseVariant A (incumbent)Variant B (challenger)Duration
Canary95%5%3 days
Expand70%30%7 days
Equal50%50%7 days
Promote10%90%3 days
Complete0%100%

Update the traffic split at each phase:

# Validate the updated config
kt policy lint --file ab-test-phase-2.yaml

Quality Scoring and Evaluation

Capturing Quality Signals

Use the events stream to evaluate model output quality:

# Export events with full metadata for quality analysis
kt export create \
--type events \
--format json \
--since 7d \
--description "Model quality evaluation dataset"

Evaluation Framework Integration

Feed Keeptrusts events into your evaluation pipeline:

import requests

API_URL = "https://api.keeptrusts.com/v1/events"
HEADERS = {"Authorization": f"Bearer {API_TOKEN}"}

def get_model_events(model, days=7):
"""Pull events for a specific model."""
params = {
"since": f"{days}d",
"model": model,
"format": "json",
"limit": 1000
}
response = requests.get(API_URL, headers=HEADERS, params=params)
response.raise_for_status()
return response.json()

def compare_models(model_a, model_b, days=7):
"""Compare two models on key metrics."""
events_a = get_model_events(model_a, days)
events_b = get_model_events(model_b, days)

def metrics(events):
latencies = [e["latency_ms"] for e in events]
costs = [float(e["cost"]) for e in events]
return {
"count": len(events),
"avg_latency_ms": sum(latencies) / len(latencies) if latencies else 0,
"p99_latency_ms": sorted(latencies)[int(len(latencies) * 0.99)] if latencies else 0,
"avg_cost": sum(costs) / len(costs) if costs else 0,
"total_cost": sum(costs),
}

return {
model_a: metrics(events_a),
model_b: metrics(events_b),
}

Quality Metrics to Track

MetricSourceEvaluation method
Latency (p50, p95, p99)Events API latency_msStatistical comparison
Cost per requestEvents API costAggregation by model
Token efficiencyoutput_tokens / input_tokensRatio analysis
Error rateEvents with error statusPercentage comparison
Policy trigger ratepolicies_triggeredSafety comparison

Model Lifecycle Management

Model Inventory

Track which models are in use across your organization:

# List distinct models in recent events
curl -H "Authorization: Bearer $API_TOKEN" \
"https://api.keeptrusts.com/v1/events?since=30d&format=json" | \
jq '[.[].model] | unique'

Model Deprecation Workflow

When retiring a model:

  1. Announce — Notify teams via your communication channel
  2. Warn — Add a warning policy for the deprecated model
  3. Redirect — Route traffic to the replacement model
  4. Block — Block requests to the deprecated model after the deadline
# Phase 1: Warn
policies:
- name: deprecation-warning
type: log
description: "Log warning for deprecated model usage"
conditions:
model: gpt-4-turbo
severity: warn
enabled: true

# Phase 3: Block
policies:
- name: block-deprecated
type: model_filter
description: "Block deprecated model"
blocked_models:
- gpt-4-turbo
enabled: true

Cost Optimization

Identify opportunities to use more cost-effective models:

# Analyze cost by model
curl -H "Authorization: Bearer $API_TOKEN" \
"https://api.keeptrusts.com/v1/events?since=30d&format=json" | \
jq 'group_by(.model) | map({
model: .[0].model,
requests: length,
total_cost: (map(.cost | tonumber) | add),
avg_tokens: (map(.input_tokens + .output_tokens) | add / length)
}) | sort_by(-.total_cost)'

Use the Console Cost Center for visual cost breakdowns by model and provider.

Gateway Configuration for ML Workflows

Validating Configuration Changes

Always validate before deploying:

# Validate the configuration
kt policy lint --file ml-routing-config.yaml

# Check gateway health after deployment
kt doctor

Monitoring Model Performance in Production

# Tail events for a specific model
kt events tail --model gpt-4o

# Check recent event statistics
kt events list --since 1h --format table

The Console Events page provides filtering by model, provider, and decision type for visual exploration.

Success Metrics for ML Engineers

MetricTargetSource
Model routing accuracy> 99% correct routingEvents metadata verification
A/B test statistical powerp < 0.05 significanceEvaluation framework
Model switch downtimeZeroEvent continuity check
Cost per quality-unitDecreasing trendCost / quality score ratio
Model deprecation compliance100% by deadlineEvents showing zero deprecated model usage

Next steps

For AI systems

  • Canonical terms: Keeptrusts, model routing, A/B testing, traffic splitting, quality scoring, model lifecycle, model filter
  • Key surfaces: Console Usage, Events API (with variant tags), Console Configurations
  • Commands: kt gateway run, kt policy lint, kt events list
  • Policy types: model_filter (allowed_models, preferred_model, conditions), traffic_split (variants with weight and tag), quality-scorer (min_score, action), cost_limit
  • Config concepts: multi-provider routing, use-case-based model selection (max_tokens conditions), canary rollout via traffic split weights
  • Best next pages: Policy Reference, Gateway Configuration, Events Guide

For engineers

  • Configure model routing with model_filter policy: set allowed_models, preferred_model, and conditions (e.g., max_tokens_gt: 2000)
  • Set up A/B tests with traffic_split policy: define variants with model, weight, and tag for each arm
  • Measure results via Events API filtering by variant tag metadata
  • Validate routing config: kt policy lint --file routing-policy.yaml
  • Deploy: kt gateway run --listen 0.0.0.0:41002 --policy-config routing-policy.yaml
  • Monitor cost-per-quality ratio in Console Usage

For leaders

  • Model routing through the gateway enables cost optimization — simple tasks go to cheaper models (gpt-4o-mini) while complex reasoning uses high-capability models (gpt-4o)
  • A/B testing with traffic splitting provides objective data for model selection decisions: quality scores, latency, and cost per variant
  • Model lifecycle management (introduction, canary, full rollout, deprecation) is controlled through policy configuration changes rather than application code deploys
  • Zero-downtime model switches through gateway routing mean model upgrades do not require application redeployment