Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Multi-Model Chat Comparison

The Keeptrusts Chat Workbench enables you to compare multiple LLM models within the same governed environment. Because all models route through the same gateway policy chain, you get an apples-to-apples comparison with consistent governance applied.

Use this page when

  • You need to compare LLM models side by side with consistent governance policies applied.
  • You are benchmarking cost, latency, and quality across providers (OpenAI, Anthropic, Google) in a real policy environment.
  • You want to document model selection recommendations backed by governed test data.
  • You are preparing test prompts and evaluation criteria for model comparison workflows.

Primary audience

  • Primary: AI Engineers evaluating model performance, Platform Administrators selecting default models
  • Secondary: Technical Leaders making vendor decisions, Finance stakeholders comparing model costs

Why Compare Models in a Governed Environment

Most model comparison tools evaluate LLMs in isolation, without the governance layer that affects real-world usage. Comparing models through the Keeptrusts gateway gives you:

  • Consistent policy application: Every model is subject to the same input and output policies.
  • Real cost data: Actual token costs from your negotiated provider rates.
  • True latency measurements: End-to-end latency including gateway policy evaluation.
  • Compliance-ready results: Comparison data is recorded in the audit trail.

Setting Up Model Comparison

Available Models

The models available for comparison depend on your gateway configuration. Review your policy configuration to see which providers and models are enabled:

pack:
name: multi-model-comparison-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider:
- id: anthropic
provider:
- id: google
provider:
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

All configured models appear in the Chat Workbench model selector.

Preparing Test Prompts

For meaningful comparisons, prepare a consistent set of test prompts:

  1. Define evaluation criteria: What matters most — accuracy, creativity, conciseness, speed, cost?
  2. Create prompt categories: Group prompts by use case (summarization, analysis, code generation, Q&A).
  3. Include edge cases: Add prompts that test policy boundaries to see how different models behave under governance.
  4. Document expected outcomes: Define what a good response looks like for each prompt.

Side-by-Side Model Testing

Manual Comparison

The simplest approach is to test the same prompt across models sequentially:

  1. Open the Chat Workbench.
  2. Select the first model (e.g., gpt-4o).
  3. Start a new conversation and send your test prompt.
  4. Note the response quality, time, and any policy interventions.
  5. Start another new conversation with the same prompt.
  6. Select the second model (e.g., claude-sonnet).
  7. Compare the responses.

Structured Comparison

For systematic evaluation, create a comparison matrix:

PromptModel A (gpt-4o)Model B (claude-sonnet)Model C (gemini-pro)
Summarize quarterly reportQuality: 4/5, Tokens: 280Quality: 5/5, Tokens: 310Quality: 3/5, Tokens: 250
Code review suggestionQuality: 5/5, Tokens: 450Quality: 4/5, Tokens: 520Quality: 4/5, Tokens: 380
Policy compliance checkQuality: 4/5, Tokens: 200Quality: 4/5, Tokens: 190Quality: 3/5, Tokens: 220

Quality Comparison

Evaluation Dimensions

Rate each model's responses across multiple dimensions:

DimensionWhat to Evaluate
AccuracyAre facts correct and claims verifiable?
RelevanceDoes the response address the actual question?
CompletenessDoes it cover all aspects of the prompt?
ConcisenessIs it appropriately brief without losing content?
Format complianceDoes it follow requested output formats?
Knowledge groundingDoes it properly cite bound knowledge assets?
Policy complianceDoes it avoid triggering output policies?

Scoring Methodology

Use a consistent 1-5 scale:

ScoreMeaning
5Excellent — meets or exceeds all criteria
4Good — minor improvements possible
3Adequate — usable but with notable gaps
2Below expectations — significant issues
1Poor — fails to meet basic requirements

Cost Comparison

Per-Message Cost Analysis

Use the Events page in the console to compare costs:

  1. Filter events by the test conversation IDs.
  2. Note the input_tokens, output_tokens, and cost_usd for each event.
  3. Compare the cost of equivalent responses across models.

Cost Efficiency Ratio

Calculate the cost efficiency for each model:

Cost Efficiency = Quality Score / Cost per Message

A higher ratio indicates better value. For example:

ModelAvg QualityAvg CostCost Efficiency
gpt-4o4.3$0.012358
claude-sonnet4.5$0.015300
gpt-4o-mini3.8$0.0031267

Token Economy

Compare how many tokens each model uses to produce equivalent-quality responses. Some models are more verbose, which affects both cost and user experience.

Latency Benchmarking

Measuring Latency

Chat events record timing data that enables latency comparison:

  • Time to first token: How quickly the model starts generating.
  • Total response time: End-to-end time from prompt to complete response.
  • Gateway overhead: Time spent in policy evaluation (consistent across models).

Latency Factors

When comparing latency, account for:

FactorImpact
Model sizeLarger models are generally slower
Input lengthLonger prompts increase processing time
Output lengthMore tokens take more time to generate
Provider loadProvider-side congestion varies by time of day
Gateway policiesComplex policy chains add consistent overhead
Knowledge groundingAsset recall adds context preparation time

Benchmarking Methodology

For reliable latency benchmarks:

  1. Test at similar times of day to control for provider load.
  2. Use identical prompts across all models.
  3. Run multiple iterations (minimum 5 per prompt per model).
  4. Report median latency, not average (to reduce outlier impact).
  5. Separate gateway overhead from provider latency.

Recording and Sharing Results

Using the Console

Comparison results are automatically captured in decision events. You can:

  1. Tag test conversations with a consistent identifier.
  2. Export the tagged events for analysis.
  3. Share the export with stakeholders.

Using the API

Query events for comparison analysis:

curl "$API_URL/v1/events?type=chat&from=2026-04-20&to=2026-04-21" \
-H "Authorization: Bearer $TOKEN" | \
jq '.events[] | {model, input_tokens, output_tokens, cost_usd, response_time_ms}'

Documenting Recommendations

After comparison testing, document your findings:

  1. Primary recommendation: The model that best balances quality, cost, and latency for your use case.
  2. Alternative: A fallback model for specific scenarios (e.g., high-volume, low-complexity tasks).
  3. Not recommended: Models that underperformed with reasons.
  4. Policy interactions: Note any models that frequently triggered policies.

Best Practices

PracticeWhy It Matters
Use identical prompts across modelsEnsures fair comparison
Test with governance appliedReflects real production conditions
Include cost in evaluationQuality alone does not determine the best choice
Run multiple iterationsReduces noise from provider variability
Document and share findingsEnables informed team-wide model selection
Re-benchmark periodicallyModel performance changes with provider updates

Next steps

For AI systems

  • Canonical terms: multi-model comparison, side-by-side testing, model benchmarking, provider evaluation, cost comparison, latency measurement, policy-consistent evaluation.
  • Config: providers section in policy-config.yaml determines available models. All models pass through the same policy chain.
  • Best next pages: Chat API Integration, Customizing the Chat Experience, Chat Analytics.

For engineers

  • All models configured in the gateway's providers section appear in the Chat Workbench model selector.
  • Use identical prompts across models for fair comparison — the same governance policies apply to all.
  • Record latency from decision events (latency_ms and upstream_latency_ms) to separate gateway overhead from provider speed.
  • Compare token costs using settled amounts in the Spend page, not list prices.
  • Automate comparison via the Chat API (POST /v1/chat/completions) with different model values for repeatable benchmarks.
  • Re-run benchmarks periodically — provider performance changes with model updates.

For leaders

  • Governed comparison ensures real-world evaluation — results reflect actual policy overhead and cost, not isolated benchmarks.
  • Cost data comes from actual provider billing, not estimates, enabling accurate vendor negotiations.
  • Comparison results are recorded in the audit trail, supporting procurement decisions with evidence.
  • Model selection impacts both quality and budget — document the trade-off for each use case.
  • Re-benchmarking after provider updates prevents lock-in to underperforming defaults.