Multi-Model Chat Comparison

The Keeptrusts Chat Workbench enables you to compare multiple LLM models within the same governed environment. Because all models route through the same gateway policy chain, you get an apples-to-apples comparison with consistent governance applied.

Use this page when

You need to compare LLM models side by side with consistent governance policies applied.
You are benchmarking cost, latency, and quality across providers (OpenAI, Anthropic, Google) in a real policy environment.
You want to document model selection recommendations backed by governed test data.
You are preparing test prompts and evaluation criteria for model comparison workflows.

Primary audience

Primary: AI Engineers evaluating model performance, Platform Administrators selecting default models
Secondary: Technical Leaders making vendor decisions, Finance stakeholders comparing model costs

Why Compare Models in a Governed Environment

Most model comparison tools evaluate LLMs in isolation, without the governance layer that affects real-world usage. Comparing models through the Keeptrusts gateway gives you:

Consistent policy application: Every model is subject to the same input and output policies.
Real cost data: Actual token costs from your negotiated provider rates.
True latency measurements: End-to-end latency including gateway policy evaluation.
Compliance-ready results: Comparison data is recorded in the audit trail.

Setting Up Model Comparison

Available Models

The models available for comparison depend on your gateway configuration. Review your policy configuration to see which providers and models are enabled:

pack:
  name: multi-model-comparison-providers-1
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai
    provider: 
  - id: anthropic
    provider: 
  - id: google
    provider: 
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

All configured models appear in the Chat Workbench model selector.

Preparing Test Prompts

For meaningful comparisons, prepare a consistent set of test prompts:

Define evaluation criteria: What matters most — accuracy, creativity, conciseness, speed, cost?
Create prompt categories: Group prompts by use case (summarization, analysis, code generation, Q&A).
Include edge cases: Add prompts that test policy boundaries to see how different models behave under governance.
Document expected outcomes: Define what a good response looks like for each prompt.

Side-by-Side Model Testing

Manual Comparison

The simplest approach is to test the same prompt across models sequentially:

Open the Chat Workbench.
Select the first model (e.g., gpt-4o).
Start a new conversation and send your test prompt.
Note the response quality, time, and any policy interventions.
Start another new conversation with the same prompt.
Select the second model (e.g., claude-sonnet).
Compare the responses.

Structured Comparison

For systematic evaluation, create a comparison matrix:

Prompt	Model A (gpt-4o)	Model B (claude-sonnet)	Model C (gemini-pro)
Summarize quarterly report	Quality: 4/5, Tokens: 280	Quality: 5/5, Tokens: 310	Quality: 3/5, Tokens: 250
Code review suggestion	Quality: 5/5, Tokens: 450	Quality: 4/5, Tokens: 520	Quality: 4/5, Tokens: 380
Policy compliance check	Quality: 4/5, Tokens: 200	Quality: 4/5, Tokens: 190	Quality: 3/5, Tokens: 220

Quality Comparison

Evaluation Dimensions

Rate each model's responses across multiple dimensions:

Dimension	What to Evaluate
Accuracy	Are facts correct and claims verifiable?
Relevance	Does the response address the actual question?
Completeness	Does it cover all aspects of the prompt?
Conciseness	Is it appropriately brief without losing content?
Format compliance	Does it follow requested output formats?
Knowledge grounding	Does it properly cite bound knowledge assets?
Policy compliance	Does it avoid triggering output policies?

Scoring Methodology

Use a consistent 1-5 scale:

Score	Meaning
5	Excellent — meets or exceeds all criteria
4	Good — minor improvements possible
3	Adequate — usable but with notable gaps
2	Below expectations — significant issues
1	Poor — fails to meet basic requirements

Cost Comparison

Per-Message Cost Analysis

Use the Events page in the console to compare costs:

Filter events by the test conversation IDs.
Note the input_tokens, output_tokens, and cost_usd for each event.
Compare the cost of equivalent responses across models.

Cost Efficiency Ratio

Calculate the cost efficiency for each model:

Cost Efficiency = Quality Score / Cost per Message

A higher ratio indicates better value. For example:

Model	Avg Quality	Avg Cost	Cost Efficiency
gpt-4o	4.3	$0.012	358
claude-sonnet	4.5	$0.015	300
gpt-4o-mini	3.8	$0.003	1267

Token Economy

Compare how many tokens each model uses to produce equivalent-quality responses. Some models are more verbose, which affects both cost and user experience.

Latency Benchmarking

Measuring Latency

Chat events record timing data that enables latency comparison:

Time to first token: How quickly the model starts generating.
Total response time: End-to-end time from prompt to complete response.
Gateway overhead: Time spent in policy evaluation (consistent across models).

Latency Factors

When comparing latency, account for:

Factor	Impact
Model size	Larger models are generally slower
Input length	Longer prompts increase processing time
Output length	More tokens take more time to generate
Provider load	Provider-side congestion varies by time of day
Gateway policies	Complex policy chains add consistent overhead
Knowledge grounding	Asset recall adds context preparation time

Benchmarking Methodology

For reliable latency benchmarks:

Test at similar times of day to control for provider load.
Use identical prompts across all models.
Run multiple iterations (minimum 5 per prompt per model).
Report median latency, not average (to reduce outlier impact).
Separate gateway overhead from provider latency.

Using the Console

Comparison results are automatically captured in decision events. You can:

Tag test conversations with a consistent identifier.
Export the tagged events for analysis.
Share the export with stakeholders.

Using the API

Query events for comparison analysis:

curl "$API_URL/v1/events?type=chat&from=2026-04-20&to=2026-04-21" \
  -H "Authorization: Bearer $TOKEN" | \
  jq '.events[] | {model, input_tokens, output_tokens, cost_usd, response_time_ms}'

Documenting Recommendations

After comparison testing, document your findings:

Primary recommendation: The model that best balances quality, cost, and latency for your use case.
Alternative: A fallback model for specific scenarios (e.g., high-volume, low-complexity tasks).
Not recommended: Models that underperformed with reasons.
Policy interactions: Note any models that frequently triggered policies.

Best Practices

Practice	Why It Matters
Use identical prompts across models	Ensures fair comparison
Test with governance applied	Reflects real production conditions
Include cost in evaluation	Quality alone does not determine the best choice
Run multiple iterations	Reduces noise from provider variability
Document and share findings	Enables informed team-wide model selection
Re-benchmark periodically	Model performance changes with provider updates

Next steps

Integrate chat programmatically for automated testing in Chat API Integration Guide.
Fine-tune model defaults and parameters in Customizing the Chat Experience.
Explore advanced multi-turn policies in Advanced Chat Patterns.

For AI systems

Canonical terms: multi-model comparison, side-by-side testing, model benchmarking, provider evaluation, cost comparison, latency measurement, policy-consistent evaluation.
Config: providers section in policy-config.yaml determines available models. All models pass through the same policy chain.
Best next pages: Chat API Integration, Customizing the Chat Experience, Chat Analytics.

For engineers

All models configured in the gateway's providers section appear in the Chat Workbench model selector.
Use identical prompts across models for fair comparison — the same governance policies apply to all.
Record latency from decision events (latency_ms and upstream_latency_ms) to separate gateway overhead from provider speed.
Compare token costs using settled amounts in the Spend page, not list prices.
Automate comparison via the Chat API (POST /v1/chat/completions) with different model values for repeatable benchmarks.
Re-run benchmarks periodically — provider performance changes with model updates.

For leaders

Governed comparison ensures real-world evaluation — results reflect actual policy overhead and cost, not isolated benchmarks.
Cost data comes from actual provider billing, not estimates, enabling accurate vendor negotiations.
Comparison results are recorded in the audit trail, supporting procurement decisions with evidence.
Model selection impacts both quality and budget — document the trade-off for each use case.
Re-benchmarking after provider updates prevents lock-in to underperforming defaults.

Use this page when​

Primary audience​

Why Compare Models in a Governed Environment​

Setting Up Model Comparison​

Available Models​

Preparing Test Prompts​

Side-by-Side Model Testing​

Manual Comparison​

Structured Comparison​

Quality Comparison​

Evaluation Dimensions​

Scoring Methodology​

Cost Comparison​

Per-Message Cost Analysis​

Cost Efficiency Ratio​

Token Economy​

Latency Benchmarking​

Measuring Latency​

Latency Factors​

Benchmarking Methodology​

Recording and Sharing Results​

Using the Console​

Using the API​

Documenting Recommendations​

Best Practices​

Next steps​

For AI systems​

For engineers​

For leaders​