Multi-Model Chat Comparison
The Keeptrusts Chat Workbench enables you to compare multiple LLM models within the same governed environment. Because all models route through the same gateway policy chain, you get an apples-to-apples comparison with consistent governance applied.
Use this page when
- You need to compare LLM models side by side with consistent governance policies applied.
- You are benchmarking cost, latency, and quality across providers (OpenAI, Anthropic, Google) in a real policy environment.
- You want to document model selection recommendations backed by governed test data.
- You are preparing test prompts and evaluation criteria for model comparison workflows.
Primary audience
- Primary: AI Engineers evaluating model performance, Platform Administrators selecting default models
- Secondary: Technical Leaders making vendor decisions, Finance stakeholders comparing model costs
Why Compare Models in a Governed Environment
Most model comparison tools evaluate LLMs in isolation, without the governance layer that affects real-world usage. Comparing models through the Keeptrusts gateway gives you:
- Consistent policy application: Every model is subject to the same input and output policies.
- Real cost data: Actual token costs from your negotiated provider rates.
- True latency measurements: End-to-end latency including gateway policy evaluation.
- Compliance-ready results: Comparison data is recorded in the audit trail.
Setting Up Model Comparison
Available Models
The models available for comparison depend on your gateway configuration. Review your policy configuration to see which providers and models are enabled:
pack:
name: multi-model-comparison-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider:
- id: anthropic
provider:
- id: google
provider:
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
All configured models appear in the Chat Workbench model selector.
Preparing Test Prompts
For meaningful comparisons, prepare a consistent set of test prompts:
- Define evaluation criteria: What matters most — accuracy, creativity, conciseness, speed, cost?
- Create prompt categories: Group prompts by use case (summarization, analysis, code generation, Q&A).
- Include edge cases: Add prompts that test policy boundaries to see how different models behave under governance.
- Document expected outcomes: Define what a good response looks like for each prompt.
Side-by-Side Model Testing
Manual Comparison
The simplest approach is to test the same prompt across models sequentially:
- Open the Chat Workbench.
- Select the first model (e.g.,
gpt-4o). - Start a new conversation and send your test prompt.
- Note the response quality, time, and any policy interventions.
- Start another new conversation with the same prompt.
- Select the second model (e.g.,
claude-sonnet). - Compare the responses.
Structured Comparison
For systematic evaluation, create a comparison matrix:
| Prompt | Model A (gpt-4o) | Model B (claude-sonnet) | Model C (gemini-pro) |
|---|---|---|---|
| Summarize quarterly report | Quality: 4/5, Tokens: 280 | Quality: 5/5, Tokens: 310 | Quality: 3/5, Tokens: 250 |
| Code review suggestion | Quality: 5/5, Tokens: 450 | Quality: 4/5, Tokens: 520 | Quality: 4/5, Tokens: 380 |
| Policy compliance check | Quality: 4/5, Tokens: 200 | Quality: 4/5, Tokens: 190 | Quality: 3/5, Tokens: 220 |
Quality Comparison
Evaluation Dimensions
Rate each model's responses across multiple dimensions:
| Dimension | What to Evaluate |
|---|---|
| Accuracy | Are facts correct and claims verifiable? |
| Relevance | Does the response address the actual question? |
| Completeness | Does it cover all aspects of the prompt? |
| Conciseness | Is it appropriately brief without losing content? |
| Format compliance | Does it follow requested output formats? |
| Knowledge grounding | Does it properly cite bound knowledge assets? |
| Policy compliance | Does it avoid triggering output policies? |
Scoring Methodology
Use a consistent 1-5 scale:
| Score | Meaning |
|---|---|
| 5 | Excellent — meets or exceeds all criteria |
| 4 | Good — minor improvements possible |
| 3 | Adequate — usable but with notable gaps |
| 2 | Below expectations — significant issues |
| 1 | Poor — fails to meet basic requirements |
Cost Comparison
Per-Message Cost Analysis
Use the Events page in the console to compare costs:
- Filter events by the test conversation IDs.
- Note the
input_tokens,output_tokens, andcost_usdfor each event. - Compare the cost of equivalent responses across models.
Cost Efficiency Ratio
Calculate the cost efficiency for each model:
Cost Efficiency = Quality Score / Cost per Message
A higher ratio indicates better value. For example:
| Model | Avg Quality | Avg Cost | Cost Efficiency |
|---|---|---|---|
| gpt-4o | 4.3 | $0.012 | 358 |
| claude-sonnet | 4.5 | $0.015 | 300 |
| gpt-4o-mini | 3.8 | $0.003 | 1267 |
Token Economy
Compare how many tokens each model uses to produce equivalent-quality responses. Some models are more verbose, which affects both cost and user experience.
Latency Benchmarking
Measuring Latency
Chat events record timing data that enables latency comparison:
- Time to first token: How quickly the model starts generating.
- Total response time: End-to-end time from prompt to complete response.
- Gateway overhead: Time spent in policy evaluation (consistent across models).
Latency Factors
When comparing latency, account for:
| Factor | Impact |
|---|---|
| Model size | Larger models are generally slower |
| Input length | Longer prompts increase processing time |
| Output length | More tokens take more time to generate |
| Provider load | Provider-side congestion varies by time of day |
| Gateway policies | Complex policy chains add consistent overhead |
| Knowledge grounding | Asset recall adds context preparation time |
Benchmarking Methodology
For reliable latency benchmarks:
- Test at similar times of day to control for provider load.
- Use identical prompts across all models.
- Run multiple iterations (minimum 5 per prompt per model).
- Report median latency, not average (to reduce outlier impact).
- Separate gateway overhead from provider latency.
Recording and Sharing Results
Using the Console
Comparison results are automatically captured in decision events. You can:
- Tag test conversations with a consistent identifier.
- Export the tagged events for analysis.
- Share the export with stakeholders.
Using the API
Query events for comparison analysis:
curl "$API_URL/v1/events?type=chat&from=2026-04-20&to=2026-04-21" \
-H "Authorization: Bearer $TOKEN" | \
jq '.events[] | {model, input_tokens, output_tokens, cost_usd, response_time_ms}'
Documenting Recommendations
After comparison testing, document your findings:
- Primary recommendation: The model that best balances quality, cost, and latency for your use case.
- Alternative: A fallback model for specific scenarios (e.g., high-volume, low-complexity tasks).
- Not recommended: Models that underperformed with reasons.
- Policy interactions: Note any models that frequently triggered policies.
Best Practices
| Practice | Why It Matters |
|---|---|
| Use identical prompts across models | Ensures fair comparison |
| Test with governance applied | Reflects real production conditions |
| Include cost in evaluation | Quality alone does not determine the best choice |
| Run multiple iterations | Reduces noise from provider variability |
| Document and share findings | Enables informed team-wide model selection |
| Re-benchmark periodically | Model performance changes with provider updates |
Next steps
- Integrate chat programmatically for automated testing in Chat API Integration Guide.
- Fine-tune model defaults and parameters in Customizing the Chat Experience.
- Explore advanced multi-turn policies in Advanced Chat Patterns.
For AI systems
- Canonical terms: multi-model comparison, side-by-side testing, model benchmarking, provider evaluation, cost comparison, latency measurement, policy-consistent evaluation.
- Config:
providerssection inpolicy-config.yamldetermines available models. All models pass through the same policy chain. - Best next pages: Chat API Integration, Customizing the Chat Experience, Chat Analytics.
For engineers
- All models configured in the gateway's
providerssection appear in the Chat Workbench model selector. - Use identical prompts across models for fair comparison — the same governance policies apply to all.
- Record latency from decision events (
latency_msandupstream_latency_ms) to separate gateway overhead from provider speed. - Compare token costs using settled amounts in the Spend page, not list prices.
- Automate comparison via the Chat API (
POST /v1/chat/completions) with differentmodelvalues for repeatable benchmarks. - Re-run benchmarks periodically — provider performance changes with model updates.
For leaders
- Governed comparison ensures real-world evaluation — results reflect actual policy overhead and cost, not isolated benchmarks.
- Cost data comes from actual provider billing, not estimates, enabling accurate vendor negotiations.
- Comparison results are recorded in the audit trail, supporting procurement decisions with evidence.
- Model selection impacts both quality and budget — document the trade-off for each use case.
- Re-benchmarking after provider updates prevents lock-in to underperforming defaults.