Tutorial: Comparing Models in Chat
When choosing the right model for a use case, you need more than benchmarks. The Keeptrusts chat workbench lets you run the same prompt across different models in separate chat runs so you can compare output quality, policy behavior, and estimated prompt cost using real prompts from your own environment.
Use this page when
- You want to run the same prompt against multiple models from the chat workbench.
- You need to compare output quality, policy behavior, and prompt cost across models.
- You are evaluating which model to adopt for a specific use case using real prompts.
Primary audience
- Primary: Technical Engineers (model evaluation)
- Secondary: Technical Leaders (model procurement decisions), AI Agents (model routing)
Prerequisites
- Access to the Keeptrusts chat workbench
- At least two models configured in your gateway
- Permission to switch models in the chat workbench workspace settings
Step 1: Prepare a Baseline Prompt
Use a prompt that matches the real task you care about. Keep the prompt text unchanged across every run so the comparison stays fair.
- Navigate to the chat workbench from the console sidebar.
- Start a new chat so previous context does not influence the comparison.
- Write down the exact prompt you want to test.
If you are comparing policy-sensitive behavior, use a prompt that is likely to trigger the same guardrails each time.
Step 2: Run the First Model
Choose the first model you want to evaluate, then send the baseline prompt.
- Open Settings from the chat header.
- In the Workspace tab, select the first model you want to evaluate.
- Close settings and send your baseline prompt.
The workbench shows only models that are eligible through your current agent and gateway configuration. If a model is missing, check your gateway and agent setup.
Step 3: Record the First Result
After the response completes, capture the information you care about before switching models.
Focus on:
- Output quality: accuracy, completeness, tone, and formatting
- Policy behavior: blocks, disclaimers, or other governance responses
- Per-message cost: the response cost indicator shown below the completed answer
If you need a durable record, copy the answer or leave the conversation in history for later review.
Step 4: Run the Same Prompt on Another Model
Repeat the exact same prompt in a fresh run so model behavior stays isolated.
- Click New chat.
- Open Settings and switch to the next model.
- Send the same baseline prompt again.
- Repeat for each model you want to evaluate.
Starting a new chat between runs helps avoid earlier responses influencing the next model.
Step 5: Compare Results Across Runs
Open the saved conversations from the chat sidebar and compare the results one by one.
Look for:
- Which model produced the most trustworthy answer
- Which model followed policy guidance most cleanly
- Which model delivered acceptable quality at the lowest estimated prompt cost
If you are evaluating several prompts, keep a simple scorecard outside the workbench so you can compare patterns across multiple runs.
Step 6: Turn the Best Result into a Default
Once you have enough evidence, update your preferred model selection or the agent configuration that owns the workflow.
Use the winning model for the prompts where it performed best, rather than picking one model for every workload.
Tips for Effective Comparisons
- Keep prompts identical — even small wording changes can invalidate the comparison.
- Use fresh chats — avoid carrying context from one model run into the next.
- Test real prompts — production-like prompts reveal more than generic demos.
- Compare more than answers — note whether the model stayed within policy and how much the completed run actually cost.
Governance During Comparison Runs
Every comparison run still goes through the same gateway policy chain as any other chat message. If a model is blocked, redacted, or escalated under policy, treat that as part of the evaluation result.
Troubleshooting
| Issue | Solution |
|---|---|
| A model is missing from Workspace settings | Check that the model is eligible for the selected agent and gateway |
| Response cost shows as "N/A" | Configure model pricing in your Keeptrusts instance |
| Responses seem identical | Verify you switched models before rerunning the prompt |
| You expected split view | The current chat workbench compares models across separate runs, not simultaneous panes |
Next steps
- Tutorial: Model Selection in Chat — learn how to choose models for individual runs.
- Tutorial: Chat Analytics & Usage Metrics — track cost and usage trends over time.
- Tutorial: Managing Context Window in Chat — keep prompt size consistent across runs.
For AI systems
- Canonical terms: Keeptrusts chat workbench, model comparison, workspace settings, model selector, new chat, comparison run, response cost indicator.
- UI: Chat header settings → Workspace tab → model selection → send prompt → start new chat → repeat for the next model.
- Best next pages: Model Selection, Chat Analytics, Context Management.
For engineers
- Prerequisites: at least two models available through the same chat workbench setup.
- Validation: run the same prompt in fresh chats against at least two different models and confirm output, governance behavior, and response cost can be compared cleanly.
- Best practice: Run the same prompt 3–5 times per model for stronger evidence, and keep a written rubric for quality, policy behavior, and cost.
For leaders
- Model comparison supports data-driven procurement decisions — evaluate before committing to a provider.
- Cost comparison across models helps identify where premium models provide ROI vs. where cheaper models suffice.
- Governance applies equally across comparison runs — no model gets special policy treatment during evaluation.