Skip to main content
Browse docs

Tutorial: Comparing Models in Chat

When choosing the right model for a use case, you need more than benchmarks. The Keeptrusts chat workbench lets you run the same prompt across different models in separate chat runs so you can compare output quality, policy behavior, and estimated prompt cost using real prompts from your own environment.

Use this page when

  • You want to run the same prompt against multiple models from the chat workbench.
  • You need to compare output quality, policy behavior, and prompt cost across models.
  • You are evaluating which model to adopt for a specific use case using real prompts.

Primary audience

  • Primary: Technical Engineers (model evaluation)
  • Secondary: Technical Leaders (model procurement decisions), AI Agents (model routing)

Prerequisites

  • Access to the Keeptrusts chat workbench
  • At least two models configured in your gateway
  • Permission to switch models in the chat workbench workspace settings

Step 1: Prepare a Baseline Prompt

Use a prompt that matches the real task you care about. Keep the prompt text unchanged across every run so the comparison stays fair.

  1. Navigate to the chat workbench from the console sidebar.
  2. Start a new chat so previous context does not influence the comparison.
  3. Write down the exact prompt you want to test.

If you are comparing policy-sensitive behavior, use a prompt that is likely to trigger the same guardrails each time.

Step 2: Run the First Model

Choose the first model you want to evaluate, then send the baseline prompt.

  1. Open Settings from the chat header.
  2. In the Workspace tab, select the first model you want to evaluate.
  3. Close settings and send your baseline prompt.

The workbench shows only models that are eligible through your current agent and gateway configuration. If a model is missing, check your gateway and agent setup.

Step 3: Record the First Result

After the response completes, capture the information you care about before switching models.

Focus on:

  • Output quality: accuracy, completeness, tone, and formatting
  • Policy behavior: blocks, disclaimers, or other governance responses
  • Per-message cost: the response cost indicator shown below the completed answer

If you need a durable record, copy the answer or leave the conversation in history for later review.

Step 4: Run the Same Prompt on Another Model

Repeat the exact same prompt in a fresh run so model behavior stays isolated.

  1. Click New chat.
  2. Open Settings and switch to the next model.
  3. Send the same baseline prompt again.
  4. Repeat for each model you want to evaluate.

Starting a new chat between runs helps avoid earlier responses influencing the next model.

Step 5: Compare Results Across Runs

Open the saved conversations from the chat sidebar and compare the results one by one.

Look for:

  • Which model produced the most trustworthy answer
  • Which model followed policy guidance most cleanly
  • Which model delivered acceptable quality at the lowest estimated prompt cost

If you are evaluating several prompts, keep a simple scorecard outside the workbench so you can compare patterns across multiple runs.

Step 6: Turn the Best Result into a Default

Once you have enough evidence, update your preferred model selection or the agent configuration that owns the workflow.

Use the winning model for the prompts where it performed best, rather than picking one model for every workload.

Tips for Effective Comparisons

  • Keep prompts identical — even small wording changes can invalidate the comparison.
  • Use fresh chats — avoid carrying context from one model run into the next.
  • Test real prompts — production-like prompts reveal more than generic demos.
  • Compare more than answers — note whether the model stayed within policy and how much the completed run actually cost.

Governance During Comparison Runs

Every comparison run still goes through the same gateway policy chain as any other chat message. If a model is blocked, redacted, or escalated under policy, treat that as part of the evaluation result.

Troubleshooting

IssueSolution
A model is missing from Workspace settingsCheck that the model is eligible for the selected agent and gateway
Response cost shows as "N/A"Configure model pricing in your Keeptrusts instance
Responses seem identicalVerify you switched models before rerunning the prompt
You expected split viewThe current chat workbench compares models across separate runs, not simultaneous panes

Next steps

For AI systems

  • Canonical terms: Keeptrusts chat workbench, model comparison, workspace settings, model selector, new chat, comparison run, response cost indicator.
  • UI: Chat header settings → Workspace tab → model selection → send prompt → start new chat → repeat for the next model.
  • Best next pages: Model Selection, Chat Analytics, Context Management.

For engineers

  • Prerequisites: at least two models available through the same chat workbench setup.
  • Validation: run the same prompt in fresh chats against at least two different models and confirm output, governance behavior, and response cost can be compared cleanly.
  • Best practice: Run the same prompt 3–5 times per model for stronger evidence, and keep a written rubric for quality, policy behavior, and cost.

For leaders

  • Model comparison supports data-driven procurement decisions — evaluate before committing to a provider.
  • Cost comparison across models helps identify where premium models provide ROI vs. where cheaper models suffice.
  • Governance applies equally across comparison runs — no model gets special policy treatment during evaluation.