Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Guarantee AI Output Quality with Automated Scoring

AI models produce inconsistent output. The same prompt can return a concise, well-sourced answer on one call and a hallucinated, rambling response on the next. Keeptrusts evaluates every response at the gateway — scoring quality, verifying citations, and rewriting substandard outputs before they reach your users.

Use this page when

  • You need to enforce minimum quality standards on AI responses before they reach users.
  • You are deploying citation verification to catch hallucinations in knowledge-grounded applications.
  • You want automated quality scoring with configurable thresholds and action-on-failure (escalate, block, or rewrite).

Primary audience

  • Primary: Technical Leaders
  • Secondary: Technical Engineers, AI Agents

What you'll achieve

  • Automated quality scoring on every AI response with configurable thresholds
  • Citation verification that checks whether responses are grounded in provided context
  • Response rewriting that fixes low-quality outputs instead of blocking them
  • Quality assertions that enforce minimum standards per endpoint or team
  • Benchmarking data to compare model quality across providers over time

Quality scorer: score every response

The quality-scorer policy evaluates responses against configurable quality dimensions and takes action when scores fall below your thresholds.

policies:
chain:
- quality-scorer
- audit-logger

policy:
quality-scorer:
dimensions:
relevance:
weight: 0.4
min_score: 0.7
coherence:
weight: 0.3
min_score: 0.6
completeness:
weight: 0.3
min_score: 0.5
overall_min_score: 0.65
on_fail: escalate

How it works at runtime:

  1. The upstream provider returns a response
  2. The quality scorer evaluates the response across each dimension
  3. If any dimension or the overall score falls below the threshold, the configured action fires
  4. The score is attached to the event record for trend analysis

Choosing an action on failure

ActionBehaviorUse when
escalateDeliver the response but flag it for reviewMonitoring phase — understand baseline quality
blockReject the response with a 409Strict quality requirements — customer-facing outputs
rewritePass the response through the rewriter before deliveryYou want to fix outputs, not block them

Citation verification: ground responses in facts

The citation-verifier policy checks whether a response is grounded in the context that was provided — catching hallucinations before they reach users.

policies:
chain:
- citation-verifier
- quality-scorer
- audit-logger

policy:
citation-verifier:
mode: strict
min_grounding_score: 0.8
on_ungrounded: escalate
check_factual_consistency: true
log_citation_records: true

When knowledge base assets are bound to a configuration, the citation verifier compares response claims against the retrieved context. Ungrounded claims are flagged, and citation records are written for audit.

# Review citation records for a configuration
kt events list \
--configuration-id my-config \
--filter "citation_verifier" \
--limit 20

Response rewriting: fix instead of block

The response-rewriter policy transforms substandard responses instead of rejecting them. Use it when blocking would degrade user experience.

policies:
chain:
- quality-scorer
- response-rewriter
- audit-logger

policy:
quality-scorer:
overall_min_score: 0.65
on_fail: rewrite
response-rewriter:
add_disclaimer: true
disclaimer_text: "This response has been reviewed for quality."
truncate_at_tokens: 2000
remove_repetition: true
enforce_format: markdown

The rewriter can:

  • Add disclaimers to responses that triggered quality flags
  • Truncate overly verbose responses
  • Remove repetitive content
  • Enforce output format constraints

Quality assertions: enforce minimum standards

Combine the quality scorer with specific assertions to create hard quality gates per use case.

policies:
chain:
- quality-scorer
- citation-verifier
- audit-logger

policy:
quality-scorer:
dimensions:
relevance:
weight: 0.5
min_score: 0.8
coherence:
weight: 0.3
min_score: 0.7
completeness:
weight: 0.2
min_score: 0.6
overall_min_score: 0.75
on_fail: block
citation-verifier:
mode: strict
min_grounding_score: 0.85
on_ungrounded: block

This configuration blocks any response that:

  • Scores below 0.8 on relevance
  • Scores below 0.7 on coherence
  • Has an overall quality score below 0.75
  • Is less than 85% grounded in provided context

Benchmarking across providers

Use quality scoring data to compare model performance over time. Every scored response is logged as an event with full dimension breakdowns.

# Export quality events for analysis
kt events export \
--from "2025-04-01" \
--to "2025-04-30" \
--filter "quality_scorer" \
--format csv \
--output quality-report.csv

Track these metrics in the console Events page:

MetricWhat it tells you
Average overall score per modelWhich provider delivers the best quality
Relevance score trendWhether quality is improving or degrading over time
Citation grounding rateHow often responses are factually supported
Block/escalation rateHow often outputs fail to meet your standards

Example: full quality-enforced config

pack:
name: quality-enforced-gateway
version: "1.0"

policies:
chain:
- citation-verifier
- quality-scorer
- response-rewriter
- audit-logger

policy:
citation-verifier:
mode: strict
min_grounding_score: 0.8
on_ungrounded: escalate
log_citation_records: true
quality-scorer:
dimensions:
relevance:
weight: 0.4
min_score: 0.7
coherence:
weight: 0.3
min_score: 0.6
completeness:
weight: 0.3
min_score: 0.5
overall_min_score: 0.65
on_fail: rewrite
response-rewriter:
add_disclaimer: true
remove_repetition: true
audit-logger:
retention_days: 365

providers:
targets:
- id: openai-gpt4o
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY

Quick wins

  1. Deploy quality-scorer with on_fail: escalate — measure baseline quality without impacting users
  2. Add citation-verifier on knowledge-grounded configs — catch hallucinations immediately
  3. Export a week of quality events — identify which models and prompts produce the lowest quality
  4. Set a quality floor with on_fail: block on customer-facing endpoints — protect end users

For AI systems

  • Canonical terms: quality-scorer, citation-verifier, response rewriter, quality dimensions (relevance, coherence, completeness), grounding score.
  • Config keys: policy.quality-scorer.dimensions, policy.quality-scorer.overall_min_score, policy.quality-scorer.on_fail, policy.citation-verifier.mode, policy.citation-verifier.min_grounding_score.
  • Actions on failure: escalate (flag for review), block (reject with 409), rewrite (fix and deliver).
  • Best next pages: Knowledge Grounding, A/B Testing, Centralize Observability.

For engineers

  • Prerequisites: gateway running with quality-scorer and/or citation-verifier in the policy chain.
  • Configure quality dimensions with weights and minimum scores: relevance (0.4), coherence (0.3), completeness (0.3).
  • Set on_fail: escalate during initial rollout to understand baseline quality before enforcing blocks.
  • Enable citation-verifier with mode: strict and min_grounding_score: 0.8 for knowledge-grounded apps.
  • Validate: send a request that produces a low-quality response and confirm the configured action fires.

For leaders

  • Inconsistent AI output quality erodes user trust and creates liability risk for customer-facing applications.
  • Automated scoring enforces quality gates without manual review of every response.
  • Citation verification catches hallucinations before they reach users — critical for regulated industries.
  • Quality trend data supports model comparison decisions and justifies model upgrade investments.

Next steps