Guarantee AI Output Quality with Automated Scoring
AI models produce inconsistent output. The same prompt can return a concise, well-sourced answer on one call and a hallucinated, rambling response on the next. Keeptrusts evaluates every response at the gateway — scoring quality, verifying citations, and rewriting substandard outputs before they reach your users.
Use this page when
- You need to enforce minimum quality standards on AI responses before they reach users.
- You are deploying citation verification to catch hallucinations in knowledge-grounded applications.
- You want automated quality scoring with configurable thresholds and action-on-failure (escalate, block, or rewrite).
Primary audience
- Primary: Technical Leaders
- Secondary: Technical Engineers, AI Agents
What you'll achieve
- Automated quality scoring on every AI response with configurable thresholds
- Citation verification that checks whether responses are grounded in provided context
- Response rewriting that fixes low-quality outputs instead of blocking them
- Quality assertions that enforce minimum standards per endpoint or team
- Benchmarking data to compare model quality across providers over time
Quality scorer: score every response
The quality-scorer policy evaluates responses against configurable quality dimensions and takes action when scores fall below your thresholds.
policies:
chain:
- quality-scorer
- audit-logger
policy:
quality-scorer:
dimensions:
relevance:
weight: 0.4
min_score: 0.7
coherence:
weight: 0.3
min_score: 0.6
completeness:
weight: 0.3
min_score: 0.5
overall_min_score: 0.65
on_fail: escalate
How it works at runtime:
- The upstream provider returns a response
- The quality scorer evaluates the response across each dimension
- If any dimension or the overall score falls below the threshold, the configured action fires
- The score is attached to the event record for trend analysis
Choosing an action on failure
| Action | Behavior | Use when |
|---|---|---|
escalate | Deliver the response but flag it for review | Monitoring phase — understand baseline quality |
block | Reject the response with a 409 | Strict quality requirements — customer-facing outputs |
rewrite | Pass the response through the rewriter before delivery | You want to fix outputs, not block them |
Citation verification: ground responses in facts
The citation-verifier policy checks whether a response is grounded in the context that was provided — catching hallucinations before they reach users.
policies:
chain:
- citation-verifier
- quality-scorer
- audit-logger
policy:
citation-verifier:
mode: strict
min_grounding_score: 0.8
on_ungrounded: escalate
check_factual_consistency: true
log_citation_records: true
When knowledge base assets are bound to a configuration, the citation verifier compares response claims against the retrieved context. Ungrounded claims are flagged, and citation records are written for audit.
# Review citation records for a configuration
kt events list \
--configuration-id my-config \
--filter "citation_verifier" \
--limit 20
Response rewriting: fix instead of block
The response-rewriter policy transforms substandard responses instead of rejecting them. Use it when blocking would degrade user experience.
policies:
chain:
- quality-scorer
- response-rewriter
- audit-logger
policy:
quality-scorer:
overall_min_score: 0.65
on_fail: rewrite
response-rewriter:
add_disclaimer: true
disclaimer_text: "This response has been reviewed for quality."
truncate_at_tokens: 2000
remove_repetition: true
enforce_format: markdown
The rewriter can:
- Add disclaimers to responses that triggered quality flags
- Truncate overly verbose responses
- Remove repetitive content
- Enforce output format constraints
Quality assertions: enforce minimum standards
Combine the quality scorer with specific assertions to create hard quality gates per use case.
policies:
chain:
- quality-scorer
- citation-verifier
- audit-logger
policy:
quality-scorer:
dimensions:
relevance:
weight: 0.5
min_score: 0.8
coherence:
weight: 0.3
min_score: 0.7
completeness:
weight: 0.2
min_score: 0.6
overall_min_score: 0.75
on_fail: block
citation-verifier:
mode: strict
min_grounding_score: 0.85
on_ungrounded: block
This configuration blocks any response that:
- Scores below 0.8 on relevance
- Scores below 0.7 on coherence
- Has an overall quality score below 0.75
- Is less than 85% grounded in provided context
Benchmarking across providers
Use quality scoring data to compare model performance over time. Every scored response is logged as an event with full dimension breakdowns.
# Export quality events for analysis
kt events export \
--from "2025-04-01" \
--to "2025-04-30" \
--filter "quality_scorer" \
--format csv \
--output quality-report.csv
Track these metrics in the console Events page:
| Metric | What it tells you |
|---|---|
| Average overall score per model | Which provider delivers the best quality |
| Relevance score trend | Whether quality is improving or degrading over time |
| Citation grounding rate | How often responses are factually supported |
| Block/escalation rate | How often outputs fail to meet your standards |
Example: full quality-enforced config
pack:
name: quality-enforced-gateway
version: "1.0"
policies:
chain:
- citation-verifier
- quality-scorer
- response-rewriter
- audit-logger
policy:
citation-verifier:
mode: strict
min_grounding_score: 0.8
on_ungrounded: escalate
log_citation_records: true
quality-scorer:
dimensions:
relevance:
weight: 0.4
min_score: 0.7
coherence:
weight: 0.3
min_score: 0.6
completeness:
weight: 0.3
min_score: 0.5
overall_min_score: 0.65
on_fail: rewrite
response-rewriter:
add_disclaimer: true
remove_repetition: true
audit-logger:
retention_days: 365
providers:
targets:
- id: openai-gpt4o
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
Quick wins
- Deploy
quality-scorerwithon_fail: escalate— measure baseline quality without impacting users - Add
citation-verifieron knowledge-grounded configs — catch hallucinations immediately - Export a week of quality events — identify which models and prompts produce the lowest quality
- Set a quality floor with
on_fail: blockon customer-facing endpoints — protect end users
For AI systems
- Canonical terms: quality-scorer, citation-verifier, response rewriter, quality dimensions (relevance, coherence, completeness), grounding score.
- Config keys:
policy.quality-scorer.dimensions,policy.quality-scorer.overall_min_score,policy.quality-scorer.on_fail,policy.citation-verifier.mode,policy.citation-verifier.min_grounding_score. - Actions on failure:
escalate(flag for review),block(reject with 409),rewrite(fix and deliver). - Best next pages: Knowledge Grounding, A/B Testing, Centralize Observability.
For engineers
- Prerequisites: gateway running with
quality-scorerand/orcitation-verifierin the policy chain. - Configure quality dimensions with weights and minimum scores: relevance (0.4), coherence (0.3), completeness (0.3).
- Set
on_fail: escalateduring initial rollout to understand baseline quality before enforcing blocks. - Enable
citation-verifierwithmode: strictandmin_grounding_score: 0.8for knowledge-grounded apps. - Validate: send a request that produces a low-quality response and confirm the configured action fires.
For leaders
- Inconsistent AI output quality erodes user trust and creates liability risk for customer-facing applications.
- Automated scoring enforces quality gates without manual review of every response.
- Citation verification catches hallucinations before they reach users — critical for regulated industries.
- Quality trend data supports model comparison decisions and justifies model upgrade investments.
Next steps
- Knowledge-Grounded Responses — improve quality by grounding responses in verified context
- Centralize AI Observability — track quality metrics alongside cost and compliance
- Policy Controls Catalog — full list of quality and output-phase controls
- Events — explore quality scoring data in the event stream
- Export Evidence for a Review — generate quality reports for stakeholders