Guarantee AI Output Quality with Automated Scoring

AI models produce inconsistent output. The same prompt can return a concise, well-sourced answer on one call and a hallucinated, rambling response on the next. Keeptrusts evaluates every response at the gateway — scoring quality, verifying citations, and rewriting substandard outputs before they reach your users.

Use this page when

You need to enforce minimum quality standards on AI responses before they reach users.
You are deploying citation verification to catch hallucinations in knowledge-grounded applications.
You want automated quality scoring with configurable thresholds and action-on-failure (escalate, block, or rewrite).

Primary audience

Primary: Technical Leaders
Secondary: Technical Engineers, AI Agents

What you'll achieve

Automated quality scoring on every AI response with configurable thresholds
Citation verification that checks whether responses are grounded in provided context
Response rewriting that fixes low-quality outputs instead of blocking them
Quality assertions that enforce minimum standards per endpoint or team
Benchmarking data to compare model quality across providers over time

Quality scorer: score every response

The quality-scorer policy evaluates responses against configurable quality dimensions and takes action when scores fall below your thresholds.

policies:
  chain:
    - quality-scorer
    - audit-logger

policy:
  quality-scorer:
    dimensions:
      relevance:
        weight: 0.4
        min_score: 0.7
      coherence:
        weight: 0.3
        min_score: 0.6
      completeness:
        weight: 0.3
        min_score: 0.5
    overall_min_score: 0.65
    on_fail: escalate

How it works at runtime:

The upstream provider returns a response
The quality scorer evaluates the response across each dimension
If any dimension or the overall score falls below the threshold, the configured action fires
The score is attached to the event record for trend analysis

Choosing an action on failure

Action	Behavior	Use when
`escalate`	Deliver the response but flag it for review	Monitoring phase — understand baseline quality
`block`	Reject the response with a 409	Strict quality requirements — customer-facing outputs
`rewrite`	Pass the response through the rewriter before delivery	You want to fix outputs, not block them

Citation verification: ground responses in facts

The citation-verifier policy checks whether a response is grounded in the context that was provided — catching hallucinations before they reach users.

policies:
  chain:
    - citation-verifier
    - quality-scorer
    - audit-logger

policy:
  citation-verifier:
    mode: strict
    min_grounding_score: 0.8
    on_ungrounded: escalate
    check_factual_consistency: true
    log_citation_records: true

When knowledge base assets are bound to a configuration, the citation verifier compares response claims against the retrieved context. Ungrounded claims are flagged, and citation records are written for audit.

# Review citation records for a configuration
kt events list \
  --configuration-id my-config \
  --filter "citation_verifier" \
  --limit 20

Response rewriting: fix instead of block

The response-rewriter policy transforms substandard responses instead of rejecting them. Use it when blocking would degrade user experience.

policies:
  chain:
    - quality-scorer
    - response-rewriter
    - audit-logger

policy:
  quality-scorer:
    overall_min_score: 0.65
    on_fail: rewrite
  response-rewriter:
    add_disclaimer: true
    disclaimer_text: "This response has been reviewed for quality."
    truncate_at_tokens: 2000
    remove_repetition: true
    enforce_format: markdown

The rewriter can:

Add disclaimers to responses that triggered quality flags
Truncate overly verbose responses
Remove repetitive content
Enforce output format constraints

Quality assertions: enforce minimum standards

Combine the quality scorer with specific assertions to create hard quality gates per use case.

policies:
  chain:
    - quality-scorer
    - citation-verifier
    - audit-logger

policy:
  quality-scorer:
    dimensions:
      relevance:
        weight: 0.5
        min_score: 0.8
      coherence:
        weight: 0.3
        min_score: 0.7
      completeness:
        weight: 0.2
        min_score: 0.6
    overall_min_score: 0.75
    on_fail: block
  citation-verifier:
    mode: strict
    min_grounding_score: 0.85
    on_ungrounded: block

This configuration blocks any response that:

Scores below 0.8 on relevance
Scores below 0.7 on coherence
Has an overall quality score below 0.75
Is less than 85% grounded in provided context

Benchmarking across providers

Use quality scoring data to compare model performance over time. Every scored response is logged as an event with full dimension breakdowns.

# Export quality events for analysis
kt events export \
  --from "2025-04-01" \
  --to "2025-04-30" \
  --filter "quality_scorer" \
  --format csv \
  --output quality-report.csv

Track these metrics in the console Events page:

Metric	What it tells you
Average overall score per model	Which provider delivers the best quality
Relevance score trend	Whether quality is improving or degrading over time
Citation grounding rate	How often responses are factually supported
Block/escalation rate	How often outputs fail to meet your standards

Example: full quality-enforced config

pack:
  name: quality-enforced-gateway
  version: "1.0"

policies:
  chain:
    - citation-verifier
    - quality-scorer
    - response-rewriter
    - audit-logger

policy:
  citation-verifier:
    mode: strict
    min_grounding_score: 0.8
    on_ungrounded: escalate
    log_citation_records: true
  quality-scorer:
    dimensions:
      relevance:
        weight: 0.4
        min_score: 0.7
      coherence:
        weight: 0.3
        min_score: 0.6
      completeness:
        weight: 0.3
        min_score: 0.5
    overall_min_score: 0.65
    on_fail: rewrite
  response-rewriter:
    add_disclaimer: true
    remove_repetition: true
  audit-logger:
    retention_days: 365

providers:
  targets:
    - id: openai-gpt4o
      provider: openai
      model: gpt-4o
      secret_key_ref:
        env: OPENAI_API_KEY

Quick wins

Deploy quality-scorer with on_fail: escalate — measure baseline quality without impacting users
Add citation-verifier on knowledge-grounded configs — catch hallucinations immediately
Export a week of quality events — identify which models and prompts produce the lowest quality
Set a quality floor with on_fail: block on customer-facing endpoints — protect end users

For AI systems

Canonical terms: quality-scorer, citation-verifier, response rewriter, quality dimensions (relevance, coherence, completeness), grounding score.
Config keys: policy.quality-scorer.dimensions, policy.quality-scorer.overall_min_score, policy.quality-scorer.on_fail, policy.citation-verifier.mode, policy.citation-verifier.min_grounding_score.
Actions on failure: escalate (flag for review), block (reject with 409), rewrite (fix and deliver).
Best next pages: Knowledge Grounding, A/B Testing, Centralize Observability.

For engineers

Prerequisites: gateway running with quality-scorer and/or citation-verifier in the policy chain.
Configure quality dimensions with weights and minimum scores: relevance (0.4), coherence (0.3), completeness (0.3).
Set on_fail: escalate during initial rollout to understand baseline quality before enforcing blocks.
Enable citation-verifier with mode: strict and min_grounding_score: 0.8 for knowledge-grounded apps.
Validate: send a request that produces a low-quality response and confirm the configured action fires.

For leaders

Inconsistent AI output quality erodes user trust and creates liability risk for customer-facing applications.
Automated scoring enforces quality gates without manual review of every response.
Citation verification catches hallucinations before they reach users — critical for regulated industries.
Quality trend data supports model comparison decisions and justifies model upgrade investments.

Next steps

Knowledge-Grounded Responses — improve quality by grounding responses in verified context
Centralize AI Observability — track quality metrics alongside cost and compliance
Policy Controls Catalog — full list of quality and output-phase controls
Events — explore quality scoring data in the event stream
Export Evidence for a Review — generate quality reports for stakeholders

Use this page when​

Primary audience​

What you'll achieve​

Quality scorer: score every response​

Choosing an action on failure​

Citation verification: ground responses in facts​

Response rewriting: fix instead of block​

Quality assertions: enforce minimum standards​

Benchmarking across providers​

Example: full quality-enforced config​

Quick wins​

For AI systems​

For engineers​

For leaders​

Next steps​