AI Output Quality Scoring & Validation

Quality assurance for AI outputs goes beyond ensuring the system runs — it means validating that responses meet accuracy, relevance, and compliance thresholds. Keeptrusts provides policy-driven quality controls that score, flag, and gate AI outputs before they reach end users.

Use this page when

You need to configure quality scoring policies that gate AI outputs on relevance, hallucination risk, and response length
You are implementing factual grounding checks against a bound knowledge base
You want to validate citation coverage, set up quality gates, and monitor quality trends in the console

Primary audience

Primary: Technical Engineers
Secondary: AI Agents, Technical Leaders

Response Quality Policies

Quality policies evaluate LLM responses against configurable thresholds. Unlike content-safety policies that block harmful outputs, quality policies assess whether outputs are useful, grounded, and appropriate.

Configuring Quality Thresholds

# policy-config.yaml — quality scoring policies
policies:
  - name: response-quality-gate
    type: quality
    action: escalate
    thresholds:
      min_relevance_score: 0.7
      max_hallucination_risk: 0.3
      min_response_length: 50
      max_response_length: 4000

  - name: factual-grounding-check
    type: grounding
    action: flag
    knowledge_base: product-docs-v3
    min_citation_coverage: 0.6
    require_sources: true

When a response falls below the min_relevance_score or exceeds max_hallucination_risk, the gateway escalates the event for human review rather than silently delivering a low-quality answer.

Factual Grounding Checks

Grounding policies verify that AI responses are anchored in authoritative sources. This is critical for regulated industries where hallucinated claims create liability.

How Grounding Works

The gateway receives the LLM response
The grounding policy compares response claims against bound knowledge base assets
Each claim is scored for citation coverage
Responses below the threshold are flagged or blocked

Testing Grounding Policies

# Send a prompt that should be grounded in the knowledge base
curl -s http://localhost:41002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [
      {"role": "system", "content": "Answer using only the provided knowledge base."},
      {"role": "user", "content": "What is the maximum retention period for audit logs?"}
    ]
  }' | jq '.choices[0].message.content'

# Check the event for grounding scores
kt events list --last 1 --format json | jq '.[0].quality_scores'

Expected output for a well-grounded response:

{
  "relevance_score": 0.85,
  "hallucination_risk": 0.12,
  "citation_coverage": 0.78,
  "sources_cited": 3
}

Knowledge Base Citation Verification

When knowledge base assets are bound to a gateway configuration, the gateway tracks which assets were recalled and cited during each interaction.

Managing Knowledge Base Assets

# List active knowledge base assets
kt knowledge-base list --status active

# Promote a draft asset to active
kt knowledge-base promote --id kb_asset_12345

# Bind assets to a gateway configuration
kt knowledge-base bind --config production --asset-ids kb_asset_12345,kb_asset_67890

Verifying Citations in Events

After processing, each event records citation metadata:

# Query events with citation details
kt events list --last 5 --format json | jq '.[].citations'

Sample citation record:

{
  "citations": [
    {
      "asset_id": "kb_asset_12345",
      "asset_title": "Data Retention Policy v2.1",
      "chunk_id": "chunk_0042",
      "relevance": 0.91
    }
  ]
}

Writing Citation Tests

Automate citation verification in your test suite:

#!/bin/bash
# test-citations.sh — verify knowledge base citations

RESPONSE=$(curl -s http://localhost:41002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Explain the data retention policy"}]
  }')

# Fetch the latest event
EVENT=$(kt events list --last 1 --format json)

# Assert citations exist
CITATION_COUNT=$(echo "$EVENT" | jq '.[0].citations | length')
if [ "$CITATION_COUNT" -gt 0 ]; then
  echo "PASS: Response includes $CITATION_COUNT citation(s)"
else
  echo "FAIL: No citations found — response may be hallucinated"
  exit 1
fi

# Assert minimum relevance
MIN_RELEVANCE=$(echo "$EVENT" | jq '[.[0].citations[].relevance] | min')
THRESHOLD="0.6"
if awk "BEGIN {exit !($MIN_RELEVANCE >= $THRESHOLD)}"; then
  echo "PASS: All citations meet minimum relevance ($MIN_RELEVANCE >= $THRESHOLD)"
else
  echo "FAIL: Low-relevance citation detected ($MIN_RELEVANCE < $THRESHOLD)"
  exit 1
fi

Automated Quality Gates

Quality gates act as checkpoints in the AI output pipeline. Configure multiple gates for layered validation:

policies:
  - name: length-gate
    type: quality
    action: block
    thresholds:
      min_response_length: 20
    message: "Response too short to be useful."

  - name: relevance-gate
    type: quality
    action: escalate
    thresholds:
      min_relevance_score: 0.65
    escalation_channel: quality-review

  - name: grounding-gate
    type: grounding
    action: flag
    knowledge_base: company-policies
    min_citation_coverage: 0.5

Gate Evaluation Order

Gates are evaluated in the output phase of the policy chain:

Length gate — rejects trivially short responses immediately
Relevance gate — escalates low-relevance responses for human review
Grounding gate — flags poorly grounded responses in the event stream

Monitoring Quality Metrics

Use the console dashboard to track quality trends:

Navigate to Events → filter by quality_scores.relevance_score < 0.7
Group by model and time window to spot quality degradation
Set up webhook notifications for escalation spikes

# CLI: query low-quality events from the last 24 hours
kt events list --from "24h" --filter "quality_scores.relevance_score < 0.7" \
  --format table

Quality Scoring in CI

Integrate quality validation into your deployment pipeline:

#!/bin/bash
# ci-quality-check.sh — run against staging gateway

PROMPTS_FILE="test-prompts.json"
FAILURES=0

jq -c '.[]' "$PROMPTS_FILE" | while read -r prompt; do
  curl -s http://staging-gateway:41002/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d "$prompt" > /dev/null

  EVENT=$(kt events list --last 1 --format json)
  SCORE=$(echo "$EVENT" | jq '.[0].quality_scores.relevance_score // 0')

  if awk "BEGIN {exit !($SCORE < 0.65)}"; then
    echo "WARN: Low quality score ($SCORE) for prompt"
    FAILURES=$((FAILURES + 1))
  fi
done

if [ "$FAILURES" -gt 0 ]; then
  echo "Quality gate failed: $FAILURES low-quality responses"
  exit 1
fi

Key Takeaways

Quality policies provide deterministic enforcement over non-deterministic LLM outputs
Factual grounding checks verify responses against bound knowledge base assets
Citation records in events enable automated verification of source attribution
Layer multiple quality gates — length, relevance, grounding — for comprehensive validation
Monitor quality trends in the console dashboard and set up escalation alerts
Integrate quality scoring into CI to catch degradation before deployment

For AI systems

Canonical terms: quality policy, grounding policy, type: quality, type: grounding, relevance score, hallucination risk, citation coverage, knowledge base binding
Policy config keys: min_relevance_score, max_hallucination_risk, min_response_length, max_response_length, min_citation_coverage, require_sources
Event metadata: quality_scores.relevance_score, quality_scores.hallucination_risk, quality_scores.citation_coverage, quality_scores.sources_cited
CLI commands: kt knowledge-base list --status active, kt events list --last 1 --format json
Related pages: Accessibility Testing, Testing AI Systems, Regression Testing

For engineers

Configure type: quality policies with thresholds for min_relevance_score, max_hallucination_risk, and response length bounds
Add type: grounding policies that reference a bound knowledge base and set min_citation_coverage
Query event quality_scores to verify scoring is active: kt events list --last 1 --format json | jq '.[0].quality_scores'
Manage knowledge base assets with kt knowledge-base list --status active (alias: kt kb)
Build quality gate scripts that send test prompts and assert scores exceed thresholds
Validate: send a prompt answerable from the knowledge base, confirm citation_coverage > min_citation_coverage and sources_cited > 0

For leaders

Quality policies prevent low-quality AI responses from reaching users — escalation alerts flag issues for human review
Factual grounding against an authoritative knowledge base reduces hallucination risk and liability
Citation tracking provides an audit trail of which sources informed each response
Quality trends in the console dashboard reveal model degradation over time
Layered quality gates (length + relevance + grounding) provide comprehensive coverage without over-blocking

Next steps

Add Accessibility Testing readability policies for inclusive AI outputs
Use Testing AI Systems patterns to assert on quality scores deterministically
Detect quality regressions with Regression Testing before/after comparison

Use this page when​

Primary audience​

Response Quality Policies​

Configuring Quality Thresholds​

Factual Grounding Checks​

How Grounding Works​

Testing Grounding Policies​

Knowledge Base Citation Verification​

Managing Knowledge Base Assets​

Verifying Citations in Events​

Writing Citation Tests​

Automated Quality Gates​

Gate Evaluation Order​

Monitoring Quality Metrics​

Quality Scoring in CI​

Key Takeaways​

For AI systems​

For engineers​

For leaders​

Next steps​