Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

AI Output Quality Scoring & Validation

Quality assurance for AI outputs goes beyond ensuring the system runs — it means validating that responses meet accuracy, relevance, and compliance thresholds. Keeptrusts provides policy-driven quality controls that score, flag, and gate AI outputs before they reach end users.

Use this page when

  • You need to configure quality scoring policies that gate AI outputs on relevance, hallucination risk, and response length
  • You are implementing factual grounding checks against a bound knowledge base
  • You want to validate citation coverage, set up quality gates, and monitor quality trends in the console

Primary audience

  • Primary: Technical Engineers
  • Secondary: AI Agents, Technical Leaders

Response Quality Policies

Quality policies evaluate LLM responses against configurable thresholds. Unlike content-safety policies that block harmful outputs, quality policies assess whether outputs are useful, grounded, and appropriate.

Configuring Quality Thresholds

# policy-config.yaml — quality scoring policies
policies:
- name: response-quality-gate
type: quality
action: escalate
thresholds:
min_relevance_score: 0.7
max_hallucination_risk: 0.3
min_response_length: 50
max_response_length: 4000

- name: factual-grounding-check
type: grounding
action: flag
knowledge_base: product-docs-v3
min_citation_coverage: 0.6
require_sources: true

When a response falls below the min_relevance_score or exceeds max_hallucination_risk, the gateway escalates the event for human review rather than silently delivering a low-quality answer.

Factual Grounding Checks

Grounding policies verify that AI responses are anchored in authoritative sources. This is critical for regulated industries where hallucinated claims create liability.

How Grounding Works

  1. The gateway receives the LLM response
  2. The grounding policy compares response claims against bound knowledge base assets
  3. Each claim is scored for citation coverage
  4. Responses below the threshold are flagged or blocked

Testing Grounding Policies

# Send a prompt that should be grounded in the knowledge base
curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [
{"role": "system", "content": "Answer using only the provided knowledge base."},
{"role": "user", "content": "What is the maximum retention period for audit logs?"}
]
}' | jq '.choices[0].message.content'

# Check the event for grounding scores
kt events list --last 1 --format json | jq '.[0].quality_scores'

Expected output for a well-grounded response:

{
"relevance_score": 0.85,
"hallucination_risk": 0.12,
"citation_coverage": 0.78,
"sources_cited": 3
}

Knowledge Base Citation Verification

When knowledge base assets are bound to a gateway configuration, the gateway tracks which assets were recalled and cited during each interaction.

Managing Knowledge Base Assets

# List active knowledge base assets
kt knowledge-base list --status active

# Promote a draft asset to active
kt knowledge-base promote --id kb_asset_12345

# Bind assets to a gateway configuration
kt knowledge-base bind --config production --asset-ids kb_asset_12345,kb_asset_67890

Verifying Citations in Events

After processing, each event records citation metadata:

# Query events with citation details
kt events list --last 5 --format json | jq '.[].citations'

Sample citation record:

{
"citations": [
{
"asset_id": "kb_asset_12345",
"asset_title": "Data Retention Policy v2.1",
"chunk_id": "chunk_0042",
"relevance": 0.91
}
]
}

Writing Citation Tests

Automate citation verification in your test suite:

#!/bin/bash
# test-citations.sh — verify knowledge base citations

RESPONSE=$(curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Explain the data retention policy"}]
}')

# Fetch the latest event
EVENT=$(kt events list --last 1 --format json)

# Assert citations exist
CITATION_COUNT=$(echo "$EVENT" | jq '.[0].citations | length')
if [ "$CITATION_COUNT" -gt 0 ]; then
echo "PASS: Response includes $CITATION_COUNT citation(s)"
else
echo "FAIL: No citations found — response may be hallucinated"
exit 1
fi

# Assert minimum relevance
MIN_RELEVANCE=$(echo "$EVENT" | jq '[.[0].citations[].relevance] | min')
THRESHOLD="0.6"
if awk "BEGIN {exit !($MIN_RELEVANCE >= $THRESHOLD)}"; then
echo "PASS: All citations meet minimum relevance ($MIN_RELEVANCE >= $THRESHOLD)"
else
echo "FAIL: Low-relevance citation detected ($MIN_RELEVANCE < $THRESHOLD)"
exit 1
fi

Automated Quality Gates

Quality gates act as checkpoints in the AI output pipeline. Configure multiple gates for layered validation:

policies:
- name: length-gate
type: quality
action: block
thresholds:
min_response_length: 20
message: "Response too short to be useful."

- name: relevance-gate
type: quality
action: escalate
thresholds:
min_relevance_score: 0.65
escalation_channel: quality-review

- name: grounding-gate
type: grounding
action: flag
knowledge_base: company-policies
min_citation_coverage: 0.5

Gate Evaluation Order

Gates are evaluated in the output phase of the policy chain:

  1. Length gate — rejects trivially short responses immediately
  2. Relevance gate — escalates low-relevance responses for human review
  3. Grounding gate — flags poorly grounded responses in the event stream

Monitoring Quality Metrics

Use the console dashboard to track quality trends:

  • Navigate to Events → filter by quality_scores.relevance_score < 0.7
  • Group by model and time window to spot quality degradation
  • Set up webhook notifications for escalation spikes
# CLI: query low-quality events from the last 24 hours
kt events list --from "24h" --filter "quality_scores.relevance_score < 0.7" \
--format table

Quality Scoring in CI

Integrate quality validation into your deployment pipeline:

#!/bin/bash
# ci-quality-check.sh — run against staging gateway

PROMPTS_FILE="test-prompts.json"
FAILURES=0

jq -c '.[]' "$PROMPTS_FILE" | while read -r prompt; do
curl -s http://staging-gateway:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d "$prompt" > /dev/null

EVENT=$(kt events list --last 1 --format json)
SCORE=$(echo "$EVENT" | jq '.[0].quality_scores.relevance_score // 0')

if awk "BEGIN {exit !($SCORE < 0.65)}"; then
echo "WARN: Low quality score ($SCORE) for prompt"
FAILURES=$((FAILURES + 1))
fi
done

if [ "$FAILURES" -gt 0 ]; then
echo "Quality gate failed: $FAILURES low-quality responses"
exit 1
fi

Key Takeaways

  • Quality policies provide deterministic enforcement over non-deterministic LLM outputs
  • Factual grounding checks verify responses against bound knowledge base assets
  • Citation records in events enable automated verification of source attribution
  • Layer multiple quality gates — length, relevance, grounding — for comprehensive validation
  • Monitor quality trends in the console dashboard and set up escalation alerts
  • Integrate quality scoring into CI to catch degradation before deployment

For AI systems

  • Canonical terms: quality policy, grounding policy, type: quality, type: grounding, relevance score, hallucination risk, citation coverage, knowledge base binding
  • Policy config keys: min_relevance_score, max_hallucination_risk, min_response_length, max_response_length, min_citation_coverage, require_sources
  • Event metadata: quality_scores.relevance_score, quality_scores.hallucination_risk, quality_scores.citation_coverage, quality_scores.sources_cited
  • CLI commands: kt knowledge-base list --status active, kt events list --last 1 --format json
  • Related pages: Accessibility Testing, Testing AI Systems, Regression Testing

For engineers

  • Configure type: quality policies with thresholds for min_relevance_score, max_hallucination_risk, and response length bounds
  • Add type: grounding policies that reference a bound knowledge base and set min_citation_coverage
  • Query event quality_scores to verify scoring is active: kt events list --last 1 --format json | jq '.[0].quality_scores'
  • Manage knowledge base assets with kt knowledge-base list --status active (alias: kt kb)
  • Build quality gate scripts that send test prompts and assert scores exceed thresholds
  • Validate: send a prompt answerable from the knowledge base, confirm citation_coverage > min_citation_coverage and sources_cited > 0

For leaders

  • Quality policies prevent low-quality AI responses from reaching users — escalation alerts flag issues for human review
  • Factual grounding against an authoritative knowledge base reduces hallucination risk and liability
  • Citation tracking provides an audit trail of which sources informed each response
  • Quality trends in the console dashboard reveal model degradation over time
  • Layered quality gates (length + relevance + grounding) provide comprehensive coverage without over-blocking

Next steps