AI Output Quality Scoring & Validation
Quality assurance for AI outputs goes beyond ensuring the system runs — it means validating that responses meet accuracy, relevance, and compliance thresholds. Keeptrusts provides policy-driven quality controls that score, flag, and gate AI outputs before they reach end users.
Use this page when
- You need to configure quality scoring policies that gate AI outputs on relevance, hallucination risk, and response length
- You are implementing factual grounding checks against a bound knowledge base
- You want to validate citation coverage, set up quality gates, and monitor quality trends in the console
Primary audience
- Primary: Technical Engineers
- Secondary: AI Agents, Technical Leaders
Response Quality Policies
Quality policies evaluate LLM responses against configurable thresholds. Unlike content-safety policies that block harmful outputs, quality policies assess whether outputs are useful, grounded, and appropriate.
Configuring Quality Thresholds
# policy-config.yaml — quality scoring policies
policies:
- name: response-quality-gate
type: quality
action: escalate
thresholds:
min_relevance_score: 0.7
max_hallucination_risk: 0.3
min_response_length: 50
max_response_length: 4000
- name: factual-grounding-check
type: grounding
action: flag
knowledge_base: product-docs-v3
min_citation_coverage: 0.6
require_sources: true
When a response falls below the min_relevance_score or exceeds max_hallucination_risk, the gateway escalates the event for human review rather than silently delivering a low-quality answer.
Factual Grounding Checks
Grounding policies verify that AI responses are anchored in authoritative sources. This is critical for regulated industries where hallucinated claims create liability.
How Grounding Works
- The gateway receives the LLM response
- The grounding policy compares response claims against bound knowledge base assets
- Each claim is scored for citation coverage
- Responses below the threshold are flagged or blocked
Testing Grounding Policies
# Send a prompt that should be grounded in the knowledge base
curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [
{"role": "system", "content": "Answer using only the provided knowledge base."},
{"role": "user", "content": "What is the maximum retention period for audit logs?"}
]
}' | jq '.choices[0].message.content'
# Check the event for grounding scores
kt events list --last 1 --format json | jq '.[0].quality_scores'
Expected output for a well-grounded response:
{
"relevance_score": 0.85,
"hallucination_risk": 0.12,
"citation_coverage": 0.78,
"sources_cited": 3
}
Knowledge Base Citation Verification
When knowledge base assets are bound to a gateway configuration, the gateway tracks which assets were recalled and cited during each interaction.
Managing Knowledge Base Assets
# List active knowledge base assets
kt knowledge-base list --status active
# Promote a draft asset to active
kt knowledge-base promote --id kb_asset_12345
# Bind assets to a gateway configuration
kt knowledge-base bind --config production --asset-ids kb_asset_12345,kb_asset_67890
Verifying Citations in Events
After processing, each event records citation metadata:
# Query events with citation details
kt events list --last 5 --format json | jq '.[].citations'
Sample citation record:
{
"citations": [
{
"asset_id": "kb_asset_12345",
"asset_title": "Data Retention Policy v2.1",
"chunk_id": "chunk_0042",
"relevance": 0.91
}
]
}
Writing Citation Tests
Automate citation verification in your test suite:
#!/bin/bash
# test-citations.sh — verify knowledge base citations
RESPONSE=$(curl -s http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Explain the data retention policy"}]
}')
# Fetch the latest event
EVENT=$(kt events list --last 1 --format json)
# Assert citations exist
CITATION_COUNT=$(echo "$EVENT" | jq '.[0].citations | length')
if [ "$CITATION_COUNT" -gt 0 ]; then
echo "PASS: Response includes $CITATION_COUNT citation(s)"
else
echo "FAIL: No citations found — response may be hallucinated"
exit 1
fi
# Assert minimum relevance
MIN_RELEVANCE=$(echo "$EVENT" | jq '[.[0].citations[].relevance] | min')
THRESHOLD="0.6"
if awk "BEGIN {exit !($MIN_RELEVANCE >= $THRESHOLD)}"; then
echo "PASS: All citations meet minimum relevance ($MIN_RELEVANCE >= $THRESHOLD)"
else
echo "FAIL: Low-relevance citation detected ($MIN_RELEVANCE < $THRESHOLD)"
exit 1
fi
Automated Quality Gates
Quality gates act as checkpoints in the AI output pipeline. Configure multiple gates for layered validation:
policies:
- name: length-gate
type: quality
action: block
thresholds:
min_response_length: 20
message: "Response too short to be useful."
- name: relevance-gate
type: quality
action: escalate
thresholds:
min_relevance_score: 0.65
escalation_channel: quality-review
- name: grounding-gate
type: grounding
action: flag
knowledge_base: company-policies
min_citation_coverage: 0.5
Gate Evaluation Order
Gates are evaluated in the output phase of the policy chain:
- Length gate — rejects trivially short responses immediately
- Relevance gate — escalates low-relevance responses for human review
- Grounding gate — flags poorly grounded responses in the event stream
Monitoring Quality Metrics
Use the console dashboard to track quality trends:
- Navigate to Events → filter by
quality_scores.relevance_score < 0.7 - Group by model and time window to spot quality degradation
- Set up webhook notifications for escalation spikes
# CLI: query low-quality events from the last 24 hours
kt events list --from "24h" --filter "quality_scores.relevance_score < 0.7" \
--format table
Quality Scoring in CI
Integrate quality validation into your deployment pipeline:
#!/bin/bash
# ci-quality-check.sh — run against staging gateway
PROMPTS_FILE="test-prompts.json"
FAILURES=0
jq -c '.[]' "$PROMPTS_FILE" | while read -r prompt; do
curl -s http://staging-gateway:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d "$prompt" > /dev/null
EVENT=$(kt events list --last 1 --format json)
SCORE=$(echo "$EVENT" | jq '.[0].quality_scores.relevance_score // 0')
if awk "BEGIN {exit !($SCORE < 0.65)}"; then
echo "WARN: Low quality score ($SCORE) for prompt"
FAILURES=$((FAILURES + 1))
fi
done
if [ "$FAILURES" -gt 0 ]; then
echo "Quality gate failed: $FAILURES low-quality responses"
exit 1
fi
Key Takeaways
- Quality policies provide deterministic enforcement over non-deterministic LLM outputs
- Factual grounding checks verify responses against bound knowledge base assets
- Citation records in events enable automated verification of source attribution
- Layer multiple quality gates — length, relevance, grounding — for comprehensive validation
- Monitor quality trends in the console dashboard and set up escalation alerts
- Integrate quality scoring into CI to catch degradation before deployment
For AI systems
- Canonical terms: quality policy, grounding policy,
type: quality,type: grounding, relevance score, hallucination risk, citation coverage, knowledge base binding - Policy config keys:
min_relevance_score,max_hallucination_risk,min_response_length,max_response_length,min_citation_coverage,require_sources - Event metadata:
quality_scores.relevance_score,quality_scores.hallucination_risk,quality_scores.citation_coverage,quality_scores.sources_cited - CLI commands:
kt knowledge-base list --status active,kt events list --last 1 --format json - Related pages: Accessibility Testing, Testing AI Systems, Regression Testing
For engineers
- Configure
type: qualitypolicies with thresholds formin_relevance_score,max_hallucination_risk, and response length bounds - Add
type: groundingpolicies that reference a bound knowledge base and setmin_citation_coverage - Query event
quality_scoresto verify scoring is active:kt events list --last 1 --format json | jq '.[0].quality_scores' - Manage knowledge base assets with
kt knowledge-base list --status active(alias:kt kb) - Build quality gate scripts that send test prompts and assert scores exceed thresholds
- Validate: send a prompt answerable from the knowledge base, confirm
citation_coverage > min_citation_coverageandsources_cited > 0
For leaders
- Quality policies prevent low-quality AI responses from reaching users — escalation alerts flag issues for human review
- Factual grounding against an authoritative knowledge base reduces hallucination risk and liability
- Citation tracking provides an audit trail of which sources informed each response
- Quality trends in the console dashboard reveal model degradation over time
- Layered quality gates (length + relevance + grounding) provide comprehensive coverage without over-blocking
Next steps
- Add Accessibility Testing readability policies for inclusive AI outputs
- Use Testing AI Systems patterns to assert on quality scores deterministically
- Detect quality regressions with Regression Testing before/after comparison