Skip to main content
Browse docs

Quality Scorer

The quality-scorer policy evaluates AI response quality using a configurable mix of deterministic assertions, LLM-graded rubrics, NLP benchmarks, RAG evaluations, and agent trajectory checks. It enforces minimum score thresholds and controls what happens when a response fails quality gates — blocking, retrying, or falling back to a safe message.

Use this page when

  • You need to evaluate AI response quality using assertions, LLM judges, NLP benchmarks, or RAG evaluations.
  • You are configuring quality thresholds that block, retry, or escalate low-quality responses.
  • You want to set up quality scoring with industry modes, custom assertion packs, or agent trajectory checks.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Configuration

pack:
name: quality-scorer-example-1
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer
policy:
quality-scorer:
industry: finance
min_output_chars: 50
min_sentences: 2
mock_scoring: false
providers:
- id: quality-judge
label: GPT-4o Judge
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
config:
temperature: 0.0
benchmarks:
ragas_faithfulness: true
ragas_relevancy: true
coherence: true
completeness: true
assertions:
- type: is-json
name: response-is-json
enabled: true
threshold: 1.0
weight: 0.3
mode: enforce
severity: critical
config: {}
- type: llm-rubric
name: finance-accuracy
enabled: true
threshold: 0.85
weight: 0.5
mode: enforce
severity: critical
config:
rubric: Evaluate whether the response is financially accurate, appropriately caveated, and suitable for internal review. Score from 0 to 1.
- type: context-faithfulness
name: rag-grounding
enabled: true
threshold: 0.8
weight: 0.2
mode: enforce
severity: warning
config: {}
thresholds:
min_aggregate: 0.7
min_faithfulness: 0.8
min_relevancy: 0.75
min_coherence: 0.65
min_completeness: 0.7
min_accuracy: 0.8
weights:
faithfulness: 0.25
relevancy: 0.25
coherence: 0.15
completeness: 0.15
accuracy: 0.2
failure_action:
action: fallback
fallback_message: I apologize, but I cannot provide a sufficiently accurate response at this time.
max_retries: 2
pass_policy:
strategy: weighted_average
threshold: 0.7
industry_profiles:
finance:
min_aggregate: 0.85
min_accuracy: 0.95
min_faithfulness: 0.9
min_relevancy: 0.8
min_coherence: 0.75
min_completeness: 0.8
healthcare:
min_aggregate: 0.9
min_accuracy: 0.95
min_faithfulness: 0.95
min_relevancy: 0.85
min_coherence: 0.8
min_completeness: 0.85

Fields

Top-Level Properties

PropertyTypeDefaultDescription
industrystring""Name of an industry profile to activate from industry_profiles. When set, the profile's thresholds override the top-level thresholds.
min_output_charsinteger0Minimum character count for the AI response. Responses shorter than this fail immediately. Range: 010000000.
min_sentencesinteger0Minimum sentence count for the AI response. Range: 01000000.
mock_scoringbooleanfalseWhen true, all assertions return synthetic pass scores. Useful for development and testing pipelines without consuming LLM judge tokens.
providersarray[]List of QualityProviderTarget entries — LLM providers used for graded assertions (e.g., llm-rubric, factuality). Also accepted as targets.
benchmarksobjectToggle standard NLP/RAG benchmarks. See benchmarks below.
assertionsarray[]Ordered list of QualityAssertion rules. See assertions below.
thresholdsobjectMinimum pass scores for aggregate and per-type metrics. See thresholds below.
weightsobjectPer-type scoring weights used for weighted average calculation. See weights below.
industry_profilesobject{}Named industry profiles with custom threshold overrides. See industry_profiles below.
failure_actionobjectWhat to do when a response fails quality gates. See failure_action below.
pass_policyobjectOverall pass/fail aggregation strategy. See pass_policy below.

providers / targets (QualityProviderTarget)

Each entry can be either a string (shorthand) or an object (full config).

String shorthand: "openai:gpt-4o" — parsed as provider:model.

Object format:

PropertyTypeRequiredDescription
idstringNoUnique identifier for this provider target.
labelstringNoHuman-readable label shown in scoring reports.
providerstringNoProvider name (e.g., "openai", "anthropic", "azure").
targetstringNoTarget identifier (alternative to provider + model).
modelstringNoModel name (e.g., "gpt-4o", "claude-sonnet-4-20250514").
base_urlstringNoCustom base URL for the provider API.
api_basestringNoAlias for base_url.
secret_key_refstringNoEnvironment variable name containing the API key.
headersobjectNoAdditional HTTP headers sent with requests.
configobjectNoProvider-specific configuration (e.g., temperature, max_tokens).

benchmarks

PropertyTypeDefaultDescription
ragas_faithfulnessbooleanfalseEnable RAGAS faithfulness benchmark — measures whether the response is grounded in the provided context.
ragas_relevancybooleanfalseEnable RAGAS relevancy benchmark — measures whether the response addresses the question.
bleu_scorebooleanfalseEnable BLEU score — measures n-gram overlap with a reference response.
nli_entailmentbooleanfalseEnable Natural Language Inference entailment check — verifies logical consistency.
coherencebooleanfalseEnable coherence benchmark — measures logical flow and consistency within the response.
completenessbooleanfalseEnable completeness benchmark — measures whether the response fully addresses the query.

assertions (QualityAssertion)

PropertyTypeDefaultRequiredDescription
typestringYesAssertion type. See Assertion Types below for the full list.
namestringNoHuman-readable name for reporting. Length: 1200.
enabledbooleantrueNoWhether this assertion is active. Set false to skip without removing.
thresholdnumberNoMinimum score (0–1) for this assertion to pass.
weightnumberNoRelative weight (0–1) in aggregate scoring.
configobject{}NoAssertion-specific configuration. Contents vary by type (e.g., rubricPrompt for llm-rubric, pattern for regex).
modestring"enforce"NoExecution mode. "enforce" blocks on failure. "audit" logs but passes through. "shadow" runs silently for data collection.
severitystring"critical"NoAlert severity on failure. One of "critical", "warning", "info".
packstringNoName of an assertion pack to inline. Used with type: "assert-set" to reference reusable assertion groups.

Assertion Types

Deterministic

Evaluate output using exact matching, pattern matching, or structural validation. No LLM calls required.

TypeDescription
containsOutput contains the specified substring.
contains-allOutput contains all specified substrings.
contains-anyOutput contains at least one specified substring.
icontainsCase-insensitive contains.
icontains-allCase-insensitive contains-all.
icontains-anyCase-insensitive contains-any.
equalsOutput exactly equals the expected value.
regexOutput matches the specified regular expression.
starts-withOutput begins with the specified string.
word-countOutput word count falls within the specified range.
is-jsonOutput is valid JSON.
is-htmlOutput is valid HTML.
is-sqlOutput is valid SQL.
is-xmlOutput is valid XML.
contains-jsonOutput contains a valid JSON substring.
contains-htmlOutput contains a valid HTML substring.
contains-sqlOutput contains a valid SQL substring.
contains-xmlOutput contains a valid XML substring.
is-valid-function-callOutput is a valid function call structure.
is-valid-openai-function-callOutput is a valid OpenAI function call.
is-valid-openai-tools-callOutput is a valid OpenAI tools call.
levenshteinLevenshtein edit distance to expected output is within threshold.
latencyResponse latency is within the specified limit (milliseconds).
costResponse cost is within the specified limit (USD).
finish-reasonThe model's finish reason matches the expected value.
f-scoreF-score (precision/recall harmonic mean) against expected output.
tool-call-f1F1 score for tool call accuracy against expected tool calls.
LLM-Graded

Use an LLM judge (configured via providers) to evaluate output quality.

TypeDescription
llm-rubricGrade output against a free-form rubric prompt.
search-rubricGrade output with a rubric that includes search/retrieval context.
model-graded-closedqaGrade correctness for closed-domain question answering.
factualityGrade factual accuracy against a known correct answer.
g-evalG-Eval framework scoring with customizable criteria.
answer-relevanceGrade whether the answer is relevant to the question.
similarSemantic similarity to a reference output (LLM-judged).
semantic-similaritySemantic similarity using embedding distance.
classifierClassify output into predefined categories.
moderationRun content moderation check on the output.
select-bestCompare multiple outputs and select the best one.
Benchmark

Standard NLP metrics computed against reference outputs.

TypeDescription
rougeROUGE score (recall-oriented understudy for gisting evaluation).
rouge-nROUGE-N score (n-gram overlap).
meteorMETEOR score (semantic-aware translation metric).
gleuGLEU score (Google-BLEU variant).
bleu_scoreBLEU score (bilingual evaluation understudy).
perplexityModel perplexity on the output.
perplexity-scoreNormalized perplexity score.
nli_entailmentNatural Language Inference entailment check.
coherenceCoherence score for logical flow.
completenessCompleteness score for full coverage of the query.
RAG (Retrieval-Augmented Generation)

Evaluate quality in RAG pipelines where context documents are provided.

TypeDescription
context-recallMeasures how much of the ground truth is captured by retrieved context.
context-relevanceMeasures relevance of retrieved context to the question.
context-faithfulnessMeasures whether the response is faithful to the retrieved context.
rag-document-exfiltrationDetects if the model leaks full retrieved documents verbatim.
rag-poisoningDetects signs of poisoned context documents influencing the output.
rag-source-attributionVerifies that the response correctly attributes sources.
conversation-relevanceMeasures relevance within a multi-turn conversation context.
Agent Trajectory

Evaluate agent behavior across multi-step workflows.

TypeDescription
trajectory:goal-successWhether the agent achieved the stated goal.
trajectory:tool-usedWhether the agent used a specific expected tool.
trajectory:tool-sequenceWhether the agent used tools in the expected sequence.
trajectory:step-countWhether the agent completed the task within a step budget.
trace-span-countNumber of trace spans falls within expected range.
trace-span-durationTrace span durations fall within expected limits.
trace-error-spansNumber of error spans is within acceptable limits.
Programmatic

Run custom code to evaluate output.

TypeDescription
javascriptExecute a JavaScript function that returns a score.
pythonExecute a Python function that returns a score.
webhookSend the output to an external webhook for scoring.
Meta

Control assertion grouping and special behaviors.

TypeDescription
assert-setGroup multiple assertions into a reusable pack. Use with the pack property.
is-refusalCheck whether the model refused to answer.
max-scoreCap the assertion score at a maximum value.
piPrompt injection detection assertion.

thresholds

PropertyTypeDefaultDescription
min_aggregatenumber0.7Minimum weighted aggregate score across all assertions and benchmarks.
min_faithfulnessnumber0.8Minimum faithfulness score.
min_relevancynumber0.75Minimum relevancy score.
min_bleunumber0.4Minimum BLEU score.
min_coherencenumber0.65Minimum coherence score.
min_completenessnumber0.7Minimum completeness score.
min_accuracynumber0.8Minimum accuracy score.

weights

PropertyTypeDefaultDescription
faithfulnessnumber0.25Weight of faithfulness in aggregate calculation.
relevancynumber0.25Weight of relevancy in aggregate calculation.
bleunumber0.2Weight of BLEU score in aggregate calculation.
coherencenumber0.15Weight of coherence in aggregate calculation.
completenessnumber0.15Weight of completeness in aggregate calculation.
accuracynumber0.2Weight of accuracy in aggregate calculation.

failure_action

PropertyTypeDefaultDescription
actionstring"block"What to do when the response fails quality gates. One of "block", "fallback", "retry".
fallback_messagestring"I apologize, but I cannot provide a sufficiently accurate response at this time."Message returned to the user when action is "fallback".
max_retriesinteger0Number of retry attempts before falling back or blocking. Range: 05. Only used when action is "retry".

pass_policy

PropertyTypeDefaultDescription
strategystring"all"Aggregation strategy. "all" requires every assertion to pass. "quorum" requires a fraction to pass. "weighted_average" computes a weighted score.
quorumnumberFraction of assertions that must pass (0–1). Used when strategy is "quorum".
thresholdnumberMinimum weighted average score (0–1). Used when strategy is "weighted_average".

industry_profiles

A map of profile names to QualityProfile objects. Each profile overrides the top-level thresholds when activated via the industry property.

QualityProfile:

PropertyTypeRangeDescription
min_aggregatenumber01Override for thresholds.min_aggregate.
min_accuracynumber01Override for thresholds.min_accuracy.
min_faithfulnessnumber01Override for thresholds.min_faithfulness.
min_relevancynumber01Override for thresholds.min_relevancy.
min_coherencenumber01Override for thresholds.min_coherence.
min_completenessnumber01Override for thresholds.min_completeness.

Use Cases

1. Simple Format Assertions

Validate that the response is valid JSON, contains required fields, and matches an expected pattern.

policy:
quality-scorer:
assertions:
- type: is-json
name: valid-json
threshold: 1.0
mode: enforce
severity: critical
- type: contains-all
name: required-fields
threshold: 1.0
config:
values:
- '"status"'
- '"data"'
- '"timestamp"'
- type: regex
name: iso-date-format
config:
pattern: '\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}'
- type: word-count
name: minimum-length
config:
min: 20
max: 500
failure_action:
action: block
pack:
name: quality-scorer-example-2
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer

2. RAG Quality with Faithfulness and Relevancy Benchmarks

Ensure RAG pipeline outputs are grounded in retrieved context and relevant to the query.

policy:
quality-scorer:
benchmarks:
ragas_faithfulness: true
ragas_relevancy: true
coherence: true
completeness: true
assertions:
- type: context-faithfulness
name: grounded-in-context
threshold: 0.85
weight: 0.4
mode: enforce
severity: critical
- type: context-relevance
name: context-is-relevant
threshold: 0.75
weight: 0.3
mode: enforce
severity: warning
- type: rag-source-attribution
name: cites-sources
threshold: 0.8
weight: 0.2
mode: enforce
severity: warning
- type: rag-document-exfiltration
name: no-document-leak
threshold: 1.0
weight: 0.1
mode: enforce
severity: critical
thresholds:
min_aggregate: 0.8
min_faithfulness: 0.85
min_relevancy: 0.75
failure_action:
action: fallback
fallback_message: I could not generate a sufficiently grounded response. Please rephrase your question.
pack:
name: quality-scorer-example-3
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer

3. LLM-Graded Rubric with External Provider

Use GPT-4o as a judge to grade responses against domain-specific rubrics.

pack:
name: quality-scorer-example-4
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer
policy:
quality-scorer:
providers:
- id: gpt4o-judge
label: GPT-4o Quality Judge
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_JUDGE_API_KEY
config:
temperature: 0.0
max_tokens: 1024
assertions:
- type: llm-rubric
name: legal-accuracy
threshold: 0.9
weight: 0.4
mode: enforce
severity: critical
config:
rubric: |
Evaluate the legal accuracy of this response:
1. Are legal citations correct and verifiable?
2. Is the legal reasoning logically sound?
3. Are relevant statutes and case law referenced?
Score from 0 to 1.
- type: factuality
name: fact-check
threshold: 0.85
weight: 0.3
mode: enforce
severity: critical
config:
reference_statement: The response must cite only verifiable legal authorities and must not invent statutes or case law.
- type: answer-relevance
name: stays-on-topic
threshold: 0.8
weight: 0.2
mode: enforce
severity: warning
config: {}
- type: moderation
name: content-safety
threshold: 1.0
weight: 0.1
mode: enforce
severity: critical
config: {}
pass_policy:
strategy: weighted_average
threshold: 0.85

4. Industry-Specific Profiles

Define reusable threshold profiles and activate them by setting industry.

pack:
name: quality-scorer-example-5
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer
policy:
quality-scorer:
industry: healthcare
providers:
- id: clinical-judge
label: Clinical Quality Judge
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
config:
temperature: 0.0
benchmarks:
ragas_faithfulness: true
coherence: true
completeness: true
assertions:
- type: llm-rubric
name: clinical-accuracy
threshold: 0.95
mode: enforce
severity: critical
config:
rubric: Evaluate clinical accuracy. Are diagnoses evidence-based? Are drug interactions noted?
- type: context-faithfulness
name: evidence-grounding
threshold: 0.9
mode: enforce
severity: critical
config: {}
industry_profiles:
finance:
min_aggregate: 0.85
min_accuracy: 0.95
min_faithfulness: 0.9
min_relevancy: 0.8
min_coherence: 0.75
min_completeness: 0.8
healthcare:
min_aggregate: 0.9
min_accuracy: 0.95
min_faithfulness: 0.95
min_relevancy: 0.85
min_coherence: 0.8
min_completeness: 0.85
legal:
min_aggregate: 0.88
min_accuracy: 0.92
min_faithfulness: 0.9
min_relevancy: 0.85
min_coherence: 0.8
min_completeness: 0.82
failure_action:
action: block

5. Assertion Pack with Mixed Deterministic and Graded

Combine structural validation with LLM-graded quality in a single assertion set.

policy:
quality-scorer:
providers:
- openai:gpt-4o
assertions:
- type: is-json
name: valid-structure
threshold: 1.0
mode: enforce
severity: critical
- type: contains
name: has-recommendation
config:
value: recommendation
mode: enforce
severity: warning
- type: latency
name: response-time
config:
maxLatency: 5000
mode: audit
severity: info
- type: cost
name: cost-cap
config:
maxCost: 0.05
mode: audit
severity: info
- type: llm-rubric
name: quality-rubric
threshold: 0.8
weight: 0.5
mode: enforce
severity: critical
config:
rubricPrompt: Is the recommendation actionable, specific, and supported by data?
- type: semantic-similarity
name: consistent-with-baseline
threshold: 0.7
weight: 0.3
mode: enforce
severity: warning
pass_policy:
strategy: quorum
quorum: 0.8
failure_action:
action: retry
max_retries: 2
pack:
name: quality-scorer-example-6
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer

6. Agent Trajectory Assertions

Evaluate whether an AI agent completed its task correctly by checking tool usage and step counts.

policy:
quality-scorer:
assertions:
- type: trajectory:goal-success
name: task-completed
threshold: 1.0
weight: 0.4
mode: enforce
severity: critical
- type: trajectory:tool-used
name: used-search-tool
threshold: 1.0
weight: 0.2
mode: enforce
severity: warning
config:
tool: web_search
- type: trajectory:tool-sequence
name: correct-workflow
threshold: 1.0
weight: 0.2
mode: audit
severity: info
config:
sequence:
- search_database
- analyze_results
- generate_report
- type: trajectory:step-count
name: efficient-execution
threshold: 1.0
weight: 0.1
mode: audit
severity: info
config:
maxSteps: 10
- type: trace-error-spans
name: no-errors
threshold: 1.0
weight: 0.1
mode: enforce
severity: warning
config:
maxErrors: 0
pass_policy:
strategy: weighted_average
threshold: 0.85
pack:
name: quality-scorer-example-7
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer

How It Works

  1. Minimum length check — Before running assertions, the gateway checks min_output_chars and min_sentences. If the response is too short, it fails immediately without consuming assertion evaluation resources.

  2. Assertion evaluation — Each enabled assertion in the assertions array is evaluated in order:

    • Deterministic assertions (e.g., is-json, contains, regex) run locally with no external calls.
    • LLM-graded assertions (e.g., llm-rubric, factuality) send the output to the configured providers for scoring.
    • Benchmark assertions (e.g., rouge, bleu_score) compute standard NLP metrics against reference outputs.
    • RAG assertions evaluate context faithfulness and relevance using the retrieval context.
    • Agent trajectory assertions inspect the trace spans for goal completion, tool usage, and step counts.
  3. Score aggregation — Individual assertion scores are combined according to pass_policy.strategy:

    • "all" — Every assertion must meet its threshold. One failure fails the entire evaluation.
    • "quorum" — At least pass_policy.quorum fraction of assertions must pass.
    • "weighted_average" — Scores are combined using weights and compared against pass_policy.threshold.
  4. Threshold comparison — The aggregate score and per-type scores are compared against thresholds. If an industry profile is active, its overrides take precedence.

  5. Failure handling — When the response fails quality gates, failure_action determines the outcome:

    • "block" — The response is suppressed entirely.
    • "retry" — The request is re-sent to the LLM up to max_retries times. If all retries fail, behavior falls through to block.
    • "fallback" — The fallback_message is returned instead of the AI response.
  6. Mode enforcement — Only assertions with mode: "enforce" affect the pass/fail outcome. "audit" assertions are scored and logged but do not block. "shadow" assertions run silently for observability only.

  7. Mock scoring — When mock_scoring: true, all assertions return synthetic passing scores. This allows CI/CD pipelines to test policy configuration without real LLM calls.

  8. Event emission — All assertion results, aggregate scores, and failure actions are emitted as structured events to POST /v1/events.

Combining With Other Policies

PolicyInteraction
prompt-injectionRuns before quality scoring. Injected prompts are blocked before they reach the LLM, so quality scoring only evaluates legitimate responses.
pii-filterRuns before quality scoring on the output. PII is redacted before the quality scorer evaluates the response, so assertions see the redacted version.
content-filterRuns before quality scoring. Blocked content never reaches the quality scorer.
agent-firewallRuns before quality scoring. The firewall controls which tools the agent can call; the quality scorer evaluates the final response.
rate-limiterIndependent. Rate limiting controls request throughput; quality scoring evaluates response content. Both are enforced.
disclaimerRuns after quality scoring. Disclaimers are appended to the response after it passes quality gates, so they do not affect scoring.

Best Practices

  • Start with deterministic assertions (is-json, contains, regex) before adding LLM-graded assertions. Deterministic checks are free, fast, and deterministic — they catch format issues before expensive LLM judge calls.
  • Use mode: "audit" for new assertions. Run in audit mode for a week to collect baseline scores before switching to "enforce". This prevents unexpected blocking in production.
  • Set mock_scoring: true in CI/CD. This validates your policy configuration syntax without consuming LLM tokens or requiring API keys.
  • Define industry_profiles for multi-vertical deployments. Instead of duplicating entire policy configs, use profiles to override only the thresholds that differ.
  • Use "weighted_average" pass policy for nuanced scoring. The "all" strategy is strict — one low-weight informational assertion can block a response. Weighted average lets you balance strictness with flexibility.
  • Keep max_retries at 1–2. Each retry doubles LLM cost. More than 2 retries suggests a fundamental model or prompt issue, not a transient quality dip.
  • Separate assertion packs by concern. Use assert-set with pack to group related assertions (e.g., "format-checks", "clinical-accuracy") so they can be reused across policies.
  • Use trajectory assertions for agent workflows. Traditional text-quality assertions evaluate the final response; trajectory assertions evaluate the agent's decision-making process.
  • Configure a dedicated low-temperature judge provider. Set temperature: 0.0 on your judge LLM to maximize scoring reproducibility across evaluations.

For AI systems

  • Canonical terms: Keeptrusts, quality-scorer, assertions, thresholds, min_aggregate, failure_action, providers, benchmarks, industry, mock_scoring, retry, fallback_message
  • Config/command names: policy.quality-scorer, assertions[], thresholds.min_aggregate, failure_action (block/retry/fallback_message), providers[] (judge LLMs), benchmarks, industry
  • Best next pages: Quality Assertions Configuration, Config Testing, Human Oversight

For engineers

  • Prerequisites: For LLM-judged assertions: a configured judge provider. For RAG benchmarks: context documents attached to requests. Define thresholds.min_aggregate based on acceptable quality floor.
  • Validation: Run kt policy test with quality-focused test cases. Monitor aggregate scores in Events. Adjust thresholds based on false-blocking rates. Use mock_scoring: true for development.
  • Key commands: kt policy lint, kt policy test, kt events tail, console Events page

For leaders

  • Governance: Quality scoring defines your organization's minimum acceptable AI output bar. It prevents low-quality, hallucinated, or irrelevant responses from reaching users.
  • Cost: LLM-judged assertions consume additional tokens per request (judge model cost). NLP benchmarks and deterministic assertions are near-free. Budget for judge model costs in high-volume deployments.
  • Rollout: Start with deterministic assertions (format checks, keyword requirements). Add LLM-judged assertions for critical quality dimensions. Set failure_action: retry before block to maximize response delivery.

Next steps