Quality Assertions Configuration
The quality-scorer policy supports 65+ assertion types for evaluating LLM output quality. Assertions can check string content, compute NLP metrics, call external LLM judges, and validate structured outputs.
Use this page when
- You are defining quality assertions for the
quality-scorerpolicy inpolicy-config.yaml. - You need to validate LLM output using string checks, NLP metrics, LLM judges, or structured output validation.
- You are choosing assertion types, configuring thresholds, or building custom assertion packs.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Quick reference
policy:
quality-scorer:
assertions:
- type: contains
value: disclaimer
- type: llm-rubric
rubric: Response must be factually accurate
threshold: 0.8
thresholds:
min_aggregate: 0.8
failure_action:
action: block
pack:
name: config-quality-assertions-example-1
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer
Assertion structure
Each assertion is an object with a type and type-specific fields:
- type: "contains" # required: assertion type
name: "has-disclaimer" # optional: human-readable name
value: "disclaimer" # type-specific field
threshold: 0.8 # optional: min score to pass (default varies)
weight: 1.0 # optional: relative weight (0.0–10.0, default 1.0)
enabled: true # optional: enable/disable (default true)
mode: "enforce" # optional: enforce | audit | shadow
severity: "critical" # optional: critical | warning | info
negate: false # optional: invert match
case_sensitive: true # optional: case sensitivity (default true)
provider: "judge-provider" # optional: provider for LLM-backed assertions
pack: "my-pack" # optional: reference an assertion pack
Assertion modes
| Mode | Behavior |
|---|---|
enforce | Failure contributes to verdict. Can block the response |
audit | Evaluated and reported but never causes a block |
shadow | Evaluated in background. Results logged at debug level only |
Assertion severity
Only applies when mode: enforce and the assertion fails:
| Severity | Behavior |
|---|---|
critical | Triggers the configured failure action |
warning | Reported but does not trigger the failure action |
info | Reported only, no action taken |
String assertions
contains / icontains
Check if the output contains a specific string.
assertions:
- type: "contains"
value: "disclaimer"
case_sensitive: false
- type: "icontains" # case-insensitive shorthand
value: "not financial advice"
contains-all / contains-any
assertions:
- type: "contains-all"
values: ["source", "citation", "reference"]
- type: "contains-any"
values: ["yes", "no", "maybe"]
starts-with
assertions:
- type: "starts-with"
value: "Based on the provided context"
regex
assertions:
- type: "regex"
config:
pattern: '\d{4}-\d{2}-\d{2}'
case_insensitive: true
word-count
assertions:
- type: "word-count"
config:
min: 50
max: 500
# OR exact count:
- type: "word-count"
config:
exact: 100
Similarity assertions
similar
Cosine similarity against an expected output.
assertions:
- type: "similar"
config:
expected: "The capital of France is Paris"
threshold: 0.85
levenshtein
Edit distance from a reference string.
assertions:
- type: "levenshtein"
config:
reference: "expected output text"
max_distance: 10
case_sensitive: false
semantic-similarity
Provider-backed embedding similarity.
assertions:
- type: "semantic-similarity"
config:
reference: "The model should explain the concept clearly"
threshold: 0.80
provider: "embed-provider"
NLP metric assertions
rouge / rouge-n
assertions:
- type: "rouge"
config:
reference: "The expected summary text"
variant: "rouge-l" # rouge-1 | rouge-2 | rouge-l
threshold: 0.6
- type: "rouge-n"
config:
reference: "Expected text"
n: 2 # 1–4
threshold: 0.5
meteor
assertions:
- type: "meteor"
config:
reference: "The expected output"
threshold: 0.5
gleu
assertions:
- type: "gleu"
config:
reference: "Expected output"
max_n: 4 # 1–4, default 4
threshold: 0.4
f-score
Token-level F-score against a reference.
assertions:
- type: "f-score"
config:
reference: "expected output text"
beta: 1 # default 1 (F1)
case_sensitive: false
threshold: 0.7
perplexity / perplexity-score
assertions:
- type: "perplexity"
config:
max_value: 50
LLM-judged assertions
These assertions call an LLM provider to evaluate quality.
llm-rubric
Grade the response against a rubric.
assertions:
- type: "llm-rubric"
config:
rubric: "The response must be factually accurate and cite sources"
reference: "Optional reference answer"
required_terms: ["source", "citation"]
threshold: 0.8
provider: "judge-provider"
model-graded-closedqa
Closed-book QA grading: does the output match the reference answer?
assertions:
- type: "model-graded-closedqa"
config:
reference_answer: "Paris is the capital of France"
question: "What is the capital of France?"
threshold: 0.9
provider: "judge-provider"
factuality
assertions:
- type: "factuality"
config:
reference_statement: "The Earth orbits the Sun in approximately 365.25 days"
threshold: 0.9
g-eval
General evaluation with custom criteria.
assertions:
- type: "g-eval"
config:
criteria: "Coherence and logical flow"
rubric: "Response should present ideas in logical order"
threshold: 0.7
answer-relevance
assertions:
- type: "answer-relevance"
config:
query: "Explain photosynthesis"
threshold: 0.8
search-rubric
assertions:
- type: "search-rubric"
config:
rubric: "The answer correctly addresses the search query"
reference: "Expected answer content"
threshold: 0.7
select-best
Compare multiple response choices.
assertions:
- type: "select-best"
config:
criteria: "Most accurate and helpful response"
candidate_source: "response_choices"
threshold: 0.6
RAG assertions
context-faithfulness
Does the response stay faithful to the provided context?
assertions:
- type: "context-faithfulness"
config:
require_context: true
threshold: 0.8
context-relevance
assertions:
- type: "context-relevance"
config:
query: "What are the benefits of exercise?"
threshold: 0.7
context-recall
assertions:
- type: "context-recall"
config:
ground_truth: "Exercise improves cardiovascular health, reduces stress, and strengthens bones"
threshold: 0.7
rag-document-exfiltration
Detect if the model is leaking verbatim document content.
assertions:
- type: "rag-document-exfiltration"
config:
max_verbatim_chars: 200
max_verbatim_ratio: 0.5
threshold: 1.0
rag-poisoning
assertions:
- type: "rag-poisoning"
config:
poisoned_context: "This context has been tampered with"
threshold: 1.0
rag-source-attribution
assertions:
- type: "rag-source-attribution"
config:
require_attribution: true
threshold: 0.8
Agent trajectory assertions
trajectory:goal-success
assertions:
- type: "trajectory:goal-success"
config:
goal: "Book a flight from NYC to London"
success_terms: ["booking confirmed", "reservation"]
threshold: 1.0
trajectory:tool-used
assertions:
- type: "trajectory:tool-used"
config:
tools: ["flight_search", "book_flight"]
match_all: true
trajectory:tool-sequence
assertions:
- type: "trajectory:tool-sequence"
config:
tools: ["search", "validate", "book"]
allow_gaps: true
trajectory:step-count
assertions:
- type: "trajectory:step-count"
config:
min: 2
max: 10
step_type: "tool_call"
Structured output assertions
is-json / contains-json
assertions:
- type: "is-json"
- type: "contains-json"
schema-match
Validate output against a JSON Schema.
assertions:
- type: "schema-match"
config:
schema:
type: "object"
required: ["name", "age"]
properties:
name: { type: "string" }
age: { type: "integer", minimum: 0 }
json-path
assertions:
- type: "json-path"
config:
path: "$.results[0].score"
expected: 0.95
is-html / is-xml / is-sql
assertions:
- type: "is-html"
config:
required_tags: ["h1", "p"]
- type: "is-xml"
config:
root_tag: "response"
- type: "is-sql"
config:
allowed_statements: ["select"]
required_tables: ["users", "orders"]
Function call assertions
is-valid-openai-function-call
assertions:
- type: "is-valid-openai-function-call"
config:
function_name: "get_weather"
schema:
type: "object"
required: ["city"]
properties:
city: { type: "string" }
is-valid-openai-tools-call
assertions:
- type: "is-valid-openai-tools-call"
config:
tool_name: "search"
allow_partial: false
tool-call-f1
assertions:
- type: "tool-call-f1"
config:
expected_tools: ["search", "calculate", "format"]
match_arguments: false
threshold: 0.8
Script assertions
javascript
assertions:
- type: "javascript"
config:
code: "return output.length > 100"
python
assertions:
- type: "python"
config:
code: "return len(output.split()) >= 50"
Cost and latency assertions
cost
assertions:
- type: "cost"
config:
max_cost: 0.05 # USD per request
latency
assertions:
- type: "latency"
config:
max_ms: 3000
Special assertions
is-refusal
Check if the model refused to answer.
assertions:
- type: "is-refusal"
config:
expected: false # pass if model does NOT refuse
conversation-relevance
assertions:
- type: "conversation-relevance"
config:
window: 3 # check relevance within last 3 messages
threshold: 0.7
moderation
assertions:
- type: "moderation"
config:
categories: ["violence", "hate", "self-harm"]
blocked_terms: ["offensive-term"]
classifier
assertions:
- type: "classifier"
config:
expected_class: "positive"
min_score: 0.8
blocked_terms: ["spam"]
webhook
assertions:
- type: "webhook"
config:
url: "https://my-validator.example.com/check"
timeout_ms: 3000
assert-set
Compose multiple assertions with pass criteria.
assertions:
- type: "assert-set"
config:
sources:
- type: "contains"
value: "disclaimer"
- type: "regex"
config:
pattern: '\d{4}'
min_pass_count: 1 # OR: min_pass_ratio: 0.5
max-score
Aggregate scores from multiple sources.
assertions:
- type: "max-score"
config:
sources:
- type: "similar"
config:
expected: "Paris is the capital"
- type: "contains"
value: "Paris"
include_base_metrics: true
threshold
Check a named quality metric against a threshold.
assertions:
- type: "threshold"
config:
metric: "faithfulness"
min: 0.8
max: 1.0
Assertion packs
Define reusable assertion bundles and reference them by name.
assertion_packs:
safety-basics:
- type: moderation
config:
categories:
- violence
- hate
- self-harm
- type: is-refusal
config:
expected: false
accuracy-checks:
- type: context-faithfulness
threshold: 0.8
- type: contains
value: source
policy:
quality-scorer:
assertions:
- pack: safety-basics
- pack: accuracy-checks
- type: llm-rubric
rubric: Response must be helpful
pack:
name: config-quality-assertions-example-52
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer
Quality benchmarks
Enable built-in quality metrics computed for every response.
policy:
quality-scorer:
benchmarks:
ragas_faithfulness: true
ragas_relevancy: true
bleu_score: true
nli_entailment: false
coherence: true
completeness: true
pack:
name: config-quality-assertions-example-53
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer
Thresholds and weights
policy:
quality-scorer:
thresholds:
min_aggregate: 0.8
min_faithfulness: 0.75
min_relevancy: 0.75
min_bleu: 0.4
min_coherence: 0.65
min_completeness: 0.7
min_accuracy: 0.8
weights:
faithfulness: 0.25
relevancy: 0.25
bleu: 0.2
coherence: 0.15
completeness: 0.15
accuracy: 0.2
pack:
name: config-quality-assertions-example-54
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer
Pass policy
Control how multiple assertion results combine into a verdict.
policy:
quality-scorer:
pass_policy:
strategy: weighted_average
threshold: 0.75
pack:
name: config-quality-assertions-example-55
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer
Industry profiles
Pre-built threshold profiles for common industries.
policy:
quality-scorer:
industry: finance
industry_profiles:
finance:
min_aggregate: 0.9
min_accuracy: 0.95
min_faithfulness: 0.9
healthcare:
min_aggregate: 0.85
min_accuracy: 0.9
legal:
min_aggregate: 0.85
min_relevancy: 0.9
pack:
name: config-quality-assertions-example-56
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer
LLM judge configuration
Use an LLM to evaluate output quality.
policy:
quality-scorer:
judge:
enabled: true
endpoint: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
timeout_ms: 5000
threshold: 0.7
warn_threshold: 0.5
rationale_capture: true
sampling_rate: 0.5
scorer_name: quality-judge
pack:
name: config-quality-assertions-example-57
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer
Failure action
policy:
quality-scorer:
failure_action:
action: block
fallback_message: Response quality below threshold.
max_retries: 2
pack:
name: config-quality-assertions-example-58
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer
Regression monitoring
policy:
quality-scorer:
regression_monitoring:
enabled: true
sample_rate: 0.1
alert_threshold: 0.6
pack:
name: config-quality-assertions-example-59
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer
Complete quality scorer example
pack:
name: "quality-enforced"
version: "1.0.0"
enabled: true
providers:
targets:
- id: "openai-prod"
provider: "openai"
model: "gpt-4o"
secret_key_ref:
env: "OPENAI_API_KEY"
assertion_packs:
safety:
- type: "moderation"
config:
categories: ["violence", "hate"]
- type: "is-refusal"
config:
expected: false
policies:
chain:
- "prompt-injection"
- "quality-scorer"
policy:
quality-scorer:
providers:
- id: "judge"
provider: "openai"
model: "gpt-4o"
secret_key_ref:
env: "OPENAI_API_KEY"
config:
temperature: 0.0
benchmarks:
ragas_faithfulness: true
ragas_relevancy: true
assertions:
- pack: "safety"
- type: "llm-rubric"
rubric: "Response must be accurate, helpful, and well-structured"
threshold: 0.8
provider: "judge"
- type: "contains"
value: "source"
mode: "audit"
severity: "warning"
thresholds:
min_aggregate: 0.80
min_faithfulness: 0.75
weights:
faithfulness: 0.40
relevancy: 0.35
bleu: 0.25
pass_policy:
strategy: "weighted_average"
threshold: 0.75
failure_action:
action: "block"
fallback_message: "Quality check failed."
regression_monitoring:
enabled: true
sample_rate: 0.1
For AI systems
- Canonical terms: Keeptrusts, quality-scorer, assertions, threshold, weight, mode, severity, llm-rubric, context-faithfulness, rouge, semantic-similarity, is-json, trajectory
- Config/command names:
policy.quality-scorer.assertions[], assertiontype,threshold,weight,mode(enforce/audit/shadow),severity(critical/warning/info),negate,pack - Best next pages: Quality Scorer, Config Testing, Declarative Config Reference
For engineers
- Prerequisites: A
quality-scorerblock in your policy config. For LLM-judged assertions, a configured judge provider inpolicy.quality-scorer.providers[]. - Validation: Run
kt policy test --jsonfrom the pack directory to execute inline test suites. Check assertion results in the JSON output, then inspect decision events in the console Events page orkt events tail. - Key commands:
kt policy test,kt policy lint,kt events tail
For leaders
- Governance: Quality assertions define your organization's minimum acceptable AI output bar. Critical-severity assertions that block responses directly impact user experience — review failure rates before tightening thresholds.
- Cost: LLM-judged assertions (
llm-rubric,context-faithfulness) consume additional tokens per request. Each assertion call adds latency and cost proportional to the judge model's pricing. - Rollout: Start assertions in
auditmode to collect baseline scores without blocking traffic. Promote toenforcemode once false-positive rates are acceptable.
Next steps
- Quality Scorer — Parent policy configuration and thresholds
- Config Testing — Inline test suites for assertions
- Declarative Config Reference — Full config schema
- Policy Controls Catalog — All available policy kinds