Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Quality Assertions Configuration

The quality-scorer policy supports 65+ assertion types for evaluating LLM output quality. Assertions can check string content, compute NLP metrics, call external LLM judges, and validate structured outputs.

Use this page when

  • You are defining quality assertions for the quality-scorer policy in policy-config.yaml.
  • You need to validate LLM output using string checks, NLP metrics, LLM judges, or structured output validation.
  • You are choosing assertion types, configuring thresholds, or building custom assertion packs.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Quick reference

policy:
quality-scorer:
assertions:
- type: contains
value: disclaimer
- type: llm-rubric
rubric: Response must be factually accurate
threshold: 0.8
thresholds:
min_aggregate: 0.8
failure_action:
action: block
pack:
name: config-quality-assertions-example-1
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer

Assertion structure

Each assertion is an object with a type and type-specific fields:

- type: "contains" # required: assertion type
name: "has-disclaimer" # optional: human-readable name
value: "disclaimer" # type-specific field
threshold: 0.8 # optional: min score to pass (default varies)
weight: 1.0 # optional: relative weight (0.0–10.0, default 1.0)
enabled: true # optional: enable/disable (default true)
mode: "enforce" # optional: enforce | audit | shadow
severity: "critical" # optional: critical | warning | info
negate: false # optional: invert match
case_sensitive: true # optional: case sensitivity (default true)
provider: "judge-provider" # optional: provider for LLM-backed assertions
pack: "my-pack" # optional: reference an assertion pack

Assertion modes

ModeBehavior
enforceFailure contributes to verdict. Can block the response
auditEvaluated and reported but never causes a block
shadowEvaluated in background. Results logged at debug level only

Assertion severity

Only applies when mode: enforce and the assertion fails:

SeverityBehavior
criticalTriggers the configured failure action
warningReported but does not trigger the failure action
infoReported only, no action taken

String assertions

contains / icontains

Check if the output contains a specific string.

assertions:
- type: "contains"
value: "disclaimer"
case_sensitive: false
- type: "icontains" # case-insensitive shorthand
value: "not financial advice"

contains-all / contains-any

assertions:
- type: "contains-all"
values: ["source", "citation", "reference"]
- type: "contains-any"
values: ["yes", "no", "maybe"]

starts-with

assertions:
- type: "starts-with"
value: "Based on the provided context"

regex

assertions:
- type: "regex"
config:
pattern: '\d{4}-\d{2}-\d{2}'
case_insensitive: true

word-count

assertions:
- type: "word-count"
config:
min: 50
max: 500
# OR exact count:
- type: "word-count"
config:
exact: 100

Similarity assertions

similar

Cosine similarity against an expected output.

assertions:
- type: "similar"
config:
expected: "The capital of France is Paris"
threshold: 0.85

levenshtein

Edit distance from a reference string.

assertions:
- type: "levenshtein"
config:
reference: "expected output text"
max_distance: 10
case_sensitive: false

semantic-similarity

Provider-backed embedding similarity.

assertions:
- type: "semantic-similarity"
config:
reference: "The model should explain the concept clearly"
threshold: 0.80
provider: "embed-provider"

NLP metric assertions

rouge / rouge-n

assertions:
- type: "rouge"
config:
reference: "The expected summary text"
variant: "rouge-l" # rouge-1 | rouge-2 | rouge-l
threshold: 0.6
- type: "rouge-n"
config:
reference: "Expected text"
n: 2 # 1–4
threshold: 0.5

meteor

assertions:
- type: "meteor"
config:
reference: "The expected output"
threshold: 0.5

gleu

assertions:
- type: "gleu"
config:
reference: "Expected output"
max_n: 4 # 1–4, default 4
threshold: 0.4

f-score

Token-level F-score against a reference.

assertions:
- type: "f-score"
config:
reference: "expected output text"
beta: 1 # default 1 (F1)
case_sensitive: false
threshold: 0.7

perplexity / perplexity-score

assertions:
- type: "perplexity"
config:
max_value: 50

LLM-judged assertions

These assertions call an LLM provider to evaluate quality.

llm-rubric

Grade the response against a rubric.

assertions:
- type: "llm-rubric"
config:
rubric: "The response must be factually accurate and cite sources"
reference: "Optional reference answer"
required_terms: ["source", "citation"]
threshold: 0.8
provider: "judge-provider"

model-graded-closedqa

Closed-book QA grading: does the output match the reference answer?

assertions:
- type: "model-graded-closedqa"
config:
reference_answer: "Paris is the capital of France"
question: "What is the capital of France?"
threshold: 0.9
provider: "judge-provider"

factuality

assertions:
- type: "factuality"
config:
reference_statement: "The Earth orbits the Sun in approximately 365.25 days"
threshold: 0.9

g-eval

General evaluation with custom criteria.

assertions:
- type: "g-eval"
config:
criteria: "Coherence and logical flow"
rubric: "Response should present ideas in logical order"
threshold: 0.7

answer-relevance

assertions:
- type: "answer-relevance"
config:
query: "Explain photosynthesis"
threshold: 0.8

search-rubric

assertions:
- type: "search-rubric"
config:
rubric: "The answer correctly addresses the search query"
reference: "Expected answer content"
threshold: 0.7

select-best

Compare multiple response choices.

assertions:
- type: "select-best"
config:
criteria: "Most accurate and helpful response"
candidate_source: "response_choices"
threshold: 0.6

RAG assertions

context-faithfulness

Does the response stay faithful to the provided context?

assertions:
- type: "context-faithfulness"
config:
require_context: true
threshold: 0.8

context-relevance

assertions:
- type: "context-relevance"
config:
query: "What are the benefits of exercise?"
threshold: 0.7

context-recall

assertions:
- type: "context-recall"
config:
ground_truth: "Exercise improves cardiovascular health, reduces stress, and strengthens bones"
threshold: 0.7

rag-document-exfiltration

Detect if the model is leaking verbatim document content.

assertions:
- type: "rag-document-exfiltration"
config:
max_verbatim_chars: 200
max_verbatim_ratio: 0.5
threshold: 1.0

rag-poisoning

assertions:
- type: "rag-poisoning"
config:
poisoned_context: "This context has been tampered with"
threshold: 1.0

rag-source-attribution

assertions:
- type: "rag-source-attribution"
config:
require_attribution: true
threshold: 0.8

Agent trajectory assertions

trajectory:goal-success

assertions:
- type: "trajectory:goal-success"
config:
goal: "Book a flight from NYC to London"
success_terms: ["booking confirmed", "reservation"]
threshold: 1.0

trajectory:tool-used

assertions:
- type: "trajectory:tool-used"
config:
tools: ["flight_search", "book_flight"]
match_all: true

trajectory:tool-sequence

assertions:
- type: "trajectory:tool-sequence"
config:
tools: ["search", "validate", "book"]
allow_gaps: true

trajectory:step-count

assertions:
- type: "trajectory:step-count"
config:
min: 2
max: 10
step_type: "tool_call"

Structured output assertions

is-json / contains-json

assertions:
- type: "is-json"
- type: "contains-json"

schema-match

Validate output against a JSON Schema.

assertions:
- type: "schema-match"
config:
schema:
type: "object"
required: ["name", "age"]
properties:
name: { type: "string" }
age: { type: "integer", minimum: 0 }

json-path

assertions:
- type: "json-path"
config:
path: "$.results[0].score"
expected: 0.95

is-html / is-xml / is-sql

assertions:
- type: "is-html"
config:
required_tags: ["h1", "p"]
- type: "is-xml"
config:
root_tag: "response"
- type: "is-sql"
config:
allowed_statements: ["select"]
required_tables: ["users", "orders"]

Function call assertions

is-valid-openai-function-call

assertions:
- type: "is-valid-openai-function-call"
config:
function_name: "get_weather"
schema:
type: "object"
required: ["city"]
properties:
city: { type: "string" }

is-valid-openai-tools-call

assertions:
- type: "is-valid-openai-tools-call"
config:
tool_name: "search"
allow_partial: false

tool-call-f1

assertions:
- type: "tool-call-f1"
config:
expected_tools: ["search", "calculate", "format"]
match_arguments: false
threshold: 0.8

Script assertions

javascript

assertions:
- type: "javascript"
config:
code: "return output.length > 100"

python

assertions:
- type: "python"
config:
code: "return len(output.split()) >= 50"

Cost and latency assertions

cost

assertions:
- type: "cost"
config:
max_cost: 0.05 # USD per request

latency

assertions:
- type: "latency"
config:
max_ms: 3000

Special assertions

is-refusal

Check if the model refused to answer.

assertions:
- type: "is-refusal"
config:
expected: false # pass if model does NOT refuse

conversation-relevance

assertions:
- type: "conversation-relevance"
config:
window: 3 # check relevance within last 3 messages
threshold: 0.7

moderation

assertions:
- type: "moderation"
config:
categories: ["violence", "hate", "self-harm"]
blocked_terms: ["offensive-term"]

classifier

assertions:
- type: "classifier"
config:
expected_class: "positive"
min_score: 0.8
blocked_terms: ["spam"]

webhook

assertions:
- type: "webhook"
config:
url: "https://my-validator.example.com/check"
timeout_ms: 3000

assert-set

Compose multiple assertions with pass criteria.

assertions:
- type: "assert-set"
config:
sources:
- type: "contains"
value: "disclaimer"
- type: "regex"
config:
pattern: '\d{4}'
min_pass_count: 1 # OR: min_pass_ratio: 0.5

max-score

Aggregate scores from multiple sources.

assertions:
- type: "max-score"
config:
sources:
- type: "similar"
config:
expected: "Paris is the capital"
- type: "contains"
value: "Paris"
include_base_metrics: true

threshold

Check a named quality metric against a threshold.

assertions:
- type: "threshold"
config:
metric: "faithfulness"
min: 0.8
max: 1.0

Assertion packs

Define reusable assertion bundles and reference them by name.

assertion_packs:
safety-basics:
- type: moderation
config:
categories:
- violence
- hate
- self-harm
- type: is-refusal
config:
expected: false
accuracy-checks:
- type: context-faithfulness
threshold: 0.8
- type: contains
value: source
policy:
quality-scorer:
assertions:
- pack: safety-basics
- pack: accuracy-checks
- type: llm-rubric
rubric: Response must be helpful
pack:
name: config-quality-assertions-example-52
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer

Quality benchmarks

Enable built-in quality metrics computed for every response.

policy:
quality-scorer:
benchmarks:
ragas_faithfulness: true
ragas_relevancy: true
bleu_score: true
nli_entailment: false
coherence: true
completeness: true
pack:
name: config-quality-assertions-example-53
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer

Thresholds and weights

policy:
quality-scorer:
thresholds:
min_aggregate: 0.8
min_faithfulness: 0.75
min_relevancy: 0.75
min_bleu: 0.4
min_coherence: 0.65
min_completeness: 0.7
min_accuracy: 0.8
weights:
faithfulness: 0.25
relevancy: 0.25
bleu: 0.2
coherence: 0.15
completeness: 0.15
accuracy: 0.2
pack:
name: config-quality-assertions-example-54
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer

Pass policy

Control how multiple assertion results combine into a verdict.

policy:
quality-scorer:
pass_policy:
strategy: weighted_average
threshold: 0.75
pack:
name: config-quality-assertions-example-55
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer

Industry profiles

Pre-built threshold profiles for common industries.

policy:
quality-scorer:
industry: finance
industry_profiles:
finance:
min_aggregate: 0.9
min_accuracy: 0.95
min_faithfulness: 0.9
healthcare:
min_aggregate: 0.85
min_accuracy: 0.9
legal:
min_aggregate: 0.85
min_relevancy: 0.9
pack:
name: config-quality-assertions-example-56
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer

LLM judge configuration

Use an LLM to evaluate output quality.

policy:
quality-scorer:
judge:
enabled: true
endpoint: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
timeout_ms: 5000
threshold: 0.7
warn_threshold: 0.5
rationale_capture: true
sampling_rate: 0.5
scorer_name: quality-judge
pack:
name: config-quality-assertions-example-57
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer

Failure action

policy:
quality-scorer:
failure_action:
action: block
fallback_message: Response quality below threshold.
max_retries: 2
pack:
name: config-quality-assertions-example-58
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer

Regression monitoring

policy:
quality-scorer:
regression_monitoring:
enabled: true
sample_rate: 0.1
alert_threshold: 0.6
pack:
name: config-quality-assertions-example-59
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer

Complete quality scorer example

pack:
name: "quality-enforced"
version: "1.0.0"
enabled: true

providers:
targets:
- id: "openai-prod"
provider: "openai"
model: "gpt-4o"
secret_key_ref:
env: "OPENAI_API_KEY"

assertion_packs:
safety:
- type: "moderation"
config:
categories: ["violence", "hate"]
- type: "is-refusal"
config:
expected: false

policies:
chain:
- "prompt-injection"
- "quality-scorer"

policy:
quality-scorer:
providers:
- id: "judge"
provider: "openai"
model: "gpt-4o"
secret_key_ref:
env: "OPENAI_API_KEY"
config:
temperature: 0.0

benchmarks:
ragas_faithfulness: true
ragas_relevancy: true

assertions:
- pack: "safety"
- type: "llm-rubric"
rubric: "Response must be accurate, helpful, and well-structured"
threshold: 0.8
provider: "judge"
- type: "contains"
value: "source"
mode: "audit"
severity: "warning"

thresholds:
min_aggregate: 0.80
min_faithfulness: 0.75

weights:
faithfulness: 0.40
relevancy: 0.35
bleu: 0.25

pass_policy:
strategy: "weighted_average"
threshold: 0.75

failure_action:
action: "block"
fallback_message: "Quality check failed."

regression_monitoring:
enabled: true
sample_rate: 0.1

For AI systems

  • Canonical terms: Keeptrusts, quality-scorer, assertions, threshold, weight, mode, severity, llm-rubric, context-faithfulness, rouge, semantic-similarity, is-json, trajectory
  • Config/command names: policy.quality-scorer.assertions[], assertion type, threshold, weight, mode (enforce/audit/shadow), severity (critical/warning/info), negate, pack
  • Best next pages: Quality Scorer, Config Testing, Declarative Config Reference

For engineers

  • Prerequisites: A quality-scorer block in your policy config. For LLM-judged assertions, a configured judge provider in policy.quality-scorer.providers[].
  • Validation: Run kt policy test --json from the pack directory to execute inline test suites. Check assertion results in the JSON output, then inspect decision events in the console Events page or kt events tail.
  • Key commands: kt policy test, kt policy lint, kt events tail

For leaders

  • Governance: Quality assertions define your organization's minimum acceptable AI output bar. Critical-severity assertions that block responses directly impact user experience — review failure rates before tightening thresholds.
  • Cost: LLM-judged assertions (llm-rubric, context-faithfulness) consume additional tokens per request. Each assertion call adds latency and cost proportional to the judge model's pricing.
  • Rollout: Start assertions in audit mode to collect baseline scores without blocking traffic. Promote to enforce mode once false-positive rates are acceptable.

Next steps