Quality Assertions Configuration

The quality-scorer policy supports 65+ assertion types for evaluating LLM output quality. Assertions can check string content, compute NLP metrics, call external LLM judges, and validate structured outputs.

Use this page when

You are defining quality assertions for the quality-scorer policy in policy-config.yaml.
You need to validate LLM output using string checks, NLP metrics, LLM judges, or structured output validation.
You are choosing assertion types, configuring thresholds, or building custom assertion packs.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Quick reference

policy:
  quality-scorer:
    assertions:
    - type: contains
      value: disclaimer
    - type: llm-rubric
      rubric: Response must be factually accurate
      threshold: 0.8
    thresholds:
      min_aggregate: 0.8
    failure_action:
      action: block
pack:
  name: config-quality-assertions-example-1
  version: 1.0.0
  enabled: true
policies:
  chain:
  - quality-scorer

Assertion structure

Each assertion is an object with a type and type-specific fields:

- type: "contains"           # required: assertion type
  name: "has-disclaimer"     # optional: human-readable name
  value: "disclaimer"        # type-specific field
  threshold: 0.8             # optional: min score to pass (default varies)
  weight: 1.0                # optional: relative weight (0.0–10.0, default 1.0)
  enabled: true              # optional: enable/disable (default true)
  mode: "enforce"            # optional: enforce | audit | shadow
  severity: "critical"       # optional: critical | warning | info
  negate: false              # optional: invert match
  case_sensitive: true       # optional: case sensitivity (default true)
  provider: "judge-provider" # optional: provider for LLM-backed assertions
  pack: "my-pack"            # optional: reference an assertion pack

Assertion modes

Mode	Behavior
`enforce`	Failure contributes to verdict. Can block the response
`audit`	Evaluated and reported but never causes a block
`shadow`	Evaluated in background. Results logged at debug level only

Assertion severity

Only applies when mode: enforce and the assertion fails:

Severity	Behavior
`critical`	Triggers the configured failure action
`warning`	Reported but does not trigger the failure action
`info`	Reported only, no action taken

String assertions

contains / icontains

Check if the output contains a specific string.

assertions:
  - type: "contains"
    value: "disclaimer"
    case_sensitive: false
  - type: "icontains"          # case-insensitive shorthand
    value: "not financial advice"

contains-all / contains-any

assertions:
  - type: "contains-all"
    values: ["source", "citation", "reference"]
  - type: "contains-any"
    values: ["yes", "no", "maybe"]

starts-with

assertions:
  - type: "starts-with"
    value: "Based on the provided context"

regex

assertions:
  - type: "regex"
    config:
      pattern: '\d{4}-\d{2}-\d{2}'
      case_insensitive: true

word-count

assertions:
  - type: "word-count"
    config:
      min: 50
      max: 500
  # OR exact count:
  - type: "word-count"
    config:
      exact: 100

Similarity assertions

similar

Cosine similarity against an expected output.

assertions:
  - type: "similar"
    config:
      expected: "The capital of France is Paris"
    threshold: 0.85

levenshtein

Edit distance from a reference string.

assertions:
  - type: "levenshtein"
    config:
      reference: "expected output text"
      max_distance: 10
      case_sensitive: false

semantic-similarity

Provider-backed embedding similarity.

assertions:
  - type: "semantic-similarity"
    config:
      reference: "The model should explain the concept clearly"
    threshold: 0.80
    provider: "embed-provider"

NLP metric assertions

rouge / rouge-n

assertions:
  - type: "rouge"
    config:
      reference: "The expected summary text"
      variant: "rouge-l"       # rouge-1 | rouge-2 | rouge-l
    threshold: 0.6
  - type: "rouge-n"
    config:
      reference: "Expected text"
      n: 2                     # 1–4
    threshold: 0.5

meteor

assertions:
  - type: "meteor"
    config:
      reference: "The expected output"
    threshold: 0.5

gleu

assertions:
  - type: "gleu"
    config:
      reference: "Expected output"
      max_n: 4                  # 1–4, default 4
    threshold: 0.4

f-score

Token-level F-score against a reference.

assertions:
  - type: "f-score"
    config:
      reference: "expected output text"
      beta: 1                   # default 1 (F1)
      case_sensitive: false
    threshold: 0.7

perplexity / perplexity-score

assertions:
  - type: "perplexity"
    config:
      max_value: 50

LLM-judged assertions

These assertions call an LLM provider to evaluate quality.

llm-rubric

Grade the response against a rubric.

assertions:
  - type: "llm-rubric"
    config:
      rubric: "The response must be factually accurate and cite sources"
      reference: "Optional reference answer"
      required_terms: ["source", "citation"]
    threshold: 0.8
    provider: "judge-provider"

model-graded-closedqa

Closed-book QA grading: does the output match the reference answer?

assertions:
  - type: "model-graded-closedqa"
    config:
      reference_answer: "Paris is the capital of France"
      question: "What is the capital of France?"
    threshold: 0.9
    provider: "judge-provider"

factuality

assertions:
  - type: "factuality"
    config:
      reference_statement: "The Earth orbits the Sun in approximately 365.25 days"
    threshold: 0.9

g-eval

General evaluation with custom criteria.

assertions:
  - type: "g-eval"
    config:
      criteria: "Coherence and logical flow"
      rubric: "Response should present ideas in logical order"
    threshold: 0.7

answer-relevance

assertions:
  - type: "answer-relevance"
    config:
      query: "Explain photosynthesis"
    threshold: 0.8

search-rubric

assertions:
  - type: "search-rubric"
    config:
      rubric: "The answer correctly addresses the search query"
      reference: "Expected answer content"
    threshold: 0.7

select-best

Compare multiple response choices.

assertions:
  - type: "select-best"
    config:
      criteria: "Most accurate and helpful response"
      candidate_source: "response_choices"
    threshold: 0.6

RAG assertions

context-faithfulness

Does the response stay faithful to the provided context?

assertions:
  - type: "context-faithfulness"
    config:
      require_context: true
    threshold: 0.8

context-relevance

assertions:
  - type: "context-relevance"
    config:
      query: "What are the benefits of exercise?"
    threshold: 0.7

context-recall

assertions:
  - type: "context-recall"
    config:
      ground_truth: "Exercise improves cardiovascular health, reduces stress, and strengthens bones"
    threshold: 0.7

rag-document-exfiltration

Detect if the model is leaking verbatim document content.

assertions:
  - type: "rag-document-exfiltration"
    config:
      max_verbatim_chars: 200
      max_verbatim_ratio: 0.5
    threshold: 1.0

rag-poisoning

assertions:
  - type: "rag-poisoning"
    config:
      poisoned_context: "This context has been tampered with"
    threshold: 1.0

rag-source-attribution

assertions:
  - type: "rag-source-attribution"
    config:
      require_attribution: true
    threshold: 0.8

Agent trajectory assertions

trajectory:goal-success

assertions:
  - type: "trajectory:goal-success"
    config:
      goal: "Book a flight from NYC to London"
      success_terms: ["booking confirmed", "reservation"]
    threshold: 1.0

trajectory:tool-used

assertions:
  - type: "trajectory:tool-used"
    config:
      tools: ["flight_search", "book_flight"]
      match_all: true

trajectory:tool-sequence

assertions:
  - type: "trajectory:tool-sequence"
    config:
      tools: ["search", "validate", "book"]
      allow_gaps: true

trajectory:step-count

assertions:
  - type: "trajectory:step-count"
    config:
      min: 2
      max: 10
      step_type: "tool_call"

Structured output assertions

is-json / contains-json

assertions:
  - type: "is-json"
  - type: "contains-json"

schema-match

Validate output against a JSON Schema.

assertions:
  - type: "schema-match"
    config:
      schema:
        type: "object"
        required: ["name", "age"]
        properties:
          name: { type: "string" }
          age: { type: "integer", minimum: 0 }

json-path

assertions:
  - type: "json-path"
    config:
      path: "$.results[0].score"
      expected: 0.95

is-html / is-xml / is-sql

assertions:
  - type: "is-html"
    config:
      required_tags: ["h1", "p"]
  - type: "is-xml"
    config:
      root_tag: "response"
  - type: "is-sql"
    config:
      allowed_statements: ["select"]
      required_tables: ["users", "orders"]

Function call assertions

is-valid-openai-function-call

assertions:
  - type: "is-valid-openai-function-call"
    config:
      function_name: "get_weather"
      schema:
        type: "object"
        required: ["city"]
        properties:
          city: { type: "string" }

is-valid-openai-tools-call

assertions:
  - type: "is-valid-openai-tools-call"
    config:
      tool_name: "search"
      allow_partial: false

tool-call-f1

assertions:
  - type: "tool-call-f1"
    config:
      expected_tools: ["search", "calculate", "format"]
      match_arguments: false
    threshold: 0.8

Script assertions

javascript

assertions:
  - type: "javascript"
    config:
      code: "return output.length > 100"

python

assertions:
  - type: "python"
    config:
      code: "return len(output.split()) >= 50"

Cost and latency assertions

cost

assertions:
  - type: "cost"
    config:
      max_cost: 0.05             # USD per request

latency

assertions:
  - type: "latency"
    config:
      max_ms: 3000

Special assertions

is-refusal

Check if the model refused to answer.

assertions:
  - type: "is-refusal"
    config:
      expected: false            # pass if model does NOT refuse

conversation-relevance

assertions:
  - type: "conversation-relevance"
    config:
      window: 3                  # check relevance within last 3 messages
    threshold: 0.7

moderation

assertions:
  - type: "moderation"
    config:
      categories: ["violence", "hate", "self-harm"]
      blocked_terms: ["offensive-term"]

classifier

assertions:
  - type: "classifier"
    config:
      expected_class: "positive"
      min_score: 0.8
      blocked_terms: ["spam"]

webhook

assertions:
  - type: "webhook"
    config:
      url: "https://my-validator.example.com/check"
      timeout_ms: 3000

assert-set

Compose multiple assertions with pass criteria.

assertions:
  - type: "assert-set"
    config:
      sources:
        - type: "contains"
          value: "disclaimer"
        - type: "regex"
          config:
            pattern: '\d{4}'
      min_pass_count: 1          # OR: min_pass_ratio: 0.5

max-score

Aggregate scores from multiple sources.

assertions:
  - type: "max-score"
    config:
      sources:
        - type: "similar"
          config:
            expected: "Paris is the capital"
        - type: "contains"
          value: "Paris"
      include_base_metrics: true

threshold

Check a named quality metric against a threshold.

assertions:
  - type: "threshold"
    config:
      metric: "faithfulness"
      min: 0.8
      max: 1.0

Assertion packs

Define reusable assertion bundles and reference them by name.

assertion_packs:
  safety-basics:
  - type: moderation
    config:
      categories:
      - violence
      - hate
      - self-harm
  - type: is-refusal
    config:
      expected: false
  accuracy-checks:
  - type: context-faithfulness
    threshold: 0.8
  - type: contains
    value: source
policy:
  quality-scorer:
    assertions:
    - pack: safety-basics
    - pack: accuracy-checks
    - type: llm-rubric
      rubric: Response must be helpful
pack:
  name: config-quality-assertions-example-52
  version: 1.0.0
  enabled: true
policies:
  chain:
  - quality-scorer

Quality benchmarks

Enable built-in quality metrics computed for every response.

policy:
  quality-scorer:
    benchmarks:
      ragas_faithfulness: true
      ragas_relevancy: true
      bleu_score: true
      nli_entailment: false
      coherence: true
      completeness: true
pack:
  name: config-quality-assertions-example-53
  version: 1.0.0
  enabled: true
policies:
  chain:
  - quality-scorer

Thresholds and weights

policy:
  quality-scorer:
    thresholds:
      min_aggregate: 0.8
      min_faithfulness: 0.75
      min_relevancy: 0.75
      min_bleu: 0.4
      min_coherence: 0.65
      min_completeness: 0.7
      min_accuracy: 0.8
    weights:
      faithfulness: 0.25
      relevancy: 0.25
      bleu: 0.2
      coherence: 0.15
      completeness: 0.15
      accuracy: 0.2
pack:
  name: config-quality-assertions-example-54
  version: 1.0.0
  enabled: true
policies:
  chain:
  - quality-scorer

Pass policy

Control how multiple assertion results combine into a verdict.

policy:
  quality-scorer:
    pass_policy:
      strategy: weighted_average
      threshold: 0.75
pack:
  name: config-quality-assertions-example-55
  version: 1.0.0
  enabled: true
policies:
  chain:
  - quality-scorer

Industry profiles

Pre-built threshold profiles for common industries.

policy:
  quality-scorer:
    industry: finance
    industry_profiles:
      finance:
        min_aggregate: 0.9
        min_accuracy: 0.95
        min_faithfulness: 0.9
      healthcare:
        min_aggregate: 0.85
        min_accuracy: 0.9
      legal:
        min_aggregate: 0.85
        min_relevancy: 0.9
pack:
  name: config-quality-assertions-example-56
  version: 1.0.0
  enabled: true
policies:
  chain:
  - quality-scorer

LLM judge configuration

Use an LLM to evaluate output quality.

policy:
  quality-scorer:
    judge:
      enabled: true
      endpoint: openai
      model: gpt-4o
      secret_key_ref:
        env: OPENAI_API_KEY
      timeout_ms: 5000
      threshold: 0.7
      warn_threshold: 0.5
      rationale_capture: true
      sampling_rate: 0.5
      scorer_name: quality-judge
pack:
  name: config-quality-assertions-example-57
  version: 1.0.0
  enabled: true
policies:
  chain:
  - quality-scorer

Failure action

policy:
  quality-scorer:
    failure_action:
      action: block
      fallback_message: Response quality below threshold.
      max_retries: 2
pack:
  name: config-quality-assertions-example-58
  version: 1.0.0
  enabled: true
policies:
  chain:
  - quality-scorer

Regression monitoring

policy:
  quality-scorer:
    regression_monitoring:
      enabled: true
      sample_rate: 0.1
      alert_threshold: 0.6
pack:
  name: config-quality-assertions-example-59
  version: 1.0.0
  enabled: true
policies:
  chain:
  - quality-scorer

Complete quality scorer example

pack:
  name: "quality-enforced"
  version: "1.0.0"
  enabled: true

providers:
  targets:
    - id: "openai-prod"
      provider: "openai"
      model: "gpt-4o"
      secret_key_ref:
        env: "OPENAI_API_KEY"

assertion_packs:
  safety:
    - type: "moderation"
      config:
        categories: ["violence", "hate"]
    - type: "is-refusal"
      config:
        expected: false

policies:
  chain:
    - "prompt-injection"
    - "quality-scorer"

policy:
  quality-scorer:
    providers:
      - id: "judge"
        provider: "openai"
        model: "gpt-4o"
        secret_key_ref:
          env: "OPENAI_API_KEY"
        config:
          temperature: 0.0

    benchmarks:
      ragas_faithfulness: true
      ragas_relevancy: true

    assertions:
      - pack: "safety"
      - type: "llm-rubric"
        rubric: "Response must be accurate, helpful, and well-structured"
        threshold: 0.8
        provider: "judge"
      - type: "contains"
        value: "source"
        mode: "audit"
        severity: "warning"

    thresholds:
      min_aggregate: 0.80
      min_faithfulness: 0.75

    weights:
      faithfulness: 0.40
      relevancy: 0.35
      bleu: 0.25

    pass_policy:
      strategy: "weighted_average"
      threshold: 0.75

    failure_action:
      action: "block"
      fallback_message: "Quality check failed."

    regression_monitoring:
      enabled: true
      sample_rate: 0.1

For AI systems

Canonical terms: Keeptrusts, quality-scorer, assertions, threshold, weight, mode, severity, llm-rubric, context-faithfulness, rouge, semantic-similarity, is-json, trajectory
Config/command names: policy.quality-scorer.assertions[], assertion type, threshold, weight, mode (enforce/audit/shadow), severity (critical/warning/info), negate, pack
Best next pages: Quality Scorer, Config Testing, Declarative Config Reference

For engineers

Prerequisites: A quality-scorer block in your policy config. For LLM-judged assertions, a configured judge provider in policy.quality-scorer.providers[].
Validation: Run kt policy test --json from the pack directory to execute inline test suites. Check assertion results in the JSON output, then inspect decision events in the console Events page or kt events tail.
Key commands: kt policy test, kt policy lint, kt events tail

For leaders

Governance: Quality assertions define your organization's minimum acceptable AI output bar. Critical-severity assertions that block responses directly impact user experience — review failure rates before tightening thresholds.
Cost: LLM-judged assertions (llm-rubric, context-faithfulness) consume additional tokens per request. Each assertion call adds latency and cost proportional to the judge model's pricing.
Rollout: Start assertions in audit mode to collect baseline scores without blocking traffic. Promote to enforce mode once false-positive rates are acceptable.

Next steps

Quality Scorer — Parent policy configuration and thresholds
Config Testing — Inline test suites for assertions
Declarative Config Reference — Full config schema
Policy Controls Catalog — All available policy kinds

Use this page when​

Primary audience​

Quick reference​

Assertion structure​

Assertion modes​

Assertion severity​

String assertions​

contains / icontains​

contains-all / contains-any​

starts-with​

regex​

word-count​

Similarity assertions​

similar​

levenshtein​

semantic-similarity​

NLP metric assertions​

rouge / rouge-n​

meteor​

gleu​

f-score​

perplexity / perplexity-score​

LLM-judged assertions​

llm-rubric​

model-graded-closedqa​

factuality​

g-eval​

answer-relevance​

search-rubric​

select-best​

RAG assertions​

context-faithfulness​

context-relevance​

context-recall​

rag-document-exfiltration​

rag-poisoning​

rag-source-attribution​

Agent trajectory assertions​

trajectory:goal-success​

trajectory:tool-used​

trajectory:tool-sequence​

trajectory:step-count​

Structured output assertions​

is-json / contains-json​

schema-match​

json-path​

is-html / is-xml / is-sql​

Function call assertions​

is-valid-openai-function-call​

is-valid-openai-tools-call​

tool-call-f1​

Script assertions​

javascript​

python​

Cost and latency assertions​

cost​

latency​

Special assertions​

is-refusal​

conversation-relevance​

moderation​

classifier​

webhook​

assert-set​

max-score​

threshold​

Assertion packs​

Quality benchmarks​

Thresholds and weights​

Pass policy​

Industry profiles​

LLM judge configuration​

Failure action​

Regression monitoring​

Complete quality scorer example​

For AI systems​

For engineers​

For leaders​

Next steps​