Quality Scorer

The quality-scorer policy evaluates AI response quality using a configurable mix of deterministic assertions, LLM-graded rubrics, NLP benchmarks, RAG evaluations, and agent trajectory checks. It enforces minimum score thresholds and controls what happens when a response fails quality gates — blocking, retrying, or falling back to a safe message.

Use this page when

You need to evaluate AI response quality using assertions, LLM judges, NLP benchmarks, or RAG evaluations.
You are configuring quality thresholds that block, retry, or escalate low-quality responses.
You want to set up quality scoring with industry modes, custom assertion packs, or agent trajectory checks.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Configuration

pack:
  name: quality-scorer-example-1
  version: 1.0.0
  enabled: true
policies:
  chain:
  - quality-scorer
policy:
  quality-scorer:
    industry: finance
    min_output_chars: 50
    min_sentences: 2
    mock_scoring: false
    providers:
    - id: quality-judge
      label: GPT-4o Judge
      provider: openai
      model: gpt-4o
      secret_key_ref:
        env: OPENAI_API_KEY
      config:
        temperature: 0.0
    benchmarks:
      ragas_faithfulness: true
      ragas_relevancy: true
      coherence: true
      completeness: true
    assertions:
    - type: is-json
      name: response-is-json
      enabled: true
      threshold: 1.0
      weight: 0.3
      mode: enforce
      severity: critical
      config: {}
    - type: llm-rubric
      name: finance-accuracy
      enabled: true
      threshold: 0.85
      weight: 0.5
      mode: enforce
      severity: critical
      config:
        rubric: Evaluate whether the response is financially accurate, appropriately caveated, and suitable for internal review. Score from 0 to 1.
    - type: context-faithfulness
      name: rag-grounding
      enabled: true
      threshold: 0.8
      weight: 0.2
      mode: enforce
      severity: warning
      config: {}
    thresholds:
      min_aggregate: 0.7
      min_faithfulness: 0.8
      min_relevancy: 0.75
      min_coherence: 0.65
      min_completeness: 0.7
      min_accuracy: 0.8
    weights:
      faithfulness: 0.25
      relevancy: 0.25
      coherence: 0.15
      completeness: 0.15
      accuracy: 0.2
    failure_action:
      action: fallback
      fallback_message: I apologize, but I cannot provide a sufficiently accurate response at this time.
      max_retries: 2
    pass_policy:
      strategy: weighted_average
      threshold: 0.7
    industry_profiles:
      finance:
        min_aggregate: 0.85
        min_accuracy: 0.95
        min_faithfulness: 0.9
        min_relevancy: 0.8
        min_coherence: 0.75
        min_completeness: 0.8
      healthcare:
        min_aggregate: 0.9
        min_accuracy: 0.95
        min_faithfulness: 0.95
        min_relevancy: 0.85
        min_coherence: 0.8
        min_completeness: 0.85

Fields

Top-Level Properties

Property	Type	Default	Description
`industry`	`string`	`""`	Name of an industry profile to activate from `industry_profiles`. When set, the profile's thresholds override the top-level `thresholds`.
`min_output_chars`	`integer`	`0`	Minimum character count for the AI response. Responses shorter than this fail immediately. Range: `0`–`10000000`.
`min_sentences`	`integer`	`0`	Minimum sentence count for the AI response. Range: `0`–`1000000`.
`mock_scoring`	`boolean`	`false`	When `true`, all assertions return synthetic pass scores. Useful for development and testing pipelines without consuming LLM judge tokens.
`providers`	`array`	`[]`	List of `QualityProviderTarget` entries — LLM providers used for graded assertions (e.g., `llm-rubric`, `factuality`). Also accepted as `targets`.
`benchmarks`	`object`	—	Toggle standard NLP/RAG benchmarks. See benchmarks below.
`assertions`	`array`	`[]`	Ordered list of `QualityAssertion` rules. See assertions below.
`thresholds`	`object`	—	Minimum pass scores for aggregate and per-type metrics. See thresholds below.
`weights`	`object`	—	Per-type scoring weights used for weighted average calculation. See weights below.
`industry_profiles`	`object`	`{}`	Named industry profiles with custom threshold overrides. See industry_profiles below.
`failure_action`	`object`	—	What to do when a response fails quality gates. See failure_action below.
`pass_policy`	`object`	—	Overall pass/fail aggregation strategy. See pass_policy below.

`providers` / `targets` (QualityProviderTarget)

Each entry can be either a string (shorthand) or an object (full config).

String shorthand: "openai:gpt-4o" — parsed as provider:model.

Object format:

Property	Type	Required	Description
`id`	`string`	No	Unique identifier for this provider target.
`label`	`string`	No	Human-readable label shown in scoring reports.
`provider`	`string`	No	Provider name (e.g., `"openai"`, `"anthropic"`, `"azure"`).
`target`	`string`	No	Target identifier (alternative to `provider` + `model`).
`model`	`string`	No	Model name (e.g., `"gpt-4o"`, `"claude-sonnet-4-20250514"`).
`base_url`	`string`	No	Custom base URL for the provider API.
`api_base`	`string`	No	Alias for `base_url`.
`secret_key_ref`	`string`	No	Environment variable name containing the API key.
`headers`	`object`	No	Additional HTTP headers sent with requests.
`config`	`object`	No	Provider-specific configuration (e.g., `temperature`, `max_tokens`).

`benchmarks`

Property	Type	Default	Description
`ragas_faithfulness`	`boolean`	`false`	Enable RAGAS faithfulness benchmark — measures whether the response is grounded in the provided context.
`ragas_relevancy`	`boolean`	`false`	Enable RAGAS relevancy benchmark — measures whether the response addresses the question.
`bleu_score`	`boolean`	`false`	Enable BLEU score — measures n-gram overlap with a reference response.
`nli_entailment`	`boolean`	`false`	Enable Natural Language Inference entailment check — verifies logical consistency.
`coherence`	`boolean`	`false`	Enable coherence benchmark — measures logical flow and consistency within the response.
`completeness`	`boolean`	`false`	Enable completeness benchmark — measures whether the response fully addresses the query.

`assertions` (QualityAssertion)

Property	Type	Default	Required	Description
`type`	`string`	—	Yes	Assertion type. See Assertion Types below for the full list.
`name`	`string`	—	No	Human-readable name for reporting. Length: `1`–`200`.
`enabled`	`boolean`	`true`	No	Whether this assertion is active. Set `false` to skip without removing.
`threshold`	`number`	—	No	Minimum score (0–1) for this assertion to pass.
`weight`	`number`	—	No	Relative weight (0–1) in aggregate scoring.
`config`	`object`	`{}`	No	Assertion-specific configuration. Contents vary by type (e.g., `rubricPrompt` for `llm-rubric`, `pattern` for `regex`).
`mode`	`string`	`"enforce"`	No	Execution mode. `"enforce"` blocks on failure. `"audit"` logs but passes through. `"shadow"` runs silently for data collection.
`severity`	`string`	`"critical"`	No	Alert severity on failure. One of `"critical"`, `"warning"`, `"info"`.
`pack`	`string`	—	No	Name of an assertion pack to inline. Used with `type: "assert-set"` to reference reusable assertion groups.

Assertion Types

Deterministic

Evaluate output using exact matching, pattern matching, or structural validation. No LLM calls required.

Type	Description
`contains`	Output contains the specified substring.
`contains-all`	Output contains all specified substrings.
`contains-any`	Output contains at least one specified substring.
`icontains`	Case-insensitive `contains`.
`icontains-all`	Case-insensitive `contains-all`.
`icontains-any`	Case-insensitive `contains-any`.
`equals`	Output exactly equals the expected value.
`regex`	Output matches the specified regular expression.
`starts-with`	Output begins with the specified string.
`word-count`	Output word count falls within the specified range.
`is-json`	Output is valid JSON.
`is-html`	Output is valid HTML.
`is-sql`	Output is valid SQL.
`is-xml`	Output is valid XML.
`contains-json`	Output contains a valid JSON substring.
`contains-html`	Output contains a valid HTML substring.
`contains-sql`	Output contains a valid SQL substring.
`contains-xml`	Output contains a valid XML substring.
`is-valid-function-call`	Output is a valid function call structure.
`is-valid-openai-function-call`	Output is a valid OpenAI function call.
`is-valid-openai-tools-call`	Output is a valid OpenAI tools call.
`levenshtein`	Levenshtein edit distance to expected output is within threshold.
`latency`	Response latency is within the specified limit (milliseconds).
`cost`	Response cost is within the specified limit (USD).
`finish-reason`	The model's finish reason matches the expected value.
`f-score`	F-score (precision/recall harmonic mean) against expected output.
`tool-call-f1`	F1 score for tool call accuracy against expected tool calls.

LLM-Graded

Use an LLM judge (configured via providers) to evaluate output quality.

Type	Description
`llm-rubric`	Grade output against a free-form rubric prompt.
`search-rubric`	Grade output with a rubric that includes search/retrieval context.
`model-graded-closedqa`	Grade correctness for closed-domain question answering.
`factuality`	Grade factual accuracy against a known correct answer.
`g-eval`	G-Eval framework scoring with customizable criteria.
`answer-relevance`	Grade whether the answer is relevant to the question.
`similar`	Semantic similarity to a reference output (LLM-judged).
`semantic-similarity`	Semantic similarity using embedding distance.
`classifier`	Classify output into predefined categories.
`moderation`	Run content moderation check on the output.
`select-best`	Compare multiple outputs and select the best one.

Benchmark

Standard NLP metrics computed against reference outputs.

Type	Description
`rouge`	ROUGE score (recall-oriented understudy for gisting evaluation).
`rouge-n`	ROUGE-N score (n-gram overlap).
`meteor`	METEOR score (semantic-aware translation metric).
`gleu`	GLEU score (Google-BLEU variant).
`bleu_score`	BLEU score (bilingual evaluation understudy).
`perplexity`	Model perplexity on the output.
`perplexity-score`	Normalized perplexity score.
`nli_entailment`	Natural Language Inference entailment check.
`coherence`	Coherence score for logical flow.
`completeness`	Completeness score for full coverage of the query.

RAG (Retrieval-Augmented Generation)

Evaluate quality in RAG pipelines where context documents are provided.

Type	Description
`context-recall`	Measures how much of the ground truth is captured by retrieved context.
`context-relevance`	Measures relevance of retrieved context to the question.
`context-faithfulness`	Measures whether the response is faithful to the retrieved context.
`rag-document-exfiltration`	Detects if the model leaks full retrieved documents verbatim.
`rag-poisoning`	Detects signs of poisoned context documents influencing the output.
`rag-source-attribution`	Verifies that the response correctly attributes sources.
`conversation-relevance`	Measures relevance within a multi-turn conversation context.

Agent Trajectory

Evaluate agent behavior across multi-step workflows.

Type	Description
`trajectory:goal-success`	Whether the agent achieved the stated goal.
`trajectory:tool-used`	Whether the agent used a specific expected tool.
`trajectory:tool-sequence`	Whether the agent used tools in the expected sequence.
`trajectory:step-count`	Whether the agent completed the task within a step budget.
`trace-span-count`	Number of trace spans falls within expected range.
`trace-span-duration`	Trace span durations fall within expected limits.
`trace-error-spans`	Number of error spans is within acceptable limits.

Programmatic

Run custom code to evaluate output.

Type	Description
`javascript`	Execute a JavaScript function that returns a score.
`python`	Execute a Python function that returns a score.
`webhook`	Send the output to an external webhook for scoring.

Type	Description
`assert-set`	Group multiple assertions into a reusable pack. Use with the `pack` property.
`is-refusal`	Check whether the model refused to answer.
`max-score`	Cap the assertion score at a maximum value.
`pi`	Prompt injection detection assertion.

`thresholds`

Property	Type	Default	Description
`min_aggregate`	`number`	`0.7`	Minimum weighted aggregate score across all assertions and benchmarks.
`min_faithfulness`	`number`	`0.8`	Minimum faithfulness score.
`min_relevancy`	`number`	`0.75`	Minimum relevancy score.
`min_bleu`	`number`	`0.4`	Minimum BLEU score.
`min_coherence`	`number`	`0.65`	Minimum coherence score.
`min_completeness`	`number`	`0.7`	Minimum completeness score.
`min_accuracy`	`number`	`0.8`	Minimum accuracy score.

`weights`

Property	Type	Default	Description
`faithfulness`	`number`	`0.25`	Weight of faithfulness in aggregate calculation.
`relevancy`	`number`	`0.25`	Weight of relevancy in aggregate calculation.
`bleu`	`number`	`0.2`	Weight of BLEU score in aggregate calculation.
`coherence`	`number`	`0.15`	Weight of coherence in aggregate calculation.
`completeness`	`number`	`0.15`	Weight of completeness in aggregate calculation.
`accuracy`	`number`	`0.2`	Weight of accuracy in aggregate calculation.

`failure_action`

Property	Type	Default	Description
`action`	`string`	`"block"`	What to do when the response fails quality gates. One of `"block"`, `"fallback"`, `"retry"`.
`fallback_message`	`string`	`"I apologize, but I cannot provide a sufficiently accurate response at this time."`	Message returned to the user when `action` is `"fallback"`.
`max_retries`	`integer`	`0`	Number of retry attempts before falling back or blocking. Range: `0`–`5`. Only used when `action` is `"retry"`.

`pass_policy`

Property	Type	Default	Description
`strategy`	`string`	`"all"`	Aggregation strategy. `"all"` requires every assertion to pass. `"quorum"` requires a fraction to pass. `"weighted_average"` computes a weighted score.
`quorum`	`number`	—	Fraction of assertions that must pass (0–1). Used when `strategy` is `"quorum"`.
`threshold`	`number`	—	Minimum weighted average score (0–1). Used when `strategy` is `"weighted_average"`.

`industry_profiles`

A map of profile names to QualityProfile objects. Each profile overrides the top-level thresholds when activated via the industry property.

QualityProfile:

Property	Type	Range	Description
`min_aggregate`	`number`	`0`–`1`	Override for `thresholds.min_aggregate`.
`min_accuracy`	`number`	`0`–`1`	Override for `thresholds.min_accuracy`.
`min_faithfulness`	`number`	`0`–`1`	Override for `thresholds.min_faithfulness`.
`min_relevancy`	`number`	`0`–`1`	Override for `thresholds.min_relevancy`.
`min_coherence`	`number`	`0`–`1`	Override for `thresholds.min_coherence`.
`min_completeness`	`number`	`0`–`1`	Override for `thresholds.min_completeness`.

Use Cases

1. Simple Format Assertions

Validate that the response is valid JSON, contains required fields, and matches an expected pattern.

policy:
  quality-scorer:
    assertions:
    - type: is-json
      name: valid-json
      threshold: 1.0
      mode: enforce
      severity: critical
    - type: contains-all
      name: required-fields
      threshold: 1.0
      config:
        values:
        - '"status"'
        - '"data"'
        - '"timestamp"'
    - type: regex
      name: iso-date-format
      config:
        pattern: '\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}'
    - type: word-count
      name: minimum-length
      config:
        min: 20
        max: 500
    failure_action:
      action: block
pack:
  name: quality-scorer-example-2
  version: 1.0.0
  enabled: true
policies:
  chain:
  - quality-scorer

2. RAG Quality with Faithfulness and Relevancy Benchmarks

Ensure RAG pipeline outputs are grounded in retrieved context and relevant to the query.

policy:
  quality-scorer:
    benchmarks:
      ragas_faithfulness: true
      ragas_relevancy: true
      coherence: true
      completeness: true
    assertions:
    - type: context-faithfulness
      name: grounded-in-context
      threshold: 0.85
      weight: 0.4
      mode: enforce
      severity: critical
    - type: context-relevance
      name: context-is-relevant
      threshold: 0.75
      weight: 0.3
      mode: enforce
      severity: warning
    - type: rag-source-attribution
      name: cites-sources
      threshold: 0.8
      weight: 0.2
      mode: enforce
      severity: warning
    - type: rag-document-exfiltration
      name: no-document-leak
      threshold: 1.0
      weight: 0.1
      mode: enforce
      severity: critical
    thresholds:
      min_aggregate: 0.8
      min_faithfulness: 0.85
      min_relevancy: 0.75
    failure_action:
      action: fallback
      fallback_message: I could not generate a sufficiently grounded response. Please rephrase your question.
pack:
  name: quality-scorer-example-3
  version: 1.0.0
  enabled: true
policies:
  chain:
  - quality-scorer

3. LLM-Graded Rubric with External Provider

Use GPT-4o as a judge to grade responses against domain-specific rubrics.

pack:
  name: quality-scorer-example-4
  version: 1.0.0
  enabled: true
policies:
  chain:
  - quality-scorer
policy:
  quality-scorer:
    providers:
    - id: gpt4o-judge
      label: GPT-4o Quality Judge
      provider: openai
      model: gpt-4o
      secret_key_ref:
        env: OPENAI_JUDGE_API_KEY
      config:
        temperature: 0.0
        max_tokens: 1024
    assertions:
    - type: llm-rubric
      name: legal-accuracy
      threshold: 0.9
      weight: 0.4
      mode: enforce
      severity: critical
      config:
        rubric: |
          Evaluate the legal accuracy of this response:
          1. Are legal citations correct and verifiable?
          2. Is the legal reasoning logically sound?
          3. Are relevant statutes and case law referenced?
          Score from 0 to 1.
    - type: factuality
      name: fact-check
      threshold: 0.85
      weight: 0.3
      mode: enforce
      severity: critical
      config:
        reference_statement: The response must cite only verifiable legal authorities and must not invent statutes or case law.
    - type: answer-relevance
      name: stays-on-topic
      threshold: 0.8
      weight: 0.2
      mode: enforce
      severity: warning
      config: {}
    - type: moderation
      name: content-safety
      threshold: 1.0
      weight: 0.1
      mode: enforce
      severity: critical
      config: {}
    pass_policy:
      strategy: weighted_average
      threshold: 0.85

4. Industry-Specific Profiles

Define reusable threshold profiles and activate them by setting industry.

pack:
  name: quality-scorer-example-5
  version: 1.0.0
  enabled: true
policies:
  chain:
  - quality-scorer
policy:
  quality-scorer:
    industry: healthcare
    providers:
    - id: clinical-judge
      label: Clinical Quality Judge
      provider: openai
      model: gpt-4o
      secret_key_ref:
        env: OPENAI_API_KEY
      config:
        temperature: 0.0
    benchmarks:
      ragas_faithfulness: true
      coherence: true
      completeness: true
    assertions:
    - type: llm-rubric
      name: clinical-accuracy
      threshold: 0.95
      mode: enforce
      severity: critical
      config:
        rubric: Evaluate clinical accuracy. Are diagnoses evidence-based? Are drug interactions noted?
    - type: context-faithfulness
      name: evidence-grounding
      threshold: 0.9
      mode: enforce
      severity: critical
      config: {}
    industry_profiles:
      finance:
        min_aggregate: 0.85
        min_accuracy: 0.95
        min_faithfulness: 0.9
        min_relevancy: 0.8
        min_coherence: 0.75
        min_completeness: 0.8
      healthcare:
        min_aggregate: 0.9
        min_accuracy: 0.95
        min_faithfulness: 0.95
        min_relevancy: 0.85
        min_coherence: 0.8
        min_completeness: 0.85
      legal:
        min_aggregate: 0.88
        min_accuracy: 0.92
        min_faithfulness: 0.9
        min_relevancy: 0.85
        min_coherence: 0.8
        min_completeness: 0.82
    failure_action:
      action: block

5. Assertion Pack with Mixed Deterministic and Graded

Combine structural validation with LLM-graded quality in a single assertion set.

policy:
  quality-scorer:
    providers:
    - openai:gpt-4o
    assertions:
    - type: is-json
      name: valid-structure
      threshold: 1.0
      mode: enforce
      severity: critical
    - type: contains
      name: has-recommendation
      config:
        value: recommendation
      mode: enforce
      severity: warning
    - type: latency
      name: response-time
      config:
        maxLatency: 5000
      mode: audit
      severity: info
    - type: cost
      name: cost-cap
      config:
        maxCost: 0.05
      mode: audit
      severity: info
    - type: llm-rubric
      name: quality-rubric
      threshold: 0.8
      weight: 0.5
      mode: enforce
      severity: critical
      config:
        rubricPrompt: Is the recommendation actionable, specific, and supported by data?
    - type: semantic-similarity
      name: consistent-with-baseline
      threshold: 0.7
      weight: 0.3
      mode: enforce
      severity: warning
    pass_policy:
      strategy: quorum
      quorum: 0.8
    failure_action:
      action: retry
      max_retries: 2
pack:
  name: quality-scorer-example-6
  version: 1.0.0
  enabled: true
policies:
  chain:
  - quality-scorer

6. Agent Trajectory Assertions

Evaluate whether an AI agent completed its task correctly by checking tool usage and step counts.

policy:
  quality-scorer:
    assertions:
    - type: trajectory:goal-success
      name: task-completed
      threshold: 1.0
      weight: 0.4
      mode: enforce
      severity: critical
    - type: trajectory:tool-used
      name: used-search-tool
      threshold: 1.0
      weight: 0.2
      mode: enforce
      severity: warning
      config:
        tool: web_search
    - type: trajectory:tool-sequence
      name: correct-workflow
      threshold: 1.0
      weight: 0.2
      mode: audit
      severity: info
      config:
        sequence:
        - search_database
        - analyze_results
        - generate_report
    - type: trajectory:step-count
      name: efficient-execution
      threshold: 1.0
      weight: 0.1
      mode: audit
      severity: info
      config:
        maxSteps: 10
    - type: trace-error-spans
      name: no-errors
      threshold: 1.0
      weight: 0.1
      mode: enforce
      severity: warning
      config:
        maxErrors: 0
    pass_policy:
      strategy: weighted_average
      threshold: 0.85
pack:
  name: quality-scorer-example-7
  version: 1.0.0
  enabled: true
policies:
  chain:
  - quality-scorer

How It Works

Minimum length check — Before running assertions, the gateway checks min_output_chars and min_sentences. If the response is too short, it fails immediately without consuming assertion evaluation resources.
Assertion evaluation — Each enabled assertion in the assertions array is evaluated in order:
- Deterministic assertions (e.g., is-json, contains, regex) run locally with no external calls.
- LLM-graded assertions (e.g., llm-rubric, factuality) send the output to the configured providers for scoring.
- Benchmark assertions (e.g., rouge, bleu_score) compute standard NLP metrics against reference outputs.
- RAG assertions evaluate context faithfulness and relevance using the retrieval context.
- Agent trajectory assertions inspect the trace spans for goal completion, tool usage, and step counts.
Score aggregation — Individual assertion scores are combined according to pass_policy.strategy:
- "all" — Every assertion must meet its threshold. One failure fails the entire evaluation.
- "quorum" — At least pass_policy.quorum fraction of assertions must pass.
- "weighted_average" — Scores are combined using weights and compared against pass_policy.threshold.
Threshold comparison — The aggregate score and per-type scores are compared against thresholds. If an industry profile is active, its overrides take precedence.
Failure handling — When the response fails quality gates, failure_action determines the outcome:
- "block" — The response is suppressed entirely.
- "retry" — The request is re-sent to the LLM up to max_retries times. If all retries fail, behavior falls through to block.
- "fallback" — The fallback_message is returned instead of the AI response.
Mode enforcement — Only assertions with mode: "enforce" affect the pass/fail outcome. "audit" assertions are scored and logged but do not block. "shadow" assertions run silently for observability only.
Mock scoring — When mock_scoring: true, all assertions return synthetic passing scores. This allows CI/CD pipelines to test policy configuration without real LLM calls.
Event emission — All assertion results, aggregate scores, and failure actions are emitted as structured events to POST /v1/events.

Combining With Other Policies

Policy	Interaction
prompt-injection	Runs before quality scoring. Injected prompts are blocked before they reach the LLM, so quality scoring only evaluates legitimate responses.
pii-filter	Runs before quality scoring on the output. PII is redacted before the quality scorer evaluates the response, so assertions see the redacted version.
content-filter	Runs before quality scoring. Blocked content never reaches the quality scorer.
agent-firewall	Runs before quality scoring. The firewall controls which tools the agent can call; the quality scorer evaluates the final response.
rate-limiter	Independent. Rate limiting controls request throughput; quality scoring evaluates response content. Both are enforced.
disclaimer	Runs after quality scoring. Disclaimers are appended to the response after it passes quality gates, so they do not affect scoring.

Best Practices

Start with deterministic assertions (is-json, contains, regex) before adding LLM-graded assertions. Deterministic checks are free, fast, and deterministic — they catch format issues before expensive LLM judge calls.
Use mode: "audit" for new assertions. Run in audit mode for a week to collect baseline scores before switching to "enforce". This prevents unexpected blocking in production.
Set mock_scoring: true in CI/CD. This validates your policy configuration syntax without consuming LLM tokens or requiring API keys.
Define industry_profiles for multi-vertical deployments. Instead of duplicating entire policy configs, use profiles to override only the thresholds that differ.
Use "weighted_average" pass policy for nuanced scoring. The "all" strategy is strict — one low-weight informational assertion can block a response. Weighted average lets you balance strictness with flexibility.
Keep max_retries at 1–2. Each retry doubles LLM cost. More than 2 retries suggests a fundamental model or prompt issue, not a transient quality dip.
Separate assertion packs by concern. Use assert-set with pack to group related assertions (e.g., "format-checks", "clinical-accuracy") so they can be reused across policies.
Use trajectory assertions for agent workflows. Traditional text-quality assertions evaluate the final response; trajectory assertions evaluate the agent's decision-making process.
Configure a dedicated low-temperature judge provider. Set temperature: 0.0 on your judge LLM to maximize scoring reproducibility across evaluations.

For AI systems

Canonical terms: Keeptrusts, quality-scorer, assertions, thresholds, min_aggregate, failure_action, providers, benchmarks, industry, mock_scoring, retry, fallback_message
Config/command names: policy.quality-scorer, assertions[], thresholds.min_aggregate, failure_action (block/retry/fallback_message), providers[] (judge LLMs), benchmarks, industry
Best next pages: Quality Assertions Configuration, Config Testing, Human Oversight

For engineers

Prerequisites: For LLM-judged assertions: a configured judge provider. For RAG benchmarks: context documents attached to requests. Define thresholds.min_aggregate based on acceptable quality floor.
Validation: Run kt policy test with quality-focused test cases. Monitor aggregate scores in Events. Adjust thresholds based on false-blocking rates. Use mock_scoring: true for development.
Key commands: kt policy lint, kt policy test, kt events tail, console Events page

For leaders

Governance: Quality scoring defines your organization's minimum acceptable AI output bar. It prevents low-quality, hallucinated, or irrelevant responses from reaching users.
Cost: LLM-judged assertions consume additional tokens per request (judge model cost). NLP benchmarks and deterministic assertions are near-free. Budget for judge model costs in high-volume deployments.
Rollout: Start with deterministic assertions (format checks, keyword requirements). Add LLM-judged assertions for critical quality dimensions. Set failure_action: retry before block to maximize response delivery.

Next steps

Quality Assertions Configuration — Full assertion type reference
Config Testing — Test quality assertions offline
Human Oversight — Escalate low-quality responses
Providers Configuration — Configure judge providers

Use this page when​

Primary audience​

Configuration​

Fields​

Top-Level Properties​

providers / targets (QualityProviderTarget)​

benchmarks​

assertions (QualityAssertion)​

Assertion Types​

Deterministic​

LLM-Graded​

Benchmark​

RAG (Retrieval-Augmented Generation)​

Agent Trajectory​

Programmatic​

Meta​

thresholds​

weights​

failure_action​

pass_policy​

industry_profiles​

Use Cases​

1. Simple Format Assertions​

2. RAG Quality with Faithfulness and Relevancy Benchmarks​

3. LLM-Graded Rubric with External Provider​

4. Industry-Specific Profiles​

5. Assertion Pack with Mixed Deterministic and Graded​

6. Agent Trajectory Assertions​

How It Works​

Combining With Other Policies​

Best Practices​

For AI systems​

For engineers​

For leaders​

Next steps​