Quality Scorer
The quality-scorer policy evaluates AI response quality using a configurable mix of deterministic assertions, LLM-graded rubrics, NLP benchmarks, RAG evaluations, and agent trajectory checks. It enforces minimum score thresholds and controls what happens when a response fails quality gates — blocking, retrying, or falling back to a safe message.
Use this page when
- You need to evaluate AI response quality using assertions, LLM judges, NLP benchmarks, or RAG evaluations.
- You are configuring quality thresholds that block, retry, or escalate low-quality responses.
- You want to set up quality scoring with industry modes, custom assertion packs, or agent trajectory checks.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Configuration
pack:
name: quality-scorer-example-1
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer
policy:
quality-scorer:
industry: finance
min_output_chars: 50
min_sentences: 2
mock_scoring: false
providers:
- id: quality-judge
label: GPT-4o Judge
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
config:
temperature: 0.0
benchmarks:
ragas_faithfulness: true
ragas_relevancy: true
coherence: true
completeness: true
assertions:
- type: is-json
name: response-is-json
enabled: true
threshold: 1.0
weight: 0.3
mode: enforce
severity: critical
config: {}
- type: llm-rubric
name: finance-accuracy
enabled: true
threshold: 0.85
weight: 0.5
mode: enforce
severity: critical
config:
rubric: Evaluate whether the response is financially accurate, appropriately caveated, and suitable for internal review. Score from 0 to 1.
- type: context-faithfulness
name: rag-grounding
enabled: true
threshold: 0.8
weight: 0.2
mode: enforce
severity: warning
config: {}
thresholds:
min_aggregate: 0.7
min_faithfulness: 0.8
min_relevancy: 0.75
min_coherence: 0.65
min_completeness: 0.7
min_accuracy: 0.8
weights:
faithfulness: 0.25
relevancy: 0.25
coherence: 0.15
completeness: 0.15
accuracy: 0.2
failure_action:
action: fallback
fallback_message: I apologize, but I cannot provide a sufficiently accurate response at this time.
max_retries: 2
pass_policy:
strategy: weighted_average
threshold: 0.7
industry_profiles:
finance:
min_aggregate: 0.85
min_accuracy: 0.95
min_faithfulness: 0.9
min_relevancy: 0.8
min_coherence: 0.75
min_completeness: 0.8
healthcare:
min_aggregate: 0.9
min_accuracy: 0.95
min_faithfulness: 0.95
min_relevancy: 0.85
min_coherence: 0.8
min_completeness: 0.85
Fields
Top-Level Properties
| Property | Type | Default | Description |
|---|---|---|---|
industry | string | "" | Name of an industry profile to activate from industry_profiles. When set, the profile's thresholds override the top-level thresholds. |
min_output_chars | integer | 0 | Minimum character count for the AI response. Responses shorter than this fail immediately. Range: 0–10000000. |
min_sentences | integer | 0 | Minimum sentence count for the AI response. Range: 0–1000000. |
mock_scoring | boolean | false | When true, all assertions return synthetic pass scores. Useful for development and testing pipelines without consuming LLM judge tokens. |
providers | array | [] | List of QualityProviderTarget entries — LLM providers used for graded assertions (e.g., llm-rubric, factuality). Also accepted as targets. |
benchmarks | object | — | Toggle standard NLP/RAG benchmarks. See benchmarks below. |
assertions | array | [] | Ordered list of QualityAssertion rules. See assertions below. |
thresholds | object | — | Minimum pass scores for aggregate and per-type metrics. See thresholds below. |
weights | object | — | Per-type scoring weights used for weighted average calculation. See weights below. |
industry_profiles | object | {} | Named industry profiles with custom threshold overrides. See industry_profiles below. |
failure_action | object | — | What to do when a response fails quality gates. See failure_action below. |
pass_policy | object | — | Overall pass/fail aggregation strategy. See pass_policy below. |
providers / targets (QualityProviderTarget)
Each entry can be either a string (shorthand) or an object (full config).
String shorthand: "openai:gpt-4o" — parsed as provider:model.
Object format:
| Property | Type | Required | Description |
|---|---|---|---|
id | string | No | Unique identifier for this provider target. |
label | string | No | Human-readable label shown in scoring reports. |
provider | string | No | Provider name (e.g., "openai", "anthropic", "azure"). |
target | string | No | Target identifier (alternative to provider + model). |
model | string | No | Model name (e.g., "gpt-4o", "claude-sonnet-4-20250514"). |
base_url | string | No | Custom base URL for the provider API. |
api_base | string | No | Alias for base_url. |
secret_key_ref | string | No | Environment variable name containing the API key. |
headers | object | No | Additional HTTP headers sent with requests. |
config | object | No | Provider-specific configuration (e.g., temperature, max_tokens). |
benchmarks
| Property | Type | Default | Description |
|---|---|---|---|
ragas_faithfulness | boolean | false | Enable RAGAS faithfulness benchmark — measures whether the response is grounded in the provided context. |
ragas_relevancy | boolean | false | Enable RAGAS relevancy benchmark — measures whether the response addresses the question. |
bleu_score | boolean | false | Enable BLEU score — measures n-gram overlap with a reference response. |
nli_entailment | boolean | false | Enable Natural Language Inference entailment check — verifies logical consistency. |
coherence | boolean | false | Enable coherence benchmark — measures logical flow and consistency within the response. |
completeness | boolean | false | Enable completeness benchmark — measures whether the response fully addresses the query. |
assertions (QualityAssertion)
| Property | Type | Default | Required | Description |
|---|---|---|---|---|
type | string | — | Yes | Assertion type. See Assertion Types below for the full list. |
name | string | — | No | Human-readable name for reporting. Length: 1–200. |
enabled | boolean | true | No | Whether this assertion is active. Set false to skip without removing. |
threshold | number | — | No | Minimum score (0–1) for this assertion to pass. |
weight | number | — | No | Relative weight (0–1) in aggregate scoring. |
config | object | {} | No | Assertion-specific configuration. Contents vary by type (e.g., rubricPrompt for llm-rubric, pattern for regex). |
mode | string | "enforce" | No | Execution mode. "enforce" blocks on failure. "audit" logs but passes through. "shadow" runs silently for data collection. |
severity | string | "critical" | No | Alert severity on failure. One of "critical", "warning", "info". |
pack | string | — | No | Name of an assertion pack to inline. Used with type: "assert-set" to reference reusable assertion groups. |
Assertion Types
Deterministic
Evaluate output using exact matching, pattern matching, or structural validation. No LLM calls required.
| Type | Description |
|---|---|
contains | Output contains the specified substring. |
contains-all | Output contains all specified substrings. |
contains-any | Output contains at least one specified substring. |
icontains | Case-insensitive contains. |
icontains-all | Case-insensitive contains-all. |
icontains-any | Case-insensitive contains-any. |
equals | Output exactly equals the expected value. |
regex | Output matches the specified regular expression. |
starts-with | Output begins with the specified string. |
word-count | Output word count falls within the specified range. |
is-json | Output is valid JSON. |
is-html | Output is valid HTML. |
is-sql | Output is valid SQL. |
is-xml | Output is valid XML. |
contains-json | Output contains a valid JSON substring. |
contains-html | Output contains a valid HTML substring. |
contains-sql | Output contains a valid SQL substring. |
contains-xml | Output contains a valid XML substring. |
is-valid-function-call | Output is a valid function call structure. |
is-valid-openai-function-call | Output is a valid OpenAI function call. |
is-valid-openai-tools-call | Output is a valid OpenAI tools call. |
levenshtein | Levenshtein edit distance to expected output is within threshold. |
latency | Response latency is within the specified limit (milliseconds). |
cost | Response cost is within the specified limit (USD). |
finish-reason | The model's finish reason matches the expected value. |
f-score | F-score (precision/recall harmonic mean) against expected output. |
tool-call-f1 | F1 score for tool call accuracy against expected tool calls. |
LLM-Graded
Use an LLM judge (configured via providers) to evaluate output quality.
| Type | Description |
|---|---|
llm-rubric | Grade output against a free-form rubric prompt. |
search-rubric | Grade output with a rubric that includes search/retrieval context. |
model-graded-closedqa | Grade correctness for closed-domain question answering. |
factuality | Grade factual accuracy against a known correct answer. |
g-eval | G-Eval framework scoring with customizable criteria. |
answer-relevance | Grade whether the answer is relevant to the question. |
similar | Semantic similarity to a reference output (LLM-judged). |
semantic-similarity | Semantic similarity using embedding distance. |
classifier | Classify output into predefined categories. |
moderation | Run content moderation check on the output. |
select-best | Compare multiple outputs and select the best one. |
Benchmark
Standard NLP metrics computed against reference outputs.
| Type | Description |
|---|---|
rouge | ROUGE score (recall-oriented understudy for gisting evaluation). |
rouge-n | ROUGE-N score (n-gram overlap). |
meteor | METEOR score (semantic-aware translation metric). |
gleu | GLEU score (Google-BLEU variant). |
bleu_score | BLEU score (bilingual evaluation understudy). |
perplexity | Model perplexity on the output. |
perplexity-score | Normalized perplexity score. |
nli_entailment | Natural Language Inference entailment check. |
coherence | Coherence score for logical flow. |
completeness | Completeness score for full coverage of the query. |
RAG (Retrieval-Augmented Generation)
Evaluate quality in RAG pipelines where context documents are provided.
| Type | Description |
|---|---|
context-recall | Measures how much of the ground truth is captured by retrieved context. |
context-relevance | Measures relevance of retrieved context to the question. |
context-faithfulness | Measures whether the response is faithful to the retrieved context. |
rag-document-exfiltration | Detects if the model leaks full retrieved documents verbatim. |
rag-poisoning | Detects signs of poisoned context documents influencing the output. |
rag-source-attribution | Verifies that the response correctly attributes sources. |
conversation-relevance | Measures relevance within a multi-turn conversation context. |
Agent Trajectory
Evaluate agent behavior across multi-step workflows.
| Type | Description |
|---|---|
trajectory:goal-success | Whether the agent achieved the stated goal. |
trajectory:tool-used | Whether the agent used a specific expected tool. |
trajectory:tool-sequence | Whether the agent used tools in the expected sequence. |
trajectory:step-count | Whether the agent completed the task within a step budget. |
trace-span-count | Number of trace spans falls within expected range. |
trace-span-duration | Trace span durations fall within expected limits. |
trace-error-spans | Number of error spans is within acceptable limits. |
Programmatic
Run custom code to evaluate output.
| Type | Description |
|---|---|
javascript | Execute a JavaScript function that returns a score. |
python | Execute a Python function that returns a score. |
webhook | Send the output to an external webhook for scoring. |
Meta
Control assertion grouping and special behaviors.
| Type | Description |
|---|---|
assert-set | Group multiple assertions into a reusable pack. Use with the pack property. |
is-refusal | Check whether the model refused to answer. |
max-score | Cap the assertion score at a maximum value. |
pi | Prompt injection detection assertion. |
thresholds
| Property | Type | Default | Description |
|---|---|---|---|
min_aggregate | number | 0.7 | Minimum weighted aggregate score across all assertions and benchmarks. |
min_faithfulness | number | 0.8 | Minimum faithfulness score. |
min_relevancy | number | 0.75 | Minimum relevancy score. |
min_bleu | number | 0.4 | Minimum BLEU score. |
min_coherence | number | 0.65 | Minimum coherence score. |
min_completeness | number | 0.7 | Minimum completeness score. |
min_accuracy | number | 0.8 | Minimum accuracy score. |
weights
| Property | Type | Default | Description |
|---|---|---|---|
faithfulness | number | 0.25 | Weight of faithfulness in aggregate calculation. |
relevancy | number | 0.25 | Weight of relevancy in aggregate calculation. |
bleu | number | 0.2 | Weight of BLEU score in aggregate calculation. |
coherence | number | 0.15 | Weight of coherence in aggregate calculation. |
completeness | number | 0.15 | Weight of completeness in aggregate calculation. |
accuracy | number | 0.2 | Weight of accuracy in aggregate calculation. |
failure_action
| Property | Type | Default | Description |
|---|---|---|---|
action | string | "block" | What to do when the response fails quality gates. One of "block", "fallback", "retry". |
fallback_message | string | "I apologize, but I cannot provide a sufficiently accurate response at this time." | Message returned to the user when action is "fallback". |
max_retries | integer | 0 | Number of retry attempts before falling back or blocking. Range: 0–5. Only used when action is "retry". |
pass_policy
| Property | Type | Default | Description |
|---|---|---|---|
strategy | string | "all" | Aggregation strategy. "all" requires every assertion to pass. "quorum" requires a fraction to pass. "weighted_average" computes a weighted score. |
quorum | number | — | Fraction of assertions that must pass (0–1). Used when strategy is "quorum". |
threshold | number | — | Minimum weighted average score (0–1). Used when strategy is "weighted_average". |
industry_profiles
A map of profile names to QualityProfile objects. Each profile overrides the top-level thresholds when activated via the industry property.
QualityProfile:
| Property | Type | Range | Description |
|---|---|---|---|
min_aggregate | number | 0–1 | Override for thresholds.min_aggregate. |
min_accuracy | number | 0–1 | Override for thresholds.min_accuracy. |
min_faithfulness | number | 0–1 | Override for thresholds.min_faithfulness. |
min_relevancy | number | 0–1 | Override for thresholds.min_relevancy. |
min_coherence | number | 0–1 | Override for thresholds.min_coherence. |
min_completeness | number | 0–1 | Override for thresholds.min_completeness. |
Use Cases
1. Simple Format Assertions
Validate that the response is valid JSON, contains required fields, and matches an expected pattern.
policy:
quality-scorer:
assertions:
- type: is-json
name: valid-json
threshold: 1.0
mode: enforce
severity: critical
- type: contains-all
name: required-fields
threshold: 1.0
config:
values:
- '"status"'
- '"data"'
- '"timestamp"'
- type: regex
name: iso-date-format
config:
pattern: '\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}'
- type: word-count
name: minimum-length
config:
min: 20
max: 500
failure_action:
action: block
pack:
name: quality-scorer-example-2
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer
2. RAG Quality with Faithfulness and Relevancy Benchmarks
Ensure RAG pipeline outputs are grounded in retrieved context and relevant to the query.
policy:
quality-scorer:
benchmarks:
ragas_faithfulness: true
ragas_relevancy: true
coherence: true
completeness: true
assertions:
- type: context-faithfulness
name: grounded-in-context
threshold: 0.85
weight: 0.4
mode: enforce
severity: critical
- type: context-relevance
name: context-is-relevant
threshold: 0.75
weight: 0.3
mode: enforce
severity: warning
- type: rag-source-attribution
name: cites-sources
threshold: 0.8
weight: 0.2
mode: enforce
severity: warning
- type: rag-document-exfiltration
name: no-document-leak
threshold: 1.0
weight: 0.1
mode: enforce
severity: critical
thresholds:
min_aggregate: 0.8
min_faithfulness: 0.85
min_relevancy: 0.75
failure_action:
action: fallback
fallback_message: I could not generate a sufficiently grounded response. Please rephrase your question.
pack:
name: quality-scorer-example-3
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer
3. LLM-Graded Rubric with External Provider
Use GPT-4o as a judge to grade responses against domain-specific rubrics.
pack:
name: quality-scorer-example-4
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer
policy:
quality-scorer:
providers:
- id: gpt4o-judge
label: GPT-4o Quality Judge
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_JUDGE_API_KEY
config:
temperature: 0.0
max_tokens: 1024
assertions:
- type: llm-rubric
name: legal-accuracy
threshold: 0.9
weight: 0.4
mode: enforce
severity: critical
config:
rubric: |
Evaluate the legal accuracy of this response:
1. Are legal citations correct and verifiable?
2. Is the legal reasoning logically sound?
3. Are relevant statutes and case law referenced?
Score from 0 to 1.
- type: factuality
name: fact-check
threshold: 0.85
weight: 0.3
mode: enforce
severity: critical
config:
reference_statement: The response must cite only verifiable legal authorities and must not invent statutes or case law.
- type: answer-relevance
name: stays-on-topic
threshold: 0.8
weight: 0.2
mode: enforce
severity: warning
config: {}
- type: moderation
name: content-safety
threshold: 1.0
weight: 0.1
mode: enforce
severity: critical
config: {}
pass_policy:
strategy: weighted_average
threshold: 0.85
4. Industry-Specific Profiles
Define reusable threshold profiles and activate them by setting industry.
pack:
name: quality-scorer-example-5
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer
policy:
quality-scorer:
industry: healthcare
providers:
- id: clinical-judge
label: Clinical Quality Judge
provider: openai
model: gpt-4o
secret_key_ref:
env: OPENAI_API_KEY
config:
temperature: 0.0
benchmarks:
ragas_faithfulness: true
coherence: true
completeness: true
assertions:
- type: llm-rubric
name: clinical-accuracy
threshold: 0.95
mode: enforce
severity: critical
config:
rubric: Evaluate clinical accuracy. Are diagnoses evidence-based? Are drug interactions noted?
- type: context-faithfulness
name: evidence-grounding
threshold: 0.9
mode: enforce
severity: critical
config: {}
industry_profiles:
finance:
min_aggregate: 0.85
min_accuracy: 0.95
min_faithfulness: 0.9
min_relevancy: 0.8
min_coherence: 0.75
min_completeness: 0.8
healthcare:
min_aggregate: 0.9
min_accuracy: 0.95
min_faithfulness: 0.95
min_relevancy: 0.85
min_coherence: 0.8
min_completeness: 0.85
legal:
min_aggregate: 0.88
min_accuracy: 0.92
min_faithfulness: 0.9
min_relevancy: 0.85
min_coherence: 0.8
min_completeness: 0.82
failure_action:
action: block
5. Assertion Pack with Mixed Deterministic and Graded
Combine structural validation with LLM-graded quality in a single assertion set.
policy:
quality-scorer:
providers:
- openai:gpt-4o
assertions:
- type: is-json
name: valid-structure
threshold: 1.0
mode: enforce
severity: critical
- type: contains
name: has-recommendation
config:
value: recommendation
mode: enforce
severity: warning
- type: latency
name: response-time
config:
maxLatency: 5000
mode: audit
severity: info
- type: cost
name: cost-cap
config:
maxCost: 0.05
mode: audit
severity: info
- type: llm-rubric
name: quality-rubric
threshold: 0.8
weight: 0.5
mode: enforce
severity: critical
config:
rubricPrompt: Is the recommendation actionable, specific, and supported by data?
- type: semantic-similarity
name: consistent-with-baseline
threshold: 0.7
weight: 0.3
mode: enforce
severity: warning
pass_policy:
strategy: quorum
quorum: 0.8
failure_action:
action: retry
max_retries: 2
pack:
name: quality-scorer-example-6
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer
6. Agent Trajectory Assertions
Evaluate whether an AI agent completed its task correctly by checking tool usage and step counts.
policy:
quality-scorer:
assertions:
- type: trajectory:goal-success
name: task-completed
threshold: 1.0
weight: 0.4
mode: enforce
severity: critical
- type: trajectory:tool-used
name: used-search-tool
threshold: 1.0
weight: 0.2
mode: enforce
severity: warning
config:
tool: web_search
- type: trajectory:tool-sequence
name: correct-workflow
threshold: 1.0
weight: 0.2
mode: audit
severity: info
config:
sequence:
- search_database
- analyze_results
- generate_report
- type: trajectory:step-count
name: efficient-execution
threshold: 1.0
weight: 0.1
mode: audit
severity: info
config:
maxSteps: 10
- type: trace-error-spans
name: no-errors
threshold: 1.0
weight: 0.1
mode: enforce
severity: warning
config:
maxErrors: 0
pass_policy:
strategy: weighted_average
threshold: 0.85
pack:
name: quality-scorer-example-7
version: 1.0.0
enabled: true
policies:
chain:
- quality-scorer
How It Works
-
Minimum length check — Before running assertions, the gateway checks
min_output_charsandmin_sentences. If the response is too short, it fails immediately without consuming assertion evaluation resources. -
Assertion evaluation — Each enabled assertion in the
assertionsarray is evaluated in order:- Deterministic assertions (e.g.,
is-json,contains,regex) run locally with no external calls. - LLM-graded assertions (e.g.,
llm-rubric,factuality) send the output to the configuredprovidersfor scoring. - Benchmark assertions (e.g.,
rouge,bleu_score) compute standard NLP metrics against reference outputs. - RAG assertions evaluate context faithfulness and relevance using the retrieval context.
- Agent trajectory assertions inspect the trace spans for goal completion, tool usage, and step counts.
- Deterministic assertions (e.g.,
-
Score aggregation — Individual assertion scores are combined according to
pass_policy.strategy:"all"— Every assertion must meet itsthreshold. One failure fails the entire evaluation."quorum"— At leastpass_policy.quorumfraction of assertions must pass."weighted_average"— Scores are combined usingweightsand compared againstpass_policy.threshold.
-
Threshold comparison — The aggregate score and per-type scores are compared against
thresholds. If anindustryprofile is active, its overrides take precedence. -
Failure handling — When the response fails quality gates,
failure_actiondetermines the outcome:"block"— The response is suppressed entirely."retry"— The request is re-sent to the LLM up tomax_retriestimes. If all retries fail, behavior falls through to block."fallback"— Thefallback_messageis returned instead of the AI response.
-
Mode enforcement — Only assertions with
mode: "enforce"affect the pass/fail outcome."audit"assertions are scored and logged but do not block."shadow"assertions run silently for observability only. -
Mock scoring — When
mock_scoring: true, all assertions return synthetic passing scores. This allows CI/CD pipelines to test policy configuration without real LLM calls. -
Event emission — All assertion results, aggregate scores, and failure actions are emitted as structured events to
POST /v1/events.
Combining With Other Policies
| Policy | Interaction |
|---|---|
| prompt-injection | Runs before quality scoring. Injected prompts are blocked before they reach the LLM, so quality scoring only evaluates legitimate responses. |
| pii-filter | Runs before quality scoring on the output. PII is redacted before the quality scorer evaluates the response, so assertions see the redacted version. |
| content-filter | Runs before quality scoring. Blocked content never reaches the quality scorer. |
| agent-firewall | Runs before quality scoring. The firewall controls which tools the agent can call; the quality scorer evaluates the final response. |
| rate-limiter | Independent. Rate limiting controls request throughput; quality scoring evaluates response content. Both are enforced. |
| disclaimer | Runs after quality scoring. Disclaimers are appended to the response after it passes quality gates, so they do not affect scoring. |
Best Practices
- Start with deterministic assertions (
is-json,contains,regex) before adding LLM-graded assertions. Deterministic checks are free, fast, and deterministic — they catch format issues before expensive LLM judge calls. - Use
mode: "audit"for new assertions. Run in audit mode for a week to collect baseline scores before switching to"enforce". This prevents unexpected blocking in production. - Set
mock_scoring: truein CI/CD. This validates your policy configuration syntax without consuming LLM tokens or requiring API keys. - Define
industry_profilesfor multi-vertical deployments. Instead of duplicating entire policy configs, use profiles to override only the thresholds that differ. - Use
"weighted_average"pass policy for nuanced scoring. The"all"strategy is strict — one low-weight informational assertion can block a response. Weighted average lets you balance strictness with flexibility. - Keep
max_retriesat 1–2. Each retry doubles LLM cost. More than 2 retries suggests a fundamental model or prompt issue, not a transient quality dip. - Separate assertion packs by concern. Use
assert-setwithpackto group related assertions (e.g., "format-checks", "clinical-accuracy") so they can be reused across policies. - Use trajectory assertions for agent workflows. Traditional text-quality assertions evaluate the final response; trajectory assertions evaluate the agent's decision-making process.
- Configure a dedicated low-temperature judge provider. Set
temperature: 0.0on your judge LLM to maximize scoring reproducibility across evaluations.
For AI systems
- Canonical terms: Keeptrusts, quality-scorer, assertions, thresholds, min_aggregate, failure_action, providers, benchmarks, industry, mock_scoring, retry, fallback_message
- Config/command names:
policy.quality-scorer,assertions[],thresholds.min_aggregate,failure_action(block/retry/fallback_message),providers[](judge LLMs),benchmarks,industry - Best next pages: Quality Assertions Configuration, Config Testing, Human Oversight
For engineers
- Prerequisites: For LLM-judged assertions: a configured judge provider. For RAG benchmarks: context documents attached to requests. Define
thresholds.min_aggregatebased on acceptable quality floor. - Validation: Run
kt policy testwith quality-focused test cases. Monitor aggregate scores in Events. Adjust thresholds based on false-blocking rates. Usemock_scoring: truefor development. - Key commands:
kt policy lint,kt policy test,kt events tail, console Events page
For leaders
- Governance: Quality scoring defines your organization's minimum acceptable AI output bar. It prevents low-quality, hallucinated, or irrelevant responses from reaching users.
- Cost: LLM-judged assertions consume additional tokens per request (judge model cost). NLP benchmarks and deterministic assertions are near-free. Budget for judge model costs in high-volume deployments.
- Rollout: Start with deterministic assertions (format checks, keyword requirements). Add LLM-judged assertions for critical quality dimensions. Set
failure_action: retrybeforeblockto maximize response delivery.
Next steps
- Quality Assertions Configuration — Full assertion type reference
- Config Testing — Test quality assertions offline
- Human Oversight — Escalate low-quality responses
- Providers Configuration — Configure judge providers