Skip to main content
Browse docs

Safety Filter

The safety-filter policy blocks or escalates unsafe content patterns including harmful, violent, and inappropriate content. It operates in industry-specific modes that activate built-in safety patterns tailored to critical infrastructure, automotive, education, and law enforcement environments. The filter supports age-gated content detection, fuzzy matching for obfuscation resistance, and custom block patterns for organization-specific safety requirements.

Use this page when

  • You need to block harmful, violent, or inappropriate content with industry-specific modes.
  • You are deploying AI in critical infrastructure, automotive, education, or law enforcement environments.
  • You want age-gated content filtering, fuzzy matching for obfuscation resistance, or custom block patterns.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Configuration

pack:
name: safety-filter
version: "1.0.0"
enabled: true

policies:
chain:
- safety-filter

policy:
safety-filter:
mode: critical_infrastructure
block_if:
- "override safety interlock"
- "disable emergency shutdown"
action: block
fuzzy_matching: false
max_distance: 1
max_age: 0

Fields

FieldTypeDescriptionDefault
modeenum: "critical_infrastructure" | "automotive" | "education" | "law_enforcement"Industry mode that selects built-in safety patterns. critical_infrastructure blocks SCADA/ICS sabotage patterns, safety interlock bypass attempts, and hazardous process manipulation. automotive blocks vehicle safety system interference, autonomous driving override attempts, and crash-avoidable scenario generation. education blocks age-inappropriate content, self-harm references, and predatory behavior patterns. law_enforcement blocks evidence tampering guidance, use-of-force escalation, and operational security compromise patterns."critical_infrastructure"
block_ifstring[]Additional custom block patterns beyond the built-in mode defaults. These are merged with the mode's built-in patterns, allowing you to extend coverage without replacing it.[]
actionenum: "block" | "escalate"Action on detection. block immediately stops the request/response and returns a safety notice. escalate flags the interaction for human review while still blocking the content."block"
fuzzy_matchingbooleanEnable Levenshtein distance fuzzy matching to catch misspellings and deliberate obfuscation of safety-critical terms (e.g., "byp4ss interl0ck" → "bypass interlock").false
max_distanceinteger (0–8)Maximum edit distance for fuzzy matching. Only takes effect when fuzzy_matching is true. Lower values reduce false positives.1
max_ageinteger (min 0)Audience age filter. When set to a value greater than 0 and less than 18, enables age-inappropriate content detection calibrated to the specified age. For example, max_age: 10 activates stricter content filtering than max_age: 16. When 0, age-based filtering is disabled.0

Use Cases

Critical Infrastructure SCADA Protection

Protect SCADA and ICS environments from AI-generated content that could guide sabotage, safety system bypass, or hazardous process manipulation.

pack:
name: safety-filter
version: "1.0.0"
enabled: true

policies:
chain:
- safety-filter

policy:
safety-filter:
mode: critical_infrastructure
block_if:
- "PLC firmware exploit"
- "bypass pressure relief valve"
- "disable gas leak detector"
- "override reactor coolant pump"
- "SCADA protocol injection"
- "modify safety instrumented system"
action: escalate
fuzzy_matching: true
max_distance: 2

Automotive Safety Systems

Prevent AI tools used by automotive engineers from generating content that could compromise vehicle safety systems, autonomous driving algorithms, or crash avoidance features.

pack:
name: safety-filter
version: "1.0.0"
enabled: true

policies:
chain:
- safety-filter

policy:
safety-filter:
mode: automotive
block_if:
- "disable airbag deployment"
- "override ABS threshold"
- "bypass lane departure warning"
- "reduce braking force below minimum"
- "steering assist override torque"
action: block
fuzzy_matching: true
max_distance: 1

K-12 Education Content Filtering

Filter AI-generated content for age-appropriateness in educational settings, blocking harmful content and enabling age-calibrated filtering for students.

pack:
name: safety-filter
version: "1.0.0"
enabled: true

policies:
chain:
- safety-filter

policy:
safety-filter:
mode: education
block_if:
- "detailed weapon construction"
- "drug synthesis procedure"
- "explicit sexual content"
- "self-harm methodology"
action: block
fuzzy_matching: true
max_distance: 1
max_age: 12

Law Enforcement Operational Safety

Protect law enforcement AI tools from generating content that could compromise evidence integrity, escalate use-of-force situations, or leak operational details.

pack:
name: safety-filter
version: "1.0.0"
enabled: true

policies:
chain:
- safety-filter

policy:
safety-filter:
mode: law_enforcement
block_if:
- "fabricate evidence procedure"
- "destroy chain of custody"
- "bypass body camera recording"
- "surveillance without warrant method"
- "interrogation coercion technique"
action: escalate
fuzzy_matching: false

Age-Gated Content Filtering

Use the max_age field independently to apply age-calibrated content restrictions for platforms serving minors.

pack:
name: safety-filter
version: "1.0.0"
enabled: true

policies:
chain:
- safety-filter

policy:
safety-filter:
mode: education
action: block
max_age: 16
block_if:
- "graphic violence description"
- "substance abuse glorification"

How It Works

  1. Mode activation — The mode field loads a built-in pattern set tailored to the selected industry. Each mode includes dozens of pre-configured patterns covering the most common safety risks for that domain.
  2. Pattern merging — Custom block_if patterns are merged with the mode's built-in patterns. This ensures you extend coverage rather than replace it.
  3. Content scanning — Incoming prompts and outgoing model responses are scanned against the full pattern set using case-insensitive matching.
  4. Fuzzy matching (when enabled) — If fuzzy_matching is true, patterns that don't match exactly are compared using Levenshtein distance. Matches within max_distance edits are flagged. This catches leetspeak substitutions, deliberate misspellings, and Unicode homoglyph attacks.
  5. Age filtering (when enabled) — If max_age is set to a value between 1 and 17, an additional content classifier evaluates whether the content is appropriate for the specified age group. This operates independently of the pattern-based blocking.
  6. Action execution — Matched content triggers either block (immediate stop, safety notice returned) or escalate (content blocked and flagged for human review through the escalations API).
  7. Audit logging — Every safety event is logged with the matched pattern, mode, action taken, and age filter status for compliance and incident review.

Combining With Other Policies

PolicyCombined Effect
topic-filterBlocks entire unsafe conversation topics at a coarser level. safety-filter catches specific dangerous patterns within otherwise permissible topics.
pii-filterPrevents personal data leakage alongside unsafe content — critical in education (COPPA/FERPA) and law enforcement (witness protection) contexts.
disclaimerAdds safety disclaimers to all AI responses in high-risk environments. Useful as a supplementary control alongside blocking.
escalation-rulesRoutes escalated safety events to specific review teams based on mode and severity.
prompt-injection-detectionCatches adversarial prompts designed to bypass the safety filter through jailbreak techniques.
audit-logLogs all interactions for safety incident review, not just blocked ones.

Best Practices

  • Select the right mode — Each mode activates domain-specific patterns. Using critical_infrastructure mode in an education environment will miss age-inappropriate content; using education mode in a SCADA environment will miss industrial sabotage patterns.
  • Use escalate for high-consequence environments — In critical infrastructure and law enforcement, escalation creates a human review record. Use block for clear-cut violations and escalate for patterns that may need contextual judgment.
  • Enable fuzzy matching in adversarial environments — If users may attempt to bypass safety filters through creative spelling or encoding, enable fuzzy matching. Keep max_distance at 1–2 to minimize false positives.
  • Set max_age conservatively — In education environments, set max_age to the youngest user in the audience. A classroom with ages 10–14 should use max_age: 10.
  • Extend with block_if, don't rely solely on built-in defaults — Built-in patterns cover common risks, but your organization likely has domain-specific safety concerns. Add custom patterns for your specific operational environment.
  • Layer with prompt injection detection — Adversarial users often combine prompt injection with safety filter bypass attempts. The prompt-injection-detection policy catches the jailbreak layer while safety-filter catches the payload.

For AI systems

  • Canonical terms: Keeptrusts, safety-filter, mode, critical_infrastructure, automotive, education, law_enforcement, block_if, action, fuzzy_matching, max_age
  • Config/command names: safety-filter policy, mode (critical_infrastructure/automotive/education/law_enforcement), block_if, action (block/escalate), fuzzy_matching, max_distance, max_age
  • Best next pages: Prompt Injection Detection, External Moderation, DLP Filter

For engineers

  • Prerequisites: Choose the appropriate mode for your industry. Add custom block_if patterns for organization-specific safety requirements.
  • Validation: Test with industry-specific unsafe content (e.g., "override safety interlock" for critical infrastructure) and verify blocking. Test fuzzy matching with obfuscated terms. Verify age filtering with age-inappropriate content.
  • Key commands: kt policy lint, kt policy test, kt events tail

For leaders

  • Governance: Safety filtering prevents AI from generating content that could enable physical harm, compromise critical systems, or endanger vulnerable populations. Industry modes apply regulatory-aligned defaults.
  • Cost: Local pattern matching with no external calls. Negligible per-request overhead. The cost of a safety incident (physical harm, regulatory action, public trust damage) far exceeds prevention.
  • Rollout: Deploy the industry-appropriate mode immediately. Add custom block_if patterns based on incident reports and red-team findings. Use action: escalate for borderline cases that need human judgment.

Next steps