Safety Filter
The safety-filter policy blocks or escalates unsafe content patterns including harmful, violent, and inappropriate content. It operates in industry-specific modes that activate built-in safety patterns tailored to critical infrastructure, automotive, education, and law enforcement environments. The filter supports age-gated content detection, fuzzy matching for obfuscation resistance, and custom block patterns for organization-specific safety requirements.
Use this page when
- You need to block harmful, violent, or inappropriate content with industry-specific modes.
- You are deploying AI in critical infrastructure, automotive, education, or law enforcement environments.
- You want age-gated content filtering, fuzzy matching for obfuscation resistance, or custom block patterns.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Configuration
pack:
name: safety-filter
version: "1.0.0"
enabled: true
policies:
chain:
- safety-filter
policy:
safety-filter:
mode: critical_infrastructure
block_if:
- "override safety interlock"
- "disable emergency shutdown"
action: block
fuzzy_matching: false
max_distance: 1
max_age: 0
Fields
| Field | Type | Description | Default |
|---|---|---|---|
mode | enum: "critical_infrastructure" | "automotive" | "education" | "law_enforcement" | Industry mode that selects built-in safety patterns. critical_infrastructure blocks SCADA/ICS sabotage patterns, safety interlock bypass attempts, and hazardous process manipulation. automotive blocks vehicle safety system interference, autonomous driving override attempts, and crash-avoidable scenario generation. education blocks age-inappropriate content, self-harm references, and predatory behavior patterns. law_enforcement blocks evidence tampering guidance, use-of-force escalation, and operational security compromise patterns. | "critical_infrastructure" |
block_if | string[] | Additional custom block patterns beyond the built-in mode defaults. These are merged with the mode's built-in patterns, allowing you to extend coverage without replacing it. | [] |
action | enum: "block" | "escalate" | Action on detection. block immediately stops the request/response and returns a safety notice. escalate flags the interaction for human review while still blocking the content. | "block" |
fuzzy_matching | boolean | Enable Levenshtein distance fuzzy matching to catch misspellings and deliberate obfuscation of safety-critical terms (e.g., "byp4ss interl0ck" → "bypass interlock"). | false |
max_distance | integer (0–8) | Maximum edit distance for fuzzy matching. Only takes effect when fuzzy_matching is true. Lower values reduce false positives. | 1 |
max_age | integer (min 0) | Audience age filter. When set to a value greater than 0 and less than 18, enables age-inappropriate content detection calibrated to the specified age. For example, max_age: 10 activates stricter content filtering than max_age: 16. When 0, age-based filtering is disabled. | 0 |
Use Cases
Critical Infrastructure SCADA Protection
Protect SCADA and ICS environments from AI-generated content that could guide sabotage, safety system bypass, or hazardous process manipulation.
pack:
name: safety-filter
version: "1.0.0"
enabled: true
policies:
chain:
- safety-filter
policy:
safety-filter:
mode: critical_infrastructure
block_if:
- "PLC firmware exploit"
- "bypass pressure relief valve"
- "disable gas leak detector"
- "override reactor coolant pump"
- "SCADA protocol injection"
- "modify safety instrumented system"
action: escalate
fuzzy_matching: true
max_distance: 2
Automotive Safety Systems
Prevent AI tools used by automotive engineers from generating content that could compromise vehicle safety systems, autonomous driving algorithms, or crash avoidance features.
pack:
name: safety-filter
version: "1.0.0"
enabled: true
policies:
chain:
- safety-filter
policy:
safety-filter:
mode: automotive
block_if:
- "disable airbag deployment"
- "override ABS threshold"
- "bypass lane departure warning"
- "reduce braking force below minimum"
- "steering assist override torque"
action: block
fuzzy_matching: true
max_distance: 1
K-12 Education Content Filtering
Filter AI-generated content for age-appropriateness in educational settings, blocking harmful content and enabling age-calibrated filtering for students.
pack:
name: safety-filter
version: "1.0.0"
enabled: true
policies:
chain:
- safety-filter
policy:
safety-filter:
mode: education
block_if:
- "detailed weapon construction"
- "drug synthesis procedure"
- "explicit sexual content"
- "self-harm methodology"
action: block
fuzzy_matching: true
max_distance: 1
max_age: 12
Law Enforcement Operational Safety
Protect law enforcement AI tools from generating content that could compromise evidence integrity, escalate use-of-force situations, or leak operational details.
pack:
name: safety-filter
version: "1.0.0"
enabled: true
policies:
chain:
- safety-filter
policy:
safety-filter:
mode: law_enforcement
block_if:
- "fabricate evidence procedure"
- "destroy chain of custody"
- "bypass body camera recording"
- "surveillance without warrant method"
- "interrogation coercion technique"
action: escalate
fuzzy_matching: false
Age-Gated Content Filtering
Use the max_age field independently to apply age-calibrated content restrictions for platforms serving minors.
pack:
name: safety-filter
version: "1.0.0"
enabled: true
policies:
chain:
- safety-filter
policy:
safety-filter:
mode: education
action: block
max_age: 16
block_if:
- "graphic violence description"
- "substance abuse glorification"
How It Works
- Mode activation — The
modefield loads a built-in pattern set tailored to the selected industry. Each mode includes dozens of pre-configured patterns covering the most common safety risks for that domain. - Pattern merging — Custom
block_ifpatterns are merged with the mode's built-in patterns. This ensures you extend coverage rather than replace it. - Content scanning — Incoming prompts and outgoing model responses are scanned against the full pattern set using case-insensitive matching.
- Fuzzy matching (when enabled) — If
fuzzy_matchingistrue, patterns that don't match exactly are compared using Levenshtein distance. Matches withinmax_distanceedits are flagged. This catches leetspeak substitutions, deliberate misspellings, and Unicode homoglyph attacks. - Age filtering (when enabled) — If
max_ageis set to a value between 1 and 17, an additional content classifier evaluates whether the content is appropriate for the specified age group. This operates independently of the pattern-based blocking. - Action execution — Matched content triggers either
block(immediate stop, safety notice returned) orescalate(content blocked and flagged for human review through the escalations API). - Audit logging — Every safety event is logged with the matched pattern, mode, action taken, and age filter status for compliance and incident review.
Combining With Other Policies
| Policy | Combined Effect |
|---|---|
topic-filter | Blocks entire unsafe conversation topics at a coarser level. safety-filter catches specific dangerous patterns within otherwise permissible topics. |
pii-filter | Prevents personal data leakage alongside unsafe content — critical in education (COPPA/FERPA) and law enforcement (witness protection) contexts. |
disclaimer | Adds safety disclaimers to all AI responses in high-risk environments. Useful as a supplementary control alongside blocking. |
escalation-rules | Routes escalated safety events to specific review teams based on mode and severity. |
prompt-injection-detection | Catches adversarial prompts designed to bypass the safety filter through jailbreak techniques. |
audit-log | Logs all interactions for safety incident review, not just blocked ones. |
Best Practices
- Select the right mode — Each mode activates domain-specific patterns. Using
critical_infrastructuremode in an education environment will miss age-inappropriate content; usingeducationmode in a SCADA environment will miss industrial sabotage patterns. - Use
escalatefor high-consequence environments — In critical infrastructure and law enforcement, escalation creates a human review record. Useblockfor clear-cut violations andescalatefor patterns that may need contextual judgment. - Enable fuzzy matching in adversarial environments — If users may attempt to bypass safety filters through creative spelling or encoding, enable fuzzy matching. Keep
max_distanceat 1–2 to minimize false positives. - Set
max_ageconservatively — In education environments, setmax_ageto the youngest user in the audience. A classroom with ages 10–14 should usemax_age: 10. - Extend with
block_if, don't rely solely on built-in defaults — Built-in patterns cover common risks, but your organization likely has domain-specific safety concerns. Add custom patterns for your specific operational environment. - Layer with prompt injection detection — Adversarial users often combine prompt injection with safety filter bypass attempts. The
prompt-injection-detectionpolicy catches the jailbreak layer whilesafety-filtercatches the payload.
For AI systems
- Canonical terms: Keeptrusts, safety-filter, mode, critical_infrastructure, automotive, education, law_enforcement, block_if, action, fuzzy_matching, max_age
- Config/command names:
safety-filterpolicy,mode(critical_infrastructure/automotive/education/law_enforcement),block_if,action(block/escalate),fuzzy_matching,max_distance,max_age - Best next pages: Prompt Injection Detection, External Moderation, DLP Filter
For engineers
- Prerequisites: Choose the appropriate
modefor your industry. Add customblock_ifpatterns for organization-specific safety requirements. - Validation: Test with industry-specific unsafe content (e.g., "override safety interlock" for critical infrastructure) and verify blocking. Test fuzzy matching with obfuscated terms. Verify age filtering with age-inappropriate content.
- Key commands:
kt policy lint,kt policy test,kt events tail
For leaders
- Governance: Safety filtering prevents AI from generating content that could enable physical harm, compromise critical systems, or endanger vulnerable populations. Industry modes apply regulatory-aligned defaults.
- Cost: Local pattern matching with no external calls. Negligible per-request overhead. The cost of a safety incident (physical harm, regulatory action, public trust damage) far exceeds prevention.
- Rollout: Deploy the industry-appropriate mode immediately. Add custom
block_ifpatterns based on incident reports and red-team findings. Useaction: escalatefor borderline cases that need human judgment.
Next steps
- Prompt Injection Detection — Block adversarial inputs before safety filtering
- External Moderation — Third-party safety validation
- Human Oversight — Escalate flagged content for review
- DLP Filter — Data-level content protection