Prompt Injection Defense: Detection, Blocking, and Encoding Normalization
Prompt injection defense in Keeptrusts means putting prompt-injection at the start of the request chain, keeping the decoding and normalization checks enabled, defining attack patterns under detection.attack_patterns, and accepting the current runtime behavior: when the policy detects an attack, it blocks the request before any provider call is made.
Use this page when
- You need an implementation-accurate view of how Keeptrusts handles prompt-injection today.
- You want to understand why normalization and fake-boundary checks matter as much as plain regex matching.
- You are deciding how to combine prompt-injection defense with DLP, routing, or human review.
Primary audience
- Primary: Technical Engineers and Security reviewers
- Secondary: Technical Leaders and AI platform owners
What this policy protects
The prompt-injection policy protects the input boundary. That sounds simple, but the important detail is where the defense lives: before the upstream model call.
That means the gateway gets a chance to inspect the request while it is still only input text and conversation history, not an already-completed provider response.
The current runtime looks at several categories of signal:
- Plain regex attack patterns such as instructions to ignore previous rules or reveal the system prompt.
- Encoded variants, including Base64-like segments.
- Unicode-normalized variants.
- Homoglyph variants that try to disguise suspicious text with look-alike characters.
- Fake system-boundary tokens and delimiter confusion.
- Multi-turn attack chains.
- Optional embedding similarity when you want a semantic confirmation path or always-on embedding detection.
This is why prompt injection cannot be treated as a single regex list. If you only look at raw text, you miss the encoded and boundary-confusion forms that real attacks use.
The configuration shape that matters
One detail trips teams up repeatedly: attack patterns belong under detection.attack_patterns, not as a top-level field.
The rest of the shape matters too. Encoding and boundary checks are first-class configuration blocks because they control how the gateway derives candidate text before matching. That is the difference between catching ignore previous instructions in plain English and missing the same payload because it arrived in a slightly transformed form.
Example: local detector with normalization enabled
pack:
name: prompt-injection-defense
version: 1.0.0
enabled: true
policies:
chain:
- prompt-injection
- dlp-filter
- audit-logger
policy:
prompt-injection:
use_embedding: false
detection:
attack_patterns:
- "ignore.*previous.*instructions"
- "forget.*system.*prompt"
- "reveal.*system.*prompt"
encoding:
decode_base64: true
normalize_unicode: true
detect_homoglyphs: true
boundaries:
enforce_delimiters: true
reject_fake_boundaries: true
dlp-filter:
blocked_terms:
- do not distribute
action: block
audit-logger:
retention_days: 365
This example is intentionally local-only. It is the cleanest place to start because it gives you the core boundary defense without introducing external embedding dependencies.
The chain order is deliberate. prompt-injection runs first because it is meant to stop the request before other policies spend time on an obviously adversarial turn. dlp-filter comes later because it is a content control, not the primary attack detector.
Why normalization is part of the defense, not a convenience feature
The most practical controls in this policy are often the least dramatic-sounding ones.
decode_base64 matters because some attackers will hide instructions in Base64-like segments and rely on downstream systems or human reviewers not decoding them mentally.
normalize_unicode matters because visually similar or escaped text can bypass naive pattern matching.
detect_homoglyphs matters because a request that looks harmless at a glance may still encode the same instruction when look-alike characters are folded to a common form.
reject_fake_boundaries and enforce_delimiters matter because prompt injection often tries to impersonate trusted structure instead of only using explicit jailbreak language. Tokens that pretend to be system boundaries are dangerous precisely because they attack the role boundary rather than the topic.
Those are not optional niceties. They are how the gateway turns a shallow text scan into a more realistic attack detector.
Blocking is the current behavior
The current runtime does not treat prompt injection as a soft warning. When it detects a positive signal, it blocks.
That matters operationally. If you were expecting warn, rewrite, or a custom end-user response message, that is not how this policy works today. The documented behavior is a block with reason_code = "prompt_injection.detected" or the data-poisoning variant when the anomaly path triggers.
This strict behavior is why prompt injection should be one of the first policies you author and test. It sits at a hard decision point in the request path.
It also means provider failover is not a substitute. If prompt injection detection blocks the request, multi-provider resilience never enters the picture because the request is intentionally prevented from reaching any provider.
When to add embeddings
The docs support optional embedding detection for cases where plain patterns and boundary checks are not enough. That is useful when you expect novel paraphrases rather than repeated jailbreak strings.
But there is a practical sequence here.
Start by getting the local detector, normalization, and boundary logic working. Those controls are cheap, deterministic, and easy to reason about.
Then add embedding detection if your threat model justifies the extra cost or latency. The important part is not to skip the local defenses just because semantic detection sounds more advanced. Embeddings are a supplement, not a replacement for clear input-boundary controls.
How to validate this policy
Validate prompt injection with attacks that exercise more than one code path.
kt policy lint --file policy-config.yaml
After linting, test at least these cases against the running gateway:
- A plain-language jailbreak such as an instruction to ignore previous instructions.
- A Base64-encoded variant of the same instruction.
- A fake boundary token such as a string that pretends to be a system-role delimiter.
- A multi-turn attack that begins as a benign request and then pivots.
- A normal request that should be allowed so you can watch for false positives.
That last case is important. Strong boundary defense is useful only if the normal traffic lane remains credible.
Where this fits with other controls
Prompt injection is one layer, not the whole safety model.
DLP Filter still matters because an allowed request can contain custom restricted phrases even when it is not a jailbreak.
Data Routing Policy still matters because a safe request may still need to be restricted to zero-retention or no-training providers.
Human Oversight still matters on high-risk output routes where human review is required even after the input boundary is clean.
That is the right composition model: attack defense at the boundary, data controls before routing, and output governance after the provider returns.
Key takeaways
- Put
prompt-injectionfirst in the chain so attacks are stopped before upstream calls. - Define patterns under
detection.attack_patterns; do not flatten them into the wrong schema shape. - Keep Base64 decoding, Unicode normalization, homoglyph folding, and fake-boundary checks enabled in production.
- Expect a block on detection. The current runtime does not treat prompt injection as a warning-only control.
- Combine prompt-injection defense with DLP, routing, and output-review policies rather than treating it as a complete governance stack by itself.