Skip to main content

Prompt Injection Attacks: Anatomy, Detection Methods, and Gateway Defense

Prompt Injection Attacks: Anatomy, Detection Methods, and Gateway Defense

Prompt injection is not a mysterious model failure. It is an input-layer attack that tries to rewrite instructions, blur role boundaries, expose hidden context, or coerce a model into unsafe behavior. If you wait until after the upstream call to respond, you have already lost the main security boundary. Keeptrusts addresses prompt injection where it belongs: at the gateway, before the model sees the request.

Use this page when

  • You need a concrete explanation of how prompt injection attacks are structured in real traffic.
  • You are choosing detection methods and need to know what should run at the request boundary.
  • You want a policy chain that blocks prompt injection attempts and limits collateral damage if a request is still risky.

Primary audience

  • Primary: Technical Engineers
  • Secondary: Technical Leaders, AI Agents

The problem

Most teams talk about prompt injection as if it were a single string such as "ignore previous instructions." In practice, attacks are more varied and more operational than that.

The first class of attacks is direct instruction override. The attacker tells the model to forget the system prompt, reveal hidden rules, act as an unrestricted persona, or prioritize the user over the application. These are the easiest attacks to recognize and the easiest ones to underestimate. If your gateway does not stop them reliably, the rest of the chain is already under stress.

The second class is boundary confusion. Attackers know that LLM applications rely on message roles and separators. They therefore inject fake system tags, delimiter tricks, or quoted blocks that look like trusted instructions. A raw string match is often too weak here because the malicious content may be wrapped in symbols, nested inside markdown, or blended into long inputs designed to look harmless.

The third class is obfuscation. Real attacks are frequently encoded, normalized strangely, or padded with noise. Base64 fragments, Unicode escape sequences, homoglyph substitutions, and multi-turn priming all exist to defeat brittle filters. If your security control evaluates only the exact bytes that arrived over the wire, it is easy to miss the actual meaning of the prompt.

The fourth class is objective-driven injection. The attacker is not trying to win an abstract red-team contest. They want something concrete: the system prompt, hidden tool rules, a secret already present in context, a reply that violates policy, or a response that helps them exploit another system. That is why prompt injection is not only a model alignment issue. It is also a data protection and abuse prevention issue.

This is where gateway design matters. A model cannot reliably distinguish trusted control text from attacker-controlled instructions in every case. The application should not delegate that problem to the model itself. The gateway has to normalize the request, inspect it, and make a hard decision before the request crosses the provider boundary.

The solution

Keeptrusts treats prompt injection as a request-boundary control, not a post-processing cleanup task. The primary policy is Prompt Injection Detection, which joins the request messages, normalizes content, evaluates regex attack patterns, checks fake boundaries and delimiter confusion, and can optionally force embedding-based similarity detection on every request.

That first decision should not stand alone. A practical chain adds DLP Filter so the same request cannot also leak hard-block material such as API keys, private keys, or restricted project terms. It adds Safety Filter so requests that are dangerous even without classic injection patterns can still be blocked or escalated. The operational framing behind that chain is already captured in Block Prompt Injection Attacks Before They Reach Your Models, Prevent Sensitive Data Leaks in AI Requests, and Implement Zero-Trust AI with Defense-in-Depth Policies.

The important design choice is ordering. prompt-injection goes first because adversarial requests should fail before you spend time on later stages. dlp-filter follows because secrets and internal terms are hard-block material. safety-filter then acts as a second containment layer for requests that may not be classic injection but are still clearly unsafe.

Implementation

This is a defensible baseline for a gateway lane exposed to untrusted user input:

pack:
name: prompt-injection-defense
version: "1.0.0"
enabled: true

policies:
chain:
- prompt-injection
- dlp-filter
- safety-filter

policy:
prompt-injection:
use_embedding: true
detection:
embedding_threshold: 0.8
attack_patterns:
- "ignore.*previous.*instructions"
- "forget.*system.*prompt"
- "reveal.*system.*prompt"
encoding:
decode_base64: true
normalize_unicode: true
detect_homoglyphs: true
boundaries:
enforce_delimiters: true
reject_fake_boundaries: true

dlp-filter:
detect_patterns:
- 'AKIA[0-9A-Z]{16}'
- 'ghp_[0-9A-Za-z]{36}'
- '-----BEGIN (RSA |EC )?PRIVATE KEY-----'
blocked_terms:
- Project Atlas
- hidden runbook
action: block
fuzzy_matching: true
max_distance: 1
sensitivity_level: restricted

safety-filter:
mode: law_enforcement
block_if:
- "doxx this employee"
- "target this executive"
- "impersonate the help desk"
action: block
fuzzy_matching: true
max_distance: 1
max_age: 0

This example stays within documented behavior. prompt-injection uses normalization, fake-boundary checks, and optional embedding detection. dlp-filter enforces only the patterns and terms you actually configure, which is exactly how the policy works today. safety-filter uses a custom keyword set because its runtime is keyword-based, not classifier-based.

Validate the config before rollout and then watch blocked events instead of assuming the control is active:

kt policy lint --file prompt-injection-defense.yaml
kt gateway run --listen 0.0.0.0:41002 --policy-config prompt-injection-defense.yaml
kt events tail --json --limit 20 --event-type decision --verdict blocked

That last command matters because prompt injection tuning is not a one-time project. You need to confirm that the gateway is catching real attack strings, encoded variants, and fake-boundary attempts without producing unacceptable noise on legitimate traffic. When you do see false positives, fix the local pattern or term set that caused them. Do not weaken the whole boundary because a single regex needs refinement.

Results and impact

Teams that implement prompt injection defense at the gateway usually see two immediate improvements. First, they stop depending on the model to protect itself from user-supplied instructions. Second, they get far cleaner incident evidence because every blocked request is logged as a governance decision instead of becoming an unexplained bad model response later.

The bigger improvement is architectural discipline. Once prompt injection is treated as a request-boundary problem, the rest of the security model becomes easier to reason about. DLP stops secret leakage. Safety controls stop obviously harmful requests. Zero-trust routing and session controls remain separate concerns. That separation is useful because it makes failure modes easier to diagnose and policy changes easier to test.

Key takeaways

Next steps