Prompt Injection Detection

The prompt-injection policy defends against jailbreak and prompt-injection attacks by combining regex pattern matching, embedding-based semantic similarity, and boundary enforcement. It inspects every user message before forwarding to the AI provider and can block, log, or escalate suspicious input.

Use this page when

You are configuring jailbreak and prompt-injection detection for your gateway.
You need to tune embedding thresholds, attack patterns, encoding normalization, or boundary enforcement.
You want to understand how prompt-injection interacts with MCP tool governance in agent workflows.

Keeptrusts ships a curated set of attack patterns out of the box and supports external embedding backends for advanced semantic detection.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

MCP Bridge Boundary

When provider: mcp is used with the native HTTP bridge, prompt-injection defenses still apply at the gateway boundary before a tools/call frame is forwarded upstream. This matters for agent-style deployments where a malicious prompt can try to coerce the model into selecting unsafe tools rather than only returning unsafe text.

Use MCP tool allowlists and tool-security policies alongside prompt-injection so the gateway can stop unsafe tool invocation attempts before they leave the bridge.
Treat prompt-injection as the content boundary and MCP tool governance as the action boundary. They should be deployed together for high-risk agent workflows.

MCP Bridge Governance

When the gateway handles MCP (Model Context Protocol) tool calls, it annotates each outbound JSON-RPC request with governance metadata:

keeptrusts_session_id — the request-scoped session identifier
keeptrusts_conversation_id — the conversation context (from X-Conversation-ID header)
keeptrusts_agent_id — the agent responsible for the tool call

These metadata fields enable MCP servers to participate in audit trails and allow downstream systems to correlate tool calls with the governing session context.

Configuration

pack:
  name: "injection-protection"
  version: "0.1.0"
  enabled: true

policies:
  chain:
    - prompt-injection

policy:
  prompt-injection:
    embedding_threshold: 0.75
    backend: "local"
    endpoint: ""
    model: "text-embedding-3-small"
    api_key: "${OPENAI_API_KEY}"
    timeout_ms: 3000
    attack_patterns:
      - "ignore.*previous.*instructions"
      - "forget.*system.*prompt"
      - "you are now.*DAN"
      - "roleplay as.*unrestricted"
      - "bypass.*safety"
      - "disregard.*rules"
    encoding:
      decode_base64: true
      normalize_unicode: true
      detect_homoglyphs: true
    boundaries:
      enforce_delimiters: true
      reject_fake_boundaries: true
    response:
      action: "block"
      message: "Request blocked: potential prompt injection detected"
      log_level: "warn"

Fields

Field	Type	Default	Description
`embedding_threshold`	number (0–1)	`0.75`	Cosine similarity threshold for embedding-based injection detection. Lower values are stricter (more false positives); higher values are more permissive.
`backend`	enum	`"local"`	Embedding backend to use. `"local"` runs pattern matching only; `"external"` queries a remote embedding API for semantic similarity scoring.
`endpoint`	string	`""`	URL of the external embedding backend (e.g., `https://api.openai.com/v1/embeddings`). Required when `backend` is `"external"`.
`model`	string	`"text-embedding-3-small"`	Embedding model name sent to the external backend. Only used when `backend` is `"external"`.
`api_key`	string	`""`	API key for the external embedding backend. Use environment variable references (e.g., `${OPENAI_API_KEY}`) instead of hardcoding secrets.
`timeout_ms`	integer (1–60000)	`3000`	Timeout in milliseconds for external embedding requests. Requests exceeding this limit fail open or closed depending on `response.action`.
`attack_patterns`	string[]	(see default list)	Regex patterns matched against normalized user input. Each pattern is evaluated case-insensitively. Add domain-specific patterns to extend detection.

`encoding` sub-object

Controls input normalization applied before pattern matching.

Field	Type	Default	Description
`encoding.decode_base64`	boolean	`true`	Decode Base64-encoded segments before pattern matching. Prevents attackers from hiding payloads in encoded strings.
`encoding.normalize_unicode`	boolean	`true`	Normalize Unicode escapes (e.g., `\u0069\u0067\u006e\u006f\u0072\u0065` → `ignore`). Defeats Unicode obfuscation.
`encoding.detect_homoglyphs`	boolean	`true`	Detect homoglyph substitutions (e.g., Cyrillic `а` in place of Latin `a`). Maps visually similar characters to their ASCII equivalents.

`boundaries` sub-object

Enforces structural separation between system prompts and user input.

Field	Type	Default	Description
`boundaries.enforce_delimiters`	boolean	`true`	Validate that system and user message delimiters are correctly structured. Rejects messages that attempt to blur role boundaries.
`boundaries.reject_fake_boundaries`	boolean	`true`	Reject user messages that contain fake system prompt delimiters (e.g., `<\|system\|>`, `### System:`) designed to trick the model into treating user text as system instructions.

`response` sub-object

Controls how the gateway responds when an injection is detected.

Field	Type	Default	Description
`response.action`	enum	`"block"`	Action to take on detection. Currently supports `"block"` which returns an error to the caller.
`response.message`	string	`"Request blocked: potential prompt injection detected"`	Message returned to the caller when the request is blocked.
`response.log_level`	enum	`"warn"`	Log level for injection detection events. One of `"trace"`, `"debug"`, `"info"`, `"warn"`, or `"error"`.

Use Cases

1. Default local detection with custom attack patterns

Use local pattern matching with additional domain-specific patterns for a financial services application:

policy:
  prompt-injection: {}
pack:
  name: prompt-injection-example-2
  version: 1.0.0
  enabled: true
policies:
  chain:
  - prompt-injection

2. External embedding-based detection with OpenAI

Use OpenAI's embedding API for semantic similarity scoring alongside regex patterns:

policy:
  prompt-injection: {}
pack:
  name: prompt-injection-example-3
  version: 1.0.0
  enabled: true
policies:
  chain:
  - prompt-injection

3. Full defense-in-depth with encoding normalization

Maximum protection with all encoding normalization layers enabled and a strict threshold:

policy:
  prompt-injection: {}
pack:
  name: prompt-injection-example-4
  version: 1.0.0
  enabled: true
policies:
  chain:
  - prompt-injection

4. Minimal latency configuration

Disable external embeddings and use only key regex patterns for lowest possible latency:

policy:
  prompt-injection: {}
pack:
  name: prompt-injection-example-5
  version: 1.0.0
  enabled: true
policies:
  chain:
  - prompt-injection

5. Combined with agent-firewall for MCP tool safety

Layer prompt injection detection before agent-firewall to protect both the LLM and any downstream tool calls:

policies:
  chain:
    - prompt-injection
    - agent-firewall

policy:
  prompt-injection:
    embedding_threshold: 0.70
    backend: "external"
    endpoint: "https://api.openai.com/v1/embeddings"
    model: "text-embedding-3-small"
    api_key: "${OPENAI_API_KEY}"
    timeout_ms: 3000
    attack_patterns:
      - "ignore.*previous.*instructions"
      - "forget.*system.*prompt"
      - "you are now.*DAN"
      - "bypass.*safety"
      - "call.*tool.*without.*permission"
      - "execute.*shell.*command"
    encoding:
      decode_base64: true
      normalize_unicode: true
      detect_homoglyphs: true
    boundaries:
      enforce_delimiters: true
      reject_fake_boundaries: true
    response:
      action: "block"
      message: "Request blocked: potential injection targeting tool execution"
      log_level: "error"

  agent-firewall:
    allowed_tools:
      - "search"
      - "calculator"
    block_unknown: true

How It Works

The prompt injection policy processes each incoming user message through multiple detection layers:

Input normalization — The raw user message is normalized based on the encoding configuration. Base64 segments are decoded, Unicode escapes are resolved, and homoglyph characters are mapped to ASCII equivalents. This prevents attackers from bypassing detection through encoding tricks.
Boundary validation — If boundaries.enforce_delimiters is enabled, the policy verifies that the message structure preserves the correct system/user role separation. If boundaries.reject_fake_boundaries is enabled, any user message containing tokens that mimic system prompt delimiters (e.g., <|system|>, ### System:, [INST]) is flagged immediately.
Pattern matching — The normalized message is tested against each regex in attack_patterns. Patterns are evaluated case-insensitively. If any pattern matches, the message is flagged.
Embedding similarity (external backend only) — When backend is "external", the normalized message is sent to the configured embedding endpoint. The resulting vector is compared against a set of known injection vectors using cosine similarity. If any similarity score exceeds embedding_threshold, the message is flagged.
Response — If the message was flagged by any detection layer, the gateway applies the configured response.action. For "block", the request is rejected with the configured response.message and the event is logged at the specified response.log_level. The original request is never forwarded to the AI provider.
Event emission — Regardless of outcome, a decision event is emitted to the Keeptrusts API containing the detection results, matched patterns, similarity scores (if applicable), and the action taken. This provides a full audit trail.

Combining With Other Policies

Prompt injection detection is most effective as the first policy in the chain. It should run before content-filtering or redaction policies since those assume the input is non-malicious:

policies:
  chain:
    - prompt-injection      # Block attacks first
    - pii-detector          # Redact PII from clean input
    - content-filter        # Apply content rules
    - disclaimer            # Append disclaimers

Common combinations:

Combination	Purpose
`prompt-injection` → `pii-detector`	Block injection attacks, then redact PII from legitimate requests
`prompt-injection` → `agent-firewall`	Prevent injection-driven tool abuse in agentic workflows
`prompt-injection` → `content-filter` → `disclaimer`	Full input sanitization pipeline with compliance disclaimers
`prompt-injection` → `hipaa-phi-detector`	Healthcare environments needing both injection defense and PHI protection

Best Practices

Always place prompt injection first in the policy chain. Other policies assume the input is not adversarial. Running pattern detection first prevents downstream policies from processing crafted payloads.
Start with the default patterns, then add domain-specific ones. The built-in patterns cover well-known jailbreak techniques. Extend them with patterns specific to your use case (e.g., financial commands, code execution keywords).
Use environment variables for API keys. Never hardcode api_key in policy configuration files. Use ${ENV_VAR} syntax to reference secrets from the environment.
Tune embedding_threshold based on your risk tolerance. A threshold of 0.60 catches more attacks but may flag legitimate creative writing prompts. A threshold of 0.85 is more permissive but may miss novel attack variants. Start at 0.75 and adjust based on false positive logs.
Enable all encoding normalizations in production. Disabling decode_base64, normalize_unicode, or detect_homoglyphs creates blind spots that attackers can exploit. Only disable for latency-critical paths where the risk is accepted.
Keep boundaries.reject_fake_boundaries enabled. Fake boundary injection is one of the most effective attack classes. Disabling this protection is not recommended unless your application legitimately includes system-like delimiters in user messages.
Set response.log_level to "error" in production. This ensures injection attempts generate alerts in your monitoring stack rather than being silently logged at lower severity.
Review detection logs regularly. Injection techniques evolve rapidly. Periodically review blocked requests to identify new patterns that should be added to attack_patterns and to reduce false positives.

For AI systems

Canonical terms: Keeptrusts, prompt-injection, embedding_threshold, attack_patterns, encoding, boundaries, response, action, block, backend, local, external, MCP bridge
Config/command names: policy.prompt-injection, embedding_threshold, backend (local/external), attack_patterns, encoding (decode_base64, normalize_unicode, detect_homoglyphs), boundaries (enforce_delimiters, reject_fake_boundaries), response.action
Best next pages: Safety Filter, PII Detector, External Moderation, Tool Security

For engineers

Prerequisites: For external backend: an embedding API endpoint and key. For local: no prerequisites. Place prompt-injection first in policies.chain.
Validation: Test with known jailbreak phrases ("ignore previous instructions", "you are now DAN") and verify blocking. Test base64-encoded and Unicode-normalized attacks. Check false-positive rates with legitimate creative prompts.
Key commands: kt policy lint, kt policy test, kt events tail, kt gateway run

For leaders

Governance: Prompt injection is the #1 attack vector against AI systems. This policy must be first in every chain. It protects all downstream policies from processing adversarial inputs.
Cost: Local mode (pattern matching) is near-free. External embedding mode adds per-request API cost and ~50-200ms latency. The cost of a successful jailbreak (data exfiltration, reputation damage) far exceeds detection costs.
Rollout: Deploy immediately with default patterns and local backend. Add external embedding backend for production environments where novel attack detection is critical. Tune embedding_threshold based on false-positive monitoring.

Next steps

Safety Filter — Content safety after injection filtering
PII Detector — PII detection on sanitized input
Tool Security — Protect tool calls from injection
External Moderation — Third-party content safety

Use this page when​

Primary audience​

MCP Bridge Boundary​

MCP Bridge Governance​

Configuration​

Fields​

encoding sub-object​

boundaries sub-object​

response sub-object​

Use Cases​

1. Default local detection with custom attack patterns​

2. External embedding-based detection with OpenAI​

3. Full defense-in-depth with encoding normalization​

4. Minimal latency configuration​

5. Combined with agent-firewall for MCP tool safety​

How It Works​

Combining With Other Policies​

Best Practices​

For AI systems​

For engineers​

For leaders​

Next steps​