Skip to main content
Browse docs

Prompt Injection Detection

The prompt-injection policy defends against jailbreak and prompt-injection attacks by combining regex pattern matching, embedding-based semantic similarity, and boundary enforcement. It inspects every user message before forwarding to the AI provider and can block, log, or escalate suspicious input.

Use this page when

  • You are configuring jailbreak and prompt-injection detection for your gateway.
  • You need to tune embedding thresholds, attack patterns, encoding normalization, or boundary enforcement.
  • You want to understand how prompt-injection interacts with MCP tool governance in agent workflows.

Keeptrusts ships a curated set of attack patterns out of the box and supports external embedding backends for advanced semantic detection.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

MCP Bridge Boundary

When provider: mcp is used with the native HTTP bridge, prompt-injection defenses still apply at the gateway boundary before a tools/call frame is forwarded upstream. This matters for agent-style deployments where a malicious prompt can try to coerce the model into selecting unsafe tools rather than only returning unsafe text.

  • Use MCP tool allowlists and tool-security policies alongside prompt-injection so the gateway can stop unsafe tool invocation attempts before they leave the bridge.
  • Treat prompt-injection as the content boundary and MCP tool governance as the action boundary. They should be deployed together for high-risk agent workflows.

MCP Bridge Governance

When the gateway handles MCP (Model Context Protocol) tool calls, it annotates each outbound JSON-RPC request with governance metadata:

  • keeptrusts_session_id — the request-scoped session identifier
  • keeptrusts_conversation_id — the conversation context (from X-Conversation-ID header)
  • keeptrusts_agent_id — the agent responsible for the tool call

These metadata fields enable MCP servers to participate in audit trails and allow downstream systems to correlate tool calls with the governing session context.

Configuration

pack:
name: "injection-protection"
version: "0.1.0"
enabled: true

policies:
chain:
- prompt-injection

policy:
prompt-injection:
embedding_threshold: 0.75
backend: "local"
endpoint: ""
model: "text-embedding-3-small"
api_key: "${OPENAI_API_KEY}"
timeout_ms: 3000
attack_patterns:
- "ignore.*previous.*instructions"
- "forget.*system.*prompt"
- "you are now.*DAN"
- "roleplay as.*unrestricted"
- "bypass.*safety"
- "disregard.*rules"
encoding:
decode_base64: true
normalize_unicode: true
detect_homoglyphs: true
boundaries:
enforce_delimiters: true
reject_fake_boundaries: true
response:
action: "block"
message: "Request blocked: potential prompt injection detected"
log_level: "warn"

Fields

FieldTypeDefaultDescription
embedding_thresholdnumber (0–1)0.75Cosine similarity threshold for embedding-based injection detection. Lower values are stricter (more false positives); higher values are more permissive.
backendenum"local"Embedding backend to use. "local" runs pattern matching only; "external" queries a remote embedding API for semantic similarity scoring.
endpointstring""URL of the external embedding backend (e.g., https://api.openai.com/v1/embeddings). Required when backend is "external".
modelstring"text-embedding-3-small"Embedding model name sent to the external backend. Only used when backend is "external".
api_keystring""API key for the external embedding backend. Use environment variable references (e.g., ${OPENAI_API_KEY}) instead of hardcoding secrets.
timeout_msinteger (1–60000)3000Timeout in milliseconds for external embedding requests. Requests exceeding this limit fail open or closed depending on response.action.
attack_patternsstring[](see default list)Regex patterns matched against normalized user input. Each pattern is evaluated case-insensitively. Add domain-specific patterns to extend detection.

encoding sub-object

Controls input normalization applied before pattern matching.

FieldTypeDefaultDescription
encoding.decode_base64booleantrueDecode Base64-encoded segments before pattern matching. Prevents attackers from hiding payloads in encoded strings.
encoding.normalize_unicodebooleantrueNormalize Unicode escapes (e.g., \u0069\u0067\u006e\u006f\u0072\u0065ignore). Defeats Unicode obfuscation.
encoding.detect_homoglyphsbooleantrueDetect homoglyph substitutions (e.g., Cyrillic а in place of Latin a). Maps visually similar characters to their ASCII equivalents.

boundaries sub-object

Enforces structural separation between system prompts and user input.

FieldTypeDefaultDescription
boundaries.enforce_delimitersbooleantrueValidate that system and user message delimiters are correctly structured. Rejects messages that attempt to blur role boundaries.
boundaries.reject_fake_boundariesbooleantrueReject user messages that contain fake system prompt delimiters (e.g., <|system|>, ### System:) designed to trick the model into treating user text as system instructions.

response sub-object

Controls how the gateway responds when an injection is detected.

FieldTypeDefaultDescription
response.actionenum"block"Action to take on detection. Currently supports "block" which returns an error to the caller.
response.messagestring"Request blocked: potential prompt injection detected"Message returned to the caller when the request is blocked.
response.log_levelenum"warn"Log level for injection detection events. One of "trace", "debug", "info", "warn", or "error".

Use Cases

1. Default local detection with custom attack patterns

Use local pattern matching with additional domain-specific patterns for a financial services application:

policy:
prompt-injection: {}
pack:
name: prompt-injection-example-2
version: 1.0.0
enabled: true
policies:
chain:
- prompt-injection

2. External embedding-based detection with OpenAI

Use OpenAI's embedding API for semantic similarity scoring alongside regex patterns:

policy:
prompt-injection: {}
pack:
name: prompt-injection-example-3
version: 1.0.0
enabled: true
policies:
chain:
- prompt-injection

3. Full defense-in-depth with encoding normalization

Maximum protection with all encoding normalization layers enabled and a strict threshold:

policy:
prompt-injection: {}
pack:
name: prompt-injection-example-4
version: 1.0.0
enabled: true
policies:
chain:
- prompt-injection

4. Minimal latency configuration

Disable external embeddings and use only key regex patterns for lowest possible latency:

policy:
prompt-injection: {}
pack:
name: prompt-injection-example-5
version: 1.0.0
enabled: true
policies:
chain:
- prompt-injection

5. Combined with agent-firewall for MCP tool safety

Layer prompt injection detection before agent-firewall to protect both the LLM and any downstream tool calls:

policies:
chain:
- prompt-injection
- agent-firewall

policy:
prompt-injection:
embedding_threshold: 0.70
backend: "external"
endpoint: "https://api.openai.com/v1/embeddings"
model: "text-embedding-3-small"
api_key: "${OPENAI_API_KEY}"
timeout_ms: 3000
attack_patterns:
- "ignore.*previous.*instructions"
- "forget.*system.*prompt"
- "you are now.*DAN"
- "bypass.*safety"
- "call.*tool.*without.*permission"
- "execute.*shell.*command"
encoding:
decode_base64: true
normalize_unicode: true
detect_homoglyphs: true
boundaries:
enforce_delimiters: true
reject_fake_boundaries: true
response:
action: "block"
message: "Request blocked: potential injection targeting tool execution"
log_level: "error"

agent-firewall:
allowed_tools:
- "search"
- "calculator"
block_unknown: true

How It Works

The prompt injection policy processes each incoming user message through multiple detection layers:

  1. Input normalization — The raw user message is normalized based on the encoding configuration. Base64 segments are decoded, Unicode escapes are resolved, and homoglyph characters are mapped to ASCII equivalents. This prevents attackers from bypassing detection through encoding tricks.

  2. Boundary validation — If boundaries.enforce_delimiters is enabled, the policy verifies that the message structure preserves the correct system/user role separation. If boundaries.reject_fake_boundaries is enabled, any user message containing tokens that mimic system prompt delimiters (e.g., <|system|>, ### System:, [INST]) is flagged immediately.

  3. Pattern matching — The normalized message is tested against each regex in attack_patterns. Patterns are evaluated case-insensitively. If any pattern matches, the message is flagged.

  4. Embedding similarity (external backend only) — When backend is "external", the normalized message is sent to the configured embedding endpoint. The resulting vector is compared against a set of known injection vectors using cosine similarity. If any similarity score exceeds embedding_threshold, the message is flagged.

  5. Response — If the message was flagged by any detection layer, the gateway applies the configured response.action. For "block", the request is rejected with the configured response.message and the event is logged at the specified response.log_level. The original request is never forwarded to the AI provider.

  6. Event emission — Regardless of outcome, a decision event is emitted to the Keeptrusts API containing the detection results, matched patterns, similarity scores (if applicable), and the action taken. This provides a full audit trail.

Combining With Other Policies

Prompt injection detection is most effective as the first policy in the chain. It should run before content-filtering or redaction policies since those assume the input is non-malicious:

policies:
chain:
- prompt-injection # Block attacks first
- pii-detector # Redact PII from clean input
- content-filter # Apply content rules
- disclaimer # Append disclaimers

Common combinations:

CombinationPurpose
prompt-injectionpii-detectorBlock injection attacks, then redact PII from legitimate requests
prompt-injectionagent-firewallPrevent injection-driven tool abuse in agentic workflows
prompt-injectioncontent-filterdisclaimerFull input sanitization pipeline with compliance disclaimers
prompt-injectionhipaa-phi-detectorHealthcare environments needing both injection defense and PHI protection

Best Practices

  • Always place prompt injection first in the policy chain. Other policies assume the input is not adversarial. Running pattern detection first prevents downstream policies from processing crafted payloads.
  • Start with the default patterns, then add domain-specific ones. The built-in patterns cover well-known jailbreak techniques. Extend them with patterns specific to your use case (e.g., financial commands, code execution keywords).
  • Use environment variables for API keys. Never hardcode api_key in policy configuration files. Use ${ENV_VAR} syntax to reference secrets from the environment.
  • Tune embedding_threshold based on your risk tolerance. A threshold of 0.60 catches more attacks but may flag legitimate creative writing prompts. A threshold of 0.85 is more permissive but may miss novel attack variants. Start at 0.75 and adjust based on false positive logs.
  • Enable all encoding normalizations in production. Disabling decode_base64, normalize_unicode, or detect_homoglyphs creates blind spots that attackers can exploit. Only disable for latency-critical paths where the risk is accepted.
  • Keep boundaries.reject_fake_boundaries enabled. Fake boundary injection is one of the most effective attack classes. Disabling this protection is not recommended unless your application legitimately includes system-like delimiters in user messages.
  • Set response.log_level to "error" in production. This ensures injection attempts generate alerts in your monitoring stack rather than being silently logged at lower severity.
  • Review detection logs regularly. Injection techniques evolve rapidly. Periodically review blocked requests to identify new patterns that should be added to attack_patterns and to reduce false positives.

For AI systems

  • Canonical terms: Keeptrusts, prompt-injection, embedding_threshold, attack_patterns, encoding, boundaries, response, action, block, backend, local, external, MCP bridge
  • Config/command names: policy.prompt-injection, embedding_threshold, backend (local/external), attack_patterns, encoding (decode_base64, normalize_unicode, detect_homoglyphs), boundaries (enforce_delimiters, reject_fake_boundaries), response.action
  • Best next pages: Safety Filter, PII Detector, External Moderation, Tool Security

For engineers

  • Prerequisites: For external backend: an embedding API endpoint and key. For local: no prerequisites. Place prompt-injection first in policies.chain.
  • Validation: Test with known jailbreak phrases ("ignore previous instructions", "you are now DAN") and verify blocking. Test base64-encoded and Unicode-normalized attacks. Check false-positive rates with legitimate creative prompts.
  • Key commands: kt policy lint, kt policy test, kt events tail, kt gateway run

For leaders

  • Governance: Prompt injection is the #1 attack vector against AI systems. This policy must be first in every chain. It protects all downstream policies from processing adversarial inputs.
  • Cost: Local mode (pattern matching) is near-free. External embedding mode adds per-request API cost and ~50-200ms latency. The cost of a successful jailbreak (data exfiltration, reputation damage) far exceeds detection costs.
  • Rollout: Deploy immediately with default patterns and local backend. Add external embedding backend for production environments where novel attack detection is critical. Tune embedding_threshold based on false-positive monitoring.

Next steps