Prompt Injection Detection
The prompt-injection policy defends against jailbreak and prompt-injection attacks by combining regex pattern matching, embedding-based semantic similarity, and boundary enforcement. It inspects every user message before forwarding to the AI provider and can block, log, or escalate suspicious input.
Use this page when
- You are configuring jailbreak and prompt-injection detection for your gateway.
- You need to tune embedding thresholds, attack patterns, encoding normalization, or boundary enforcement.
- You want to understand how prompt-injection interacts with MCP tool governance in agent workflows.
Keeptrusts ships a curated set of attack patterns out of the box and supports external embedding backends for advanced semantic detection.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
MCP Bridge Boundary
When provider: mcp is used with the native HTTP bridge, prompt-injection defenses still apply at the gateway boundary before a tools/call frame is forwarded upstream. This matters for agent-style deployments where a malicious prompt can try to coerce the model into selecting unsafe tools rather than only returning unsafe text.
- Use MCP tool allowlists and tool-security policies alongside
prompt-injectionso the gateway can stop unsafe tool invocation attempts before they leave the bridge. - Treat
prompt-injectionas the content boundary and MCP tool governance as the action boundary. They should be deployed together for high-risk agent workflows.
MCP Bridge Governance
When the gateway handles MCP (Model Context Protocol) tool calls, it annotates each outbound JSON-RPC request with governance metadata:
keeptrusts_session_id— the request-scoped session identifierkeeptrusts_conversation_id— the conversation context (fromX-Conversation-IDheader)keeptrusts_agent_id— the agent responsible for the tool call
These metadata fields enable MCP servers to participate in audit trails and allow downstream systems to correlate tool calls with the governing session context.
Configuration
pack:
name: "injection-protection"
version: "0.1.0"
enabled: true
policies:
chain:
- prompt-injection
policy:
prompt-injection:
embedding_threshold: 0.75
backend: "local"
endpoint: ""
model: "text-embedding-3-small"
api_key: "${OPENAI_API_KEY}"
timeout_ms: 3000
attack_patterns:
- "ignore.*previous.*instructions"
- "forget.*system.*prompt"
- "you are now.*DAN"
- "roleplay as.*unrestricted"
- "bypass.*safety"
- "disregard.*rules"
encoding:
decode_base64: true
normalize_unicode: true
detect_homoglyphs: true
boundaries:
enforce_delimiters: true
reject_fake_boundaries: true
response:
action: "block"
message: "Request blocked: potential prompt injection detected"
log_level: "warn"
Fields
| Field | Type | Default | Description |
|---|---|---|---|
embedding_threshold | number (0–1) | 0.75 | Cosine similarity threshold for embedding-based injection detection. Lower values are stricter (more false positives); higher values are more permissive. |
backend | enum | "local" | Embedding backend to use. "local" runs pattern matching only; "external" queries a remote embedding API for semantic similarity scoring. |
endpoint | string | "" | URL of the external embedding backend (e.g., https://api.openai.com/v1/embeddings). Required when backend is "external". |
model | string | "text-embedding-3-small" | Embedding model name sent to the external backend. Only used when backend is "external". |
api_key | string | "" | API key for the external embedding backend. Use environment variable references (e.g., ${OPENAI_API_KEY}) instead of hardcoding secrets. |
timeout_ms | integer (1–60000) | 3000 | Timeout in milliseconds for external embedding requests. Requests exceeding this limit fail open or closed depending on response.action. |
attack_patterns | string[] | (see default list) | Regex patterns matched against normalized user input. Each pattern is evaluated case-insensitively. Add domain-specific patterns to extend detection. |
encoding sub-object
Controls input normalization applied before pattern matching.
| Field | Type | Default | Description |
|---|---|---|---|
encoding.decode_base64 | boolean | true | Decode Base64-encoded segments before pattern matching. Prevents attackers from hiding payloads in encoded strings. |
encoding.normalize_unicode | boolean | true | Normalize Unicode escapes (e.g., \u0069\u0067\u006e\u006f\u0072\u0065 → ignore). Defeats Unicode obfuscation. |
encoding.detect_homoglyphs | boolean | true | Detect homoglyph substitutions (e.g., Cyrillic а in place of Latin a). Maps visually similar characters to their ASCII equivalents. |
boundaries sub-object
Enforces structural separation between system prompts and user input.
| Field | Type | Default | Description |
|---|---|---|---|
boundaries.enforce_delimiters | boolean | true | Validate that system and user message delimiters are correctly structured. Rejects messages that attempt to blur role boundaries. |
boundaries.reject_fake_boundaries | boolean | true | Reject user messages that contain fake system prompt delimiters (e.g., <|system|>, ### System:) designed to trick the model into treating user text as system instructions. |
response sub-object
Controls how the gateway responds when an injection is detected.
| Field | Type | Default | Description |
|---|---|---|---|
response.action | enum | "block" | Action to take on detection. Currently supports "block" which returns an error to the caller. |
response.message | string | "Request blocked: potential prompt injection detected" | Message returned to the caller when the request is blocked. |
response.log_level | enum | "warn" | Log level for injection detection events. One of "trace", "debug", "info", "warn", or "error". |
Use Cases
1. Default local detection with custom attack patterns
Use local pattern matching with additional domain-specific patterns for a financial services application:
policy:
prompt-injection: {}
pack:
name: prompt-injection-example-2
version: 1.0.0
enabled: true
policies:
chain:
- prompt-injection
2. External embedding-based detection with OpenAI
Use OpenAI's embedding API for semantic similarity scoring alongside regex patterns:
policy:
prompt-injection: {}
pack:
name: prompt-injection-example-3
version: 1.0.0
enabled: true
policies:
chain:
- prompt-injection
3. Full defense-in-depth with encoding normalization
Maximum protection with all encoding normalization layers enabled and a strict threshold:
policy:
prompt-injection: {}
pack:
name: prompt-injection-example-4
version: 1.0.0
enabled: true
policies:
chain:
- prompt-injection
4. Minimal latency configuration
Disable external embeddings and use only key regex patterns for lowest possible latency:
policy:
prompt-injection: {}
pack:
name: prompt-injection-example-5
version: 1.0.0
enabled: true
policies:
chain:
- prompt-injection
5. Combined with agent-firewall for MCP tool safety
Layer prompt injection detection before agent-firewall to protect both the LLM and any downstream tool calls:
policies:
chain:
- prompt-injection
- agent-firewall
policy:
prompt-injection:
embedding_threshold: 0.70
backend: "external"
endpoint: "https://api.openai.com/v1/embeddings"
model: "text-embedding-3-small"
api_key: "${OPENAI_API_KEY}"
timeout_ms: 3000
attack_patterns:
- "ignore.*previous.*instructions"
- "forget.*system.*prompt"
- "you are now.*DAN"
- "bypass.*safety"
- "call.*tool.*without.*permission"
- "execute.*shell.*command"
encoding:
decode_base64: true
normalize_unicode: true
detect_homoglyphs: true
boundaries:
enforce_delimiters: true
reject_fake_boundaries: true
response:
action: "block"
message: "Request blocked: potential injection targeting tool execution"
log_level: "error"
agent-firewall:
allowed_tools:
- "search"
- "calculator"
block_unknown: true
How It Works
The prompt injection policy processes each incoming user message through multiple detection layers:
-
Input normalization — The raw user message is normalized based on the
encodingconfiguration. Base64 segments are decoded, Unicode escapes are resolved, and homoglyph characters are mapped to ASCII equivalents. This prevents attackers from bypassing detection through encoding tricks. -
Boundary validation — If
boundaries.enforce_delimitersis enabled, the policy verifies that the message structure preserves the correct system/user role separation. Ifboundaries.reject_fake_boundariesis enabled, any user message containing tokens that mimic system prompt delimiters (e.g.,<|system|>,### System:,[INST]) is flagged immediately. -
Pattern matching — The normalized message is tested against each regex in
attack_patterns. Patterns are evaluated case-insensitively. If any pattern matches, the message is flagged. -
Embedding similarity (external backend only) — When
backendis"external", the normalized message is sent to the configured embedding endpoint. The resulting vector is compared against a set of known injection vectors using cosine similarity. If any similarity score exceedsembedding_threshold, the message is flagged. -
Response — If the message was flagged by any detection layer, the gateway applies the configured
response.action. For"block", the request is rejected with the configuredresponse.messageand the event is logged at the specifiedresponse.log_level. The original request is never forwarded to the AI provider. -
Event emission — Regardless of outcome, a decision event is emitted to the Keeptrusts API containing the detection results, matched patterns, similarity scores (if applicable), and the action taken. This provides a full audit trail.
Combining With Other Policies
Prompt injection detection is most effective as the first policy in the chain. It should run before content-filtering or redaction policies since those assume the input is non-malicious:
policies:
chain:
- prompt-injection # Block attacks first
- pii-detector # Redact PII from clean input
- content-filter # Apply content rules
- disclaimer # Append disclaimers
Common combinations:
| Combination | Purpose |
|---|---|
prompt-injection → pii-detector | Block injection attacks, then redact PII from legitimate requests |
prompt-injection → agent-firewall | Prevent injection-driven tool abuse in agentic workflows |
prompt-injection → content-filter → disclaimer | Full input sanitization pipeline with compliance disclaimers |
prompt-injection → hipaa-phi-detector | Healthcare environments needing both injection defense and PHI protection |
Best Practices
- Always place prompt injection first in the policy chain. Other policies assume the input is not adversarial. Running pattern detection first prevents downstream policies from processing crafted payloads.
- Start with the default patterns, then add domain-specific ones. The built-in patterns cover well-known jailbreak techniques. Extend them with patterns specific to your use case (e.g., financial commands, code execution keywords).
- Use environment variables for API keys. Never hardcode
api_keyin policy configuration files. Use${ENV_VAR}syntax to reference secrets from the environment. - Tune
embedding_thresholdbased on your risk tolerance. A threshold of0.60catches more attacks but may flag legitimate creative writing prompts. A threshold of0.85is more permissive but may miss novel attack variants. Start at0.75and adjust based on false positive logs. - Enable all encoding normalizations in production. Disabling
decode_base64,normalize_unicode, ordetect_homoglyphscreates blind spots that attackers can exploit. Only disable for latency-critical paths where the risk is accepted. - Keep
boundaries.reject_fake_boundariesenabled. Fake boundary injection is one of the most effective attack classes. Disabling this protection is not recommended unless your application legitimately includes system-like delimiters in user messages. - Set
response.log_levelto"error"in production. This ensures injection attempts generate alerts in your monitoring stack rather than being silently logged at lower severity. - Review detection logs regularly. Injection techniques evolve rapidly. Periodically review blocked requests to identify new patterns that should be added to
attack_patternsand to reduce false positives.
For AI systems
- Canonical terms: Keeptrusts, prompt-injection, embedding_threshold, attack_patterns, encoding, boundaries, response, action, block, backend, local, external, MCP bridge
- Config/command names:
policy.prompt-injection,embedding_threshold,backend(local/external),attack_patterns,encoding(decode_base64, normalize_unicode, detect_homoglyphs),boundaries(enforce_delimiters, reject_fake_boundaries),response.action - Best next pages: Safety Filter, PII Detector, External Moderation, Tool Security
For engineers
- Prerequisites: For
externalbackend: an embedding API endpoint and key. Forlocal: no prerequisites. Placeprompt-injectionfirst inpolicies.chain. - Validation: Test with known jailbreak phrases ("ignore previous instructions", "you are now DAN") and verify blocking. Test base64-encoded and Unicode-normalized attacks. Check false-positive rates with legitimate creative prompts.
- Key commands:
kt policy lint,kt policy test,kt events tail,kt gateway run
For leaders
- Governance: Prompt injection is the #1 attack vector against AI systems. This policy must be first in every chain. It protects all downstream policies from processing adversarial inputs.
- Cost: Local mode (pattern matching) is near-free. External embedding mode adds per-request API cost and ~50-200ms latency. The cost of a successful jailbreak (data exfiltration, reputation damage) far exceeds detection costs.
- Rollout: Deploy immediately with default patterns and
localbackend. Addexternalembedding backend for production environments where novel attack detection is critical. Tuneembedding_thresholdbased on false-positive monitoring.
Next steps
- Safety Filter — Content safety after injection filtering
- PII Detector — PII detection on sanitized input
- Tool Security — Protect tool calls from injection
- External Moderation — Third-party content safety