Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Block Prompt Injection Attacks Before They Reach Your Models

Prompt injection is the top security risk for AI applications. Attackers embed malicious instructions in user input to hijack model behavior — extracting data, bypassing safety filters, or making models perform unauthorized actions. Keeptrusts detects and blocks these attacks at the gateway before they ever reach your models.

Use this page when

  • You need to protect AI-powered applications against adversarial prompt injection, jailbreaks, or instruction override attacks.
  • You are deploying AI agents with tool-calling capability and need to prevent unauthorized tool execution via injected instructions.
  • You want to understand the multi-layer detection approach (pattern, embedding, safety filter) before configuring your gateway.

Primary audience

  • Primary: Technical Leaders
  • Secondary: Technical Engineers, AI Agents

What you'll achieve

  • Real-time prompt injection detection using pattern matching and embedding analysis
  • Automatic blocking of jailbreak attempts, delimiter attacks, and instruction overrides
  • Agent-specific protection with tool-call validation and action limits
  • Escalation routing for borderline cases that need human review
  • Attack visibility through detailed event logging of every blocked attempt

How Keeptrusts detects prompt injection

The gateway evaluates every incoming request through multiple detection layers:

Layer 1: Pattern-based detection

The prompt-injection policy matches known attack patterns including:

  • Instruction overrides — "Ignore previous instructions and..."
  • Delimiter escapes — attempts to break out of system/user message boundaries
  • Base64/Unicode evasion — encoded payloads designed to bypass simple string matching
  • Role confusion — injecting fake system messages within user input
policies:
chain:
- prompt-injection
- audit-logger

policy:
prompt-injection:
response:
action: block
message: "Request blocked: potential prompt injection detected"
encoding:
decode_base64: true
normalize_unicode: true
boundaries:
enforce_delimiters: true

Layer 2: Embedding-based semantic detection

For sophisticated attacks that evade pattern matching, the embedding_threshold parameter uses semantic similarity to detect injection intent:

policy:
prompt-injection: {}
pack:
name: block-prompt-injection-example-2
version: 1.0.0
enabled: true
policies:
chain:
- prompt-injection

The embedding detector compares the semantic intent of user input against known injection patterns. Inputs that exceed the similarity threshold are blocked regardless of surface-level text.

Layer 3: Safety filters

The safety-filter policy catches broader categories of unsafe content including content that may not be injection but is still inappropriate:

policies:
chain:
- prompt-injection
- safety-filter
- audit-logger

policy:
prompt-injection:
embedding_threshold: 0.8
response:
action: block
safety-filter:
categories:
- violence
- hate_speech
- self_harm
- sexual_content
action: block

Protecting AI agents

AI agents that can call tools present a larger attack surface. An injected prompt could instruct the agent to call execute_code, shell_command, or send_email with attacker-controlled parameters.

The agent-firewall policy addresses this:

policies:
chain:
- prompt-injection
- agent-firewall
- audit-logger

policy:
prompt-injection:
embedding_threshold: 0.8
response:
action: block
agent-firewall:
allowed_tools:
- search
- summarize
- retrieve_document
blocked_tools:
- execute_code
- shell_command
- send_email
- delete_record
max_actions_per_session: 25

This configuration:

  • Allowlists safe tools — only search, summarize, and retrieve_document are permitted
  • Blocklists dangerous tools — code execution, shell access, and email are explicitly denied
  • Limits action count — prevents runaway agent loops from executing more than 25 actions

See Agent Firewall Template for advanced configurations.


Escalation routing for borderline cases

Not every suspicious input should be blocked outright. For borderline cases, route to human review:

policies:
chain:
- prompt-injection
- human-oversight
- audit-logger

policy:
prompt-injection:
embedding_threshold: 0.6
response:
action: escalate
human-oversight:
escalate_on:
- prompt_injection_detected
require_resolution_within_hours: 4

With action: escalate, the request is flagged in the Escalations queue for a reviewer to inspect, claim, and resolve.


Defense in depth: combining all layers

A production-grade prompt injection defense uses all layers together:

pack:
name: prompt-injection-defense
version: "1.0"

policies:
chain:
- prompt-injection
- safety-filter
- agent-firewall
- pii-detector
- audit-logger

policy:
prompt-injection:
embedding_threshold: 0.8
response:
action: block
message: "Request blocked: security policy violation"
encoding:
decode_base64: true
normalize_unicode: true
boundaries:
enforce_delimiters: true

safety-filter:
categories:
- violence
- hate_speech
- self_harm
action: block

agent-firewall:
allowed_tools:
- search
- summarize
blocked_tools:
- execute_code
- shell_command
max_actions_per_session: 50

pii-detector:
action: redact
redaction:
marker_format: label

audit-logger:
retention_days: 365

Monitoring attack attempts

After deploying prompt injection defenses:

  1. Check the Events page — filter by policy_type=prompt-injection to see blocked attempts
  2. Review escalations — inspect borderline cases in the escalation queue
  3. Export attack data — use kt export create --filter "policy_type=prompt-injection" for security analysis
  4. Track block rate trends — a sudden spike may indicate a targeted attack

Quick wins

  1. Deploy prompt-injection with action: block — immediate protection, zero application changes
  2. Enable decode_base64 and normalize_unicode — catch evasion attempts
  3. Add agent-firewall for any AI agent — restrict tool access to what's actually needed
  4. Set up safety-filter — catch harmful content that isn't technically injection
  5. Review blocked events weekly — tune thresholds based on false positive rates

For AI systems

  • Canonical terms: prompt-injection policy, embedding_threshold, safety-filter, agent-firewall, delimiter enforcement, escalation.
  • Config keys: policy.prompt-injection.response.action, policy.prompt-injection.embedding_threshold, policy.prompt-injection.encoding, policy.prompt-injection.boundaries.
  • CLI commands: kt gateway run, kt events list --filter "prompt_injection".
  • Best next pages: Govern AI Agents, Agent Firewall Template, Policy Controls Catalog.

For engineers

  • Prerequisites: gateway running, prompt-injection policy in the chain.
  • Add embedding_threshold: 0.8 to catch semantic injection attacks that evade pattern rules.
  • Enable encoding.decode_base64 and encoding.normalize_unicode to prevent evasion via encoding.
  • Validate: send a known injection payload (e.g., “Ignore previous instructions”) and confirm a 409 block response.
  • Monitor: filter Events by policy_type=prompt-injection and action block to track blocked attempts.

For leaders

  • Prompt injection is the #1 AI security risk (OWASP LLM Top 10); this control mitigates it at the infrastructure layer.
  • Blocking happens before the request reaches the provider — zero exposure of your models to adversarial input.
  • Escalation routing lets borderline cases reach human reviewers, reducing false-positive disruption.
  • Every blocked attempt is logged for security audit and incident response evidence.

Next steps