Block Prompt Injection Attacks Before They Reach Your Models

Prompt injection is the top security risk for AI applications. Attackers embed malicious instructions in user input to hijack model behavior — extracting data, bypassing safety filters, or making models perform unauthorized actions. Keeptrusts detects and blocks these attacks at the gateway before they ever reach your models.

Use this page when

You need to protect AI-powered applications against adversarial prompt injection, jailbreaks, or instruction override attacks.
You are deploying AI agents with tool-calling capability and need to prevent unauthorized tool execution via injected instructions.
You want to understand the multi-layer detection approach (pattern, embedding, safety filter) before configuring your gateway.

Primary audience

Primary: Technical Leaders
Secondary: Technical Engineers, AI Agents

What you'll achieve

Real-time prompt injection detection using pattern matching and embedding analysis
Automatic blocking of jailbreak attempts, delimiter attacks, and instruction overrides
Agent-specific protection with tool-call validation and action limits
Escalation routing for borderline cases that need human review
Attack visibility through detailed event logging of every blocked attempt

How Keeptrusts detects prompt injection

The gateway evaluates every incoming request through multiple detection layers:

Layer 1: Pattern-based detection

The prompt-injection policy matches known attack patterns including:

Instruction overrides — "Ignore previous instructions and..."
Delimiter escapes — attempts to break out of system/user message boundaries
Base64/Unicode evasion — encoded payloads designed to bypass simple string matching
Role confusion — injecting fake system messages within user input

policies:
  chain:
    - prompt-injection
    - audit-logger

policy:
  prompt-injection:
    response:
      action: block
      message: "Request blocked: potential prompt injection detected"
    encoding:
      decode_base64: true
      normalize_unicode: true
    boundaries:
      enforce_delimiters: true

Layer 2: Embedding-based semantic detection

For sophisticated attacks that evade pattern matching, the embedding_threshold parameter uses semantic similarity to detect injection intent:

policy:
  prompt-injection: {}
pack:
  name: block-prompt-injection-example-2
  version: 1.0.0
  enabled: true
policies:
  chain:
  - prompt-injection

The embedding detector compares the semantic intent of user input against known injection patterns. Inputs that exceed the similarity threshold are blocked regardless of surface-level text.

Layer 3: Safety filters

The safety-filter policy catches broader categories of unsafe content including content that may not be injection but is still inappropriate:

policies:
  chain:
    - prompt-injection
    - safety-filter
    - audit-logger

policy:
  prompt-injection:
    embedding_threshold: 0.8
    response:
      action: block
  safety-filter:
    categories:
      - violence
      - hate_speech
      - self_harm
      - sexual_content
    action: block

Protecting AI agents

AI agents that can call tools present a larger attack surface. An injected prompt could instruct the agent to call execute_code, shell_command, or send_email with attacker-controlled parameters.

The agent-firewall policy addresses this:

policies:
  chain:
    - prompt-injection
    - agent-firewall
    - audit-logger

policy:
  prompt-injection:
    embedding_threshold: 0.8
    response:
      action: block
  agent-firewall:
    allowed_tools:
      - search
      - summarize
      - retrieve_document
    blocked_tools:
      - execute_code
      - shell_command
      - send_email
      - delete_record
    max_actions_per_session: 25

This configuration:

Allowlists safe tools — only search, summarize, and retrieve_document are permitted
Blocklists dangerous tools — code execution, shell access, and email are explicitly denied
Limits action count — prevents runaway agent loops from executing more than 25 actions

See Agent Firewall Template for advanced configurations.

Escalation routing for borderline cases

Not every suspicious input should be blocked outright. For borderline cases, route to human review:

policies:
  chain:
    - prompt-injection
    - human-oversight
    - audit-logger

policy:
  prompt-injection:
    embedding_threshold: 0.6
    response:
      action: escalate
  human-oversight:
    escalate_on:
      - prompt_injection_detected
    require_resolution_within_hours: 4

With action: escalate, the request is flagged in the Escalations queue for a reviewer to inspect, claim, and resolve.

Defense in depth: combining all layers

A production-grade prompt injection defense uses all layers together:

pack:
  name: prompt-injection-defense
  version: "1.0"

policies:
  chain:
    - prompt-injection
    - safety-filter
    - agent-firewall
    - pii-detector
    - audit-logger

policy:
  prompt-injection:
    embedding_threshold: 0.8
    response:
      action: block
      message: "Request blocked: security policy violation"
    encoding:
      decode_base64: true
      normalize_unicode: true
    boundaries:
      enforce_delimiters: true

  safety-filter:
    categories:
      - violence
      - hate_speech
      - self_harm
    action: block

  agent-firewall:
    allowed_tools:
      - search
      - summarize
    blocked_tools:
      - execute_code
      - shell_command
    max_actions_per_session: 50

  pii-detector:
    action: redact
    redaction:
      marker_format: label

  audit-logger:
    retention_days: 365

Monitoring attack attempts

After deploying prompt injection defenses:

Check the Events page — filter by policy_type=prompt-injection to see blocked attempts
Review escalations — inspect borderline cases in the escalation queue
Export attack data — use kt export create --filter "policy_type=prompt-injection" for security analysis
Track block rate trends — a sudden spike may indicate a targeted attack

Quick wins

Deploy prompt-injection with action: block — immediate protection, zero application changes
Enable decode_base64 and normalize_unicode — catch evasion attempts
Add agent-firewall for any AI agent — restrict tool access to what's actually needed
Set up safety-filter — catch harmful content that isn't technically injection
Review blocked events weekly — tune thresholds based on false positive rates

For AI systems

Canonical terms: prompt-injection policy, embedding_threshold, safety-filter, agent-firewall, delimiter enforcement, escalation.
Config keys: policy.prompt-injection.response.action, policy.prompt-injection.embedding_threshold, policy.prompt-injection.encoding, policy.prompt-injection.boundaries.
CLI commands: kt gateway run, kt events list --filter "prompt_injection".
Best next pages: Govern AI Agents, Agent Firewall Template, Policy Controls Catalog.

For engineers

Prerequisites: gateway running, prompt-injection policy in the chain.
Add embedding_threshold: 0.8 to catch semantic injection attacks that evade pattern rules.
Enable encoding.decode_base64 and encoding.normalize_unicode to prevent evasion via encoding.
Validate: send a known injection payload (e.g., “Ignore previous instructions”) and confirm a 409 block response.
Monitor: filter Events by policy_type=prompt-injection and action block to track blocked attempts.

For leaders

Prompt injection is the #1 AI security risk (OWASP LLM Top 10); this control mitigates it at the infrastructure layer.
Blocking happens before the request reaches the provider — zero exposure of your models to adversarial input.
Escalation routing lets borderline cases reach human reviewers, reducing false-positive disruption.
Every blocked attempt is logged for security audit and incident response evidence.

Next steps

Govern AI Agents — comprehensive agent governance controls
Prevent Sensitive Data Leaks — stop data exfiltration via injection
Prompt Injection Template — ready-to-deploy template
Agent Firewall Template — agent-specific protection template
Policy Controls Catalog — full control inventory

Use this page when​

Primary audience​

What you'll achieve​

How Keeptrusts detects prompt injection​

Layer 1: Pattern-based detection​

Layer 2: Embedding-based semantic detection​

Layer 3: Safety filters​

Protecting AI agents​

Escalation routing for borderline cases​

Defense in depth: combining all layers​

Monitoring attack attempts​

Quick wins​

For AI systems​

For engineers​

For leaders​

Next steps​