Adversarial Inputs: Governance for Intentionally Malformed AI Requests

Not every dangerous AI request looks dangerous when you first read it. Attackers encode payloads, swap characters with homoglyphs, insert fake role boundaries, and split malicious intent across multiple turns so simple pattern checks miss the real content. Governance for adversarial input therefore starts with normalization and boundary enforcement at the gateway. Keeptrusts provides those controls in policy, which is the right place to make them testable and consistent.

Use this page when

You are seeing malformed, encoded, or boundary-confusing inputs in AI traffic.
You need to govern evasion techniques such as Base64 wrapping, Unicode tricks, and fake system markers.
You want a policy chain that normalizes hostile input before deciding whether to block it.

Primary audience

Primary: Technical Engineers
Secondary: Technical Leaders, AI Agents

The problem

Malicious AI input is often designed to defeat the exact kind of control teams deploy first: a literal text match against a short list of suspicious strings.

Base64 wrapping is a common example. The input looks like an opaque blob, but once decoded it contains a direct instruction override or a request to reveal the system prompt. Unicode normalization attacks are similar. Characters that appear harmless or visually equivalent can be used to bypass naive string checks. Homoglyph substitution makes this worse by allowing characters from different scripts to impersonate trusted tokens.

Another category is fake boundary injection. Attackers insert content that looks like a system or developer role marker inside user-controlled text. If the downstream application or operator treats those markers as meaningful, the attacker has already shaped the control surface. Even if the application is correct, the model itself may still respond differently when boundary-like text is embedded in the prompt.

Then there are multi-turn attacks. The first prompt looks harmless and establishes a frame. The next one introduces the real override. The third makes a seemingly minor change that turns earlier content into a policy bypass. If your governance view is only a per-message keyword scan, you miss the attack chain.

These cases are why adversarial-input governance cannot be reduced to a single list of bad words. The system must normalize what the attacker sent, evaluate the normalized content, and then combine that result with explicit DLP and safety policy so the request fails closed when the input is malicious or clearly unsafe.

The solution

Keeptrusts already exposes the right mechanics in Prompt Injection Detection. The policy can decode Base64-like content, normalize Unicode, fold homoglyphs, detect fake boundaries, enforce delimiter checks, and optionally force embedding-based similarity evaluation. That makes it the core control for intentionally malformed requests.

You should then chain DLP Filter behind it, because decoded content may expose patterns that were invisible in the raw input. A hidden key, an internal codename, or a sealed case reference becomes obvious only after normalization. Finally, use Safety Filter for dangerous requests whose main signal is unsafe intent rather than classic boundary manipulation.

This same layering appears in Block Prompt Injection Attacks Before They Reach Your Models, Prevent Sensitive Data Leaks in AI Requests, and Implement Zero-Trust AI with Defense-in-Depth Policies. The governing principle is simple: normalize first, evaluate second, route or block based on explicit policy, not hopeful prompt engineering.

Implementation

Use a config that turns the normalization features on by default for hostile traffic lanes:

pack:
  name: malformed-input-guard
  version: "1.0.0"
  enabled: true

policies:
  chain:
    - prompt-injection
    - dlp-filter
    - safety-filter

policy:
  prompt-injection:
    use_embedding: true
    detection:
      embedding_threshold: 0.8
      attack_patterns:
        - "ignore.*previous.*instructions"
        - "forget.*system.*prompt"
        - "reveal.*system.*prompt"
    encoding:
      decode_base64: true
      normalize_unicode: true
      detect_homoglyphs: true
    boundaries:
      enforce_delimiters: true
      reject_fake_boundaries: true

  dlp-filter:
    detect_patterns:
      - 'AKIA[0-9A-Z]{16}'
      - 'ghp_[0-9A-Za-z]{36}'
    blocked_terms:
      - hidden system prompt
      - sandbox escape
    action: block
    fuzzy_matching: true
    max_distance: 1
    sensitivity_level: high

  safety-filter:
    mode: critical_infrastructure
    block_if:
      - "disable emergency shutdown"
      - "override safety interlock"
    action: block
    fuzzy_matching: true
    max_distance: 1
    max_age: 0

The point of this config is not that every malformed prompt matches a single regex. The point is that the gateway normalizes the content before the decisive checks happen. decode_base64, normalize_unicode, and detect_homoglyphs matter because attackers use those transformations specifically to reduce the usefulness of literal inspection.

dlp-filter follows because normalization can reveal sensitive patterns that were hidden in the raw message. safety-filter follows because some adversarial inputs are really an evasion wrapper around obviously harmful asks. You still want a separate control that says the content itself is unacceptable.

Validate and monitor the lane with concrete commands:

kt policy lint --file malformed-input-guard.yaml
kt gateway run --listen 0.0.0.0:41002 --policy-config malformed-input-guard.yaml
kt events tail --json --limit 20 --event-type decision --verdict blocked

If you are tuning this lane, feed it known encoded attacks, homoglyph variants, and fake system markers rather than only plain English override strings. The purpose of the configuration is to catch what the attacker meant, not only what they typed in the most obvious form.

Results and impact

The immediate benefit of this approach is fewer blind spots. Teams stop assuming that malformed input is someone else's parsing problem and start treating it as a security decision at the gateway. That reduces the chance that encoded or role-confusing content slips through because the application handled it inconsistently.

The second benefit is cleaner policy reasoning. When normalization happens inside the same gateway control path as prompt detection, DLP, and safety enforcement, you do not need separate ad hoc filters scattered across multiple services. One path evaluates the request, one event record explains the decision, and one policy file expresses the intended defense.

The third benefit is resilience during model changes. Providers and model families vary in how they interpret malformed prompt structure. A gateway normalization layer gives you a stable defense regardless of which upstream model is currently serving the workload.

Key takeaways

Intentionally malformed AI requests are usually evasion attempts, not harmless formatting noise.
Prompt Injection Detection gives Keeptrusts the normalization and boundary checks needed to govern those requests.
DLP Filter should run after normalization so hidden secrets and restricted terms are still caught.
Safety Filter adds a separate decision point for clearly dangerous intent.
The related operating guidance remains Block Prompt Injection Attacks Before They Reach Your Models, Prevent Sensitive Data Leaks in AI Requests, and Implement Zero-Trust AI with Defense-in-Depth Policies.

Adversarial Inputs: Governance for Intentionally Malformed AI Requests

Use this page when​

Primary audience​

The problem​

The solution​

Implementation​

Results and impact​

Key takeaways​

Next steps​