Code Sanitation: Preventing Malicious Code in AI Outputs

Generated code becomes dangerous when users or agents treat it like trusted automation. The failure is not only that a model produced something risky. The failure is that the risky output survived the governance layer and reached an execution path. Keeptrusts addresses that with the output-stage policy documented at Code Sanitizer. The page name says code sanitation for continuity, but the implemented policy_kind is code-sanitizer, and that distinction matters when you write real policy YAML.

Use this page when

You need to reduce the risk of malicious or obviously unsafe generated code reaching users or automation.
You want to understand the implemented code-sanitizer behavior rather than a generic secure-coding promise.
You are pairing output filtering with agent tool governance for higher-risk workflows.

Primary audience

Primary: Platform engineers, security engineers, and AI application owners
Secondary: Technical Leaders evaluating execution safety for AI-generated code

The problem

AI systems frequently generate shell commands, SQL, cloud CLI snippets, or infrastructure automation. In many teams, that output is reviewed by a human before anything happens. In agentic systems, however, generated code can move much faster. A model may propose a command, another component may pass it to a tool, and a human may only see the result after the fact. If the output boundary has no concrete filter, unsafe code-like content can slip into execution paths that were supposed to be governed.

The trap is assuming a tool policy alone solves this. Tool Security and Agent Firewall are important, but they govern tool requests and tool actions. They are not output filters. If the model returns dangerous shell or SQL text in a plain response, the code boundary still needs its own control.

The solution

Use Code Sanitizer as the output-stage filter for concrete dangerous patterns, and keep expectations aligned with the implementation. The built-in detector is intentionally small. It looks for patterns such as rm -rf, drop table, the cloud metadata IP 169.254.169.254, localhost curl calls, chmod 777, and recursive S3 copy commands. It can also compile your additional_patterns and either redact or block the response when matches occur.

That makes the policy valuable for catching obvious high-risk output before it reaches downstream consumers. It does not make the model a secure-code generator, and it is not a general vulnerability scanner. Treat it as a targeted output guardrail.

In higher-risk agentic flows, pair it with Tool Validation, Tool Security, and Agent Firewall. Those controls govern what can be invoked. code-sanitizer governs what the model is allowed to emit as final content.

Implementation

This chain pairs request hardening, tool hardening, and output sanitation so unsafe code has fewer paths to execution.

pack:
  name: code-output-guard
  version: 1.0.0
  enabled: true

policies:
  chain:
    - prompt-injection
    - tool-validation
    - tool-security
    - agent-firewall
    - code-sanitizer
    - audit-logger

policy:
  prompt-injection:
    use_embedding: true
    detection:
      embedding_threshold: 0.78
    encoding:
      decode_base64: true
      normalize_unicode: true
      detect_homoglyphs: true
    boundaries:
      enforce_delimiters: true
      reject_fake_boundaries: true

  tool-validation:
    declared_tools:
      - web_search
      - knowledge_lookup
    allow_undeclared: false

  tool-security:
    analysis_mode: local
    blocked_patterns:
      - rm -rf
      - drop table
      - file://
    blocked_entity_types:
      - jwt
      - private_key

  agent-firewall:
    blocked_tools:
      - shell_command
      - delete_database
    max_actions_per_window: 2
    max_actions_per_session: 8

  code-sanitizer:
    enabled: true
    block_on_match: true
    additional_patterns:
      - 'kubectl\s+delete\s+namespace'
      - 'terraform\s+destroy'

  audit-logger: {}

Lint the config and then inspect whether flagged output is being blocked or redacted in the expected review window:

kt policy lint --file code-output-guard.yaml
kt events tail --since 1h --verdict blocked --json
kt events tail --since 1h --verdict redacted --json

If your deployment allows generated commands to move into an automated execution path, keep block_on_match: true. If the workflow is advisory and always human-reviewed, a redaction-oriented rollout may be enough while you tune additional_patterns.

Results and impact

The biggest benefit is that obviously dangerous output stops being a silent copy-paste hazard. Teams can still use AI for code assistance, but the most recognizable destructive patterns do not flow through untouched.

There is also an architectural benefit. Once teams separate output sanitation from tool governance, they stop overloading a single policy with impossible expectations. That usually leads to better tuning because each control has one clear job.

Key takeaways

The implemented output policy is code-sanitizer, documented on Code Sanitizer.
It catches a small built-in set of dangerous patterns plus tested additional_patterns; it is not a full vulnerability scanner.
Use block_on_match: true whenever generated code may feed automation.
Pair output sanitation with Tool Validation, Tool Security, and Agent Firewall for agentic flows.
Output safety is a separate boundary and deserves its own control.

Code Sanitation: Preventing Malicious Code in AI Outputs

Use this page when​

Primary audience​

The problem​

The solution​

Implementation​

Results and impact​

Key takeaways​

Next steps​