Code Sanitation: Preventing Malicious Code in AI Outputs
Code Sanitation: Preventing Malicious Code in AI Outputs
Generated code becomes dangerous when users or agents treat it like trusted automation. The failure is not only that a model produced something risky. The failure is that the risky output survived the governance layer and reached an execution path. Keeptrusts addresses that with the output-stage policy documented at Code Sanitizer. The page name says code sanitation for continuity, but the implemented policy_kind is code-sanitizer, and that distinction matters when you write real policy YAML.
Use this page when
- You need to reduce the risk of malicious or obviously unsafe generated code reaching users or automation.
- You want to understand the implemented
code-sanitizerbehavior rather than a generic secure-coding promise. - You are pairing output filtering with agent tool governance for higher-risk workflows.
Primary audience
- Primary: Platform engineers, security engineers, and AI application owners
- Secondary: Technical Leaders evaluating execution safety for AI-generated code
The problem
AI systems frequently generate shell commands, SQL, cloud CLI snippets, or infrastructure automation. In many teams, that output is reviewed by a human before anything happens. In agentic systems, however, generated code can move much faster. A model may propose a command, another component may pass it to a tool, and a human may only see the result after the fact. If the output boundary has no concrete filter, unsafe code-like content can slip into execution paths that were supposed to be governed.
The trap is assuming a tool policy alone solves this. Tool Security and Agent Firewall are important, but they govern tool requests and tool actions. They are not output filters. If the model returns dangerous shell or SQL text in a plain response, the code boundary still needs its own control.
The solution
Use Code Sanitizer as the output-stage filter for concrete dangerous patterns, and keep expectations aligned with the implementation. The built-in detector is intentionally small. It looks for patterns such as rm -rf, drop table, the cloud metadata IP 169.254.169.254, localhost curl calls, chmod 777, and recursive S3 copy commands. It can also compile your additional_patterns and either redact or block the response when matches occur.
That makes the policy valuable for catching obvious high-risk output before it reaches downstream consumers. It does not make the model a secure-code generator, and it is not a general vulnerability scanner. Treat it as a targeted output guardrail.
In higher-risk agentic flows, pair it with Tool Validation, Tool Security, and Agent Firewall. Those controls govern what can be invoked. code-sanitizer governs what the model is allowed to emit as final content.
Implementation
This chain pairs request hardening, tool hardening, and output sanitation so unsafe code has fewer paths to execution.
pack:
name: code-output-guard
version: 1.0.0
enabled: true
policies:
chain:
- prompt-injection
- tool-validation
- tool-security
- agent-firewall
- code-sanitizer
- audit-logger
policy:
prompt-injection:
use_embedding: true
detection:
embedding_threshold: 0.78
encoding:
decode_base64: true
normalize_unicode: true
detect_homoglyphs: true
boundaries:
enforce_delimiters: true
reject_fake_boundaries: true
tool-validation:
declared_tools:
- web_search
- knowledge_lookup
allow_undeclared: false
tool-security:
analysis_mode: local
blocked_patterns:
- rm -rf
- drop table
- file://
blocked_entity_types:
- jwt
- private_key
agent-firewall:
blocked_tools:
- shell_command
- delete_database
max_actions_per_window: 2
max_actions_per_session: 8
code-sanitizer:
enabled: true
block_on_match: true
additional_patterns:
- 'kubectl\s+delete\s+namespace'
- 'terraform\s+destroy'
audit-logger: {}
Lint the config and then inspect whether flagged output is being blocked or redacted in the expected review window:
kt policy lint --file code-output-guard.yaml
kt events tail --since 1h --verdict blocked --json
kt events tail --since 1h --verdict redacted --json
If your deployment allows generated commands to move into an automated execution path, keep block_on_match: true. If the workflow is advisory and always human-reviewed, a redaction-oriented rollout may be enough while you tune additional_patterns.
Results and impact
The biggest benefit is that obviously dangerous output stops being a silent copy-paste hazard. Teams can still use AI for code assistance, but the most recognizable destructive patterns do not flow through untouched.
There is also an architectural benefit. Once teams separate output sanitation from tool governance, they stop overloading a single policy with impossible expectations. That usually leads to better tuning because each control has one clear job.
Key takeaways
- The implemented output policy is
code-sanitizer, documented on Code Sanitizer. - It catches a small built-in set of dangerous patterns plus tested
additional_patterns; it is not a full vulnerability scanner. - Use
block_on_match: truewhenever generated code may feed automation. - Pair output sanitation with Tool Validation, Tool Security, and Agent Firewall for agentic flows.
- Output safety is a separate boundary and deserves its own control.