Document Analyzer
The document-analyzer policy inspects documents attached to or embedded in requests, enforcing file type restrictions, size limits, and optional code sanitization for extracted code blocks.
Use this page when
- You need to enforce file type restrictions and size limits on documents uploaded through AI interactions.
- You are building a pipeline that processes document attachments and need code sanitization for extracted code blocks.
- You want to control which MIME types are permitted for document uploads.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Configuration
policy:
document-analyzer:
enabled: true
sanitize_code: false
max_document_bytes: 524288
allowed_mime_types: []
pack:
name: document-analyzer-example-1
version: 1.0.0
enabled: true
policies:
chain:
- document-analyzer
Fields
| Field | Type | Description | Default |
|---|---|---|---|
enabled | bool | Enable or disable document analysis. When disabled, the policy passes all requests through without inspection. | true |
sanitize_code | bool | Run the code-sanitizer on extracted code blocks within documents. Detects and neutralizes potentially dangerous code patterns (shell commands, network calls, file system operations) found inside uploaded documents. | false |
max_document_bytes | integer | Maximum document size to process in bytes. Documents exceeding this limit are rejected. | 524288 (512 KB) |
allowed_mime_types | string[] | Permitted MIME types for document uploads. An empty list means all MIME types are allowed. When specified, documents with unlisted MIME types are rejected. | [] |
Use Cases
Document Upload Screening
Restrict uploaded documents to safe file types and enforce size limits in a customer support chatbot.
pack:
name: "upload-screening"
version: "0.1.0"
enabled: true
policies:
chain:
- document-analyzer
- pii-detector
- audit-logger
policy:
document-analyzer:
enabled: true
sanitize_code: false
max_document_bytes: 1048576
allowed_mime_types:
- "application/pdf"
- "text/plain"
- "image/png"
- "image/jpeg"
pii-detector:
action: "redact"
audit-logger:
retention_days: 365
Code Extraction Safety
Analyze uploaded documents for embedded code and sanitize dangerous patterns before passing content to the LLM.
pack:
name: "code-safety"
version: "0.1.0"
enabled: true
policies:
chain:
- document-analyzer
- prompt-injection
- safety-filter
policy:
document-analyzer:
enabled: true
sanitize_code: true
max_document_bytes: 262144
allowed_mime_types:
- "text/plain"
- "text/markdown"
- "application/pdf"
- "text/x-python"
- "text/x-java-source"
prompt-injection:
threshold: 0.9
action: "block"
safety-filter:
action: "block"
File Type Restriction
Lock down an internal tool to only accept specific document formats, rejecting executables and archives.
pack:
name: "filetype-lockdown"
version: "0.1.0"
enabled: true
policies:
chain:
- document-analyzer
- dlp-filter
policy:
document-analyzer:
enabled: true
sanitize_code: false
max_document_bytes: 524288
allowed_mime_types:
- "application/pdf"
- "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
- "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
- "text/csv"
dlp-filter:
action: "block"
patterns:
- name: "ssn"
regex: '\b\d{3}-\d{2}-\d{4}\b'
How It Works
- Document detection — The gateway inspects incoming requests for attached documents, base64-encoded content, and multipart file uploads.
- Size validation — Each detected document is checked against
max_document_bytes. Documents exceeding the limit are rejected immediately with an error response. - MIME type validation — If
allowed_mime_typesis non-empty, each document's MIME type is verified against the allowlist. Documents with unlisted types are rejected. - Content extraction — Permitted documents are parsed to extract text content, metadata, and embedded code blocks. Supported formats include PDF, plain text, Markdown, and common office document formats.
- Code sanitization — When
sanitize_codeis enabled, extracted code blocks are scanned for dangerous patterns (e.g.,rm -rf,eval(), network socket calls, file system writes) and neutralized before the content is passed to the LLM.
Best Practices
- Set
allowed_mime_typesexplicitly in production. An open allowlist (empty array) permits any file type. Always restrict to the minimum set of document types your application needs. - Enable
sanitize_codefor user-uploaded documents. Code blocks embedded in PDFs or Markdown files can contain injection payloads. Code sanitization adds a defense layer without blocking the entire document. - Right-size
max_document_bytes. The 512 KB default works for most text documents. Increase for PDF-heavy workflows, but be mindful of LLM context window consumption. - Combine with
prompt-injection. Documents are a common injection vector. Chaindocument-analyzerwithprompt-injectionto scan extracted text content for adversarial prompts. - Use with
pii-detectorfor compliance. Uploaded documents frequently contain personal data. Chain document analysis with PII detection to redact sensitive information before it reaches the LLM.
For AI systems
- Canonical terms: Keeptrusts, document-analyzer, allowed_mime_types, max_document_bytes, sanitize_code, enabled
- Config/command names:
policy.document-analyzer,allowed_mime_types,max_document_bytes,sanitize_code,enabled - Best next pages: Content Extractor, PII Detector, Prompt Injection Detection
For engineers
- Prerequisites: Requests containing document attachments (base64-encoded or multipart). Knowledge of the MIME types your application uses.
- Validation: Upload a document exceeding
max_document_bytesand verify rejection. Upload an unlisted MIME type and verify rejection. Enablesanitize_codeand upload a document with embedded shell commands. - Key commands:
kt policy lint,kt gateway run, send requests with document attachments
For leaders
- Governance: Document analysis prevents malicious file uploads and limits attack surface. MIME type allowlisting ensures only expected document formats enter your AI pipeline.
- Cost: Document analysis adds per-request latency proportional to document size. Code sanitization adds additional processing for each extracted code block.
- Rollout: Start with size limits and MIME type restrictions. Enable
sanitize_codefor environments where users upload code-containing documents.
Next steps
- Content Extractor — Fetch and inline URL content
- PII Detector — Scan document content for PII
- Prompt Injection Detection — Detect injection in document content
- DLP Filter — Scan documents for sensitive patterns