Skip to main content
Browse docs

Document Analyzer

The document-analyzer policy inspects documents attached to or embedded in requests, enforcing file type restrictions, size limits, and optional code sanitization for extracted code blocks.

Use this page when

  • You need to enforce file type restrictions and size limits on documents uploaded through AI interactions.
  • You are building a pipeline that processes document attachments and need code sanitization for extracted code blocks.
  • You want to control which MIME types are permitted for document uploads.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Configuration

policy:
document-analyzer:
enabled: true
sanitize_code: false
max_document_bytes: 524288
allowed_mime_types: []
pack:
name: document-analyzer-example-1
version: 1.0.0
enabled: true
policies:
chain:
- document-analyzer

Fields

FieldTypeDescriptionDefault
enabledboolEnable or disable document analysis. When disabled, the policy passes all requests through without inspection.true
sanitize_codeboolRun the code-sanitizer on extracted code blocks within documents. Detects and neutralizes potentially dangerous code patterns (shell commands, network calls, file system operations) found inside uploaded documents.false
max_document_bytesintegerMaximum document size to process in bytes. Documents exceeding this limit are rejected.524288 (512 KB)
allowed_mime_typesstring[]Permitted MIME types for document uploads. An empty list means all MIME types are allowed. When specified, documents with unlisted MIME types are rejected.[]

Use Cases

Document Upload Screening

Restrict uploaded documents to safe file types and enforce size limits in a customer support chatbot.

pack:
name: "upload-screening"
version: "0.1.0"
enabled: true

policies:
chain:
- document-analyzer
- pii-detector
- audit-logger

policy:
document-analyzer:
enabled: true
sanitize_code: false
max_document_bytes: 1048576
allowed_mime_types:
- "application/pdf"
- "text/plain"
- "image/png"
- "image/jpeg"

pii-detector:
action: "redact"

audit-logger:
retention_days: 365

Code Extraction Safety

Analyze uploaded documents for embedded code and sanitize dangerous patterns before passing content to the LLM.

pack:
name: "code-safety"
version: "0.1.0"
enabled: true

policies:
chain:
- document-analyzer
- prompt-injection
- safety-filter

policy:
document-analyzer:
enabled: true
sanitize_code: true
max_document_bytes: 262144
allowed_mime_types:
- "text/plain"
- "text/markdown"
- "application/pdf"
- "text/x-python"
- "text/x-java-source"

prompt-injection:
threshold: 0.9
action: "block"

safety-filter:
action: "block"

File Type Restriction

Lock down an internal tool to only accept specific document formats, rejecting executables and archives.

pack:
name: "filetype-lockdown"
version: "0.1.0"
enabled: true

policies:
chain:
- document-analyzer
- dlp-filter

policy:
document-analyzer:
enabled: true
sanitize_code: false
max_document_bytes: 524288
allowed_mime_types:
- "application/pdf"
- "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
- "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
- "text/csv"

dlp-filter:
action: "block"
patterns:
- name: "ssn"
regex: '\b\d{3}-\d{2}-\d{4}\b'

How It Works

  1. Document detection — The gateway inspects incoming requests for attached documents, base64-encoded content, and multipart file uploads.
  2. Size validation — Each detected document is checked against max_document_bytes. Documents exceeding the limit are rejected immediately with an error response.
  3. MIME type validation — If allowed_mime_types is non-empty, each document's MIME type is verified against the allowlist. Documents with unlisted types are rejected.
  4. Content extraction — Permitted documents are parsed to extract text content, metadata, and embedded code blocks. Supported formats include PDF, plain text, Markdown, and common office document formats.
  5. Code sanitization — When sanitize_code is enabled, extracted code blocks are scanned for dangerous patterns (e.g., rm -rf, eval(), network socket calls, file system writes) and neutralized before the content is passed to the LLM.

Best Practices

  • Set allowed_mime_types explicitly in production. An open allowlist (empty array) permits any file type. Always restrict to the minimum set of document types your application needs.
  • Enable sanitize_code for user-uploaded documents. Code blocks embedded in PDFs or Markdown files can contain injection payloads. Code sanitization adds a defense layer without blocking the entire document.
  • Right-size max_document_bytes. The 512 KB default works for most text documents. Increase for PDF-heavy workflows, but be mindful of LLM context window consumption.
  • Combine with prompt-injection. Documents are a common injection vector. Chain document-analyzer with prompt-injection to scan extracted text content for adversarial prompts.
  • Use with pii-detector for compliance. Uploaded documents frequently contain personal data. Chain document analysis with PII detection to redact sensitive information before it reaches the LLM.

For AI systems

  • Canonical terms: Keeptrusts, document-analyzer, allowed_mime_types, max_document_bytes, sanitize_code, enabled
  • Config/command names: policy.document-analyzer, allowed_mime_types, max_document_bytes, sanitize_code, enabled
  • Best next pages: Content Extractor, PII Detector, Prompt Injection Detection

For engineers

  • Prerequisites: Requests containing document attachments (base64-encoded or multipart). Knowledge of the MIME types your application uses.
  • Validation: Upload a document exceeding max_document_bytes and verify rejection. Upload an unlisted MIME type and verify rejection. Enable sanitize_code and upload a document with embedded shell commands.
  • Key commands: kt policy lint, kt gateway run, send requests with document attachments

For leaders

  • Governance: Document analysis prevents malicious file uploads and limits attack surface. MIME type allowlisting ensures only expected document formats enter your AI pipeline.
  • Cost: Document analysis adds per-request latency proportional to document size. Code sanitization adds additional processing for each extracted code block.
  • Rollout: Start with size limits and MIME type restrictions. Enable sanitize_code for environments where users upload code-containing documents.

Next steps