Document Analyzer

The document-analyzer policy inspects documents attached to or embedded in requests, enforcing file type restrictions, size limits, and optional code sanitization for extracted code blocks.

Use this page when

You need to enforce file type restrictions and size limits on documents uploaded through AI interactions.
You are building a pipeline that processes document attachments and need code sanitization for extracted code blocks.
You want to control which MIME types are permitted for document uploads.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Configuration

policy:
  document-analyzer:
    enabled: true
    sanitize_code: false
    max_document_bytes: 524288
    allowed_mime_types: []
pack:
  name: document-analyzer-example-1
  version: 1.0.0
  enabled: true
policies:
  chain:
  - document-analyzer

Fields

Field	Type	Description	Default
`enabled`	bool	Enable or disable document analysis. When disabled, the policy passes all requests through without inspection.	`true`
`sanitize_code`	bool	Run the code-sanitizer on extracted code blocks within documents. Detects and neutralizes potentially dangerous code patterns (shell commands, network calls, file system operations) found inside uploaded documents.	`false`
`max_document_bytes`	integer	Maximum document size to process in bytes. Documents exceeding this limit are rejected.	`524288` (512 KB)
`allowed_mime_types`	string[]	Permitted MIME types for document uploads. An empty list means all MIME types are allowed. When specified, documents with unlisted MIME types are rejected.	`[]`

Use Cases

Document Upload Screening

Restrict uploaded documents to safe file types and enforce size limits in a customer support chatbot.

pack:
  name: "upload-screening"
  version: "0.1.0"
  enabled: true

policies:
  chain:
    - document-analyzer
    - pii-detector
    - audit-logger

policy:
  document-analyzer:
    enabled: true
    sanitize_code: false
    max_document_bytes: 1048576
    allowed_mime_types:
      - "application/pdf"
      - "text/plain"
      - "image/png"
      - "image/jpeg"

  pii-detector:
    action: "redact"

  audit-logger:
    retention_days: 365

Code Extraction Safety

Analyze uploaded documents for embedded code and sanitize dangerous patterns before passing content to the LLM.

pack:
  name: "code-safety"
  version: "0.1.0"
  enabled: true

policies:
  chain:
    - document-analyzer
    - prompt-injection
    - safety-filter

policy:
  document-analyzer:
    enabled: true
    sanitize_code: true
    max_document_bytes: 262144
    allowed_mime_types:
      - "text/plain"
      - "text/markdown"
      - "application/pdf"
      - "text/x-python"
      - "text/x-java-source"

  prompt-injection:
    threshold: 0.9
    action: "block"

  safety-filter:
    action: "block"

File Type Restriction

Lock down an internal tool to only accept specific document formats, rejecting executables and archives.

pack:
  name: "filetype-lockdown"
  version: "0.1.0"
  enabled: true

policies:
  chain:
    - document-analyzer
    - dlp-filter

policy:
  document-analyzer:
    enabled: true
    sanitize_code: false
    max_document_bytes: 524288
    allowed_mime_types:
      - "application/pdf"
      - "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
      - "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
      - "text/csv"

  dlp-filter:
    action: "block"
    patterns:
      - name: "ssn"
        regex: '\b\d{3}-\d{2}-\d{4}\b'

How It Works

Document detection — The gateway inspects incoming requests for attached documents, base64-encoded content, and multipart file uploads.
Size validation — Each detected document is checked against max_document_bytes. Documents exceeding the limit are rejected immediately with an error response.
MIME type validation — If allowed_mime_types is non-empty, each document's MIME type is verified against the allowlist. Documents with unlisted types are rejected.
Content extraction — Permitted documents are parsed to extract text content, metadata, and embedded code blocks. Supported formats include PDF, plain text, Markdown, and common office document formats.
Code sanitization — When sanitize_code is enabled, extracted code blocks are scanned for dangerous patterns (e.g., rm -rf, eval(), network socket calls, file system writes) and neutralized before the content is passed to the LLM.

Best Practices

Set allowed_mime_types explicitly in production. An open allowlist (empty array) permits any file type. Always restrict to the minimum set of document types your application needs.
Enable sanitize_code for user-uploaded documents. Code blocks embedded in PDFs or Markdown files can contain injection payloads. Code sanitization adds a defense layer without blocking the entire document.
Right-size max_document_bytes. The 512 KB default works for most text documents. Increase for PDF-heavy workflows, but be mindful of LLM context window consumption.
Combine with prompt-injection. Documents are a common injection vector. Chain document-analyzer with prompt-injection to scan extracted text content for adversarial prompts.
Use with pii-detector for compliance. Uploaded documents frequently contain personal data. Chain document analysis with PII detection to redact sensitive information before it reaches the LLM.

For AI systems

Canonical terms: Keeptrusts, document-analyzer, allowed_mime_types, max_document_bytes, sanitize_code, enabled
Config/command names: policy.document-analyzer, allowed_mime_types, max_document_bytes, sanitize_code, enabled
Best next pages: Content Extractor, PII Detector, Prompt Injection Detection

For engineers

Prerequisites: Requests containing document attachments (base64-encoded or multipart). Knowledge of the MIME types your application uses.
Validation: Upload a document exceeding max_document_bytes and verify rejection. Upload an unlisted MIME type and verify rejection. Enable sanitize_code and upload a document with embedded shell commands.
Key commands: kt policy lint, kt gateway run, send requests with document attachments

For leaders

Governance: Document analysis prevents malicious file uploads and limits attack surface. MIME type allowlisting ensures only expected document formats enter your AI pipeline.
Cost: Document analysis adds per-request latency proportional to document size. Code sanitization adds additional processing for each extracted code block.
Rollout: Start with size limits and MIME type restrictions. Enable sanitize_code for environments where users upload code-containing documents.

Next steps

Content Extractor — Fetch and inline URL content
PII Detector — Scan document content for PII
Prompt Injection Detection — Detect injection in document content
DLP Filter — Scan documents for sensitive patterns

Use this page when​

Primary audience​

Configuration​

Fields​

Use Cases​

Document Upload Screening​

Code Extraction Safety​

File Type Restriction​

How It Works​

Best Practices​

For AI systems​

For engineers​

For leaders​

Next steps​