Content Extractor

The content-extractor policy fetches and inlines content from URLs found in request messages, enabling RAG pipelines, document summarization, and link verification with strict host allowlisting and size controls.

Use this page when

You are building a RAG pipeline that needs to fetch and inline content from URLs referenced in user messages.
You need to enforce host allowlisting and size limits on external URL fetches during request processing.
You are setting up document summarization or link verification with strict security controls.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Configuration

policy:
  content-extractor:
    allow_hosts:
    - docs.example.com
    - wiki.internal.corp
    timeout_ms: 2000
    max_bytes: 65536
    fetch_urls: true
    action_on_error: warn
pack:
  name: content-extractor-example-1
  version: 1.0.0
  enabled: true
policies:
  chain:
  - content-extractor

Fields

Field	Type	Description	Default
`allow_hosts`	string[]	Permitted fetch hostnames. Only URLs matching these hosts will be fetched. An empty list means no URLs are allowed — hosts must be explicitly listed. Supports exact hostnames only, not wildcards.	`[]`
`timeout_ms`	integer (min: 1)	Fetch timeout per URL in milliseconds. Requests that exceed this timeout are treated as errors.	`2000`
`max_bytes`	integer (min: 1)	Maximum response body size in bytes. Responses exceeding this limit are truncated.	`65536` (64 KB)
`fetch_urls`	bool	Automatically detect and fetch URLs found in request content. When disabled, the policy is effectively a no-op.	`true`
`action_on_error`	string	Action when a URL fetch fails (timeout, host not allowed, size exceeded): `"warn"` logs the failure and continues, `"block"` rejects the entire request.	`"warn"`

Use Cases

RAG Pipeline URL Fetching

Automatically fetch referenced documentation URLs and inline their content for the LLM to use as context.

pack:
  name: "rag-content-fetch"
  version: "0.1.0"
  enabled: true

policies:
  chain:
    - content-extractor
    - prompt-injection
    - audit-logger

policy:
  content-extractor:
    allow_hosts:
      - "docs.company.com"
      - "confluence.company.com"
      - "github.com"
    timeout_ms: 5000
    max_bytes: 131072
    fetch_urls: true
    action_on_error: "warn"

  prompt-injection:
    threshold: 0.8
    action: "block"

  audit-logger:
    retention_days: 90

Document Summarization Preprocessing

Fetch linked documents for a summarization pipeline with strict size limits and error handling.

pack:
  name: "doc-summarizer"
  version: "0.1.0"
  enabled: true

policies:
  chain:
    - content-extractor
    - safety-filter

policy:
  content-extractor:
    allow_hosts:
      - "storage.googleapis.com"
      - "s3.amazonaws.com"
    timeout_ms: 10000
    max_bytes: 524288
    fetch_urls: true
    action_on_error: "block"

  safety-filter:
    action: "block"

Link Verification

Verify that URLs referenced in requests are reachable and serve permitted content, blocking requests with broken or disallowed links.

pack:
  name: "link-verifier"
  version: "0.1.0"
  enabled: true

policies:
  chain:
    - content-extractor
    - dlp-filter

policy:
  content-extractor:
    allow_hosts:
      - "api.example.com"
    timeout_ms: 3000
    max_bytes: 1024
    fetch_urls: true
    action_on_error: "block"

  dlp-filter:
    action: "redact"
    patterns:
      - name: "api_key"
        regex: "sk-[a-zA-Z0-9]{48}"

How It Works

URL detection — When fetch_urls is enabled, the gateway scans request message content for URLs using standard URL pattern matching.
Host validation — Each detected URL's hostname is checked against the allow_hosts list. URLs with disallowed hosts are skipped (or trigger action_on_error if set to "block").
Fetch execution — Permitted URLs are fetched via HTTP GET with the configured timeout_ms. Responses larger than max_bytes are truncated at the byte limit.
Content inlining — Successfully fetched content is appended to the request context so the LLM can reference it directly. The original URL remains in the message for traceability.
Error handling — Fetch failures (timeouts, HTTP errors, host denials, size overflows) are handled according to action_on_error: "warn" logs the error and continues processing, "block" rejects the entire request.

Best Practices

Always specify allow_hosts explicitly. An empty list blocks all fetches by default. This is a security control — never use wildcards or overly broad host lists.
Set conservative max_bytes limits. Large fetched documents consume LLM context window tokens. Start with 64 KB and increase only for specific use cases like document summarization.
Use "block" for critical pipelines. If the fetched content is essential for the LLM to produce a correct response (e.g., RAG), set action_on_error to "block" to avoid hallucinated answers based on missing context.
Combine with prompt-injection. Fetched content from external URLs is an injection vector. Always chain content-extractor with prompt-injection to scan inlined content.
Monitor timeout settings. A 2-second default is appropriate for internal documentation. Increase timeout_ms for external hosts or large documents, but be mindful of the impact on end-to-end latency.

For AI systems

Canonical terms: Keeptrusts, content-extractor, allow_hosts, fetch_urls, timeout_ms, max_bytes, action_on_error, RAG, URL fetching
Config/command names: policy.content-extractor, allow_hosts, fetch_urls, timeout_ms, max_bytes, action_on_error (warn/block)
Best next pages: Document Analyzer, Prompt Injection Detection, Safety Filter

For engineers

Prerequisites: URLs referenced in user messages must resolve to hosts listed in allow_hosts. The gateway must have network access to those hosts.
Validation: Add content-extractor to policies.chain, configure allow_hosts, and send a request containing a URL. Verify the fetched content appears in the forwarded request. Test rejection by referencing a non-allowed host.
Key commands: kt policy lint, kt gateway run, curl with a message containing a URL

For leaders

Governance: Content extraction introduces external data fetching into your AI pipeline. The allow_hosts field is your security boundary — only explicitly listed hosts are fetched. Review this list as part of your supply-chain security posture.
Cost: Each URL fetch adds latency (bounded by timeout_ms). Fetching large documents increases request processing time. Set max_bytes to control memory and bandwidth consumption.
Rollout: Start with a restrictive allow_hosts list covering only internal documentation systems. Expand as you validate content quality and security.

Next steps

Document Analyzer — File type restrictions and code sanitization
Prompt Injection Detection — Protect against injection in fetched content
DLP Filter — Scan fetched content for sensitive data
Safety Filter — Block unsafe fetched content

Use this page when​

Primary audience​

Configuration​

Fields​

Use Cases​

RAG Pipeline URL Fetching​

Document Summarization Preprocessing​

Link Verification​

How It Works​

Best Practices​

For AI systems​

For engineers​

For leaders​

Next steps​