Skip to main content
Browse docs

Content Extractor

The content-extractor policy fetches and inlines content from URLs found in request messages, enabling RAG pipelines, document summarization, and link verification with strict host allowlisting and size controls.

Use this page when

  • You are building a RAG pipeline that needs to fetch and inline content from URLs referenced in user messages.
  • You need to enforce host allowlisting and size limits on external URL fetches during request processing.
  • You are setting up document summarization or link verification with strict security controls.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Configuration

policy:
content-extractor:
allow_hosts:
- docs.example.com
- wiki.internal.corp
timeout_ms: 2000
max_bytes: 65536
fetch_urls: true
action_on_error: warn
pack:
name: content-extractor-example-1
version: 1.0.0
enabled: true
policies:
chain:
- content-extractor

Fields

FieldTypeDescriptionDefault
allow_hostsstring[]Permitted fetch hostnames. Only URLs matching these hosts will be fetched. An empty list means no URLs are allowed — hosts must be explicitly listed. Supports exact hostnames only, not wildcards.[]
timeout_msinteger (min: 1)Fetch timeout per URL in milliseconds. Requests that exceed this timeout are treated as errors.2000
max_bytesinteger (min: 1)Maximum response body size in bytes. Responses exceeding this limit are truncated.65536 (64 KB)
fetch_urlsboolAutomatically detect and fetch URLs found in request content. When disabled, the policy is effectively a no-op.true
action_on_errorstringAction when a URL fetch fails (timeout, host not allowed, size exceeded): "warn" logs the failure and continues, "block" rejects the entire request."warn"

Use Cases

RAG Pipeline URL Fetching

Automatically fetch referenced documentation URLs and inline their content for the LLM to use as context.

pack:
name: "rag-content-fetch"
version: "0.1.0"
enabled: true

policies:
chain:
- content-extractor
- prompt-injection
- audit-logger

policy:
content-extractor:
allow_hosts:
- "docs.company.com"
- "confluence.company.com"
- "github.com"
timeout_ms: 5000
max_bytes: 131072
fetch_urls: true
action_on_error: "warn"

prompt-injection:
threshold: 0.8
action: "block"

audit-logger:
retention_days: 90

Document Summarization Preprocessing

Fetch linked documents for a summarization pipeline with strict size limits and error handling.

pack:
name: "doc-summarizer"
version: "0.1.0"
enabled: true

policies:
chain:
- content-extractor
- safety-filter

policy:
content-extractor:
allow_hosts:
- "storage.googleapis.com"
- "s3.amazonaws.com"
timeout_ms: 10000
max_bytes: 524288
fetch_urls: true
action_on_error: "block"

safety-filter:
action: "block"

Verify that URLs referenced in requests are reachable and serve permitted content, blocking requests with broken or disallowed links.

pack:
name: "link-verifier"
version: "0.1.0"
enabled: true

policies:
chain:
- content-extractor
- dlp-filter

policy:
content-extractor:
allow_hosts:
- "api.example.com"
timeout_ms: 3000
max_bytes: 1024
fetch_urls: true
action_on_error: "block"

dlp-filter:
action: "redact"
patterns:
- name: "api_key"
regex: "sk-[a-zA-Z0-9]{48}"

How It Works

  1. URL detection — When fetch_urls is enabled, the gateway scans request message content for URLs using standard URL pattern matching.
  2. Host validation — Each detected URL's hostname is checked against the allow_hosts list. URLs with disallowed hosts are skipped (or trigger action_on_error if set to "block").
  3. Fetch execution — Permitted URLs are fetched via HTTP GET with the configured timeout_ms. Responses larger than max_bytes are truncated at the byte limit.
  4. Content inlining — Successfully fetched content is appended to the request context so the LLM can reference it directly. The original URL remains in the message for traceability.
  5. Error handling — Fetch failures (timeouts, HTTP errors, host denials, size overflows) are handled according to action_on_error: "warn" logs the error and continues processing, "block" rejects the entire request.

Best Practices

  • Always specify allow_hosts explicitly. An empty list blocks all fetches by default. This is a security control — never use wildcards or overly broad host lists.
  • Set conservative max_bytes limits. Large fetched documents consume LLM context window tokens. Start with 64 KB and increase only for specific use cases like document summarization.
  • Use "block" for critical pipelines. If the fetched content is essential for the LLM to produce a correct response (e.g., RAG), set action_on_error to "block" to avoid hallucinated answers based on missing context.
  • Combine with prompt-injection. Fetched content from external URLs is an injection vector. Always chain content-extractor with prompt-injection to scan inlined content.
  • Monitor timeout settings. A 2-second default is appropriate for internal documentation. Increase timeout_ms for external hosts or large documents, but be mindful of the impact on end-to-end latency.

For AI systems

  • Canonical terms: Keeptrusts, content-extractor, allow_hosts, fetch_urls, timeout_ms, max_bytes, action_on_error, RAG, URL fetching
  • Config/command names: policy.content-extractor, allow_hosts, fetch_urls, timeout_ms, max_bytes, action_on_error (warn/block)
  • Best next pages: Document Analyzer, Prompt Injection Detection, Safety Filter

For engineers

  • Prerequisites: URLs referenced in user messages must resolve to hosts listed in allow_hosts. The gateway must have network access to those hosts.
  • Validation: Add content-extractor to policies.chain, configure allow_hosts, and send a request containing a URL. Verify the fetched content appears in the forwarded request. Test rejection by referencing a non-allowed host.
  • Key commands: kt policy lint, kt gateway run, curl with a message containing a URL

For leaders

  • Governance: Content extraction introduces external data fetching into your AI pipeline. The allow_hosts field is your security boundary — only explicitly listed hosts are fetched. Review this list as part of your supply-chain security posture.
  • Cost: Each URL fetch adds latency (bounded by timeout_ms). Fetching large documents increases request processing time. Set max_bytes to control memory and bandwidth consumption.
  • Rollout: Start with a restrictive allow_hosts list covering only internal documentation systems. Expand as you validate content quality and security.

Next steps