External Moderation

External moderation policies route LLM inputs and outputs through third-party content safety, PII detection, or custom moderation services before allowing traffic to proceed. Keeptrusts supports nine providers out of the box — from managed APIs like OpenAI Moderation and Azure Content Safety to self-hosted options like Presidio and generic webhooks — letting you layer multiple moderation checks into a single policy pipeline.

Use this page when

You need to route AI inputs/outputs through third-party content safety services (OpenAI Moderation, Azure Content Safety, Bedrock Guardrails, Lakera, Presidio, etc.).
You are layering multiple moderation providers for defense-in-depth content safety.
You need self-hosted PII detection via Presidio or custom webhook-based moderation.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Configuration

Fields

Property	Type	Default	Description
`provider`	`enum`	`"webhook"`	Moderation provider. One of `"openai-moderation"`, `"azure-content-safety"`, `"bedrock-apply-guardrail"`, `"embedding-endpoint"`, `"webhook"`, `"presidio"`, `"guardrails-ai"`, `"dynamo-ai"`, `"lakera"`.
`secret_key_ref`	`string`	—	Name of the environment variable that holds the API key for the chosen provider. Never put raw secrets in the policy file.
`endpoint`	`string` (uri)	—	Full URL of the provider's moderation endpoint. Required for `webhook`, `embedding-endpoint`, `presidio`, `guardrails-ai`, and `dynamo-ai`. Optional when the provider has a well-known default (e.g., OpenAI, Lakera).
`categories`	`string[]`	`[]`	Content categories the provider should evaluate. Interpretation is provider-specific (e.g., OpenAI category names, Azure category names). An empty array means "check all available categories".
`threshold`	`number`	`0.5`	Confidence score between `0` and `1`. Content flagged at or above this threshold is blocked. Lower values are more aggressive; higher values allow more borderline content through.
`timeout_ms`	`integer`	`3000`	Maximum time in milliseconds to wait for the moderation provider to respond. Must be ≥ 1.
`fail_closed`	`boolean`	`false`	Behavior when the provider is unreachable or returns an error. `true` blocks the request (fail-closed / deny-by-default). `false` allows it through (fail-open).
`webhook_headers`	`object`	`{}`	Additional HTTP headers sent with every request to a `webhook` provider. Useful for authentication tokens, tracing headers, or tenant identifiers.
`aws_region`	`string`	—	AWS region for the Bedrock Guardrails API (e.g., `"us-east-1"`, `"eu-west-1"`). Required when `provider` is `"bedrock-apply-guardrail"`.
`aws_access_key_env`	`string`	—	Environment variable holding the AWS access key ID. Required for Bedrock unless instance-profile or IRSA credentials are available.
`aws_secret_key_env`	`string`	—	Environment variable holding the AWS secret access key. Required alongside `aws_access_key_env`.
`aws_session_token_env`	`string`	—	Environment variable holding a temporary AWS session token. Optional; used with STS-assumed roles or federated credentials.
`guardrail_id`	`string`	—	Bedrock guardrail identifier (e.g., `"abc123"`). Found in the AWS Bedrock console under Guardrails. Required when `provider` is `"bedrock-apply-guardrail"`.
`guardrail_version`	`string`	—	Version of the Bedrock guardrail to apply. Use `"DRAFT"` during development or a numeric version like `"1"` in production. Required when `provider` is `"bedrock-apply-guardrail"`.
`embedding_model`	`string`	—	Model identifier for the embedding endpoint (e.g., `"text-embedding-3-small"`, `"all-MiniLM-L6-v2"`). Required when `provider` is `"embedding-endpoint"`.
`reference_texts`	`string[]`	`[]`	Reference texts for embedding-based similarity detection. Content whose embedding is too close to any reference text (above `threshold`) is flagged. Required when `provider` is `"embedding-endpoint"`.
`presidio_language`	`string`	—	Language code for Presidio's NLP engine (e.g., `"en"`, `"de"`, `"es"`). Required when `provider` is `"presidio"`.
`presidio_entities`	`string[]`	`[]`	Presidio entity types to detect (e.g., `PHONE_NUMBER`, `EMAIL_ADDRESS`, `CREDIT_CARD`, `US_SSN`). An empty array means detect all supported entity types.
`guard_name`	`string`	—	Name of the Guardrails AI guard to execute. Must match a guard registered in your Guardrails AI server. Required when `provider` is `"guardrails-ai"`.
`policy_id`	`string`	—	Dynamo AI policy identifier. Found in the Dynamo AI dashboard. Required when `provider` is `"dynamo-ai"`.
`lakera_categories`	`string[]`	`[]`	Lakera Guard category filters (e.g., `"prompt_injection"`, `"jailbreak"`, `"pii"`, `"toxicity"`). An empty array checks all categories.

Provider-Specific Configuration

OpenAI Moderation

Uses OpenAI's Moderation API to classify content across safety categories including harassment, hate speech, self-harm, sexual content, and violence. Each category returns a confidence score; content is blocked when any checked category exceeds the threshold.

Required fields: secret_key_ref

Optional fields: endpoint (defaults to https://api.openai.com/v1/moderations), categories, threshold, timeout_ms, fail_closed

Supported categories: harassment, harassment/threatening, hate, hate/threatening, self-harm, self-harm/instructions, self-harm/intent, sexual, sexual/minors, violence, violence/graphic

Webhook verdict mapping

Generic webhook providers may return either a boolean flagged field or an explicit textual verdict in verdict or action.

Keeptrusts treats deny, block, blocked, reject, rejected, flagged, and review as blocking verdicts. allow, none, and omitted verdicts with flagged: false are treated as allow decisions.

{
  "verdict": "deny",
  "reason": "prompt injection candidate",
  "scores": {
    "prompt_injection": 0.97
  }
}

Azure Content Safety

Uses Microsoft Azure Content Safety for enterprise-grade content moderation. Azure returns severity levels (0–6) per category; Keeptrusts normalizes these to the 0–1 threshold scale.

Required fields: secret_key_ref, endpoint

Optional fields: categories, threshold, timeout_ms, fail_closed

Supported categories: Hate, SelfHarm, Sexual, Violence

AWS Bedrock Guardrails

Uses Amazon Bedrock's ApplyGuardrail API to enforce guardrails configured in the AWS console. Supports content filters, denied topics, word filters, sensitive information filters, and contextual grounding checks.

Required fields: aws_region, guardrail_id, guardrail_version, and at least one of (aws_access_key_env + aws_secret_key_env) or instance-profile credentials.

Optional fields: aws_session_token_env, categories, threshold, timeout_ms, fail_closed

For temporary credentials (e.g., from aws sts assume-role), also set aws_session_token_env:

Embedding Endpoint

Routes content through a custom embedding model and flags it when the cosine similarity to any reference_texts entry exceeds the threshold. This enables semantic similarity-based moderation — useful for detecting content that is topically close to known-harmful examples even when exact keywords differ.

Required fields: endpoint, embedding_model, reference_texts

Optional fields: secret_key_ref, threshold, timeout_ms, fail_closed

Webhook

Sends an HTTP POST with the content to any URL. The webhook must return a JSON response with a flagged boolean and optional categories array. This is the default provider and the most flexible — use it to integrate with in-house moderation services, serverless functions, or any HTTP-accessible classifier.

Required fields: endpoint

Optional fields: secret_key_ref, webhook_headers, categories, threshold, timeout_ms, fail_closed

Expected response format:

{
  "flagged": true,
  "score": 0.92,
  "categories": ["toxicity", "threat"]
}

Presidio

Uses Microsoft Presidio for PII (Personally Identifiable Information) detection and de-identification. Presidio runs as a self-hosted service and supports dozens of entity types across multiple languages.

Required fields: endpoint, presidio_language

Optional fields: presidio_entities, secret_key_ref, threshold, timeout_ms, fail_closed

Common entity types: PERSON, PHONE_NUMBER, EMAIL_ADDRESS, CREDIT_CARD, US_SSN, US_PASSPORT, IBAN_CODE, IP_ADDRESS, MEDICAL_LICENSE, LOCATION, DATE_TIME, NRP (nationality/religious/political group)

Guardrails AI

Executes a named guard from Guardrails AI. Guards are composable validation pipelines that can check for toxicity, PII, hallucination, competitor mentions, off-topic responses, and more.

Required fields: endpoint, guard_name

Optional fields: secret_key_ref, threshold, timeout_ms, fail_closed

Dynamo AI

Evaluates content against a policy defined in Dynamo AI. Policies in Dynamo AI can cover regulatory compliance, brand safety, and custom business rules.

Required fields: endpoint, policy_id

Optional fields: secret_key_ref, threshold, timeout_ms, fail_closed

Lakera

Uses Lakera Guard for prompt injection detection, jailbreak prevention, PII detection, and content safety. Lakera specializes in adversarial-input defense for LLM applications.

Required fields: secret_key_ref

Optional fields: endpoint (defaults to Lakera's hosted API), lakera_categories, threshold, timeout_ms, fail_closed

Supported categories: prompt_injection, jailbreak, pii, toxicity, unknown_links

Use Cases

Content Safety with OpenAI Moderation

Block harmful content across standard safety categories before it reaches the model or the end user.

Enterprise Content Safety with Azure

Organizations using Azure can leverage Azure Content Safety for compliant, region-specific moderation with data residency guarantees.

AWS-Native Guardrails with Bedrock

Teams on AWS can enforce centrally-managed Bedrock guardrails across all LLM traffic passing through Keeptrusts, without modifying application code.

PII Detection with Presidio

Detect and block messages containing personally identifiable information before they are sent to an LLM, helping comply with GDPR, HIPAA, and other data-protection regulations.

Custom Webhook Integration

Integrate your own moderation logic — a FastAPI service, a Lambda function, or any HTTP endpoint — into the Keeptrusts pipeline.

Prompt Injection Defense with Lakera

Protect against prompt injection and jailbreak attempts at the input layer, before prompts reach the model.

Multi-Provider Moderation Pipeline

Chain multiple moderation providers for defense-in-depth. Policies are evaluated in order — Lakera catches prompt injections first, Presidio strips PII next, and OpenAI Moderation enforces content safety last.

How It Works

Interception — When a request or response passes through the Keeptrusts gateway, each external-moderation policy in the pipeline is evaluated in order.
Dispatch — Keeptrusts serializes the content and sends it to the configured provider's endpoint using the provider-specific protocol (REST API, AWS SDK call, or webhook POST).
Evaluation — The provider analyzes the content and returns category scores, a flagged boolean, or a pass/fail decision depending on the provider type.
Threshold check — Keeptrusts compares the returned scores against threshold. If any checked category meets or exceeds the threshold, the content is flagged.
Decision — Flagged content is blocked and an event is recorded with the moderation result, provider, flagged categories, and scores. Unflagged content proceeds to the next policy or to the upstream model.
Error handling — If the provider is unreachable or returns an error within timeout_ms, the fail_closed setting determines whether the request is blocked (safe default) or allowed through (availability-first).

Combining With Other Policies

External moderation works well alongside other Keeptrusts policy types:

Keyword filters → Use keyword policies for fast, deterministic blocking of known-bad terms, and external moderation for nuanced semantic checks.
Quality scorer → Run quality scoring on model outputs after external moderation has verified content safety on inputs.
Rate limiting → Apply rate limits before external moderation to avoid unnecessary API calls against moderation providers during traffic spikes.
Redaction → Chain Presidio moderation (to detect PII) with a redaction policy (to mask it) instead of blocking outright.
Disclaimers → Add disclaimers to responses that pass moderation but touch sensitive topics.

Best Practices

Start fail-open, move to fail-closed. Begin with fail_closed: false while calibrating thresholds, then switch to true once you have confidence in provider reliability and latency.
Tune thresholds per category. A threshold of 0.5 is a reasonable starting point, but lower it for high-risk categories (hate speech, self-harm) and raise it for noisier ones (sexual innuendo in medical contexts).
Set realistic timeouts. Cloud moderation APIs typically respond in 100–500 ms. Set timeout_ms to 2–3× the observed P99 latency to avoid spurious timeouts while still protecting overall request latency.
Use environment variables for all secrets. Never hard-code API keys in policy YAML. Use secret_key_ref, aws_access_key_env, and aws_secret_key_env to reference secrets from the environment.
Layer providers for defense-in-depth. No single provider catches everything. Combine a prompt-injection specialist (Lakera) with a PII detector (Presidio) and a general content safety provider (OpenAI, Azure) for comprehensive coverage.
Monitor moderation latency. External moderation adds latency to every request. Use Keeptrusts's trace and event data to track per-provider response times and flag degradation before it impacts users.
Scope categories explicitly. Leaving categories empty checks everything, which may produce false positives. List only the categories relevant to your use case to reduce noise.
Test with realistic payloads. Validate thresholds against representative content from your domain — medical, legal, and financial text often triggers generic safety models at inappropriate confidence levels.
Pin Bedrock guardrail versions. Use a numeric guardrail_version in production rather than "DRAFT" to ensure consistent behavior across deployments.
Keep webhook endpoints idempotent. If your webhook provider is a custom service, ensure it handles retries and duplicate requests gracefully.

For AI systems

Canonical terms: Keeptrusts, external-moderation, provider, openai-moderation, azure-content-safety, bedrock-apply-guardrail, presidio, lakera, webhook, guardrails-ai, dynamo-ai, embedding-endpoint
Config/command names: external-moderation policy, provider enum, threshold, categories, fail_closed, timeout_ms, secret_key_ref, guardrail_id, presidio_entities, lakera_categories
Best next pages: Safety Filter, Prompt Injection Detection, PII Detector

For engineers

Prerequisites: API keys or endpoints for your chosen moderation provider(s). For Presidio, a running Presidio instance. For Bedrock, AWS credentials with Guardrails access.
Validation: Send known-flaggable content and verify the moderation provider returns a block/flag. Test fail_closed: true by pointing to an unreachable endpoint and confirming requests are blocked.
Key commands: kt policy lint, kt gateway run, kt events tail

For leaders

Governance: External moderation provides independent third-party safety validation. For regulated environments, this demonstrates defense-in-depth beyond self-hosted controls.
Cost: Each moderation call adds latency (bounded by timeout_ms) and per-request cost from the moderation provider. OpenAI Moderation is free; others charge per-request fees.
Rollout: Start with a single provider (OpenAI Moderation is free and fast). Layer additional providers for high-risk deployments. Set fail_closed: true in production to prevent bypass on provider outages.

Next steps

Safety Filter — Built-in content safety patterns
Prompt Injection Detection — Input attack detection
PII Detector — Built-in PII detection
DLP Filter — Pattern-based data loss prevention

Use this page when​

Primary audience​

Configuration​

Fields​

Provider-Specific Configuration​

OpenAI Moderation​

Webhook verdict mapping​

Azure Content Safety​

AWS Bedrock Guardrails​

Embedding Endpoint​

Webhook​

Presidio​

Guardrails AI​

Dynamo AI​

Lakera​

Use Cases​

Content Safety with OpenAI Moderation​

Enterprise Content Safety with Azure​

AWS-Native Guardrails with Bedrock​

PII Detection with Presidio​

Custom Webhook Integration​

Prompt Injection Defense with Lakera​

Multi-Provider Moderation Pipeline​

How It Works​

Combining With Other Policies​

Best Practices​

For AI systems​

For engineers​

For leaders​

Next steps​

Use this page when

Primary audience

Configuration

Fields

Provider-Specific Configuration

OpenAI Moderation

Webhook verdict mapping

Azure Content Safety

AWS Bedrock Guardrails

Embedding Endpoint

Webhook

Presidio

Guardrails AI

Dynamo AI

Lakera

Use Cases

Content Safety with OpenAI Moderation

Enterprise Content Safety with Azure

AWS-Native Guardrails with Bedrock

PII Detection with Presidio

Custom Webhook Integration

Prompt Injection Defense with Lakera

Multi-Provider Moderation Pipeline

How It Works

Combining With Other Policies

Best Practices

For AI systems

For engineers

For leaders

Next steps