External Moderation
External moderation policies route LLM inputs and outputs through third-party content safety, PII detection, or custom moderation services before allowing traffic to proceed. Keeptrusts supports nine providers out of the box — from managed APIs like OpenAI Moderation and Azure Content Safety to self-hosted options like Presidio and generic webhooks — letting you layer multiple moderation checks into a single policy pipeline.
Use this page when
- You need to route AI inputs/outputs through third-party content safety services (OpenAI Moderation, Azure Content Safety, Bedrock Guardrails, Lakera, Presidio, etc.).
- You are layering multiple moderation providers for defense-in-depth content safety.
- You need self-hosted PII detection via Presidio or custom webhook-based moderation.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Configuration
Fields
| Property | Type | Default | Description |
|---|---|---|---|
provider | enum | "webhook" | Moderation provider. One of "openai-moderation", "azure-content-safety", "bedrock-apply-guardrail", "embedding-endpoint", "webhook", "presidio", "guardrails-ai", "dynamo-ai", "lakera". |
secret_key_ref | string | — | Name of the environment variable that holds the API key for the chosen provider. Never put raw secrets in the policy file. |
endpoint | string (uri) | — | Full URL of the provider's moderation endpoint. Required for webhook, embedding-endpoint, presidio, guardrails-ai, and dynamo-ai. Optional when the provider has a well-known default (e.g., OpenAI, Lakera). |
categories | string[] | [] | Content categories the provider should evaluate. Interpretation is provider-specific (e.g., OpenAI category names, Azure category names). An empty array means "check all available categories". |
threshold | number | 0.5 | Confidence score between 0 and 1. Content flagged at or above this threshold is blocked. Lower values are more aggressive; higher values allow more borderline content through. |
timeout_ms | integer | 3000 | Maximum time in milliseconds to wait for the moderation provider to respond. Must be ≥ 1. |
fail_closed | boolean | false | Behavior when the provider is unreachable or returns an error. true blocks the request (fail-closed / deny-by-default). false allows it through (fail-open). |
webhook_headers | object | {} | Additional HTTP headers sent with every request to a webhook provider. Useful for authentication tokens, tracing headers, or tenant identifiers. |
aws_region | string | — | AWS region for the Bedrock Guardrails API (e.g., "us-east-1", "eu-west-1"). Required when provider is "bedrock-apply-guardrail". |
aws_access_key_env | string | — | Environment variable holding the AWS access key ID. Required for Bedrock unless instance-profile or IRSA credentials are available. |
aws_secret_key_env | string | — | Environment variable holding the AWS secret access key. Required alongside aws_access_key_env. |
aws_session_token_env | string | — | Environment variable holding a temporary AWS session token. Optional; used with STS-assumed roles or federated credentials. |
guardrail_id | string | — | Bedrock guardrail identifier (e.g., "abc123"). Found in the AWS Bedrock console under Guardrails. Required when provider is "bedrock-apply-guardrail". |
guardrail_version | string | — | Version of the Bedrock guardrail to apply. Use "DRAFT" during development or a numeric version like "1" in production. Required when provider is "bedrock-apply-guardrail". |
embedding_model | string | — | Model identifier for the embedding endpoint (e.g., "text-embedding-3-small", "all-MiniLM-L6-v2"). Required when provider is "embedding-endpoint". |
reference_texts | string[] | [] | Reference texts for embedding-based similarity detection. Content whose embedding is too close to any reference text (above threshold) is flagged. Required when provider is "embedding-endpoint". |
presidio_language | string | — | Language code for Presidio's NLP engine (e.g., "en", "de", "es"). Required when provider is "presidio". |
presidio_entities | string[] | [] | Presidio entity types to detect (e.g., PHONE_NUMBER, EMAIL_ADDRESS, CREDIT_CARD, US_SSN). An empty array means detect all supported entity types. |
guard_name | string | — | Name of the Guardrails AI guard to execute. Must match a guard registered in your Guardrails AI server. Required when provider is "guardrails-ai". |
policy_id | string | — | Dynamo AI policy identifier. Found in the Dynamo AI dashboard. Required when provider is "dynamo-ai". |
lakera_categories | string[] | [] | Lakera Guard category filters (e.g., "prompt_injection", "jailbreak", "pii", "toxicity"). An empty array checks all categories. |
Provider-Specific Configuration
OpenAI Moderation
Uses OpenAI's Moderation API to classify content across safety categories including harassment, hate speech, self-harm, sexual content, and violence. Each category returns a confidence score; content is blocked when any checked category exceeds the threshold.
Required fields: secret_key_ref
Optional fields: endpoint (defaults to https://api.openai.com/v1/moderations), categories, threshold, timeout_ms, fail_closed
Supported categories: harassment, harassment/threatening, hate, hate/threatening, self-harm, self-harm/instructions, self-harm/intent, sexual, sexual/minors, violence, violence/graphic
Webhook verdict mapping
Generic webhook providers may return either a boolean flagged field or an explicit textual verdict in verdict or action.
Keeptrusts treats deny, block, blocked, reject, rejected, flagged, and review as blocking verdicts. allow, none, and omitted verdicts with flagged: false are treated as allow decisions.
{
"verdict": "deny",
"reason": "prompt injection candidate",
"scores": {
"prompt_injection": 0.97
}
}
Azure Content Safety
Uses Microsoft Azure Content Safety for enterprise-grade content moderation. Azure returns severity levels (0–6) per category; Keeptrusts normalizes these to the 0–1 threshold scale.
Required fields: secret_key_ref, endpoint
Optional fields: categories, threshold, timeout_ms, fail_closed
Supported categories: Hate, SelfHarm, Sexual, Violence
AWS Bedrock Guardrails
Uses Amazon Bedrock's ApplyGuardrail API to enforce guardrails configured in the AWS console. Supports content filters, denied topics, word filters, sensitive information filters, and contextual grounding checks.
Required fields: aws_region, guardrail_id, guardrail_version, and at least one of (aws_access_key_env + aws_secret_key_env) or instance-profile credentials.
Optional fields: aws_session_token_env, categories, threshold, timeout_ms, fail_closed
For temporary credentials (e.g., from aws sts assume-role), also set aws_session_token_env:
Embedding Endpoint
Routes content through a custom embedding model and flags it when the cosine similarity to any reference_texts entry exceeds the threshold. This enables semantic similarity-based moderation — useful for detecting content that is topically close to known-harmful examples even when exact keywords differ.
Required fields: endpoint, embedding_model, reference_texts
Optional fields: secret_key_ref, threshold, timeout_ms, fail_closed
Webhook
Sends an HTTP POST with the content to any URL. The webhook must return a JSON response with a flagged boolean and optional categories array. This is the default provider and the most flexible — use it to integrate with in-house moderation services, serverless functions, or any HTTP-accessible classifier.
Required fields: endpoint
Optional fields: secret_key_ref, webhook_headers, categories, threshold, timeout_ms, fail_closed
Expected response format:
{
"flagged": true,
"score": 0.92,
"categories": ["toxicity", "threat"]
}
Presidio
Uses Microsoft Presidio for PII (Personally Identifiable Information) detection and de-identification. Presidio runs as a self-hosted service and supports dozens of entity types across multiple languages.
Required fields: endpoint, presidio_language
Optional fields: presidio_entities, secret_key_ref, threshold, timeout_ms, fail_closed
Common entity types: PERSON, PHONE_NUMBER, EMAIL_ADDRESS, CREDIT_CARD, US_SSN, US_PASSPORT, IBAN_CODE, IP_ADDRESS, MEDICAL_LICENSE, LOCATION, DATE_TIME, NRP (nationality/religious/political group)
Guardrails AI
Executes a named guard from Guardrails AI. Guards are composable validation pipelines that can check for toxicity, PII, hallucination, competitor mentions, off-topic responses, and more.
Required fields: endpoint, guard_name
Optional fields: secret_key_ref, threshold, timeout_ms, fail_closed
Dynamo AI
Evaluates content against a policy defined in Dynamo AI. Policies in Dynamo AI can cover regulatory compliance, brand safety, and custom business rules.
Required fields: endpoint, policy_id
Optional fields: secret_key_ref, threshold, timeout_ms, fail_closed
Lakera
Uses Lakera Guard for prompt injection detection, jailbreak prevention, PII detection, and content safety. Lakera specializes in adversarial-input defense for LLM applications.
Required fields: secret_key_ref
Optional fields: endpoint (defaults to Lakera's hosted API), lakera_categories, threshold, timeout_ms, fail_closed
Supported categories: prompt_injection, jailbreak, pii, toxicity, unknown_links
Use Cases
Content Safety with OpenAI Moderation
Block harmful content across standard safety categories before it reaches the model or the end user.
Enterprise Content Safety with Azure
Organizations using Azure can leverage Azure Content Safety for compliant, region-specific moderation with data residency guarantees.
AWS-Native Guardrails with Bedrock
Teams on AWS can enforce centrally-managed Bedrock guardrails across all LLM traffic passing through Keeptrusts, without modifying application code.
PII Detection with Presidio
Detect and block messages containing personally identifiable information before they are sent to an LLM, helping comply with GDPR, HIPAA, and other data-protection regulations.
Custom Webhook Integration
Integrate your own moderation logic — a FastAPI service, a Lambda function, or any HTTP endpoint — into the Keeptrusts pipeline.
Prompt Injection Defense with Lakera
Protect against prompt injection and jailbreak attempts at the input layer, before prompts reach the model.
Multi-Provider Moderation Pipeline
Chain multiple moderation providers for defense-in-depth. Policies are evaluated in order — Lakera catches prompt injections first, Presidio strips PII next, and OpenAI Moderation enforces content safety last.
How It Works
- Interception — When a request or response passes through the Keeptrusts gateway, each
external-moderationpolicy in the pipeline is evaluated in order. - Dispatch — Keeptrusts serializes the content and sends it to the configured provider's endpoint using the provider-specific protocol (REST API, AWS SDK call, or webhook POST).
- Evaluation — The provider analyzes the content and returns category scores, a flagged boolean, or a pass/fail decision depending on the provider type.
- Threshold check — Keeptrusts compares the returned scores against
threshold. If any checked category meets or exceeds the threshold, the content is flagged. - Decision — Flagged content is blocked and an event is recorded with the moderation result, provider, flagged categories, and scores. Unflagged content proceeds to the next policy or to the upstream model.
- Error handling — If the provider is unreachable or returns an error within
timeout_ms, thefail_closedsetting determines whether the request is blocked (safe default) or allowed through (availability-first).
Combining With Other Policies
External moderation works well alongside other Keeptrusts policy types:
- Keyword filters → Use keyword policies for fast, deterministic blocking of known-bad terms, and external moderation for nuanced semantic checks.
- Quality scorer → Run quality scoring on model outputs after external moderation has verified content safety on inputs.
- Rate limiting → Apply rate limits before external moderation to avoid unnecessary API calls against moderation providers during traffic spikes.
- Redaction → Chain Presidio moderation (to detect PII) with a redaction policy (to mask it) instead of blocking outright.
- Disclaimers → Add disclaimers to responses that pass moderation but touch sensitive topics.
Best Practices
- Start fail-open, move to fail-closed. Begin with
fail_closed: falsewhile calibrating thresholds, then switch totrueonce you have confidence in provider reliability and latency. - Tune thresholds per category. A
thresholdof0.5is a reasonable starting point, but lower it for high-risk categories (hate speech, self-harm) and raise it for noisier ones (sexual innuendo in medical contexts). - Set realistic timeouts. Cloud moderation APIs typically respond in 100–500 ms. Set
timeout_msto 2–3× the observed P99 latency to avoid spurious timeouts while still protecting overall request latency. - Use environment variables for all secrets. Never hard-code API keys in policy YAML. Use
secret_key_ref,aws_access_key_env, andaws_secret_key_envto reference secrets from the environment. - Layer providers for defense-in-depth. No single provider catches everything. Combine a prompt-injection specialist (Lakera) with a PII detector (Presidio) and a general content safety provider (OpenAI, Azure) for comprehensive coverage.
- Monitor moderation latency. External moderation adds latency to every request. Use Keeptrusts's trace and event data to track per-provider response times and flag degradation before it impacts users.
- Scope categories explicitly. Leaving
categoriesempty checks everything, which may produce false positives. List only the categories relevant to your use case to reduce noise. - Test with realistic payloads. Validate thresholds against representative content from your domain — medical, legal, and financial text often triggers generic safety models at inappropriate confidence levels.
- Pin Bedrock guardrail versions. Use a numeric
guardrail_versionin production rather than"DRAFT"to ensure consistent behavior across deployments. - Keep webhook endpoints idempotent. If your webhook provider is a custom service, ensure it handles retries and duplicate requests gracefully.
For AI systems
- Canonical terms: Keeptrusts, external-moderation, provider, openai-moderation, azure-content-safety, bedrock-apply-guardrail, presidio, lakera, webhook, guardrails-ai, dynamo-ai, embedding-endpoint
- Config/command names:
external-moderationpolicy,providerenum,threshold,categories,fail_closed,timeout_ms,secret_key_ref,guardrail_id,presidio_entities,lakera_categories - Best next pages: Safety Filter, Prompt Injection Detection, PII Detector
For engineers
- Prerequisites: API keys or endpoints for your chosen moderation provider(s). For Presidio, a running Presidio instance. For Bedrock, AWS credentials with Guardrails access.
- Validation: Send known-flaggable content and verify the moderation provider returns a block/flag. Test
fail_closed: trueby pointing to an unreachable endpoint and confirming requests are blocked. - Key commands:
kt policy lint,kt gateway run,kt events tail
For leaders
- Governance: External moderation provides independent third-party safety validation. For regulated environments, this demonstrates defense-in-depth beyond self-hosted controls.
- Cost: Each moderation call adds latency (bounded by
timeout_ms) and per-request cost from the moderation provider. OpenAI Moderation is free; others charge per-request fees. - Rollout: Start with a single provider (OpenAI Moderation is free and fast). Layer additional providers for high-risk deployments. Set
fail_closed: truein production to prevent bypass on provider outages.
Next steps
- Safety Filter — Built-in content safety patterns
- Prompt Injection Detection — Input attack detection
- PII Detector — Built-in PII detection
- DLP Filter — Pattern-based data loss prevention