Skip to main content
Browse docs

External Moderation

External moderation policies route LLM inputs and outputs through third-party content safety, PII detection, or custom moderation services before allowing traffic to proceed. Keeptrusts supports nine providers out of the box — from managed APIs like OpenAI Moderation and Azure Content Safety to self-hosted options like Presidio and generic webhooks — letting you layer multiple moderation checks into a single policy pipeline.

Use this page when

  • You need to route AI inputs/outputs through third-party content safety services (OpenAI Moderation, Azure Content Safety, Bedrock Guardrails, Lakera, Presidio, etc.).
  • You are layering multiple moderation providers for defense-in-depth content safety.
  • You need self-hosted PII detection via Presidio or custom webhook-based moderation.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Configuration

Fields

PropertyTypeDefaultDescription
providerenum"webhook"Moderation provider. One of "openai-moderation", "azure-content-safety", "bedrock-apply-guardrail", "embedding-endpoint", "webhook", "presidio", "guardrails-ai", "dynamo-ai", "lakera".
secret_key_refstringName of the environment variable that holds the API key for the chosen provider. Never put raw secrets in the policy file.
endpointstring (uri)Full URL of the provider's moderation endpoint. Required for webhook, embedding-endpoint, presidio, guardrails-ai, and dynamo-ai. Optional when the provider has a well-known default (e.g., OpenAI, Lakera).
categoriesstring[][]Content categories the provider should evaluate. Interpretation is provider-specific (e.g., OpenAI category names, Azure category names). An empty array means "check all available categories".
thresholdnumber0.5Confidence score between 0 and 1. Content flagged at or above this threshold is blocked. Lower values are more aggressive; higher values allow more borderline content through.
timeout_msinteger3000Maximum time in milliseconds to wait for the moderation provider to respond. Must be ≥ 1.
fail_closedbooleanfalseBehavior when the provider is unreachable or returns an error. true blocks the request (fail-closed / deny-by-default). false allows it through (fail-open).
webhook_headersobject{}Additional HTTP headers sent with every request to a webhook provider. Useful for authentication tokens, tracing headers, or tenant identifiers.
aws_regionstringAWS region for the Bedrock Guardrails API (e.g., "us-east-1", "eu-west-1"). Required when provider is "bedrock-apply-guardrail".
aws_access_key_envstringEnvironment variable holding the AWS access key ID. Required for Bedrock unless instance-profile or IRSA credentials are available.
aws_secret_key_envstringEnvironment variable holding the AWS secret access key. Required alongside aws_access_key_env.
aws_session_token_envstringEnvironment variable holding a temporary AWS session token. Optional; used with STS-assumed roles or federated credentials.
guardrail_idstringBedrock guardrail identifier (e.g., "abc123"). Found in the AWS Bedrock console under Guardrails. Required when provider is "bedrock-apply-guardrail".
guardrail_versionstringVersion of the Bedrock guardrail to apply. Use "DRAFT" during development or a numeric version like "1" in production. Required when provider is "bedrock-apply-guardrail".
embedding_modelstringModel identifier for the embedding endpoint (e.g., "text-embedding-3-small", "all-MiniLM-L6-v2"). Required when provider is "embedding-endpoint".
reference_textsstring[][]Reference texts for embedding-based similarity detection. Content whose embedding is too close to any reference text (above threshold) is flagged. Required when provider is "embedding-endpoint".
presidio_languagestringLanguage code for Presidio's NLP engine (e.g., "en", "de", "es"). Required when provider is "presidio".
presidio_entitiesstring[][]Presidio entity types to detect (e.g., PHONE_NUMBER, EMAIL_ADDRESS, CREDIT_CARD, US_SSN). An empty array means detect all supported entity types.
guard_namestringName of the Guardrails AI guard to execute. Must match a guard registered in your Guardrails AI server. Required when provider is "guardrails-ai".
policy_idstringDynamo AI policy identifier. Found in the Dynamo AI dashboard. Required when provider is "dynamo-ai".
lakera_categoriesstring[][]Lakera Guard category filters (e.g., "prompt_injection", "jailbreak", "pii", "toxicity"). An empty array checks all categories.

Provider-Specific Configuration

OpenAI Moderation

Uses OpenAI's Moderation API to classify content across safety categories including harassment, hate speech, self-harm, sexual content, and violence. Each category returns a confidence score; content is blocked when any checked category exceeds the threshold.

Required fields: secret_key_ref

Optional fields: endpoint (defaults to https://api.openai.com/v1/moderations), categories, threshold, timeout_ms, fail_closed

Supported categories: harassment, harassment/threatening, hate, hate/threatening, self-harm, self-harm/instructions, self-harm/intent, sexual, sexual/minors, violence, violence/graphic

Webhook verdict mapping

Generic webhook providers may return either a boolean flagged field or an explicit textual verdict in verdict or action.

Keeptrusts treats deny, block, blocked, reject, rejected, flagged, and review as blocking verdicts. allow, none, and omitted verdicts with flagged: false are treated as allow decisions.

{
"verdict": "deny",
"reason": "prompt injection candidate",
"scores": {
"prompt_injection": 0.97
}
}

Azure Content Safety

Uses Microsoft Azure Content Safety for enterprise-grade content moderation. Azure returns severity levels (0–6) per category; Keeptrusts normalizes these to the 0–1 threshold scale.

Required fields: secret_key_ref, endpoint

Optional fields: categories, threshold, timeout_ms, fail_closed

Supported categories: Hate, SelfHarm, Sexual, Violence

AWS Bedrock Guardrails

Uses Amazon Bedrock's ApplyGuardrail API to enforce guardrails configured in the AWS console. Supports content filters, denied topics, word filters, sensitive information filters, and contextual grounding checks.

Required fields: aws_region, guardrail_id, guardrail_version, and at least one of (aws_access_key_env + aws_secret_key_env) or instance-profile credentials.

Optional fields: aws_session_token_env, categories, threshold, timeout_ms, fail_closed

For temporary credentials (e.g., from aws sts assume-role), also set aws_session_token_env:

Embedding Endpoint

Routes content through a custom embedding model and flags it when the cosine similarity to any reference_texts entry exceeds the threshold. This enables semantic similarity-based moderation — useful for detecting content that is topically close to known-harmful examples even when exact keywords differ.

Required fields: endpoint, embedding_model, reference_texts

Optional fields: secret_key_ref, threshold, timeout_ms, fail_closed

Webhook

Sends an HTTP POST with the content to any URL. The webhook must return a JSON response with a flagged boolean and optional categories array. This is the default provider and the most flexible — use it to integrate with in-house moderation services, serverless functions, or any HTTP-accessible classifier.

Required fields: endpoint

Optional fields: secret_key_ref, webhook_headers, categories, threshold, timeout_ms, fail_closed

Expected response format:

{
"flagged": true,
"score": 0.92,
"categories": ["toxicity", "threat"]
}

Presidio

Uses Microsoft Presidio for PII (Personally Identifiable Information) detection and de-identification. Presidio runs as a self-hosted service and supports dozens of entity types across multiple languages.

Required fields: endpoint, presidio_language

Optional fields: presidio_entities, secret_key_ref, threshold, timeout_ms, fail_closed

Common entity types: PERSON, PHONE_NUMBER, EMAIL_ADDRESS, CREDIT_CARD, US_SSN, US_PASSPORT, IBAN_CODE, IP_ADDRESS, MEDICAL_LICENSE, LOCATION, DATE_TIME, NRP (nationality/religious/political group)

Guardrails AI

Executes a named guard from Guardrails AI. Guards are composable validation pipelines that can check for toxicity, PII, hallucination, competitor mentions, off-topic responses, and more.

Required fields: endpoint, guard_name

Optional fields: secret_key_ref, threshold, timeout_ms, fail_closed

Dynamo AI

Evaluates content against a policy defined in Dynamo AI. Policies in Dynamo AI can cover regulatory compliance, brand safety, and custom business rules.

Required fields: endpoint, policy_id

Optional fields: secret_key_ref, threshold, timeout_ms, fail_closed

Lakera

Uses Lakera Guard for prompt injection detection, jailbreak prevention, PII detection, and content safety. Lakera specializes in adversarial-input defense for LLM applications.

Required fields: secret_key_ref

Optional fields: endpoint (defaults to Lakera's hosted API), lakera_categories, threshold, timeout_ms, fail_closed

Supported categories: prompt_injection, jailbreak, pii, toxicity, unknown_links

Use Cases

Content Safety with OpenAI Moderation

Block harmful content across standard safety categories before it reaches the model or the end user.

Enterprise Content Safety with Azure

Organizations using Azure can leverage Azure Content Safety for compliant, region-specific moderation with data residency guarantees.

AWS-Native Guardrails with Bedrock

Teams on AWS can enforce centrally-managed Bedrock guardrails across all LLM traffic passing through Keeptrusts, without modifying application code.

PII Detection with Presidio

Detect and block messages containing personally identifiable information before they are sent to an LLM, helping comply with GDPR, HIPAA, and other data-protection regulations.

Custom Webhook Integration

Integrate your own moderation logic — a FastAPI service, a Lambda function, or any HTTP endpoint — into the Keeptrusts pipeline.

Prompt Injection Defense with Lakera

Protect against prompt injection and jailbreak attempts at the input layer, before prompts reach the model.

Multi-Provider Moderation Pipeline

Chain multiple moderation providers for defense-in-depth. Policies are evaluated in order — Lakera catches prompt injections first, Presidio strips PII next, and OpenAI Moderation enforces content safety last.

How It Works

  1. Interception — When a request or response passes through the Keeptrusts gateway, each external-moderation policy in the pipeline is evaluated in order.
  2. Dispatch — Keeptrusts serializes the content and sends it to the configured provider's endpoint using the provider-specific protocol (REST API, AWS SDK call, or webhook POST).
  3. Evaluation — The provider analyzes the content and returns category scores, a flagged boolean, or a pass/fail decision depending on the provider type.
  4. Threshold check — Keeptrusts compares the returned scores against threshold. If any checked category meets or exceeds the threshold, the content is flagged.
  5. Decision — Flagged content is blocked and an event is recorded with the moderation result, provider, flagged categories, and scores. Unflagged content proceeds to the next policy or to the upstream model.
  6. Error handling — If the provider is unreachable or returns an error within timeout_ms, the fail_closed setting determines whether the request is blocked (safe default) or allowed through (availability-first).

Combining With Other Policies

External moderation works well alongside other Keeptrusts policy types:

  • Keyword filters → Use keyword policies for fast, deterministic blocking of known-bad terms, and external moderation for nuanced semantic checks.
  • Quality scorer → Run quality scoring on model outputs after external moderation has verified content safety on inputs.
  • Rate limiting → Apply rate limits before external moderation to avoid unnecessary API calls against moderation providers during traffic spikes.
  • Redaction → Chain Presidio moderation (to detect PII) with a redaction policy (to mask it) instead of blocking outright.
  • Disclaimers → Add disclaimers to responses that pass moderation but touch sensitive topics.

Best Practices

  1. Start fail-open, move to fail-closed. Begin with fail_closed: false while calibrating thresholds, then switch to true once you have confidence in provider reliability and latency.
  2. Tune thresholds per category. A threshold of 0.5 is a reasonable starting point, but lower it for high-risk categories (hate speech, self-harm) and raise it for noisier ones (sexual innuendo in medical contexts).
  3. Set realistic timeouts. Cloud moderation APIs typically respond in 100–500 ms. Set timeout_ms to 2–3× the observed P99 latency to avoid spurious timeouts while still protecting overall request latency.
  4. Use environment variables for all secrets. Never hard-code API keys in policy YAML. Use secret_key_ref, aws_access_key_env, and aws_secret_key_env to reference secrets from the environment.
  5. Layer providers for defense-in-depth. No single provider catches everything. Combine a prompt-injection specialist (Lakera) with a PII detector (Presidio) and a general content safety provider (OpenAI, Azure) for comprehensive coverage.
  6. Monitor moderation latency. External moderation adds latency to every request. Use Keeptrusts's trace and event data to track per-provider response times and flag degradation before it impacts users.
  7. Scope categories explicitly. Leaving categories empty checks everything, which may produce false positives. List only the categories relevant to your use case to reduce noise.
  8. Test with realistic payloads. Validate thresholds against representative content from your domain — medical, legal, and financial text often triggers generic safety models at inappropriate confidence levels.
  9. Pin Bedrock guardrail versions. Use a numeric guardrail_version in production rather than "DRAFT" to ensure consistent behavior across deployments.
  10. Keep webhook endpoints idempotent. If your webhook provider is a custom service, ensure it handles retries and duplicate requests gracefully.

For AI systems

  • Canonical terms: Keeptrusts, external-moderation, provider, openai-moderation, azure-content-safety, bedrock-apply-guardrail, presidio, lakera, webhook, guardrails-ai, dynamo-ai, embedding-endpoint
  • Config/command names: external-moderation policy, provider enum, threshold, categories, fail_closed, timeout_ms, secret_key_ref, guardrail_id, presidio_entities, lakera_categories
  • Best next pages: Safety Filter, Prompt Injection Detection, PII Detector

For engineers

  • Prerequisites: API keys or endpoints for your chosen moderation provider(s). For Presidio, a running Presidio instance. For Bedrock, AWS credentials with Guardrails access.
  • Validation: Send known-flaggable content and verify the moderation provider returns a block/flag. Test fail_closed: true by pointing to an unreachable endpoint and confirming requests are blocked.
  • Key commands: kt policy lint, kt gateway run, kt events tail

For leaders

  • Governance: External moderation provides independent third-party safety validation. For regulated environments, this demonstrates defense-in-depth beyond self-hosted controls.
  • Cost: Each moderation call adds latency (bounded by timeout_ms) and per-request cost from the moderation provider. OpenAI Moderation is free; others charge per-request fees.
  • Rollout: Start with a single provider (OpenAI Moderation is free and fast). Layer additional providers for high-risk deployments. Set fail_closed: true in production to prevent bypass on provider outages.

Next steps