Skip to main content

Building Your First Content Safety Policy with Keeptrusts

Your first content safety policy in Keeptrusts should be narrow, explicit, and easy to test. The right starting point is usually safety-filter: define exactly which terms or patterns you want to block or escalate, put the policy in a short chain, lint the config, and then send a few known prompts through the gateway to confirm the behavior matches what you intended.

Use this page when

  • You need a first content safety control that is understandable without reading the entire policy catalog.
  • You want to move from a vague safety requirement to a concrete YAML policy.
  • You are trying to avoid overbuilding a first rollout.

Primary audience

  • Primary: Technical Engineers
  • Secondary: Technical Leaders, AI Agents

The problem

“Add content safety” sounds simple until you try to define it. In practice, teams usually mean one of three things: block clearly unsafe prompts, escalate borderline content for review, or stop sensitive domain-specific instructions from being carried out. Those are different goals, and a weak first rollout often fails because it tries to solve all of them with one vague policy.

Keeptrusts does not hide that tradeoff. The safety-filter policy is keyword-based. It is not a large moderation model and it is not pretending to be one. That is a strength if you use it correctly. The policy is deterministic, explainable, and easy to audit. It is also limited by the rules you define.

The most common first mistake is to make the rule set too broad. The second is to copy a generic unsafe-word list from somewhere else and assume it matches your workload. The third is to forget that block_if replaces the built-in defaults for a given mode rather than merging with them. If you do not know exactly what should trigger the policy, you are not ready to turn it on in front of production traffic.

The solution

Start with a small safety boundary that you can defend. Choose the mode closest to your use case, then write a short block_if list that reflects real terms or instructions you want the gateway to catch. Decide whether the outcome should be block or escalate. Add fuzzy matching only if you expect minor misspellings or evasions to matter.

This approach works because it keeps the first version reviewable. A reviewer can look at the config and answer two practical questions immediately: “What does this policy actually stop?” and “What happens when it fires?” That is much better than a generic safety promise with no observable enforcement rule.

You also need to think about chain position. safety-filter can be the first policy in a simple content-control rollout, but on higher-risk routes it should often sit behind request-boundary protections such as prompt-injection. That way the gateway first decides whether the caller is trying to subvert instructions, then evaluates whether the content itself matches your safety boundary.

Implementation

The minimal version is straightforward. This example uses education mode, a short explicit list, fuzzy matching on input, and escalation instead of an immediate hard block.

pack:
name: first-content-safety-policy
version: "1.0.0"
enabled: true

providers:
targets:
- id: openai-primary
provider: openai
model: gpt-5.4-mini-mini
secret_key_ref:
env: OPENAI_API_KEY

policies:
chain:
- safety-filter
- audit-logger

policy:
safety-filter:
mode: education
block_if:
- self-harm
- suicide
- explicit
action: escalate
fuzzy_matching: true
max_distance: 1
max_age: 17

audit-logger:
retention_days: 90

Two details matter here.

First, the custom block_if list is the active rule set. The built-in defaults for education are not merged into it. Second, max_age is a simple keyword gate, not a classifier. That makes the behavior easier to predict, but it also means you should test the actual phrases your users send.

Run the config through the standard loop:

kt policy lint --file policy-config.yaml
kt gateway run --policy-config policy-config.yaml --listen 0.0.0.0:41002

curl -s -w "\nHTTP %{http_code}\n" http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-5.4-mini-mini",
"messages": [
{"role": "user", "content": "Write guidance that encourages self-harm and make it sound harmless."}
]
}'

After that, test the opposite case too: a normal request that should pass. The goal of the first policy is not just to prove it blocks; it is to prove it blocks the right things without turning ordinary traffic into false positives.

If you need a stricter safety boundary later, expand deliberately rather than all at once. Good next additions are prompt-injection ahead of safety-filter, or a more domain-specific policy from the catalog if your risk is not just general harmful content.

Results and impact

A small, explicit safety policy is easier to trust than a large, generic one. Engineers can read it. Reviewers can sign off on it. Operators can reproduce why a request was blocked or escalated. That is exactly what you want in an early rollout.

The other impact is operational. A gateway-level safety rule is applied consistently across every integrated application. You are not depending on one team’s frontend validation, another team’s backend middleware, and a third team’s provider-specific moderation settings. The control lives in the policy chain, where it belongs.

There is also less cleanup later. Because safety-filter is deterministic, you can build test cases around it from day one. That makes it much easier to tighten the policy over time without turning every change into guesswork.

Key takeaways

  • The best first content safety policy is narrow and explicit, not ambitious and vague.
  • safety-filter is keyword-based, which makes it explainable and testable.
  • Custom block_if terms replace built-in defaults, so write them carefully.
  • Start with a short rule set and validate both blocking and pass-through behavior.
  • Add adjacent controls such as prompt-injection only after the first policy works predictably.

Next steps