Custom Data Classification with Regex Patterns and Embeddings
The most reliable way to build custom data classification in Keeptrusts is to combine exact pattern controls with similarity-based detection instead of forcing one technique to do everything. Use regexes and literal terms when you know the shape of the data, use Embedding Detector when you need semantic near-match detection, and add PII Detector or DLP Filter depending on whether the right outcome is shared redaction or custom block-or-redact logic.
Use this page when
- You need to classify organization-specific secrets, identifiers, codenames, or sensitive phrases that Keeptrusts does not ship as a built-in catalog.
- You want to understand when regexes are enough and when similarity-based detection is worth adding.
- You need to tune detection without inventing a custom sidecar service around the gateway.
Primary audience
- Primary: Technical Engineers and policy authors
- Secondary: Technical Leaders responsible for data-governance coverage
Start with the question, not the tool
Before you choose a policy, ask what kind of data you are trying to classify.
If the data has a stable shape, regex is usually the right first answer.
Examples:
- internal employee IDs such as
EMP-442871 - account references such as
ACCT-12345678 - API-key prefixes or structured ticket IDs
If the data is more conceptual, regex is usually not enough.
Examples:
- a prompt that paraphrases a trade-secret concept without using the exact codename
- a message that hints at a proprietary process using similar wording
- euphemistic references to export-controlled or sensitive internal topics
That is where similarity-based controls become useful.
What each policy is best at
Keeptrusts already gives you the right building blocks. You do not need to overload one policy with every responsibility.
pii-detector for redaction-oriented custom identifiers
PII Detector supports detect_patterns for custom regexes. Those matches are recorded as generic_id, and the policy can return redact or block depending on action.
Use it when you want structured identifiers scrubbed through the shared redaction pipeline.
dlp-filter for organization-specific regexes and term lists
DLP Filter evaluates your configured detect_patterns and blocked_terms. It can either block or return a redact verdict, and it also supports fuzzy matching for near-miss terms.
Use it when you want a custom content gate for secrets, codenames, domains, or classification phrases that are specific to your environment.
embedding-detector for semantic lookalikes
Embedding Detector performs similarity-based detection with a built-in local bag-of-words cosine-similarity backend, plus an optional external embedding path when the CLI is built with the embedding-external feature.
Use it when exact matching misses too many near variants.
That combination is the real pattern:
- exact patterns for precision
- similarity categories for recall
A layered custom-classification config
This pack uses all three layers without inventing unsupported fields:
pack:
name: custom-classification-stack
version: 1.0.0
enabled: true
policies:
chain:
- pii-detector
- dlp-filter
- embedding-detector
policy:
pii-detector:
action: redact
detect_patterns:
- 'EMP-\d{6}'
- 'ACCT-\d{8,12}'
redaction:
marker_format: label
include_metadata: true
custom_markers:
generic_id: "[REDACTED-ID]"
dlp-filter:
detect_patterns:
- 'PRJ-[A-Z]{3}-\d{4}'
- 'AKIA[0-9A-Z]{16}'
blocked_terms:
- Project Titan
- board-only forecast
action: block
fuzzy_matching: true
max_distance: 1
embedding-detector:
backend: local
similarity_threshold: 0.8
action: block
categories:
- label: trade_secret
reference_text: proprietary manufacturing process internal formula secret recipe
- label: competitive_intel
reference_text: competitor pricing strategy acquisition target market positioning
This example works because each layer has a clear job.
pii-detectorsanitizes structured identifiers and emits redaction metadata.dlp-filterblocks organization-specific secrets and terms that should never leave the boundary.embedding-detectorcatches semantically similar descriptions that exact patterns may miss.
The regex side: where precision comes from
Regex and exact terms are still the foundation of most custom classification.
That is because they are deterministic, cheap, and easy to audit.
dlp-filter.detect_patterns is the right place for exact pattern classes that are unique to your organization. blocked_terms is the right place for literal names, domains, or phrases that should trigger even without a regex structure.
pii-detector.detect_patterns is slightly different. It is best when the thing you are matching should flow through the shared redaction engine. The docs are explicit that these custom regex matches are recorded as generic_id, not as new custom labels. That is a useful constraint to remember during design.
If you need a stable custom category label for policy logic, embedding-detector.categories[].label is often the cleaner place to express the taxonomy.
The embedding side: where recall comes from
Similarity detection matters when users do not write the same phrase every time.
This is common in real enterprise traffic. People paraphrase. They misspell codenames. They describe a trade secret without using the official internal label.
That is what Embedding Detector is for.
Its default local mode is important. You do not need to stand up a separate embedding service just to get started. The built-in local backend gives you a practical first semantic layer, and the docs recommend starting there before considering an external endpoint.
That should influence rollout strategy.
- Start with a precise regex and term baseline.
- Add local embedding categories for the concepts exact matching misses.
- Raise or lower
similarity_thresholdbased on false positives and misses. - Consider the external backend only if you have a strong reason to accept the extra dependency.
How to connect custom classification to downstream governance
Custom classification is rarely the final goal. It is usually the first control in a larger workflow.
If the classified content is structured sensitive data, redact it.
If the content should never cross the boundary, block it.
If the content is part of a regulated output workflow, pair the request-side classification with output-side controls.
For example:
- use Financial Compliance when the generated answer could drift into investment guidance
- use Healthcare Compliance when the generated answer could look like treatment or diagnosis advice
- use Human Oversight or routed review when the remaining cases require accountable human judgment
That is the bigger design principle: custom classification should narrow the problem before the heavier downstream controls engage.
How to validate the classification pack
Do not guess whether a regex or similarity threshold is right. Test it.
kt policy lint --file policy-config.yaml
kt policy test --json
For a custom classifier, the test pack should include at least four case types:
- exact matches that must trigger
- near matches that should trigger only with fuzzy or embedding detection
- benign content that should allow
- realistic mixed prompts where the content includes both safe and sensitive text
That is especially important with embeddings. A threshold that looks good on paper can be noisy in real phrasing. The local backend is intentionally simple, so representative tests matter more than theoretical confidence.
Common design mistakes
One mistake is using embedding-detector as the first tool for data that already has a stable pattern. Regex is better there.
Another mistake is pushing every custom class into pii-detector when the real need is not redaction but a broader block-or-review decision. That is what dlp-filter and embedding-detector are for.
The third mistake is expecting custom regexes in pii-detector to create distinct category labels. They do not. They surface as generic_id unless you build the taxonomy elsewhere.
Key takeaways
- Use regexes and literal terms first for exact, auditable classification.
- Use
pii-detectorwhen the right outcome is shared redaction. - Use
dlp-filterfor custom block-or-redact term and regex enforcement. - Use
embedding-detectorwhen near-match or conceptual detection matters. - Pair custom classification with finance, healthcare, or human-review controls when the downstream decision is about generated output.