Custom Data Classification with Regex Patterns and Embeddings

The most reliable way to build custom data classification in Keeptrusts is to combine exact pattern controls with similarity-based detection instead of forcing one technique to do everything. Use regexes and literal terms when you know the shape of the data, use Embedding Detector when you need semantic near-match detection, and add PII Detector or DLP Filter depending on whether the right outcome is shared redaction or custom block-or-redact logic.

Use this page when

You need to classify organization-specific secrets, identifiers, codenames, or sensitive phrases that Keeptrusts does not ship as a built-in catalog.
You want to understand when regexes are enough and when similarity-based detection is worth adding.
You need to tune detection without inventing a custom sidecar service around the gateway.

Primary audience

Primary: Technical Engineers and policy authors
Secondary: Technical Leaders responsible for data-governance coverage

Start with the question, not the tool

Before you choose a policy, ask what kind of data you are trying to classify.

If the data has a stable shape, regex is usually the right first answer.

Examples:

internal employee IDs such as EMP-442871
account references such as ACCT-12345678
API-key prefixes or structured ticket IDs

If the data is more conceptual, regex is usually not enough.

Examples:

a prompt that paraphrases a trade-secret concept without using the exact codename
a message that hints at a proprietary process using similar wording
euphemistic references to export-controlled or sensitive internal topics

That is where similarity-based controls become useful.

What each policy is best at

Keeptrusts already gives you the right building blocks. You do not need to overload one policy with every responsibility.

`pii-detector` for redaction-oriented custom identifiers

PII Detector supports detect_patterns for custom regexes. Those matches are recorded as generic_id, and the policy can return redact or block depending on action.

Use it when you want structured identifiers scrubbed through the shared redaction pipeline.

`dlp-filter` for organization-specific regexes and term lists

DLP Filter evaluates your configured detect_patterns and blocked_terms. It can either block or return a redact verdict, and it also supports fuzzy matching for near-miss terms.

Use it when you want a custom content gate for secrets, codenames, domains, or classification phrases that are specific to your environment.

`embedding-detector` for semantic lookalikes

Embedding Detector performs similarity-based detection with a built-in local bag-of-words cosine-similarity backend, plus an optional external embedding path when the CLI is built with the embedding-external feature.

Use it when exact matching misses too many near variants.

That combination is the real pattern:

exact patterns for precision
similarity categories for recall

A layered custom-classification config

This pack uses all three layers without inventing unsupported fields:

pack:
  name: custom-classification-stack
  version: 1.0.0
  enabled: true

policies:
  chain:
    - pii-detector
    - dlp-filter
    - embedding-detector

policy:
  pii-detector:
    action: redact
    detect_patterns:
      - 'EMP-\d{6}'
      - 'ACCT-\d{8,12}'
    redaction:
      marker_format: label
      include_metadata: true
      custom_markers:
        generic_id: "[REDACTED-ID]"

  dlp-filter:
    detect_patterns:
      - 'PRJ-[A-Z]{3}-\d{4}'
      - 'AKIA[0-9A-Z]{16}'
    blocked_terms:
      - Project Titan
      - board-only forecast
    action: block
    fuzzy_matching: true
    max_distance: 1

  embedding-detector:
    backend: local
    similarity_threshold: 0.8
    action: block
    categories:
      - label: trade_secret
        reference_text: proprietary manufacturing process internal formula secret recipe
      - label: competitive_intel
        reference_text: competitor pricing strategy acquisition target market positioning

This example works because each layer has a clear job.

pii-detector sanitizes structured identifiers and emits redaction metadata.
dlp-filter blocks organization-specific secrets and terms that should never leave the boundary.
embedding-detector catches semantically similar descriptions that exact patterns may miss.

The regex side: where precision comes from

Regex and exact terms are still the foundation of most custom classification.

That is because they are deterministic, cheap, and easy to audit.

dlp-filter.detect_patterns is the right place for exact pattern classes that are unique to your organization. blocked_terms is the right place for literal names, domains, or phrases that should trigger even without a regex structure.

pii-detector.detect_patterns is slightly different. It is best when the thing you are matching should flow through the shared redaction engine. The docs are explicit that these custom regex matches are recorded as generic_id, not as new custom labels. That is a useful constraint to remember during design.

If you need a stable custom category label for policy logic, embedding-detector.categories[].label is often the cleaner place to express the taxonomy.

The embedding side: where recall comes from

Similarity detection matters when users do not write the same phrase every time.

This is common in real enterprise traffic. People paraphrase. They misspell codenames. They describe a trade secret without using the official internal label.

That is what Embedding Detector is for.

Its default local mode is important. You do not need to stand up a separate embedding service just to get started. The built-in local backend gives you a practical first semantic layer, and the docs recommend starting there before considering an external endpoint.

That should influence rollout strategy.

Start with a precise regex and term baseline.
Add local embedding categories for the concepts exact matching misses.
Raise or lower similarity_threshold based on false positives and misses.
Consider the external backend only if you have a strong reason to accept the extra dependency.

How to connect custom classification to downstream governance

Custom classification is rarely the final goal. It is usually the first control in a larger workflow.

If the classified content is structured sensitive data, redact it.

If the content should never cross the boundary, block it.

If the content is part of a regulated output workflow, pair the request-side classification with output-side controls.

For example:

use Financial Compliance when the generated answer could drift into investment guidance
use Healthcare Compliance when the generated answer could look like treatment or diagnosis advice
use Human Oversight or routed review when the remaining cases require accountable human judgment

That is the bigger design principle: custom classification should narrow the problem before the heavier downstream controls engage.

How to validate the classification pack

Do not guess whether a regex or similarity threshold is right. Test it.

kt policy lint --file policy-config.yaml
kt policy test --json

For a custom classifier, the test pack should include at least four case types:

exact matches that must trigger
near matches that should trigger only with fuzzy or embedding detection
benign content that should allow
realistic mixed prompts where the content includes both safe and sensitive text

That is especially important with embeddings. A threshold that looks good on paper can be noisy in real phrasing. The local backend is intentionally simple, so representative tests matter more than theoretical confidence.

Common design mistakes

One mistake is using embedding-detector as the first tool for data that already has a stable pattern. Regex is better there.

Another mistake is pushing every custom class into pii-detector when the real need is not redaction but a broader block-or-review decision. That is what dlp-filter and embedding-detector are for.

The third mistake is expecting custom regexes in pii-detector to create distinct category labels. They do not. They surface as generic_id unless you build the taxonomy elsewhere.

Key takeaways

Use regexes and literal terms first for exact, auditable classification.
Use pii-detector when the right outcome is shared redaction.
Use dlp-filter for custom block-or-redact term and regex enforcement.
Use embedding-detector when near-match or conceptual detection matters.
Pair custom classification with finance, healthcare, or human-review controls when the downstream decision is about generated output.

Custom Data Classification with Regex Patterns and Embeddings

Use this page when​

Primary audience​

Start with the question, not the tool​

What each policy is best at​

pii-detector for redaction-oriented custom identifiers​

dlp-filter for organization-specific regexes and term lists​

embedding-detector for semantic lookalikes​

A layered custom-classification config​

The regex side: where precision comes from​

The embedding side: where recall comes from​

How to connect custom classification to downstream governance​

How to validate the classification pack​

Common design mistakes​

Key takeaways​

Next steps​