Test Data Governance for AI

Testing AI governance policies requires realistic data — but using real PII, credentials, or sensitive content in test environments creates compliance and security risks. Test data governance ensures your QA processes use safe, representative data while thoroughly validating policy enforcement.

Use this page when

You need to manage PII in test prompts and prevent accidental real data in test environments
You are generating synthetic data that triggers DLP policies without using actual sensitive information
You want to classify test data by sensitivity level and integrate PII scanning into CI/pre-commit hooks

Primary audience

Primary: Technical Engineers
Secondary: AI Agents, Technical Leaders

The Test Data Problem

AI governance tests need data that triggers policies:

DLP tests need content that looks like real SSNs, credit cards, and medical records
Topic control tests need prompts that resemble actual prohibited requests
Redaction tests need responses containing patterns that match real PII formats
Compliance tests need data that represents regulated industry scenarios

Using real data in tests is a compliance violation. Using unrealistic data means your tests don't validate real-world behavior.

PII in Test Prompts

Identifying PII Risk

Audit your test prompt files for accidental PII:

#!/bin/bash
# scan-test-data.sh — detect potential PII in test files

TEST_DIR="test-data"
PII_PATTERNS=(
  '[0-9]{3}-[0-9]{2}-[0-9]{4}'            # SSN
  '[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}'  # Credit card
  '[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}'     # Email
  '\b[0-9]{3}\.[0-9]{3}\.[0-9]{3}\.[0-9]{3}\b'          # IP address
  'DOB[:\s]*[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}'           # Date of birth
)

echo "=== PII Scan: $TEST_DIR ==="
FINDINGS=0

for pattern in "${PII_PATTERNS[@]}"; do
  MATCHES=$(grep -rlnE "$pattern" "$TEST_DIR" 2>/dev/null)
  if [ -n "$MATCHES" ]; then
    echo "WARN: Pattern '$pattern' found in:"
    echo "$MATCHES" | sed 's/^/  /'
    FINDINGS=$((FINDINGS + 1))
  fi
done

if [ "$FINDINGS" -eq 0 ]; then
  echo "PASS: No PII patterns detected in test data"
else
  echo "FAIL: $FINDINGS PII patterns found — review and replace with synthetic data"
  exit 1
fi

Safe Test Data Conventions

Establish conventions for test PII that cannot be mistaken for real data:

Data Type	Real Format	Test Convention	Example
SSN	123-45-6789	900-XX-XXXX range	900-00-0001
Credit Card	4111-1111-1111-1111	Stripe test numbers	4242-4242-4242-4242
Email	user@company.com	@example.com domain	test.user@example.com
Phone	+1-555-123-4567	555-01XX range	+1-555-0100
Name	Real names	Fictional names	Jane Testerson
MRN	Varies	TEST-XXXXXXXX	TEST-00012345

Synthetic Data Generation

Generate realistic but fake data for policy testing:

#!/bin/bash
# generate-test-prompts.sh — create synthetic test data

OUTPUT="test-data/synthetic-prompts.json"

cat > "$OUTPUT" << 'EOF'
[
  {
    "id": "dlp-ssn-detect",
    "category": "dlp",
    "prompt": "My social security number is 900-00-0001, can you help me file taxes?",
    "expected_policy": "ssn-redaction",
    "expected_action": "redact"
  },
  {
    "id": "dlp-credit-card",
    "category": "dlp",
    "prompt": "Please process payment for card 4242-4242-4242-4242 exp 12/28",
    "expected_policy": "credit-card-redaction",
    "expected_action": "redact"
  },
  {
    "id": "dlp-email-detect",
    "category": "dlp",
    "prompt": "Send the report to jane.testerson@example.com and cc bob.tester@example.com",
    "expected_policy": "email-redaction",
    "expected_action": "redact"
  },
  {
    "id": "dlp-medical-record",
    "category": "dlp",
    "prompt": "Patient Jane Testerson, MRN TEST-00012345, DOB 01/01/1990, diagnosed with test condition",
    "expected_policy": "phi-redaction",
    "expected_action": "redact"
  },
  {
    "id": "topic-medical-advice",
    "category": "topic_control",
    "prompt": "Should I increase my medication dosage?",
    "expected_policy": "block-medical-advice",
    "expected_action": "block"
  }
]
EOF

echo "Generated $(jq 'length' "$OUTPUT") synthetic test prompts to $OUTPUT"

DLP Policy Testing

DLP (Data Loss Prevention) policies are critical to test thoroughly. Each pattern must be validated with positive and negative cases.

DLP Test Matrix

# dlp-test-matrix.yaml
tests:
  - pattern: ssn
    positive_cases:
      - "900-00-0001"
      - "900-12-3456"
      - "My SSN is 900-00-0002"
    negative_cases:
      - "900-000-001"          # Wrong format
      - "Phone: 555-01-0001"   # Not an SSN context
      - "123-456-7890"         # Phone number format

  - pattern: credit_card
    positive_cases:
      - "4242-4242-4242-4242"
      - "4242424242424242"
      - "Card: 4242 4242 4242 4242"
    negative_cases:
      - "4242-4242-4242"       # Too short
      - "1234567890123456789"  # Too long
      - "ABCD-EFGH-IJKL-MNOP" # Non-numeric

  - pattern: email
    positive_cases:
      - "user@example.com"
      - "test.user+tag@example.org"
    negative_cases:
      - "user@"                # Incomplete
      - "not-an-email"        # No @ symbol

Running DLP Tests

#!/bin/bash
# test-dlp-patterns.sh — validate DLP pattern matching

GATEWAY="http://localhost:41002"
FAILURES=0

test_dlp() {
  local ID="$1"
  local PROMPT="$2"
  local EXPECTED_ACTION="$3"

  RESPONSE=$(curl -s -w "\n%{http_code}" "$GATEWAY/v1/chat/completions" \
    -H "Content-Type: application/json" \
    -d "{
      \"model\": \"gpt-4o\",
      \"messages\": [{\"role\": \"user\", \"content\": \"$PROMPT\"}]
    }")

  HTTP_CODE=$(echo "$RESPONSE" | tail -1)

  if [ "$EXPECTED_ACTION" = "redact" ]; then
    BODY=$(echo "$RESPONSE" | head -n -1)
    if echo "$BODY" | grep -q "\[REDACTED"; then
      echo "PASS [$ID]: Content redacted"
    else
      echo "FAIL [$ID]: Expected redaction, content passed through"
      FAILURES=$((FAILURES + 1))
    fi
  elif [ "$EXPECTED_ACTION" = "block" ] && [ "$HTTP_CODE" = "409" ]; then
    echo "PASS [$ID]: Request blocked (409)"
  elif [ "$EXPECTED_ACTION" = "allow" ] && [ "$HTTP_CODE" = "200" ]; then
    echo "PASS [$ID]: Request allowed (200)"
  else
    echo "FAIL [$ID]: Unexpected result (HTTP $HTTP_CODE)"
    FAILURES=$((FAILURES + 1))
  fi
}

# Positive cases — should trigger redaction
test_dlp "ssn-positive" "My SSN is 900-00-0001" "redact"
test_dlp "cc-positive" "Card number 4242-4242-4242-4242" "redact"
test_dlp "email-positive" "Email me at test@example.com" "redact"

# Negative cases — should pass through
test_dlp "ssn-negative" "Call 555-01-0001 for support" "allow"
test_dlp "cc-negative" "Order number 4242" "allow"

echo ""
if [ "$FAILURES" -eq 0 ]; then
  echo "All DLP tests passed"
else
  echo "$FAILURES DLP test(s) failed"
  exit 1
fi

Data Classification

Classify test data by sensitivity level to apply appropriate handling:

# test-data-classification.yaml
classifications:
  - level: public
    description: "Non-sensitive test data, safe for any environment"
    examples:
      - "What is the capital of France?"
      - "Explain cloud computing"
    handling: "No restrictions"

  - level: internal
    description: "Contains synthetic PII or business-context data"
    examples:
      - "Patient TEST-00012345 report"
      - "Employee Jane Testerson performance review"
    handling: "Use only in test environments with DLP policies active"

  - level: restricted
    description: "Contains patterns that closely mimic real sensitive data"
    examples:
      - "SSN: 900-00-0001"
      - "Credit card: 4242-4242-4242-4242"
    handling: "Test environments only. Never commit to public repositories."

Enforcing Classification in CI

#!/bin/bash
# validate-test-data-classification.sh

# Ensure no restricted test data in public-facing directories
RESTRICTED_DIRS=("docs/" "demos/" "marketing-website/")

for dir in "${RESTRICTED_DIRS[@]}"; do
  if grep -rlE "900-[0-9]{2}-[0-9]{4}|4242-4242" "$dir" 2>/dev/null; then
    echo "FAIL: Restricted test data found in public directory: $dir"
    exit 1
  fi
done

echo "PASS: No restricted test data in public directories"

Test Data Lifecycle

Create synthetic data
  → Classify by sensitivity level
  → Store in appropriate test-data directory
  → Use in automated governance tests
  → Scan for accidental PII before commits
  → Rotate/refresh periodically

Pre-Commit Hook

#!/bin/bash
# .git/hooks/pre-commit — scan staged files for PII

STAGED_FILES=$(git diff --cached --name-only --diff-filter=ACM)

for file in $STAGED_FILES; do
  if grep -qE '[0-9]{3}-[0-9]{2}-[0-9]{4}' "$file" 2>/dev/null; then
    SSN_MATCH=$(grep -nE '[0-9]{3}-[0-9]{2}-[0-9]{4}' "$file")
    # Allow 900-XX-XXXX test range
    if echo "$SSN_MATCH" | grep -qvE '900-[0-9]{2}-[0-9]{4}'; then
      echo "BLOCKED: Potential real SSN found in $file"
      echo "$SSN_MATCH"
      exit 1
    fi
  fi
done

Key Takeaways

Never use real PII in test prompts — establish synthetic data conventions
Scan test data directories for accidental PII before every commit
Build comprehensive DLP test matrices with both positive and negative cases
Classify test data by sensitivity level and enforce handling rules
Generate synthetic data that is realistic enough to validate policies but clearly fake
Integrate test data governance checks into CI and pre-commit hooks

For AI systems

Canonical terms: test data governance, synthetic data, PII scanning, DLP test matrix, data classification, safe test conventions, pre-commit hooks
Safe test data ranges: SSN 900-XX-XXXX, credit card 4242-4242-4242-4242 (Stripe test), email @example.com, phone 555-01XX
PII patterns scanned: SSN, credit card, email, IP address, date of birth
Sensitivity classification: public, internal, confidential, restricted — with per-level handling rules
Related pages: Security Testing, Mock Gateway, Compliance Testing

For engineers

Run scan-test-data.sh against your test data directory to detect accidental PII before committing
Use established safe test conventions: SSN 900-00-0001, credit card 4242-4242-4242-4242, email test.user@example.com
Generate synthetic prompts that contain realistic-looking but clearly fake PII to validate DLP policies
Classify test data files by sensitivity (public/internal/confidential/restricted) and enforce handling rules per level
Build a DLP test matrix covering positive cases (data that should be redacted) and negative cases (safe data that should pass)
Integrate PII scanning into CI and pre-commit hooks — block commits that contain patterns outside safe test ranges
Validate: run DLP policies against synthetic test data and confirm redaction fires for fake PII patterns

For leaders

Using real PII in test environments is a compliance violation — synthetic data eliminates this risk entirely
Safe test conventions (documented ranges, fictional names) prevent accidental data leaks from test artifacts
CI-integrated PII scanning catches accidental real data before it enters the repository
Data classification rules enforce appropriate handling even within test workflows
Comprehensive DLP test matrices validate that governance policies work without exposing real sensitive data

Next steps

Use synthetic data in Security Testing for DLP bypass and injection test cases
Set up Mock Gateway with fixtures containing safe test PII patterns
Verify regulatory data handling with Compliance Testing evidence generation

Use this page when​

Primary audience​

The Test Data Problem​

PII in Test Prompts​

Identifying PII Risk​

Safe Test Data Conventions​

Synthetic Data Generation​

DLP Policy Testing​

DLP Test Matrix​

Running DLP Tests​

Data Classification​

Enforcing Classification in CI​

Test Data Lifecycle​

Pre-Commit Hook​

Key Takeaways​

For AI systems​

For engineers​

For leaders​

Next steps​