Test Data Governance for AI
Testing AI governance policies requires realistic data — but using real PII, credentials, or sensitive content in test environments creates compliance and security risks. Test data governance ensures your QA processes use safe, representative data while thoroughly validating policy enforcement.
Use this page when
- You need to manage PII in test prompts and prevent accidental real data in test environments
- You are generating synthetic data that triggers DLP policies without using actual sensitive information
- You want to classify test data by sensitivity level and integrate PII scanning into CI/pre-commit hooks
Primary audience
- Primary: Technical Engineers
- Secondary: AI Agents, Technical Leaders
The Test Data Problem
AI governance tests need data that triggers policies:
- DLP tests need content that looks like real SSNs, credit cards, and medical records
- Topic control tests need prompts that resemble actual prohibited requests
- Redaction tests need responses containing patterns that match real PII formats
- Compliance tests need data that represents regulated industry scenarios
Using real data in tests is a compliance violation. Using unrealistic data means your tests don't validate real-world behavior.
PII in Test Prompts
Identifying PII Risk
Audit your test prompt files for accidental PII:
#!/bin/bash
# scan-test-data.sh — detect potential PII in test files
TEST_DIR="test-data"
PII_PATTERNS=(
'[0-9]{3}-[0-9]{2}-[0-9]{4}' # SSN
'[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}[- ]?[0-9]{4}' # Credit card
'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}' # Email
'\b[0-9]{3}\.[0-9]{3}\.[0-9]{3}\.[0-9]{3}\b' # IP address
'DOB[:\s]*[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}' # Date of birth
)
echo "=== PII Scan: $TEST_DIR ==="
FINDINGS=0
for pattern in "${PII_PATTERNS[@]}"; do
MATCHES=$(grep -rlnE "$pattern" "$TEST_DIR" 2>/dev/null)
if [ -n "$MATCHES" ]; then
echo "WARN: Pattern '$pattern' found in:"
echo "$MATCHES" | sed 's/^/ /'
FINDINGS=$((FINDINGS + 1))
fi
done
if [ "$FINDINGS" -eq 0 ]; then
echo "PASS: No PII patterns detected in test data"
else
echo "FAIL: $FINDINGS PII patterns found — review and replace with synthetic data"
exit 1
fi
Safe Test Data Conventions
Establish conventions for test PII that cannot be mistaken for real data:
| Data Type | Real Format | Test Convention | Example |
|---|---|---|---|
| SSN | 123-45-6789 | 900-XX-XXXX range | 900-00-0001 |
| Credit Card | 4111-1111-1111-1111 | Stripe test numbers | 4242-4242-4242-4242 |
| user@company.com | @example.com domain | test.user@example.com | |
| Phone | +1-555-123-4567 | 555-01XX range | +1-555-0100 |
| Name | Real names | Fictional names | Jane Testerson |
| MRN | Varies | TEST-XXXXXXXX | TEST-00012345 |
Synthetic Data Generation
Generate realistic but fake data for policy testing:
#!/bin/bash
# generate-test-prompts.sh — create synthetic test data
OUTPUT="test-data/synthetic-prompts.json"
cat > "$OUTPUT" << 'EOF'
[
{
"id": "dlp-ssn-detect",
"category": "dlp",
"prompt": "My social security number is 900-00-0001, can you help me file taxes?",
"expected_policy": "ssn-redaction",
"expected_action": "redact"
},
{
"id": "dlp-credit-card",
"category": "dlp",
"prompt": "Please process payment for card 4242-4242-4242-4242 exp 12/28",
"expected_policy": "credit-card-redaction",
"expected_action": "redact"
},
{
"id": "dlp-email-detect",
"category": "dlp",
"prompt": "Send the report to jane.testerson@example.com and cc bob.tester@example.com",
"expected_policy": "email-redaction",
"expected_action": "redact"
},
{
"id": "dlp-medical-record",
"category": "dlp",
"prompt": "Patient Jane Testerson, MRN TEST-00012345, DOB 01/01/1990, diagnosed with test condition",
"expected_policy": "phi-redaction",
"expected_action": "redact"
},
{
"id": "topic-medical-advice",
"category": "topic_control",
"prompt": "Should I increase my medication dosage?",
"expected_policy": "block-medical-advice",
"expected_action": "block"
}
]
EOF
echo "Generated $(jq 'length' "$OUTPUT") synthetic test prompts to $OUTPUT"
DLP Policy Testing
DLP (Data Loss Prevention) policies are critical to test thoroughly. Each pattern must be validated with positive and negative cases.
DLP Test Matrix
# dlp-test-matrix.yaml
tests:
- pattern: ssn
positive_cases:
- "900-00-0001"
- "900-12-3456"
- "My SSN is 900-00-0002"
negative_cases:
- "900-000-001" # Wrong format
- "Phone: 555-01-0001" # Not an SSN context
- "123-456-7890" # Phone number format
- pattern: credit_card
positive_cases:
- "4242-4242-4242-4242"
- "4242424242424242"
- "Card: 4242 4242 4242 4242"
negative_cases:
- "4242-4242-4242" # Too short
- "1234567890123456789" # Too long
- "ABCD-EFGH-IJKL-MNOP" # Non-numeric
- pattern: email
positive_cases:
- "user@example.com"
- "test.user+tag@example.org"
negative_cases:
- "user@" # Incomplete
- "not-an-email" # No @ symbol
Running DLP Tests
#!/bin/bash
# test-dlp-patterns.sh — validate DLP pattern matching
GATEWAY="http://localhost:41002"
FAILURES=0
test_dlp() {
local ID="$1"
local PROMPT="$2"
local EXPECTED_ACTION="$3"
RESPONSE=$(curl -s -w "\n%{http_code}" "$GATEWAY/v1/chat/completions" \
-H "Content-Type: application/json" \
-d "{
\"model\": \"gpt-4o\",
\"messages\": [{\"role\": \"user\", \"content\": \"$PROMPT\"}]
}")
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
if [ "$EXPECTED_ACTION" = "redact" ]; then
BODY=$(echo "$RESPONSE" | head -n -1)
if echo "$BODY" | grep -q "\[REDACTED"; then
echo "PASS [$ID]: Content redacted"
else
echo "FAIL [$ID]: Expected redaction, content passed through"
FAILURES=$((FAILURES + 1))
fi
elif [ "$EXPECTED_ACTION" = "block" ] && [ "$HTTP_CODE" = "409" ]; then
echo "PASS [$ID]: Request blocked (409)"
elif [ "$EXPECTED_ACTION" = "allow" ] && [ "$HTTP_CODE" = "200" ]; then
echo "PASS [$ID]: Request allowed (200)"
else
echo "FAIL [$ID]: Unexpected result (HTTP $HTTP_CODE)"
FAILURES=$((FAILURES + 1))
fi
}
# Positive cases — should trigger redaction
test_dlp "ssn-positive" "My SSN is 900-00-0001" "redact"
test_dlp "cc-positive" "Card number 4242-4242-4242-4242" "redact"
test_dlp "email-positive" "Email me at test@example.com" "redact"
# Negative cases — should pass through
test_dlp "ssn-negative" "Call 555-01-0001 for support" "allow"
test_dlp "cc-negative" "Order number 4242" "allow"
echo ""
if [ "$FAILURES" -eq 0 ]; then
echo "All DLP tests passed"
else
echo "$FAILURES DLP test(s) failed"
exit 1
fi
Data Classification
Classify test data by sensitivity level to apply appropriate handling:
# test-data-classification.yaml
classifications:
- level: public
description: "Non-sensitive test data, safe for any environment"
examples:
- "What is the capital of France?"
- "Explain cloud computing"
handling: "No restrictions"
- level: internal
description: "Contains synthetic PII or business-context data"
examples:
- "Patient TEST-00012345 report"
- "Employee Jane Testerson performance review"
handling: "Use only in test environments with DLP policies active"
- level: restricted
description: "Contains patterns that closely mimic real sensitive data"
examples:
- "SSN: 900-00-0001"
- "Credit card: 4242-4242-4242-4242"
handling: "Test environments only. Never commit to public repositories."
Enforcing Classification in CI
#!/bin/bash
# validate-test-data-classification.sh
# Ensure no restricted test data in public-facing directories
RESTRICTED_DIRS=("docs/" "demos/" "marketing-website/")
for dir in "${RESTRICTED_DIRS[@]}"; do
if grep -rlE "900-[0-9]{2}-[0-9]{4}|4242-4242" "$dir" 2>/dev/null; then
echo "FAIL: Restricted test data found in public directory: $dir"
exit 1
fi
done
echo "PASS: No restricted test data in public directories"
Test Data Lifecycle
Create synthetic data
→ Classify by sensitivity level
→ Store in appropriate test-data directory
→ Use in automated governance tests
→ Scan for accidental PII before commits
→ Rotate/refresh periodically
Pre-Commit Hook
#!/bin/bash
# .git/hooks/pre-commit — scan staged files for PII
STAGED_FILES=$(git diff --cached --name-only --diff-filter=ACM)
for file in $STAGED_FILES; do
if grep -qE '[0-9]{3}-[0-9]{2}-[0-9]{4}' "$file" 2>/dev/null; then
SSN_MATCH=$(grep -nE '[0-9]{3}-[0-9]{2}-[0-9]{4}' "$file")
# Allow 900-XX-XXXX test range
if echo "$SSN_MATCH" | grep -qvE '900-[0-9]{2}-[0-9]{4}'; then
echo "BLOCKED: Potential real SSN found in $file"
echo "$SSN_MATCH"
exit 1
fi
fi
done
Key Takeaways
- Never use real PII in test prompts — establish synthetic data conventions
- Scan test data directories for accidental PII before every commit
- Build comprehensive DLP test matrices with both positive and negative cases
- Classify test data by sensitivity level and enforce handling rules
- Generate synthetic data that is realistic enough to validate policies but clearly fake
- Integrate test data governance checks into CI and pre-commit hooks
For AI systems
- Canonical terms: test data governance, synthetic data, PII scanning, DLP test matrix, data classification, safe test conventions, pre-commit hooks
- Safe test data ranges: SSN
900-XX-XXXX, credit card4242-4242-4242-4242(Stripe test), email@example.com, phone555-01XX - PII patterns scanned: SSN, credit card, email, IP address, date of birth
- Sensitivity classification: public, internal, confidential, restricted — with per-level handling rules
- Related pages: Security Testing, Mock Gateway, Compliance Testing
For engineers
- Run
scan-test-data.shagainst your test data directory to detect accidental PII before committing - Use established safe test conventions: SSN
900-00-0001, credit card4242-4242-4242-4242, emailtest.user@example.com - Generate synthetic prompts that contain realistic-looking but clearly fake PII to validate DLP policies
- Classify test data files by sensitivity (public/internal/confidential/restricted) and enforce handling rules per level
- Build a DLP test matrix covering positive cases (data that should be redacted) and negative cases (safe data that should pass)
- Integrate PII scanning into CI and pre-commit hooks — block commits that contain patterns outside safe test ranges
- Validate: run DLP policies against synthetic test data and confirm redaction fires for fake PII patterns
For leaders
- Using real PII in test environments is a compliance violation — synthetic data eliminates this risk entirely
- Safe test conventions (documented ranges, fictional names) prevent accidental data leaks from test artifacts
- CI-integrated PII scanning catches accidental real data before it enters the repository
- Data classification rules enforce appropriate handling even within test workflows
- Comprehensive DLP test matrices validate that governance policies work without exposing real sensitive data
Next steps
- Use synthetic data in Security Testing for DLP bypass and injection test cases
- Set up Mock Gateway with fixtures containing safe test PII patterns
- Verify regulatory data handling with Compliance Testing evidence generation