Regression Testing AI Policies
Policy changes are among the highest-risk modifications in an AI governance platform. A single misconfigured rule can silently allow prohibited content or incorrectly block legitimate traffic. Regression testing ensures that policy updates produce only the intended behavioral changes.
Use this page when
- You are testing policy config changes to ensure they produce only the intended behavioral differences
- You need before/after event comparison, snapshot testing, or
kt policy lintvalidation - You want to build a fixed test-prompt suite and automate regression detection in CI
Primary audience
- Primary: Technical Engineers
- Secondary: AI Agents, Technical Leaders
Policy Change Impact Analysis
Before applying a policy change, understand the blast radius. Compare the current and proposed configurations to identify which rules are added, modified, or removed.
Structural Diff
# Compare current and proposed policy configs
diff --unified current-policy-config.yaml proposed-policy-config.yaml
For YAML-aware comparison, use kt policy lint to parse both files and highlight semantic differences:
# Validate the proposed config and check for structural issues
kt policy lint --file proposed-policy-config.yaml
# Compare policy names and actions between versions
diff <(grep -E "^\s+- name:|^\s+action:" current-policy-config.yaml) \
<(grep -E "^\s+- name:|^\s+action:" proposed-policy-config.yaml)
Identifying Affected Policy Rules
Create a mapping of what changed:
#!/bin/bash
# policy-diff.sh — identify changed policies
CURRENT="current-policy-config.yaml"
PROPOSED="proposed-policy-config.yaml"
echo "=== Policies in current config ==="
grep "^\s*- name:" "$CURRENT" | sed 's/.*name: //'
echo ""
echo "=== Policies in proposed config ==="
grep "^\s*- name:" "$PROPOSED" | sed 's/.*name: //'
echo ""
echo "=== Added policies ==="
diff <(grep "^\s*- name:" "$CURRENT" | sed 's/.*name: //') \
<(grep "^\s*- name:" "$PROPOSED" | sed 's/.*name: //') \
| grep "^>" | sed 's/^> //'
echo ""
echo "=== Removed policies ==="
diff <(grep "^\s*- name:" "$CURRENT" | sed 's/.*name: //') \
<(grep "^\s*- name:" "$PROPOSED" | sed 's/.*name: //') \
| grep "^<" | sed 's/^< //'
Before/After Event Comparison
The most reliable regression test replays a fixed set of prompts through both the old and new policy configurations, then compares the decision outcomes.
Step 1: Build a Test Prompt Suite
Create a JSON file with categorized test prompts:
[
{
"id": "block-pii-ssn",
"category": "dlp",
"expected_decision": "redact",
"request": {
"model": "gpt-4o",
"messages": [{"role": "user", "content": "My SSN is 123-45-6789, can you verify it?"}]
}
},
{
"id": "allow-general-question",
"category": "passthrough",
"expected_decision": "allow",
"request": {
"model": "gpt-4o",
"messages": [{"role": "user", "content": "What is the capital of France?"}]
}
},
{
"id": "block-medical-advice",
"category": "topic_control",
"expected_decision": "block",
"request": {
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Should I stop taking my medication?"}]
}
}
]
Step 2: Run Against Current Config
# Start gateway with current config
kt gateway run --policy-config current-policy-config.yaml --port 41002 &
GATEWAY_PID=$!
sleep 3
# Run test prompts and capture decisions
./scripts/run-regression-prompts.sh test-prompts.json > results-before.json
kill $GATEWAY_PID
Step 3: Run Against Proposed Config
# Start gateway with proposed config
kt gateway run --policy-config proposed-policy-config.yaml --port 41002 &
GATEWAY_PID=$!
sleep 3
# Run same prompts
./scripts/run-regression-prompts.sh test-prompts.json > results-after.json
kill $GATEWAY_PID
Step 4: Compare Results
# Diff decisions
diff <(jq '[.[] | {id, decision}]' results-before.json) \
<(jq '[.[] | {id, decision}]' results-after.json)
# Count regressions
REGRESSIONS=$(diff <(jq -c '.[] | {id, decision}' results-before.json) \
<(jq -c '.[] | {id, decision}' results-after.json) | grep "^<" | wc -l)
echo "Regressions detected: $REGRESSIONS"
Configuration Validation with kt policy lint
Always validate before deploying:
# Syntax and schema validation
kt policy lint --file policy-config.yaml
# Check for common issues:
# - Duplicate policy names
# - Invalid action types
# - Missing required fields
# - Regex syntax errors in DLP patterns
Integrate validation as a pre-commit hook:
#!/bin/bash
# .git/hooks/pre-commit — validate policy configs
CONFIGS=$(git diff --cached --name-only | grep "policy-config.yaml")
for config in $CONFIGS; do
echo "Validating $config..."
if ! kt policy lint --file "$config"; then
echo "ERROR: Policy config validation failed for $config"
exit 1
fi
done
Snapshot Testing Policy Outputs
Snapshot testing captures the full gateway decision for a set of prompts and stores it as a golden file. Future runs compare against this snapshot.
Creating Snapshots
#!/bin/bash
# create-policy-snapshot.sh
CONFIG="$1"
PROMPTS="test-prompts.json"
SNAPSHOT_DIR="snapshots"
mkdir -p "$SNAPSHOT_DIR"
kt gateway run --policy-config "$CONFIG" --port 41099 &
GATEWAY_PID=$!
sleep 3
RESULTS=()
jq -c '.[]' "$PROMPTS" | while read -r entry; do
ID=$(echo "$entry" | jq -r '.id')
REQUEST=$(echo "$entry" | jq -c '.request')
RESPONSE=$(curl -s -w "\n%{http_code}" http://localhost:41099/v1/chat/completions \
-H "Content-Type: application/json" \
-d "$REQUEST")
HTTP_CODE=$(echo "$RESPONSE" | tail -1)
BODY=$(echo "$RESPONSE" | head -n -1)
echo "{\"id\": \"$ID\", \"http_code\": $HTTP_CODE}" >> "$SNAPSHOT_DIR/decisions.jsonl"
done
kill $GATEWAY_PID
echo "Snapshot saved to $SNAPSHOT_DIR/decisions.jsonl"
Comparing Against Snapshots
#!/bin/bash
# compare-policy-snapshot.sh
SNAPSHOT="snapshots/decisions.jsonl"
CURRENT="current-decisions.jsonl"
# Generate current decisions (same script as above, output to CURRENT)
DIFF_COUNT=$(diff <(sort "$SNAPSHOT") <(sort "$CURRENT") | grep "^[<>]" | wc -l)
if [ "$DIFF_COUNT" -eq 0 ]; then
echo "PASS: No regressions detected"
else
echo "FAIL: $DIFF_COUNT decision differences found"
diff <(sort "$SNAPSHOT") <(sort "$CURRENT")
exit 1
fi
Regression Test Matrix
Maintain a test matrix covering all policy types:
| Policy Type | Test Case Count | Assertion Type |
|---|---|---|
| Topic control (block) | 10+ per topic | HTTP 409 |
| DLP redaction | 5+ per pattern | Redacted markers in body |
| Rate limiting | 3+ burst scenarios | HTTP 429 after N requests |
| Spend limits | 2+ per budget tier | HTTP 402 or budget error |
| Disclaimers | 5+ trigger phrases | Disclaimer appended to body |
| Escalation | 5+ flagged scenarios | Event status = escalated |
| Passthrough | 10+ benign prompts | HTTP 200, no modification |
Automating in CI
# .github/workflows/policy-regression.yml
jobs:
regression-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Validate configs
run: kt policy lint --file policy-config.yaml
- name: Run snapshot comparison
run: |
./scripts/create-policy-snapshot.sh policy-config.yaml
./scripts/compare-policy-snapshot.sh
- name: Upload regression report
if: failure()
uses: actions/upload-artifact@v4
with:
name: regression-report
path: snapshots/
Key Takeaways
- Always run
kt policy lintbefore deploying policy changes - Build a fixed test-prompt suite covering every policy type and expected decision
- Use before/after comparison to detect unintended behavioral changes
- Snapshot testing provides a quick regression check for policy outputs
- Automate regression tests in CI to catch issues before they reach production
- Maintain a regression test matrix that grows with your policy configuration
For AI systems
- Canonical terms: regression testing, policy diff, before/after comparison, snapshot testing,
kt policy lint, test-prompt suite, change impact analysis - CLI commands:
kt policy lint --file <path>,kt gateway run --policy-config <path> - Test structure: JSON test-prompt suite with
id,category,expected_decision(block/allow/redact), andrequestpayload - Comparison approach: run same prompts through old and new configs, diff the gateway decisions
- Related pages: Config-as-Code, Mock Gateway, Testing AI Systems
For engineers
- Always run
kt policy lint --file <proposed>before deploying any policy change - Build a fixed test-prompt suite covering every policy type: DLP (SSN, credit card), topic control (blocked topics), rate limits, and passthrough
- Run prompts through the current config, capture decisions; then run through the proposed config and diff
- Use snapshot testing: save expected decision outcomes in a file and compare against actual results
- Automate in CI: run the regression suite on every PR that touches
policy-config.yaml - Validate: any unexpected decision change (e.g., a previously-blocked prompt now passes) should fail the build
For leaders
- Policy changes are high-risk — a single misconfigured rule can silently allow prohibited content or block legitimate traffic
- Before/after testing proves the change does exactly what was intended and nothing else
- CI-integrated regression tests prevent production incidents from untested policy modifications
- A growing test-prompt suite provides increasing confidence as the policy configuration evolves
- Snapshot testing scales to hundreds of test cases without manual verification
Next steps
- Structure your configs for safe changes with Config-as-Code promotion workflows
- Use Mock Gateway for fast, deterministic regression testing without provider costs
- Learn Testing AI Systems patterns for policy-as-test-oracle methodology