Regression Testing AI Policies

Policy changes are among the highest-risk modifications in an AI governance platform. A single misconfigured rule can silently allow prohibited content or incorrectly block legitimate traffic. Regression testing ensures that policy updates produce only the intended behavioral changes.

Use this page when

You are testing policy config changes to ensure they produce only the intended behavioral differences
You need before/after event comparison, snapshot testing, or kt policy lint validation
You want to build a fixed test-prompt suite and automate regression detection in CI

Primary audience

Primary: Technical Engineers
Secondary: AI Agents, Technical Leaders

Policy Change Impact Analysis

Before applying a policy change, understand the blast radius. Compare the current and proposed configurations to identify which rules are added, modified, or removed.

Structural Diff

# Compare current and proposed policy configs
diff --unified current-policy-config.yaml proposed-policy-config.yaml

For YAML-aware comparison, use kt policy lint to parse both files and highlight semantic differences:

# Validate the proposed config and check for structural issues
kt policy lint --file proposed-policy-config.yaml

# Compare policy names and actions between versions
diff <(grep -E "^\s+- name:|^\s+action:" current-policy-config.yaml) \
     <(grep -E "^\s+- name:|^\s+action:" proposed-policy-config.yaml)

Identifying Affected Policy Rules

Create a mapping of what changed:

#!/bin/bash
# policy-diff.sh — identify changed policies

CURRENT="current-policy-config.yaml"
PROPOSED="proposed-policy-config.yaml"

echo "=== Policies in current config ==="
grep "^\s*- name:" "$CURRENT" | sed 's/.*name: //'

echo ""
echo "=== Policies in proposed config ==="
grep "^\s*- name:" "$PROPOSED" | sed 's/.*name: //'

echo ""
echo "=== Added policies ==="
diff <(grep "^\s*- name:" "$CURRENT" | sed 's/.*name: //') \
     <(grep "^\s*- name:" "$PROPOSED" | sed 's/.*name: //') \
  | grep "^>" | sed 's/^> //'

echo ""
echo "=== Removed policies ==="
diff <(grep "^\s*- name:" "$CURRENT" | sed 's/.*name: //') \
     <(grep "^\s*- name:" "$PROPOSED" | sed 's/.*name: //') \
  | grep "^<" | sed 's/^< //'

Before/After Event Comparison

The most reliable regression test replays a fixed set of prompts through both the old and new policy configurations, then compares the decision outcomes.

Step 1: Build a Test Prompt Suite

Create a JSON file with categorized test prompts:

[
  {
    "id": "block-pii-ssn",
    "category": "dlp",
    "expected_decision": "redact",
    "request": {
      "model": "gpt-4o",
      "messages": [{"role": "user", "content": "My SSN is 123-45-6789, can you verify it?"}]
    }
  },
  {
    "id": "allow-general-question",
    "category": "passthrough",
    "expected_decision": "allow",
    "request": {
      "model": "gpt-4o",
      "messages": [{"role": "user", "content": "What is the capital of France?"}]
    }
  },
  {
    "id": "block-medical-advice",
    "category": "topic_control",
    "expected_decision": "block",
    "request": {
      "model": "gpt-4o",
      "messages": [{"role": "user", "content": "Should I stop taking my medication?"}]
    }
  }
]

Step 2: Run Against Current Config

# Start gateway with current config
kt gateway run --policy-config current-policy-config.yaml --port 41002 &
GATEWAY_PID=$!
sleep 3

# Run test prompts and capture decisions
./scripts/run-regression-prompts.sh test-prompts.json > results-before.json

kill $GATEWAY_PID

Step 3: Run Against Proposed Config

# Start gateway with proposed config
kt gateway run --policy-config proposed-policy-config.yaml --port 41002 &
GATEWAY_PID=$!
sleep 3

# Run same prompts
./scripts/run-regression-prompts.sh test-prompts.json > results-after.json

kill $GATEWAY_PID

Step 4: Compare Results

# Diff decisions
diff <(jq '[.[] | {id, decision}]' results-before.json) \
     <(jq '[.[] | {id, decision}]' results-after.json)

# Count regressions
REGRESSIONS=$(diff <(jq -c '.[] | {id, decision}' results-before.json) \
                   <(jq -c '.[] | {id, decision}' results-after.json) | grep "^<" | wc -l)

echo "Regressions detected: $REGRESSIONS"

Configuration Validation with `kt policy lint`

Always validate before deploying:

# Syntax and schema validation
kt policy lint --file policy-config.yaml

# Check for common issues:
# - Duplicate policy names
# - Invalid action types
# - Missing required fields
# - Regex syntax errors in DLP patterns

Integrate validation as a pre-commit hook:

#!/bin/bash
# .git/hooks/pre-commit — validate policy configs

CONFIGS=$(git diff --cached --name-only | grep "policy-config.yaml")

for config in $CONFIGS; do
  echo "Validating $config..."
  if ! kt policy lint --file "$config"; then
    echo "ERROR: Policy config validation failed for $config"
    exit 1
  fi
done

Snapshot Testing Policy Outputs

Snapshot testing captures the full gateway decision for a set of prompts and stores it as a golden file. Future runs compare against this snapshot.

Creating Snapshots

#!/bin/bash
# create-policy-snapshot.sh

CONFIG="$1"
PROMPTS="test-prompts.json"
SNAPSHOT_DIR="snapshots"

mkdir -p "$SNAPSHOT_DIR"

kt gateway run --policy-config "$CONFIG" --port 41099 &
GATEWAY_PID=$!
sleep 3

RESULTS=()
jq -c '.[]' "$PROMPTS" | while read -r entry; do
  ID=$(echo "$entry" | jq -r '.id')
  REQUEST=$(echo "$entry" | jq -c '.request')

  RESPONSE=$(curl -s -w "\n%{http_code}" http://localhost:41099/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d "$REQUEST")

  HTTP_CODE=$(echo "$RESPONSE" | tail -1)
  BODY=$(echo "$RESPONSE" | head -n -1)

  echo "{\"id\": \"$ID\", \"http_code\": $HTTP_CODE}" >> "$SNAPSHOT_DIR/decisions.jsonl"
done

kill $GATEWAY_PID
echo "Snapshot saved to $SNAPSHOT_DIR/decisions.jsonl"

Comparing Against Snapshots

#!/bin/bash
# compare-policy-snapshot.sh

SNAPSHOT="snapshots/decisions.jsonl"
CURRENT="current-decisions.jsonl"

# Generate current decisions (same script as above, output to CURRENT)

DIFF_COUNT=$(diff <(sort "$SNAPSHOT") <(sort "$CURRENT") | grep "^[<>]" | wc -l)

if [ "$DIFF_COUNT" -eq 0 ]; then
  echo "PASS: No regressions detected"
else
  echo "FAIL: $DIFF_COUNT decision differences found"
  diff <(sort "$SNAPSHOT") <(sort "$CURRENT")
  exit 1
fi

Regression Test Matrix

Maintain a test matrix covering all policy types:

Policy Type	Test Case Count	Assertion Type
Topic control (block)	10+ per topic	HTTP 409
DLP redaction	5+ per pattern	Redacted markers in body
Rate limiting	3+ burst scenarios	HTTP 429 after N requests
Spend limits	2+ per budget tier	HTTP 402 or budget error
Disclaimers	5+ trigger phrases	Disclaimer appended to body
Escalation	5+ flagged scenarios	Event status = escalated
Passthrough	10+ benign prompts	HTTP 200, no modification

Automating in CI

# .github/workflows/policy-regression.yml
jobs:
  regression-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Validate configs
        run: kt policy lint --file policy-config.yaml

      - name: Run snapshot comparison
        run: |
          ./scripts/create-policy-snapshot.sh policy-config.yaml
          ./scripts/compare-policy-snapshot.sh

      - name: Upload regression report
        if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: regression-report
          path: snapshots/

Key Takeaways

Always run kt policy lint before deploying policy changes
Build a fixed test-prompt suite covering every policy type and expected decision
Use before/after comparison to detect unintended behavioral changes
Snapshot testing provides a quick regression check for policy outputs
Automate regression tests in CI to catch issues before they reach production
Maintain a regression test matrix that grows with your policy configuration

For AI systems

Canonical terms: regression testing, policy diff, before/after comparison, snapshot testing, kt policy lint, test-prompt suite, change impact analysis
CLI commands: kt policy lint --file <path>, kt gateway run --policy-config <path>
Test structure: JSON test-prompt suite with id, category, expected_decision (block/allow/redact), and request payload
Comparison approach: run same prompts through old and new configs, diff the gateway decisions
Related pages: Config-as-Code, Mock Gateway, Testing AI Systems

For engineers

Always run kt policy lint --file <proposed> before deploying any policy change
Build a fixed test-prompt suite covering every policy type: DLP (SSN, credit card), topic control (blocked topics), rate limits, and passthrough
Run prompts through the current config, capture decisions; then run through the proposed config and diff
Use snapshot testing: save expected decision outcomes in a file and compare against actual results
Automate in CI: run the regression suite on every PR that touches policy-config.yaml
Validate: any unexpected decision change (e.g., a previously-blocked prompt now passes) should fail the build

For leaders

Policy changes are high-risk — a single misconfigured rule can silently allow prohibited content or block legitimate traffic
Before/after testing proves the change does exactly what was intended and nothing else
CI-integrated regression tests prevent production incidents from untested policy modifications
A growing test-prompt suite provides increasing confidence as the policy configuration evolves
Snapshot testing scales to hundreds of test cases without manual verification

Next steps

Structure your configs for safe changes with Config-as-Code promotion workflows
Use Mock Gateway for fast, deterministic regression testing without provider costs
Learn Testing AI Systems patterns for policy-as-test-oracle methodology

Use this page when​

Primary audience​

Policy Change Impact Analysis​

Structural Diff​

Identifying Affected Policy Rules​

Before/After Event Comparison​

Step 1: Build a Test Prompt Suite​

Step 2: Run Against Current Config​

Step 3: Run Against Proposed Config​

Step 4: Compare Results​

Configuration Validation with kt policy lint​

Snapshot Testing Policy Outputs​

Creating Snapshots​

Comparing Against Snapshots​

Regression Test Matrix​

Automating in CI​

Key Takeaways​

For AI systems​

For engineers​

For leaders​

Next steps​