Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Regression Testing AI Policies

Policy changes are among the highest-risk modifications in an AI governance platform. A single misconfigured rule can silently allow prohibited content or incorrectly block legitimate traffic. Regression testing ensures that policy updates produce only the intended behavioral changes.

Use this page when

  • You are testing policy config changes to ensure they produce only the intended behavioral differences
  • You need before/after event comparison, snapshot testing, or kt policy lint validation
  • You want to build a fixed test-prompt suite and automate regression detection in CI

Primary audience

  • Primary: Technical Engineers
  • Secondary: AI Agents, Technical Leaders

Policy Change Impact Analysis

Before applying a policy change, understand the blast radius. Compare the current and proposed configurations to identify which rules are added, modified, or removed.

Structural Diff

# Compare current and proposed policy configs
diff --unified current-policy-config.yaml proposed-policy-config.yaml

For YAML-aware comparison, use kt policy lint to parse both files and highlight semantic differences:

# Validate the proposed config and check for structural issues
kt policy lint --file proposed-policy-config.yaml

# Compare policy names and actions between versions
diff <(grep -E "^\s+- name:|^\s+action:" current-policy-config.yaml) \
<(grep -E "^\s+- name:|^\s+action:" proposed-policy-config.yaml)

Identifying Affected Policy Rules

Create a mapping of what changed:

#!/bin/bash
# policy-diff.sh — identify changed policies

CURRENT="current-policy-config.yaml"
PROPOSED="proposed-policy-config.yaml"

echo "=== Policies in current config ==="
grep "^\s*- name:" "$CURRENT" | sed 's/.*name: //'

echo ""
echo "=== Policies in proposed config ==="
grep "^\s*- name:" "$PROPOSED" | sed 's/.*name: //'

echo ""
echo "=== Added policies ==="
diff <(grep "^\s*- name:" "$CURRENT" | sed 's/.*name: //') \
<(grep "^\s*- name:" "$PROPOSED" | sed 's/.*name: //') \
| grep "^>" | sed 's/^> //'

echo ""
echo "=== Removed policies ==="
diff <(grep "^\s*- name:" "$CURRENT" | sed 's/.*name: //') \
<(grep "^\s*- name:" "$PROPOSED" | sed 's/.*name: //') \
| grep "^<" | sed 's/^< //'

Before/After Event Comparison

The most reliable regression test replays a fixed set of prompts through both the old and new policy configurations, then compares the decision outcomes.

Step 1: Build a Test Prompt Suite

Create a JSON file with categorized test prompts:

[
{
"id": "block-pii-ssn",
"category": "dlp",
"expected_decision": "redact",
"request": {
"model": "gpt-4o",
"messages": [{"role": "user", "content": "My SSN is 123-45-6789, can you verify it?"}]
}
},
{
"id": "allow-general-question",
"category": "passthrough",
"expected_decision": "allow",
"request": {
"model": "gpt-4o",
"messages": [{"role": "user", "content": "What is the capital of France?"}]
}
},
{
"id": "block-medical-advice",
"category": "topic_control",
"expected_decision": "block",
"request": {
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Should I stop taking my medication?"}]
}
}
]

Step 2: Run Against Current Config

# Start gateway with current config
kt gateway run --policy-config current-policy-config.yaml --port 41002 &
GATEWAY_PID=$!
sleep 3

# Run test prompts and capture decisions
./scripts/run-regression-prompts.sh test-prompts.json > results-before.json

kill $GATEWAY_PID

Step 3: Run Against Proposed Config

# Start gateway with proposed config
kt gateway run --policy-config proposed-policy-config.yaml --port 41002 &
GATEWAY_PID=$!
sleep 3

# Run same prompts
./scripts/run-regression-prompts.sh test-prompts.json > results-after.json

kill $GATEWAY_PID

Step 4: Compare Results

# Diff decisions
diff <(jq '[.[] | {id, decision}]' results-before.json) \
<(jq '[.[] | {id, decision}]' results-after.json)

# Count regressions
REGRESSIONS=$(diff <(jq -c '.[] | {id, decision}' results-before.json) \
<(jq -c '.[] | {id, decision}' results-after.json) | grep "^<" | wc -l)

echo "Regressions detected: $REGRESSIONS"

Configuration Validation with kt policy lint

Always validate before deploying:

# Syntax and schema validation
kt policy lint --file policy-config.yaml

# Check for common issues:
# - Duplicate policy names
# - Invalid action types
# - Missing required fields
# - Regex syntax errors in DLP patterns

Integrate validation as a pre-commit hook:

#!/bin/bash
# .git/hooks/pre-commit — validate policy configs

CONFIGS=$(git diff --cached --name-only | grep "policy-config.yaml")

for config in $CONFIGS; do
echo "Validating $config..."
if ! kt policy lint --file "$config"; then
echo "ERROR: Policy config validation failed for $config"
exit 1
fi
done

Snapshot Testing Policy Outputs

Snapshot testing captures the full gateway decision for a set of prompts and stores it as a golden file. Future runs compare against this snapshot.

Creating Snapshots

#!/bin/bash
# create-policy-snapshot.sh

CONFIG="$1"
PROMPTS="test-prompts.json"
SNAPSHOT_DIR="snapshots"

mkdir -p "$SNAPSHOT_DIR"

kt gateway run --policy-config "$CONFIG" --port 41099 &
GATEWAY_PID=$!
sleep 3

RESULTS=()
jq -c '.[]' "$PROMPTS" | while read -r entry; do
ID=$(echo "$entry" | jq -r '.id')
REQUEST=$(echo "$entry" | jq -c '.request')

RESPONSE=$(curl -s -w "\n%{http_code}" http://localhost:41099/v1/chat/completions \
-H "Content-Type: application/json" \
-d "$REQUEST")

HTTP_CODE=$(echo "$RESPONSE" | tail -1)
BODY=$(echo "$RESPONSE" | head -n -1)

echo "{\"id\": \"$ID\", \"http_code\": $HTTP_CODE}" >> "$SNAPSHOT_DIR/decisions.jsonl"
done

kill $GATEWAY_PID
echo "Snapshot saved to $SNAPSHOT_DIR/decisions.jsonl"

Comparing Against Snapshots

#!/bin/bash
# compare-policy-snapshot.sh

SNAPSHOT="snapshots/decisions.jsonl"
CURRENT="current-decisions.jsonl"

# Generate current decisions (same script as above, output to CURRENT)

DIFF_COUNT=$(diff <(sort "$SNAPSHOT") <(sort "$CURRENT") | grep "^[<>]" | wc -l)

if [ "$DIFF_COUNT" -eq 0 ]; then
echo "PASS: No regressions detected"
else
echo "FAIL: $DIFF_COUNT decision differences found"
diff <(sort "$SNAPSHOT") <(sort "$CURRENT")
exit 1
fi

Regression Test Matrix

Maintain a test matrix covering all policy types:

Policy TypeTest Case CountAssertion Type
Topic control (block)10+ per topicHTTP 409
DLP redaction5+ per patternRedacted markers in body
Rate limiting3+ burst scenariosHTTP 429 after N requests
Spend limits2+ per budget tierHTTP 402 or budget error
Disclaimers5+ trigger phrasesDisclaimer appended to body
Escalation5+ flagged scenariosEvent status = escalated
Passthrough10+ benign promptsHTTP 200, no modification

Automating in CI

# .github/workflows/policy-regression.yml
jobs:
regression-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Validate configs
run: kt policy lint --file policy-config.yaml

- name: Run snapshot comparison
run: |
./scripts/create-policy-snapshot.sh policy-config.yaml
./scripts/compare-policy-snapshot.sh

- name: Upload regression report
if: failure()
uses: actions/upload-artifact@v4
with:
name: regression-report
path: snapshots/

Key Takeaways

  • Always run kt policy lint before deploying policy changes
  • Build a fixed test-prompt suite covering every policy type and expected decision
  • Use before/after comparison to detect unintended behavioral changes
  • Snapshot testing provides a quick regression check for policy outputs
  • Automate regression tests in CI to catch issues before they reach production
  • Maintain a regression test matrix that grows with your policy configuration

For AI systems

  • Canonical terms: regression testing, policy diff, before/after comparison, snapshot testing, kt policy lint, test-prompt suite, change impact analysis
  • CLI commands: kt policy lint --file <path>, kt gateway run --policy-config <path>
  • Test structure: JSON test-prompt suite with id, category, expected_decision (block/allow/redact), and request payload
  • Comparison approach: run same prompts through old and new configs, diff the gateway decisions
  • Related pages: Config-as-Code, Mock Gateway, Testing AI Systems

For engineers

  • Always run kt policy lint --file <proposed> before deploying any policy change
  • Build a fixed test-prompt suite covering every policy type: DLP (SSN, credit card), topic control (blocked topics), rate limits, and passthrough
  • Run prompts through the current config, capture decisions; then run through the proposed config and diff
  • Use snapshot testing: save expected decision outcomes in a file and compare against actual results
  • Automate in CI: run the regression suite on every PR that touches policy-config.yaml
  • Validate: any unexpected decision change (e.g., a previously-blocked prompt now passes) should fail the build

For leaders

  • Policy changes are high-risk — a single misconfigured rule can silently allow prohibited content or block legitimate traffic
  • Before/after testing proves the change does exactly what was intended and nothing else
  • CI-integrated regression tests prevent production incidents from untested policy modifications
  • A growing test-prompt suite provides increasing confidence as the policy configuration evolves
  • Snapshot testing scales to hundreds of test cases without manual verification

Next steps

  • Structure your configs for safe changes with Config-as-Code promotion workflows
  • Use Mock Gateway for fast, deterministic regression testing without provider costs
  • Learn Testing AI Systems patterns for policy-as-test-oracle methodology