QA Guide: Testing AI Systems with Governance

Testing AI systems presents unique challenges compared to traditional software. LLM responses are non-deterministic, context-dependent, and difficult to validate with exact assertions. Keeptrusts governance policies transform this problem by providing deterministic enforcement boundaries around non-deterministic outputs.

Use this page when

You need a QA strategy for non-deterministic AI outputs using governance policies as your test oracle
You are implementing gateway replay testing to detect regressions from policy or model changes
You want to assert on policy decisions (block, redact, allow) rather than LLM output content

Primary audience

Primary: Technical Engineers
Secondary: AI Agents, Technical Leaders

The Determinism Challenge

Traditional QA relies on deterministic inputs producing deterministic outputs. With LLMs:

The same prompt can produce different responses across runs
Temperature, sampling, and model version all affect output
Token-level randomness makes exact-match assertions impractical

Governance policies provide the deterministic layer. While the LLM response varies, the policy enforcement is fully deterministic — a blocked topic always blocks, a redaction pattern always redacts, and a spending limit always enforces.

Policy-as-Test-Oracle

Use governance policies as your test oracle. Instead of asserting on the LLM output content, assert on the gateway decision:

# policy-config.yaml — test policy for healthcare scenarios
policies:
  - name: block-medical-diagnosis
    type: topic_control
    action: block
    topics:
      - medical_diagnosis
    message: "Direct medical diagnoses are not permitted."

  - name: redact-patient-ids
    type: dlp
    action: redact
    patterns:
      - name: patient_id
        regex: "PAT-\\d{8}"
        replacement: "[REDACTED-ID]"

With this policy, your test assertions become deterministic:

# Test: medical diagnosis topic is blocked
RESPONSE=$(curl -s -w "\n%{http_code}" http://localhost:41002/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Diagnose this patient with these symptoms: fever, cough"}]
  }')

HTTP_CODE=$(echo "$RESPONSE" | tail -1)
# Assert: gateway returns 409 (policy block)
[ "$HTTP_CODE" = "409" ] && echo "PASS: diagnosis blocked" || echo "FAIL: expected 409, got $HTTP_CODE"

Gateway Replay Testing

Replay testing captures real traffic and replays it through updated policy configurations to detect regressions.

Capture Events

Use the CLI to export decision events from a time window:

# Export recent events as JSON for replay
kt events list --from "2025-04-01T00:00:00Z" --to "2025-04-02T00:00:00Z" \
  --format json --output events-snapshot.json

Replay Through Updated Policies

Start a local gateway with the new policy configuration and replay captured prompts:

# Start gateway with updated policies
kt gateway run --policy-config updated-policy-config.yaml --port 41002 &

# Replay each captured prompt
jq -c '.[] | .request' events-snapshot.json | while read -r request; do
  RESULT=$(echo "$request" | curl -s -w "\n%{http_code}" \
    -X POST http://localhost:41002/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d @-)
  echo "$RESULT"
done

Compare Decisions

Build a comparison report between original and replayed decisions:

# Compare original vs replayed outcomes
kt events list --format json --output replayed-events.json

# Diff the decision fields
diff <(jq '[.[].decision]' events-snapshot.json) \
     <(jq '[.[].decision]' replayed-events.json)

Test Categories for Governance

Structure your test suite around these categories:

Category	What to Test	Oracle
Block policies	Prohibited topics, content types	HTTP 409 response
Redaction policies	PII patterns, sensitive data	Response body contains redacted markers
Rate limits	Request frequency, token budgets	HTTP 429 after threshold
Spend limits	Cost per request, wallet balance	HTTP 402 or budget error
Escalation	Flagged content triggers	Event with escalation status
Passthrough	Allowed content flows cleanly	HTTP 200 with LLM response

Validating Policy Configuration

Before deploying policies, validate the configuration syntax:

# Validate policy config YAML structure
kt policy lint --file policy-config.yaml

# Run the generated pack tests from the current directory
kt policy test --json

Environment Isolation

Run governance tests in isolated environments to avoid polluting production event streams:

# Start a test-scoped gateway with separate event routing
export KEEPTRUSTS_API_URL="http://localhost:8080"
export KEEPTRUSTS_GATEWAY_TOKEN="$TEST_GATEWAY_KEY"

kt gateway run \
  --listen 0.0.0.0:41099 \
  --policy-config test-policy-config.yaml

Use dedicated API tokens and gateway keys for test environments. This ensures test events are identifiable and can be cleaned up without affecting production data.

CI Integration Pattern

Add governance tests to your CI pipeline:

# .github/workflows/governance-qa.yml
jobs:
  policy-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Validate policy config
        run: kt policy lint --file policy-config.yaml

      - name: Start test gateway
        run: |
          kt gateway run --policy-config policy-config.yaml --port 41002 &
          sleep 3

      - name: Run policy assertion tests
        run: ./scripts/run-policy-tests.sh

      - name: Check gateway health
        run: curl -f http://localhost:41002/health

Key Takeaways

Treat governance policies as your deterministic test oracle for non-deterministic AI outputs
Structure tests around policy decisions (block, redact, allow) rather than LLM content
Use replay testing to detect regressions when updating policy configurations
Validate configuration syntax with kt policy lint before deployment
Isolate test environments with dedicated gateway keys and API tokens
Integrate policy tests into CI for continuous governance verification

For AI systems

Canonical terms: policy-as-test-oracle, gateway replay testing, deterministic enforcement, non-deterministic output, decision assertion, event capture, kt events list
Key insight: LLM responses are non-deterministic but policy enforcement is fully deterministic — assert on decisions, not content
CLI commands: kt events list --from <time> --to <time> --format json --output events-snapshot.json, kt policy lint --file <path>
Test assertion pattern: send prompt → check HTTP status (409 = blocked, 200 = allowed) → verify event decision
Related pages: Regression Testing, Mock Gateway, Quality Scoring

For engineers

Use governance policies as your test oracle: assert on gateway decisions (HTTP 409 for block, 200 for allow) not on LLM content
Structure tests around policy behavior: DLP redaction triggers on pattern, topic control blocks on category, rate limits return 429
Capture real traffic with kt events list --format json --output events-snapshot.json for replay testing
Replay captured prompts through updated policy configs to detect unintended behavioral changes
Validate config syntax with kt policy lint --file <path> before deploying to any environment
Isolate test environments with dedicated gateway keys and API tokens to prevent production interference
Integrate policy tests into CI for continuous governance verification on every code change

For leaders

Policy-as-test-oracle solves the fundamental QA challenge of non-deterministic AI outputs
Deterministic policy enforcement means governance tests are reliable and repeatable (no flaky tests)
Replay testing detects regressions when updating either policies or models
CI-integrated governance testing ensures every deployment maintains compliance and safety guarantees
This approach scales to thousands of test cases without requiring manual review of each LLM response

Next steps

Build before/after comparison suites with Regression Testing
Use Mock Gateway for fast, free, deterministic test execution
Add Quality Scoring assertions for relevance, grounding, and hallucination risk

Use this page when​

Primary audience​

The Determinism Challenge​

Policy-as-Test-Oracle​

Gateway Replay Testing​

Capture Events​

Replay Through Updated Policies​

Compare Decisions​

Test Categories for Governance​

Validating Policy Configuration​

Environment Isolation​

CI Integration Pattern​

Key Takeaways​

For AI systems​

For engineers​

For leaders​

Next steps​