Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

QA Guide: Testing AI Systems with Governance

Testing AI systems presents unique challenges compared to traditional software. LLM responses are non-deterministic, context-dependent, and difficult to validate with exact assertions. Keeptrusts governance policies transform this problem by providing deterministic enforcement boundaries around non-deterministic outputs.

Use this page when

  • You need a QA strategy for non-deterministic AI outputs using governance policies as your test oracle
  • You are implementing gateway replay testing to detect regressions from policy or model changes
  • You want to assert on policy decisions (block, redact, allow) rather than LLM output content

Primary audience

  • Primary: Technical Engineers
  • Secondary: AI Agents, Technical Leaders

The Determinism Challenge

Traditional QA relies on deterministic inputs producing deterministic outputs. With LLMs:

  • The same prompt can produce different responses across runs
  • Temperature, sampling, and model version all affect output
  • Token-level randomness makes exact-match assertions impractical

Governance policies provide the deterministic layer. While the LLM response varies, the policy enforcement is fully deterministic — a blocked topic always blocks, a redaction pattern always redacts, and a spending limit always enforces.

Policy-as-Test-Oracle

Use governance policies as your test oracle. Instead of asserting on the LLM output content, assert on the gateway decision:

# policy-config.yaml — test policy for healthcare scenarios
policies:
- name: block-medical-diagnosis
type: topic_control
action: block
topics:
- medical_diagnosis
message: "Direct medical diagnoses are not permitted."

- name: redact-patient-ids
type: dlp
action: redact
patterns:
- name: patient_id
regex: "PAT-\\d{8}"
replacement: "[REDACTED-ID]"

With this policy, your test assertions become deterministic:

# Test: medical diagnosis topic is blocked
RESPONSE=$(curl -s -w "\n%{http_code}" http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o",
"messages": [{"role": "user", "content": "Diagnose this patient with these symptoms: fever, cough"}]
}')

HTTP_CODE=$(echo "$RESPONSE" | tail -1)
# Assert: gateway returns 409 (policy block)
[ "$HTTP_CODE" = "409" ] && echo "PASS: diagnosis blocked" || echo "FAIL: expected 409, got $HTTP_CODE"

Gateway Replay Testing

Replay testing captures real traffic and replays it through updated policy configurations to detect regressions.

Capture Events

Use the CLI to export decision events from a time window:

# Export recent events as JSON for replay
kt events list --from "2025-04-01T00:00:00Z" --to "2025-04-02T00:00:00Z" \
--format json --output events-snapshot.json

Replay Through Updated Policies

Start a local gateway with the new policy configuration and replay captured prompts:

# Start gateway with updated policies
kt gateway run --policy-config updated-policy-config.yaml --port 41002 &

# Replay each captured prompt
jq -c '.[] | .request' events-snapshot.json | while read -r request; do
RESULT=$(echo "$request" | curl -s -w "\n%{http_code}" \
-X POST http://localhost:41002/v1/chat/completions \
-H "Content-Type: application/json" \
-d @-)
echo "$RESULT"
done

Compare Decisions

Build a comparison report between original and replayed decisions:

# Compare original vs replayed outcomes
kt events list --format json --output replayed-events.json

# Diff the decision fields
diff <(jq '[.[].decision]' events-snapshot.json) \
<(jq '[.[].decision]' replayed-events.json)

Test Categories for Governance

Structure your test suite around these categories:

CategoryWhat to TestOracle
Block policiesProhibited topics, content typesHTTP 409 response
Redaction policiesPII patterns, sensitive dataResponse body contains redacted markers
Rate limitsRequest frequency, token budgetsHTTP 429 after threshold
Spend limitsCost per request, wallet balanceHTTP 402 or budget error
EscalationFlagged content triggersEvent with escalation status
PassthroughAllowed content flows cleanlyHTTP 200 with LLM response

Validating Policy Configuration

Before deploying policies, validate the configuration syntax:

# Validate policy config YAML structure
kt policy lint --file policy-config.yaml

# Run the generated pack tests from the current directory
kt policy test --json

Environment Isolation

Run governance tests in isolated environments to avoid polluting production event streams:

# Start a test-scoped gateway with separate event routing
export KEEPTRUSTS_API_URL="http://localhost:8080"
export KEEPTRUSTS_GATEWAY_TOKEN="$TEST_GATEWAY_KEY"

kt gateway run \
--listen 0.0.0.0:41099 \
--policy-config test-policy-config.yaml

Use dedicated API tokens and gateway keys for test environments. This ensures test events are identifiable and can be cleaned up without affecting production data.

CI Integration Pattern

Add governance tests to your CI pipeline:

# .github/workflows/governance-qa.yml
jobs:
policy-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Validate policy config
run: kt policy lint --file policy-config.yaml

- name: Start test gateway
run: |
kt gateway run --policy-config policy-config.yaml --port 41002 &
sleep 3

- name: Run policy assertion tests
run: ./scripts/run-policy-tests.sh

- name: Check gateway health
run: curl -f http://localhost:41002/health

Key Takeaways

  • Treat governance policies as your deterministic test oracle for non-deterministic AI outputs
  • Structure tests around policy decisions (block, redact, allow) rather than LLM content
  • Use replay testing to detect regressions when updating policy configurations
  • Validate configuration syntax with kt policy lint before deployment
  • Isolate test environments with dedicated gateway keys and API tokens
  • Integrate policy tests into CI for continuous governance verification

For AI systems

  • Canonical terms: policy-as-test-oracle, gateway replay testing, deterministic enforcement, non-deterministic output, decision assertion, event capture, kt events list
  • Key insight: LLM responses are non-deterministic but policy enforcement is fully deterministic — assert on decisions, not content
  • CLI commands: kt events list --from <time> --to <time> --format json --output events-snapshot.json, kt policy lint --file <path>
  • Test assertion pattern: send prompt → check HTTP status (409 = blocked, 200 = allowed) → verify event decision
  • Related pages: Regression Testing, Mock Gateway, Quality Scoring

For engineers

  • Use governance policies as your test oracle: assert on gateway decisions (HTTP 409 for block, 200 for allow) not on LLM content
  • Structure tests around policy behavior: DLP redaction triggers on pattern, topic control blocks on category, rate limits return 429
  • Capture real traffic with kt events list --format json --output events-snapshot.json for replay testing
  • Replay captured prompts through updated policy configs to detect unintended behavioral changes
  • Validate config syntax with kt policy lint --file <path> before deploying to any environment
  • Isolate test environments with dedicated gateway keys and API tokens to prevent production interference
  • Integrate policy tests into CI for continuous governance verification on every code change

For leaders

  • Policy-as-test-oracle solves the fundamental QA challenge of non-deterministic AI outputs
  • Deterministic policy enforcement means governance tests are reliable and repeatable (no flaky tests)
  • Replay testing detects regressions when updating either policies or models
  • CI-integrated governance testing ensures every deployment maintains compliance and safety guarantees
  • This approach scales to thousands of test cases without requiring manual review of each LLM response

Next steps