Setting Semantic Replay Thresholds
Direct semantic replay returns a cached response when a new request is semantically similar to a previously cached one. The similarity_threshold controls how similar two requests must be before the cache serves the stored response.
Use this page when
- You need to choose the right
similarity_thresholdvalue for your team’s accuracy and cost needs. - You are monitoring threshold effectiveness and deciding whether to adjust up or down.
- You want per-agent threshold overrides for different use cases (security vs explanation vs boilerplate).
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
How It Works
When a request arrives:
- The gateway computes a semantic embedding of the request.
- It searches the cache for entries with a cosine similarity above the threshold.
- If a match is found, the cached response is returned without calling the upstream provider.
- If no match exceeds the threshold, the request goes to the provider and the response is cached.
Default Threshold
workflow_cache:
similarity_threshold: 0.95
The default of 0.95 is conservative — it requires very high similarity before replaying a cached response. This prioritizes accuracy over hit rate.
Understanding the Threshold Scale
| Threshold | Behavior | Hit Rate | Accuracy Risk |
|---|---|---|---|
| 0.99 | Near-exact match only | Very low | Minimal |
| 0.95 | Default — high similarity required | Moderate | Low |
| 0.92 | Relaxed — broader pattern matching | High | Moderate |
| 0.88 | Aggressive — wider semantic matching | Very high | Higher |
| 0.85 | Maximum sharing — use with caution | Highest | Significant |
Threshold Selection Guidelines
Code generation and refactoring (0.95–0.98)
For tasks where precise context matters — generating new code, refactoring existing code, or explaining specific implementations:
workflow_cache:
similarity_threshold: 0.96
Higher thresholds ensure the cached response truly matches the intent. A request about "adding an Axum handler for user creation" should not return a cached response about "adding an Axum handler for event ingestion."
Documentation and explanation (0.90–0.94)
For tasks where the answer is more general — explaining concepts, describing patterns, or answering "how do I" questions:
workflow_cache:
similarity_threshold: 0.92
Lower thresholds work here because the answer to "how do I write a unit test in Rust" is broadly applicable regardless of minor context differences.
Boilerplate and templates (0.88–0.92)
For repetitive tasks where the output is highly predictable — generating test scaffolding, creating migration templates, or adding standard error handling:
workflow_cache:
similarity_threshold: 0.90
These tasks produce similar outputs regardless of small input variations.
Security-sensitive operations (0.97–0.99)
For tasks involving auth, encryption, access control, or compliance:
workflow_cache:
similarity_threshold: 0.98
Higher thresholds prevent a cached response about one security context from being replayed in a different security context.
Threshold Impact on Accuracy vs Savings
Conservative (0.95+)
- Cache hits only when requests are nearly identical.
- Low risk of serving an irrelevant response.
- Lower cost savings — more requests go to the provider.
- Best for teams that prioritize correctness over cost.
Balanced (0.91–0.94)
- Cache hits for requests with shared patterns and intent.
- Moderate risk — occasionally a cached response may be slightly off-target.
- Good cost savings — 30–50% of similar requests served from cache.
- Best for teams with consistent coding patterns across engineers.
Aggressive (0.85–0.90)
- Cache hits for broadly similar requests.
- Higher risk of serving responses that do not perfectly match the request.
- Maximum cost savings — 50–70% of requests served from cache.
- Best for teams doing repetitive tasks with predictable outputs.
Monitoring Threshold Effectiveness
After setting your threshold, monitor these metrics:
Cache hit rate
Check the org dashboard or spend logs for the ratio of cache hits to total requests. A healthy hit rate depends on your team size and work patterns:
- Small team (2–5 engineers): 15–30% hit rate is normal.
- Medium team (5–20 engineers): 25–45% hit rate is achievable.
- Large team (20+ engineers): 35–60% hit rate is possible.
User satisfaction signals
If engineers report receiving irrelevant cached responses, your threshold is too low. Look for:
- Regeneration requests immediately following a cache hit.
- Feedback signals indicating "not helpful" on cached responses.
- Engineers disabling the cache for specific tasks.
Cost savings
Compare your monthly spend before and after enabling semantic replay. The spend dashboard shows cached_input_tokens vs input_tokens to quantify savings.
Per-Agent Threshold Overrides
You can set different thresholds for different agents:
workflow_cache:
similarity_threshold: 0.95
agent_overrides:
- agent_id: "code-reviewer"
similarity_threshold: 0.92
- agent_id: "security-auditor"
similarity_threshold: 0.98
This allows you to relax thresholds for general-purpose agents while keeping them strict for security-sensitive ones.
Adjusting Over Time
Start with the default (0.95) and adjust based on observed behavior:
- Deploy with 0.95 and monitor for two weeks.
- If hit rates are below 20% and your team does repetitive work, lower to 0.92.
- If engineers report irrelevant responses, raise to 0.96 or higher.
- Check spend logs monthly to verify the threshold delivers meaningful savings.
- Re-evaluate after team size changes or when adopting new frameworks.
For AI systems
- Canonical terms: Keeptrusts,
similarity_threshold, semantic replay, cosine similarity, cache accuracy, threshold tuning,agent_overrides. - Config keys:
workflow_cache.similarity_threshold,workflow_cache.agent_overrides[].agent_id,workflow_cache.agent_overrides[].similarity_threshold. - Best next pages: Controlling Semantic Replay by Scope, Per-Agent Cache Policies, Cache Hit Rates: What Good Looks Like.
For engineers
- Default: 0.95 (conservative). Start here and adjust after 2 weeks of observation.
- Threshold guidelines: code generation 0.95–0.98, documentation/explanation 0.90–0.94, boilerplate 0.88–0.92, security-sensitive 0.97–0.99.
- Monitor: hit rate, regeneration requests after cache hits (too low = threshold too low), cost savings in spend dashboard.
- Per-agent overrides allow different thresholds without changing the global default.
- If hit rates < 20% and team does repetitive work, lower to 0.92. If engineers report irrelevant responses, raise to 0.96+.
For leaders
- The threshold directly controls the cost-vs-accuracy trade-off: lower = more savings, higher = more precision.
- Conservative (0.95+): low risk, lower savings. Best for teams that prioritize correctness.
- Aggressive (0.85–0.90): high savings (50–70% of requests cached), but occasional off-target responses.
- Monitor user satisfaction signals: regeneration requests and “not helpful” feedback indicate the threshold is too low.
- Re-evaluate after team size changes or when adopting new frameworks.
Next steps
- Controlling Semantic Replay by Scope — layered scope precedence
- Per-Agent Cache Policies — control replay per agent type
- Cache Hit Rates: What Good Looks Like — expected benchmarks