Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Setting Semantic Replay Thresholds

Direct semantic replay returns a cached response when a new request is semantically similar to a previously cached one. The similarity_threshold controls how similar two requests must be before the cache serves the stored response.

Use this page when

  • You need to choose the right similarity_threshold value for your team’s accuracy and cost needs.
  • You are monitoring threshold effectiveness and deciding whether to adjust up or down.
  • You want per-agent threshold overrides for different use cases (security vs explanation vs boilerplate).

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

How It Works

When a request arrives:

  1. The gateway computes a semantic embedding of the request.
  2. It searches the cache for entries with a cosine similarity above the threshold.
  3. If a match is found, the cached response is returned without calling the upstream provider.
  4. If no match exceeds the threshold, the request goes to the provider and the response is cached.

Default Threshold

workflow_cache:
similarity_threshold: 0.95

The default of 0.95 is conservative — it requires very high similarity before replaying a cached response. This prioritizes accuracy over hit rate.

Understanding the Threshold Scale

ThresholdBehaviorHit RateAccuracy Risk
0.99Near-exact match onlyVery lowMinimal
0.95Default — high similarity requiredModerateLow
0.92Relaxed — broader pattern matchingHighModerate
0.88Aggressive — wider semantic matchingVery highHigher
0.85Maximum sharing — use with cautionHighestSignificant

Threshold Selection Guidelines

Code generation and refactoring (0.95–0.98)

For tasks where precise context matters — generating new code, refactoring existing code, or explaining specific implementations:

workflow_cache:
similarity_threshold: 0.96

Higher thresholds ensure the cached response truly matches the intent. A request about "adding an Axum handler for user creation" should not return a cached response about "adding an Axum handler for event ingestion."

Documentation and explanation (0.90–0.94)

For tasks where the answer is more general — explaining concepts, describing patterns, or answering "how do I" questions:

workflow_cache:
similarity_threshold: 0.92

Lower thresholds work here because the answer to "how do I write a unit test in Rust" is broadly applicable regardless of minor context differences.

Boilerplate and templates (0.88–0.92)

For repetitive tasks where the output is highly predictable — generating test scaffolding, creating migration templates, or adding standard error handling:

workflow_cache:
similarity_threshold: 0.90

These tasks produce similar outputs regardless of small input variations.

Security-sensitive operations (0.97–0.99)

For tasks involving auth, encryption, access control, or compliance:

workflow_cache:
similarity_threshold: 0.98

Higher thresholds prevent a cached response about one security context from being replayed in a different security context.

Threshold Impact on Accuracy vs Savings

Conservative (0.95+)

  • Cache hits only when requests are nearly identical.
  • Low risk of serving an irrelevant response.
  • Lower cost savings — more requests go to the provider.
  • Best for teams that prioritize correctness over cost.

Balanced (0.91–0.94)

  • Cache hits for requests with shared patterns and intent.
  • Moderate risk — occasionally a cached response may be slightly off-target.
  • Good cost savings — 30–50% of similar requests served from cache.
  • Best for teams with consistent coding patterns across engineers.

Aggressive (0.85–0.90)

  • Cache hits for broadly similar requests.
  • Higher risk of serving responses that do not perfectly match the request.
  • Maximum cost savings — 50–70% of requests served from cache.
  • Best for teams doing repetitive tasks with predictable outputs.

Monitoring Threshold Effectiveness

After setting your threshold, monitor these metrics:

Cache hit rate

Check the org dashboard or spend logs for the ratio of cache hits to total requests. A healthy hit rate depends on your team size and work patterns:

  • Small team (2–5 engineers): 15–30% hit rate is normal.
  • Medium team (5–20 engineers): 25–45% hit rate is achievable.
  • Large team (20+ engineers): 35–60% hit rate is possible.

User satisfaction signals

If engineers report receiving irrelevant cached responses, your threshold is too low. Look for:

  • Regeneration requests immediately following a cache hit.
  • Feedback signals indicating "not helpful" on cached responses.
  • Engineers disabling the cache for specific tasks.

Cost savings

Compare your monthly spend before and after enabling semantic replay. The spend dashboard shows cached_input_tokens vs input_tokens to quantify savings.

Per-Agent Threshold Overrides

You can set different thresholds for different agents:

workflow_cache:
similarity_threshold: 0.95
agent_overrides:
- agent_id: "code-reviewer"
similarity_threshold: 0.92
- agent_id: "security-auditor"
similarity_threshold: 0.98

This allows you to relax thresholds for general-purpose agents while keeping them strict for security-sensitive ones.

Adjusting Over Time

Start with the default (0.95) and adjust based on observed behavior:

  1. Deploy with 0.95 and monitor for two weeks.
  2. If hit rates are below 20% and your team does repetitive work, lower to 0.92.
  3. If engineers report irrelevant responses, raise to 0.96 or higher.
  4. Check spend logs monthly to verify the threshold delivers meaningful savings.
  5. Re-evaluate after team size changes or when adopting new frameworks.

For AI systems

For engineers

  • Default: 0.95 (conservative). Start here and adjust after 2 weeks of observation.
  • Threshold guidelines: code generation 0.95–0.98, documentation/explanation 0.90–0.94, boilerplate 0.88–0.92, security-sensitive 0.97–0.99.
  • Monitor: hit rate, regeneration requests after cache hits (too low = threshold too low), cost savings in spend dashboard.
  • Per-agent overrides allow different thresholds without changing the global default.
  • If hit rates < 20% and team does repetitive work, lower to 0.92. If engineers report irrelevant responses, raise to 0.96+.

For leaders

  • The threshold directly controls the cost-vs-accuracy trade-off: lower = more savings, higher = more precision.
  • Conservative (0.95+): low risk, lower savings. Best for teams that prioritize correctness.
  • Aggressive (0.85–0.90): high savings (50–70% of requests cached), but occasional off-target responses.
  • Monitor user satisfaction signals: regeneration requests and “not helpful” feedback indicate the threshold is too low.
  • Re-evaluate after team size changes or when adopting new frameworks.

Next steps