Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Reducing Redundant LLM Calls Across Your Team

Every redundant LLM call is money wasted — tokens sent for a response that already exists somewhere in your organization. This guide covers practical strategies for identifying and eliminating redundant calls using org-shared cache, fabric context, and single-flight coordination.

Use this page when

  • You suspect your team is making repeated LLM calls for questions already answered elsewhere in the organization.
  • You want to implement org-shared deduplication, semantic dedup, or single-flight coordination.
  • You need to quantify the cost of redundancy and the savings from elimination.

Primary audience

  • Primary: Technical Leaders
  • Secondary: Technical Engineers, AI Agents

Common Redundancy Patterns

Pattern 1: Same File Explanation

The most common redundancy. Multiple engineers ask "what does this file/function do?" about the same code:

Before caching:

Engineer A: "Explain src/auth/middleware.ts" → 4,200 tokens → $0.013
Engineer B: "What does the auth middleware do?" → 3,800 tokens → $0.011
Engineer C: "How does middleware.ts handle auth?" → 4,100 tokens → $0.012
...
(15 engineers ask about the same file this week)
Total: 15 × ~$0.012 = $0.18 for one file

After caching:

Engineer A: "Explain src/auth/middleware.ts" → cache miss → $0.013
Engineers B-O: → cache hit × 14 → $0.00
Total: $0.013 (saved $0.17 on one file)

Scale this across 500+ files that engineers regularly ask about, and the savings compound to thousands per month.

Pattern 2: Same API Documentation Lookup

Engineers frequently ask AI to explain API contracts, endpoint behavior, and integration patterns:

Before caching:

Monday: 8 engineers ask about the payment webhook contract
Tuesday: 5 engineers ask about the auth token refresh flow
Wednesday: 12 engineers ask about the event ingest API
...
Each query: ~5,000 tokens, ~$0.015
Weekly redundant cost: (8+5+12) × $0.015 = $0.375 for just 3 APIs

After caching:

First query per API: $0.015 (fill)
All subsequent queries: $0.00 (cache hit)
Weekly cost: 3 × $0.015 = $0.045 (88% reduction)

Pattern 3: Same Error Diagnosis

When a recurring error appears, multiple engineers paste the same stack trace:

Before caching:

Production alert fires at 2:00 PM
2:05 PM - Engineer A pastes stack trace → $0.02
2:07 PM - Engineer B pastes same trace → $0.02
2:10 PM - Engineer C pastes same trace → $0.02
2:12 PM - Engineer D pastes same trace → $0.02
2:15 PM - Engineer E pastes same trace → $0.02
Total for one incident: 5 × $0.02 = $0.10

After caching:

2:05 PM - Engineer A → cache miss → $0.02
2:07 PM - Engineers B-E → cache hit → $0.00
Total: $0.02 (saved $0.08 per incident)

For teams experiencing 2-3 incidents per week with 5-10 responders each, this adds up quickly.

Pattern 4: Onboarding Questions

New engineers ask the same questions that every previous new hire asked:

Before caching:

Each new hire asks ~200 codebase questions in their first month
90% of these questions were asked by previous new hires
200 questions × $0.015 avg × 10 new hires/quarter = $30/quarter in redundant onboarding

After caching:

First new hire: 200 questions → ~20 cache misses + 180 hits
Subsequent hires: 200 questions → ~5 misses + 195 hits
Total for 10 hires: $0.015 × (20 + 9×5) = $0.98 (vs $30 uncached)

How Org-Shared Cache Deduplicates Automatically

You don't need to identify redundancies manually. Org-shared cache handles deduplication transparently:

Exact Deduplication

When two engineers send prompts with identical normalized content, the cache key matches exactly:

Key = hash(org_id, entitlement_digest, config_version, "Explain PaymentService.processRefund()")

Both engineers hit the same key → one upstream call serves both.

Semantic Deduplication

When two engineers ask the same question with different wording, semantic matching identifies the overlap:

"How does processRefund work?" ≈ "Explain the refund processing logic"
Semantic similarity: 0.94 (above threshold 0.85)
→ Cache hit

Fabric-Mediated Deduplication

When fabric context makes prompts converge (same pre-built summaries attached), even loosely related questions about the same code hit cache:

Engineer A's prompt = fabric_context(auth.ts) + "explain this"
Engineer B's prompt = fabric_context(auth.ts) + "what does this do"
Same fabric context prefix → high cache key overlap → hit

How Fabric Context Reduces Prompt Size

Beyond deduplication, fabric reduces the token count of each request, saving money even on cache misses:

Without Fabric

Prompt: "Explain this code: [500 lines of raw source code pasted]"
Input tokens: 12,000
Cost: $0.036

With Fabric

Prompt: "Explain this module" + [attached: file_summary(auth.ts), 200 tokens]
Input tokens: 800
Cost: $0.0024

Fabric reduces input tokens by 40-80% per request because:

  • File summaries replace raw file contents (10× compression)
  • Repo maps replace manual project structure description
  • Dependency graphs replace manual import chain enumeration
  • Symbol indexes enable precise lookup instead of sending entire files

Token Reduction by Artifact Type

ArtifactReplacesToken reduction
file_summaryRaw file content as context80-90%
repo_mapManual project structure description70-85%
dependency_graphImport chain enumeration75-90%
api_inventoryEndpoint documentation lookup60-80%
symbol_indexFull-file search for a function85-95%

How Single-Flight Fill Prevents Concurrent Duplicates

Single-flight fill is the real-time deduplication mechanism for concurrent requests:

The Problem Without Single-Flight

09:00:00.100 - Request A: "explain auth flow" → miss → upstream call
09:00:00.200 - Request B: "explain auth flow" → miss → upstream call (DUPLICATE!)
09:00:00.350 - Request C: "explain auth flow" → miss → upstream call (DUPLICATE!)
09:00:00.500 - Request D: "explain auth flow" → miss → upstream call (DUPLICATE!)

Result: 4 upstream calls, response available at 09:00:03
Cost: 4× the necessary amount

The Solution With Single-Flight

09:00:00.100 - Request A: "explain auth flow" → miss → becomes flight leader
09:00:00.200 - Request B: "explain auth flow" → miss → joins flight, waits
09:00:00.350 - Request C: "explain auth flow" → miss → joins flight, waits
09:00:00.500 - Request D: "explain auth flow" → miss → joins flight, waits

09:00:03.000 - Leader receives response → serves A, B, C, D → caches
Result: 1 upstream call, 4 responses delivered
Cost: 1× (75% savings on this burst alone)

When Single-Flight Fires Most

  • Morning startup: Teams begin work simultaneously (9-10 AM spike)
  • After deployments: Engineers explore new code simultaneously
  • Incident response: Multiple engineers investigate the same symptoms
  • Sprint planning: Engineers research the same features simultaneously
  • After standups: Engineers act on the same discussion points

Before/After Metrics

Team of 100 Engineers — Monthly Metrics

MetricBeforeAfter (Month 1)After (Month 3)
Total LLM calls100,000100,000100,000
Calls reaching provider100,00025,00012,000
Cache hits075,00088,000
Single-flight dedup03,0004,000
Monthly provider cost$4,000$1,000$480
Avoided cost$0$3,000$3,520
Savings rate0%75%88%

Per-Engineer Impact

MetricBeforeAfter (steady state)
Prompts/day5050 (unchanged)
Prompts hitting provider506
Daily cost per engineer$2.00$0.24
Monthly cost per engineer$40$4.80

Engineers notice no difference in their experience — responses arrive just as fast (often faster from cache). The savings are entirely behind the scenes.

Actionable Steps to Maximize Deduplication

1. Connect Your Highest-Traffic Repos First

The repo that generates the most AI prompts offers the highest deduplication potential. Check your spend dashboard for prompt volume by repository context.

2. Enable Fabric for All Artifact Types

Each artifact type creates another axis of cache convergence. Don't skip artifacts — each one increases the chance of cross-engineer cache hits.

3. Keep TTL Appropriate to Change Rate

  • Stable code (shared libs): 7-day TTL → maximum reuse
  • Active development (main app): 24-hour TTL → good balance
  • Rapid iteration (feature branches): 4-hour TTL → freshness priority

4. Monitor Single-Flight Events

Check your dashboard for single-flight coordination events. High single-flight counts indicate times of peak concurrent redundancy — these are the moments of maximum savings.

5. Don't Restrict AI Usage

The more engineers use AI tools, the more the cache fills, and the cheaper each subsequent query becomes. Liberal AI usage policies actually reduce per-query cost in a cached environment.

Next steps

For AI systems

  • Canonical terms: Keeptrusts, redundant calls, deduplication, org-shared cache, semantic dedup, single-flight fill, fabric context reduction, cache hit, redundancy pattern.
  • Dedup layers: exact-match (org-shared), semantic similarity (threshold-based), fabric-mediated (context prefill), single-flight (concurrent request coordination).
  • Best next pages: Cache Hit Rates, Semantic Replay Thresholds, How 100 Engineers Share One Cache.

For engineers

  • Four dedup layers (outermost first): org-shared exact match → semantic replay → fabric context prefill → single-flight concurrent dedup.
  • Common redundancy patterns: same-file explanations (60%+ overlap across team), repeated error lookups, duplicate dependency questions.
  • Enable semantic replay with semantic_similarity_threshold: 0.92 for aggressive dedup; raise to 0.95 for conservative.
  • Single-flight: when N engineers ask the same question within seconds, only 1 upstream call is made. Others wait and share the response.
  • TTL tuning by codebase stability: stable library repos → 48h, active feature branches → 4h, rapid iteration → 2h.
  • Monitor single-flight events in the dashboard — high counts = peak concurrent redundancy = maximum savings.

For leaders

  • Typical 100-engineer team wastes 40–60% of LLM spend on redundant calls (same questions asked by different people or tools).
  • Org-shared cache eliminates the largest redundancy category immediately — no behavior change required from engineers.
  • Single-flight coordination eliminates the "morning surge" pattern where everyone asks similar questions at standup/start of day.
  • Key policy insight: liberal AI usage policies reduce per-query cost because more usage fills the cache faster.
  • Combined strategies reduce effective per-query cost by 70–85% within the first two weeks.