Reducing Redundant LLM Calls Across Your Team
Every redundant LLM call is money wasted — tokens sent for a response that already exists somewhere in your organization. This guide covers practical strategies for identifying and eliminating redundant calls using org-shared cache, fabric context, and single-flight coordination.
Use this page when
- You suspect your team is making repeated LLM calls for questions already answered elsewhere in the organization.
- You want to implement org-shared deduplication, semantic dedup, or single-flight coordination.
- You need to quantify the cost of redundancy and the savings from elimination.
Primary audience
- Primary: Technical Leaders
- Secondary: Technical Engineers, AI Agents
Common Redundancy Patterns
Pattern 1: Same File Explanation
The most common redundancy. Multiple engineers ask "what does this file/function do?" about the same code:
Before caching:
Engineer A: "Explain src/auth/middleware.ts" → 4,200 tokens → $0.013
Engineer B: "What does the auth middleware do?" → 3,800 tokens → $0.011
Engineer C: "How does middleware.ts handle auth?" → 4,100 tokens → $0.012
...
(15 engineers ask about the same file this week)
Total: 15 × ~$0.012 = $0.18 for one file
After caching:
Engineer A: "Explain src/auth/middleware.ts" → cache miss → $0.013
Engineers B-O: → cache hit × 14 → $0.00
Total: $0.013 (saved $0.17 on one file)
Scale this across 500+ files that engineers regularly ask about, and the savings compound to thousands per month.
Pattern 2: Same API Documentation Lookup
Engineers frequently ask AI to explain API contracts, endpoint behavior, and integration patterns:
Before caching:
Monday: 8 engineers ask about the payment webhook contract
Tuesday: 5 engineers ask about the auth token refresh flow
Wednesday: 12 engineers ask about the event ingest API
...
Each query: ~5,000 tokens, ~$0.015
Weekly redundant cost: (8+5+12) × $0.015 = $0.375 for just 3 APIs
After caching:
First query per API: $0.015 (fill)
All subsequent queries: $0.00 (cache hit)
Weekly cost: 3 × $0.015 = $0.045 (88% reduction)
Pattern 3: Same Error Diagnosis
When a recurring error appears, multiple engineers paste the same stack trace:
Before caching:
Production alert fires at 2:00 PM
2:05 PM - Engineer A pastes stack trace → $0.02
2:07 PM - Engineer B pastes same trace → $0.02
2:10 PM - Engineer C pastes same trace → $0.02
2:12 PM - Engineer D pastes same trace → $0.02
2:15 PM - Engineer E pastes same trace → $0.02
Total for one incident: 5 × $0.02 = $0.10
After caching:
2:05 PM - Engineer A → cache miss → $0.02
2:07 PM - Engineers B-E → cache hit → $0.00
Total: $0.02 (saved $0.08 per incident)
For teams experiencing 2-3 incidents per week with 5-10 responders each, this adds up quickly.
Pattern 4: Onboarding Questions
New engineers ask the same questions that every previous new hire asked:
Before caching:
Each new hire asks ~200 codebase questions in their first month
90% of these questions were asked by previous new hires
200 questions × $0.015 avg × 10 new hires/quarter = $30/quarter in redundant onboarding
After caching:
First new hire: 200 questions → ~20 cache misses + 180 hits
Subsequent hires: 200 questions → ~5 misses + 195 hits
Total for 10 hires: $0.015 × (20 + 9×5) = $0.98 (vs $30 uncached)
How Org-Shared Cache Deduplicates Automatically
You don't need to identify redundancies manually. Org-shared cache handles deduplication transparently:
Exact Deduplication
When two engineers send prompts with identical normalized content, the cache key matches exactly:
Key = hash(org_id, entitlement_digest, config_version, "Explain PaymentService.processRefund()")
Both engineers hit the same key → one upstream call serves both.
Semantic Deduplication
When two engineers ask the same question with different wording, semantic matching identifies the overlap:
"How does processRefund work?" ≈ "Explain the refund processing logic"
Semantic similarity: 0.94 (above threshold 0.85)
→ Cache hit
Fabric-Mediated Deduplication
When fabric context makes prompts converge (same pre-built summaries attached), even loosely related questions about the same code hit cache:
Engineer A's prompt = fabric_context(auth.ts) + "explain this"
Engineer B's prompt = fabric_context(auth.ts) + "what does this do"
Same fabric context prefix → high cache key overlap → hit
How Fabric Context Reduces Prompt Size
Beyond deduplication, fabric reduces the token count of each request, saving money even on cache misses:
Without Fabric
Prompt: "Explain this code: [500 lines of raw source code pasted]"
Input tokens: 12,000
Cost: $0.036
With Fabric
Prompt: "Explain this module" + [attached: file_summary(auth.ts), 200 tokens]
Input tokens: 800
Cost: $0.0024
Fabric reduces input tokens by 40-80% per request because:
- File summaries replace raw file contents (10× compression)
- Repo maps replace manual project structure description
- Dependency graphs replace manual import chain enumeration
- Symbol indexes enable precise lookup instead of sending entire files
Token Reduction by Artifact Type
| Artifact | Replaces | Token reduction |
|---|---|---|
file_summary | Raw file content as context | 80-90% |
repo_map | Manual project structure description | 70-85% |
dependency_graph | Import chain enumeration | 75-90% |
api_inventory | Endpoint documentation lookup | 60-80% |
symbol_index | Full-file search for a function | 85-95% |
How Single-Flight Fill Prevents Concurrent Duplicates
Single-flight fill is the real-time deduplication mechanism for concurrent requests:
The Problem Without Single-Flight
09:00:00.100 - Request A: "explain auth flow" → miss → upstream call
09:00:00.200 - Request B: "explain auth flow" → miss → upstream call (DUPLICATE!)
09:00:00.350 - Request C: "explain auth flow" → miss → upstream call (DUPLICATE!)
09:00:00.500 - Request D: "explain auth flow" → miss → upstream call (DUPLICATE!)
Result: 4 upstream calls, response available at 09:00:03
Cost: 4× the necessary amount
The Solution With Single-Flight
09:00:00.100 - Request A: "explain auth flow" → miss → becomes flight leader
09:00:00.200 - Request B: "explain auth flow" → miss → joins flight, waits
09:00:00.350 - Request C: "explain auth flow" → miss → joins flight, waits
09:00:00.500 - Request D: "explain auth flow" → miss → joins flight, waits
09:00:03.000 - Leader receives response → serves A, B, C, D → caches
Result: 1 upstream call, 4 responses delivered
Cost: 1× (75% savings on this burst alone)
When Single-Flight Fires Most
- Morning startup: Teams begin work simultaneously (9-10 AM spike)
- After deployments: Engineers explore new code simultaneously
- Incident response: Multiple engineers investigate the same symptoms
- Sprint planning: Engineers research the same features simultaneously
- After standups: Engineers act on the same discussion points
Before/After Metrics
Team of 100 Engineers — Monthly Metrics
| Metric | Before | After (Month 1) | After (Month 3) |
|---|---|---|---|
| Total LLM calls | 100,000 | 100,000 | 100,000 |
| Calls reaching provider | 100,000 | 25,000 | 12,000 |
| Cache hits | 0 | 75,000 | 88,000 |
| Single-flight dedup | 0 | 3,000 | 4,000 |
| Monthly provider cost | $4,000 | $1,000 | $480 |
| Avoided cost | $0 | $3,000 | $3,520 |
| Savings rate | 0% | 75% | 88% |
Per-Engineer Impact
| Metric | Before | After (steady state) |
|---|---|---|
| Prompts/day | 50 | 50 (unchanged) |
| Prompts hitting provider | 50 | 6 |
| Daily cost per engineer | $2.00 | $0.24 |
| Monthly cost per engineer | $40 | $4.80 |
Engineers notice no difference in their experience — responses arrive just as fast (often faster from cache). The savings are entirely behind the scenes.
Actionable Steps to Maximize Deduplication
1. Connect Your Highest-Traffic Repos First
The repo that generates the most AI prompts offers the highest deduplication potential. Check your spend dashboard for prompt volume by repository context.
2. Enable Fabric for All Artifact Types
Each artifact type creates another axis of cache convergence. Don't skip artifacts — each one increases the chance of cross-engineer cache hits.
3. Keep TTL Appropriate to Change Rate
- Stable code (shared libs): 7-day TTL → maximum reuse
- Active development (main app): 24-hour TTL → good balance
- Rapid iteration (feature branches): 4-hour TTL → freshness priority
4. Monitor Single-Flight Events
Check your dashboard for single-flight coordination events. High single-flight counts indicate times of peak concurrent redundancy — these are the moments of maximum savings.
5. Don't Restrict AI Usage
The more engineers use AI tools, the more the cache fills, and the cheaper each subsequent query becomes. Liberal AI usage policies actually reduce per-query cost in a cached environment.
Next steps
- Cache Hit Rates: What Good Looks Like — benchmark your deduplication effectiveness
- Estimating Fill Cost — budget for initial cache population
- How 100 Engineers Share One Cache — understand the sharing model
For AI systems
- Canonical terms: Keeptrusts, redundant calls, deduplication, org-shared cache, semantic dedup, single-flight fill, fabric context reduction, cache hit, redundancy pattern.
- Dedup layers: exact-match (org-shared), semantic similarity (threshold-based), fabric-mediated (context prefill), single-flight (concurrent request coordination).
- Best next pages: Cache Hit Rates, Semantic Replay Thresholds, How 100 Engineers Share One Cache.
For engineers
- Four dedup layers (outermost first): org-shared exact match → semantic replay → fabric context prefill → single-flight concurrent dedup.
- Common redundancy patterns: same-file explanations (60%+ overlap across team), repeated error lookups, duplicate dependency questions.
- Enable semantic replay with
semantic_similarity_threshold: 0.92for aggressive dedup; raise to 0.95 for conservative. - Single-flight: when N engineers ask the same question within seconds, only 1 upstream call is made. Others wait and share the response.
- TTL tuning by codebase stability: stable library repos → 48h, active feature branches → 4h, rapid iteration → 2h.
- Monitor single-flight events in the dashboard — high counts = peak concurrent redundancy = maximum savings.
For leaders
- Typical 100-engineer team wastes 40–60% of LLM spend on redundant calls (same questions asked by different people or tools).
- Org-shared cache eliminates the largest redundancy category immediately — no behavior change required from engineers.
- Single-flight coordination eliminates the "morning surge" pattern where everyone asks similar questions at standup/start of day.
- Key policy insight: liberal AI usage policies reduce per-query cost because more usage fills the cache faster.
- Combined strategies reduce effective per-query cost by 70–85% within the first two weeks.