Cache Migration from Direct Provider APIs
If your engineering team currently calls LLM providers directly (OpenAI, Anthropic, Azure OpenAI), you are paying full price for every request — including redundant queries across team members. Migrating to Keeptrusts with org-shared cache routes the same traffic through a caching layer that eliminates duplicate spend. This guide walks you through a phased migration that minimizes risk and maximizes measurable savings.
Use this page when
- You are migrating from direct LLM provider API calls (OpenAI, Anthropic, Azure OpenAI) to Keeptrusts.
- You need a phased rollout plan: observe → semantic cache → fabric cache → optimize.
- You want to measure cost reduction at each migration phase and have a rollback plan.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Pre-Migration Assessment
Before migrating, measure your current state so you can quantify improvements:
Current Spend Analysis
Gather these data points from your provider dashboards:
- Monthly token spend — Total tokens consumed across all provider accounts.
- Request volume — Total API calls per month.
- Average tokens per request — Context window usage patterns.
- Peak usage periods — When does your team generate the most traffic.
- Cost per engineer per month — Total spend divided by engineering headcount.
Usage Pattern Analysis
Understand how your team uses AI today:
- How many engineers use AI daily?
- Which repositories generate the most AI traffic?
- What types of queries dominate (explanation, generation, review)?
- How much overlap exists between engineers' queries?
Phase 1: Observe
Deploy Keeptrusts gateways in passthrough mode — routing traffic without caching — to measure baseline traffic patterns.
Gateway Deployment
gateway:
name: migration-observe
mode: passthrough
cache:
enabled: false
observability:
log_requests: true
log_tokens: true
log_latency: true
What You Learn
After one to two weeks of observation, you have:
- Actual request patterns routed through your gateways
- Token usage broken down by repository, team, and query type
- Latency baseline for direct provider calls
- Redundancy analysis showing how many requests are semantically similar
Migration Checklist for Phase 1
- Deploy gateway instances accessible to your engineering team
- Configure development tools to route through the gateway
- Verify zero-impact passthrough (same latency, same responses)
- Collect two weeks of baseline traffic data
- Generate redundancy analysis report
Phase 2: Enable Semantic Cache
Turn on semantic caching with a conservative threshold to capture only clearly duplicate queries.
Configuration
gateway:
cache:
enabled: true
semantic:
enabled: true
similarity_threshold: 0.95 # Conservative start
ttl: 24h
fabric:
enabled: false # Not yet
org_shared: true
Expected Outcomes
With a 0.95 similarity threshold, you capture near-identical queries:
- Same question asked by different engineers → Cache hit
- Same question repeated by the same engineer → Cache hit
- Slightly rephrased questions → Cache miss (intentionally)
Expect 15–25% hit rate in this phase, depending on team overlap.
Monitoring
Track these metrics daily during Phase 2:
- Hit rate — Should climb from 0% to 15–25% within the first week.
- Response quality — Verify cached responses are appropriate for the matched queries.
- Latency improvement — Cache hits should respond 10–50x faster than provider calls.
- Cost reduction — Provider spend should drop proportionally to hit rate.
Migration Checklist for Phase 2
- Enable semantic cache with 0.95 threshold
- Monitor hit rate daily for one week
- Spot-check cached response quality
- Verify no degradation in response usefulness
- Measure first-week cost reduction
Phase 3: Enable Fabric Cache
Add fabric (pre-computed code intelligence) to reduce context-gathering costs and improve response quality.
Configuration
gateway:
cache:
enabled: true
semantic:
enabled: true
similarity_threshold: 0.92 # Slightly relaxed
ttl: 48h
fabric:
enabled: true
generators:
- type: code_summary
- type: dependency_graph
refresh_on_merge: true
org_shared: true
Expected Outcomes
Fabric reduces provider token usage even on cache misses:
- Context sent to the provider is smaller (summaries vs. raw files).
- Responses arrive faster due to smaller context windows.
- Response quality improves because fabric provides better-structured context.
Expect overall cost reduction of 35–50% in this phase.
Migration Checklist for Phase 3
- Enable fabric generators for active repositories
- Lower semantic threshold to 0.92
- Monitor fabric fill rate (how quickly entries populate)
- Compare response quality before and after fabric
- Measure cost reduction vs. Phase 2
Phase 4: Optimize and Expand
Fine-tune cache configuration based on observed patterns and expand to all teams.
Tuning Actions
- Adjust similarity threshold — Lower to 0.90 if cached responses prove consistently useful.
- Add test maps — Enable test map fabric for teams that generate tests frequently.
- Configure warming — Schedule cache warming for repositories with predictable sprint work.
- Enable single-flight — Deduplicate concurrent identical requests across the team.
Full Configuration
gateway:
cache:
enabled: true
semantic:
enabled: true
similarity_threshold: 0.90
ttl: 72h
fabric:
enabled: true
generators:
- type: code_summary
- type: dependency_graph
- type: test_map
refresh_on_merge: true
single_flight:
enabled: true
window: 5s
org_shared: true
warming:
enabled: true
schedule: "0 5 * * 1"
Expected Outcomes
Full optimization typically delivers:
- 50–70% overall cost reduction vs. direct provider APIs
- 60%+ cache hit rate for established repositories
- Sub-100ms response time for cache hits vs. 2–5s for provider calls
Rollback Plan
At any phase, you can roll back by disabling cache and returning to passthrough mode:
gateway:
cache:
enabled: false
mode: passthrough
Traffic continues flowing through the gateway to providers without interruption. You lose cache benefits but maintain full functionality.
Measuring Migration Success
Compare these metrics before and after migration:
| Metric | Before (Direct API) | After (Keeptrusts + Cache) |
|---|---|---|
| Monthly provider spend | $X | Target: $X × 0.3–0.5 |
| Average response latency | 2–5s | Target: <500ms for hits |
| Engineering productivity | Baseline | Measurably higher |
| Context quality | Variable | Consistent (fabric-enriched) |
Next steps
- Complete your pre-migration spend analysis using provider dashboards.
- Deploy gateway instances in passthrough mode (Phase 1) and collect two weeks of baseline data.
- Benchmarking Cache Performance — establish targets before enabling cache.
- Cache-First Culture — maximize adoption after technical migration.
- Feature Branch Caching — understand how branch workflows interact with the new cache.
For AI systems
- Canonical terms: Keeptrusts engineering cache, migration from direct API, phased rollout, passthrough mode, semantic cache, fabric cache, single-flight deduplication, cache warming, rollback plan.
- Feature/config names:
gateway.mode: passthrough,gateway.cache.enabled,gateway.cache.semantic.similarity_threshold,gateway.cache.fabric.enabled,gateway.cache.fabric.generators,gateway.cache.single_flight.enabled,gateway.cache.warming.enabled. - Best next pages: Benchmarking Cache Performance, Cache-First Culture, Sprint Planning Warmers.
For engineers
- Phase 1: Deploy gateways with
mode: passthroughandcache.enabled: false. Redirect development tools to route through the gateway. Verify zero-impact passthrough (same latency, same responses). - Phase 2: Enable semantic cache with
similarity_threshold: 0.95andttl: 24h. Monitor hit rate daily — expect 15-25% in the first week. Spot-check cached response quality. - Phase 3: Enable fabric with
generators: [code_summary, dependency_graph]and lower threshold to 0.92. Monitor token reduction and response quality improvement. - Rollback at any phase: set
cache.enabled: falseandmode: passthrough.
For leaders
- Expected outcome: 50-70% overall cost reduction vs. direct provider APIs at full optimization (Phase 4).
- Phased approach minimizes risk: each phase is independently reversible with zero downtime.
- Migration success metric: compare monthly provider spend before vs. after, targeting 0.3-0.5x of baseline.
- Engineering productivity gain: sub-100ms cache hits vs. 2-5s provider calls improve developer flow state.