Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Cache Migration from Direct Provider APIs

If your engineering team currently calls LLM providers directly (OpenAI, Anthropic, Azure OpenAI), you are paying full price for every request — including redundant queries across team members. Migrating to Keeptrusts with org-shared cache routes the same traffic through a caching layer that eliminates duplicate spend. This guide walks you through a phased migration that minimizes risk and maximizes measurable savings.

Use this page when

  • You are migrating from direct LLM provider API calls (OpenAI, Anthropic, Azure OpenAI) to Keeptrusts.
  • You need a phased rollout plan: observe → semantic cache → fabric cache → optimize.
  • You want to measure cost reduction at each migration phase and have a rollback plan.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Pre-Migration Assessment

Before migrating, measure your current state so you can quantify improvements:

Current Spend Analysis

Gather these data points from your provider dashboards:

  • Monthly token spend — Total tokens consumed across all provider accounts.
  • Request volume — Total API calls per month.
  • Average tokens per request — Context window usage patterns.
  • Peak usage periods — When does your team generate the most traffic.
  • Cost per engineer per month — Total spend divided by engineering headcount.

Usage Pattern Analysis

Understand how your team uses AI today:

  • How many engineers use AI daily?
  • Which repositories generate the most AI traffic?
  • What types of queries dominate (explanation, generation, review)?
  • How much overlap exists between engineers' queries?

Phase 1: Observe

Deploy Keeptrusts gateways in passthrough mode — routing traffic without caching — to measure baseline traffic patterns.

Gateway Deployment

gateway:
name: migration-observe
mode: passthrough
cache:
enabled: false
observability:
log_requests: true
log_tokens: true
log_latency: true

What You Learn

After one to two weeks of observation, you have:

  • Actual request patterns routed through your gateways
  • Token usage broken down by repository, team, and query type
  • Latency baseline for direct provider calls
  • Redundancy analysis showing how many requests are semantically similar

Migration Checklist for Phase 1

  • Deploy gateway instances accessible to your engineering team
  • Configure development tools to route through the gateway
  • Verify zero-impact passthrough (same latency, same responses)
  • Collect two weeks of baseline traffic data
  • Generate redundancy analysis report

Phase 2: Enable Semantic Cache

Turn on semantic caching with a conservative threshold to capture only clearly duplicate queries.

Configuration

gateway:
cache:
enabled: true
semantic:
enabled: true
similarity_threshold: 0.95 # Conservative start
ttl: 24h
fabric:
enabled: false # Not yet
org_shared: true

Expected Outcomes

With a 0.95 similarity threshold, you capture near-identical queries:

  • Same question asked by different engineers → Cache hit
  • Same question repeated by the same engineer → Cache hit
  • Slightly rephrased questions → Cache miss (intentionally)

Expect 15–25% hit rate in this phase, depending on team overlap.

Monitoring

Track these metrics daily during Phase 2:

  • Hit rate — Should climb from 0% to 15–25% within the first week.
  • Response quality — Verify cached responses are appropriate for the matched queries.
  • Latency improvement — Cache hits should respond 10–50x faster than provider calls.
  • Cost reduction — Provider spend should drop proportionally to hit rate.

Migration Checklist for Phase 2

  • Enable semantic cache with 0.95 threshold
  • Monitor hit rate daily for one week
  • Spot-check cached response quality
  • Verify no degradation in response usefulness
  • Measure first-week cost reduction

Phase 3: Enable Fabric Cache

Add fabric (pre-computed code intelligence) to reduce context-gathering costs and improve response quality.

Configuration

gateway:
cache:
enabled: true
semantic:
enabled: true
similarity_threshold: 0.92 # Slightly relaxed
ttl: 48h
fabric:
enabled: true
generators:
- type: code_summary
- type: dependency_graph
refresh_on_merge: true
org_shared: true

Expected Outcomes

Fabric reduces provider token usage even on cache misses:

  • Context sent to the provider is smaller (summaries vs. raw files).
  • Responses arrive faster due to smaller context windows.
  • Response quality improves because fabric provides better-structured context.

Expect overall cost reduction of 35–50% in this phase.

Migration Checklist for Phase 3

  • Enable fabric generators for active repositories
  • Lower semantic threshold to 0.92
  • Monitor fabric fill rate (how quickly entries populate)
  • Compare response quality before and after fabric
  • Measure cost reduction vs. Phase 2

Phase 4: Optimize and Expand

Fine-tune cache configuration based on observed patterns and expand to all teams.

Tuning Actions

  • Adjust similarity threshold — Lower to 0.90 if cached responses prove consistently useful.
  • Add test maps — Enable test map fabric for teams that generate tests frequently.
  • Configure warming — Schedule cache warming for repositories with predictable sprint work.
  • Enable single-flight — Deduplicate concurrent identical requests across the team.

Full Configuration

gateway:
cache:
enabled: true
semantic:
enabled: true
similarity_threshold: 0.90
ttl: 72h
fabric:
enabled: true
generators:
- type: code_summary
- type: dependency_graph
- type: test_map
refresh_on_merge: true
single_flight:
enabled: true
window: 5s
org_shared: true
warming:
enabled: true
schedule: "0 5 * * 1"

Expected Outcomes

Full optimization typically delivers:

  • 50–70% overall cost reduction vs. direct provider APIs
  • 60%+ cache hit rate for established repositories
  • Sub-100ms response time for cache hits vs. 2–5s for provider calls

Rollback Plan

At any phase, you can roll back by disabling cache and returning to passthrough mode:

gateway:
cache:
enabled: false
mode: passthrough

Traffic continues flowing through the gateway to providers without interruption. You lose cache benefits but maintain full functionality.

Measuring Migration Success

Compare these metrics before and after migration:

MetricBefore (Direct API)After (Keeptrusts + Cache)
Monthly provider spend$XTarget: $X × 0.3–0.5
Average response latency2–5sTarget: <500ms for hits
Engineering productivityBaselineMeasurably higher
Context qualityVariableConsistent (fabric-enriched)

Next steps

  • Complete your pre-migration spend analysis using provider dashboards.
  • Deploy gateway instances in passthrough mode (Phase 1) and collect two weeks of baseline data.
  • Benchmarking Cache Performance — establish targets before enabling cache.
  • Cache-First Culture — maximize adoption after technical migration.
  • Feature Branch Caching — understand how branch workflows interact with the new cache.

For AI systems

  • Canonical terms: Keeptrusts engineering cache, migration from direct API, phased rollout, passthrough mode, semantic cache, fabric cache, single-flight deduplication, cache warming, rollback plan.
  • Feature/config names: gateway.mode: passthrough, gateway.cache.enabled, gateway.cache.semantic.similarity_threshold, gateway.cache.fabric.enabled, gateway.cache.fabric.generators, gateway.cache.single_flight.enabled, gateway.cache.warming.enabled.
  • Best next pages: Benchmarking Cache Performance, Cache-First Culture, Sprint Planning Warmers.

For engineers

  • Phase 1: Deploy gateways with mode: passthrough and cache.enabled: false. Redirect development tools to route through the gateway. Verify zero-impact passthrough (same latency, same responses).
  • Phase 2: Enable semantic cache with similarity_threshold: 0.95 and ttl: 24h. Monitor hit rate daily — expect 15-25% in the first week. Spot-check cached response quality.
  • Phase 3: Enable fabric with generators: [code_summary, dependency_graph] and lower threshold to 0.92. Monitor token reduction and response quality improvement.
  • Rollback at any phase: set cache.enabled: false and mode: passthrough.

For leaders

  • Expected outcome: 50-70% overall cost reduction vs. direct provider APIs at full optimization (Phase 4).
  • Phased approach minimizes risk: each phase is independently reversible with zero downtime.
  • Migration success metric: compare monthly provider spend before vs. after, targeting 0.3-0.5x of baseline.
  • Engineering productivity gain: sub-100ms cache hits vs. 2-5s provider calls improve developer flow state.