Cache Migration from Direct Provider APIs

If your engineering team currently calls LLM providers directly (OpenAI, Anthropic, Azure OpenAI), you are paying full price for every request — including redundant queries across team members. Migrating to Keeptrusts with org-shared cache routes the same traffic through a caching layer that eliminates duplicate spend. This guide walks you through a phased migration that minimizes risk and maximizes measurable savings.

Use this page when

You are migrating from direct LLM provider API calls (OpenAI, Anthropic, Azure OpenAI) to Keeptrusts.
You need a phased rollout plan: observe → semantic cache → fabric cache → optimize.
You want to measure cost reduction at each migration phase and have a rollback plan.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Pre-Migration Assessment

Before migrating, measure your current state so you can quantify improvements:

Current Spend Analysis

Gather these data points from your provider dashboards:

Monthly token spend — Total tokens consumed across all provider accounts.
Request volume — Total API calls per month.
Average tokens per request — Context window usage patterns.
Peak usage periods — When does your team generate the most traffic.
Cost per engineer per month — Total spend divided by engineering headcount.

Usage Pattern Analysis

Understand how your team uses AI today:

How many engineers use AI daily?
Which repositories generate the most AI traffic?
What types of queries dominate (explanation, generation, review)?
How much overlap exists between engineers' queries?

Phase 1: Observe

Deploy Keeptrusts gateways in passthrough mode — routing traffic without caching — to measure baseline traffic patterns.

Gateway Deployment

gateway:
  name: migration-observe
  mode: passthrough
  cache:
    enabled: false
  observability:
    log_requests: true
    log_tokens: true
    log_latency: true

What You Learn

After one to two weeks of observation, you have:

Actual request patterns routed through your gateways
Token usage broken down by repository, team, and query type
Latency baseline for direct provider calls
Redundancy analysis showing how many requests are semantically similar

Migration Checklist for Phase 1

Deploy gateway instances accessible to your engineering team
Configure development tools to route through the gateway
Verify zero-impact passthrough (same latency, same responses)
Collect two weeks of baseline traffic data
Generate redundancy analysis report

Phase 2: Enable Semantic Cache

Turn on semantic caching with a conservative threshold to capture only clearly duplicate queries.

Configuration

gateway:
  cache:
    enabled: true
    semantic:
      enabled: true
      similarity_threshold: 0.95  # Conservative start
      ttl: 24h
    fabric:
      enabled: false  # Not yet
    org_shared: true

Expected Outcomes

With a 0.95 similarity threshold, you capture near-identical queries:

Same question asked by different engineers → Cache hit
Same question repeated by the same engineer → Cache hit
Slightly rephrased questions → Cache miss (intentionally)

Expect 15–25% hit rate in this phase, depending on team overlap.

Monitoring

Track these metrics daily during Phase 2:

Hit rate — Should climb from 0% to 15–25% within the first week.
Response quality — Verify cached responses are appropriate for the matched queries.
Latency improvement — Cache hits should respond 10–50x faster than provider calls.
Cost reduction — Provider spend should drop proportionally to hit rate.

Migration Checklist for Phase 2

Enable semantic cache with 0.95 threshold
Monitor hit rate daily for one week
Spot-check cached response quality
Verify no degradation in response usefulness
Measure first-week cost reduction

Phase 3: Enable Fabric Cache

Add fabric (pre-computed code intelligence) to reduce context-gathering costs and improve response quality.

Configuration

gateway:
  cache:
    enabled: true
    semantic:
      enabled: true
      similarity_threshold: 0.92  # Slightly relaxed
      ttl: 48h
    fabric:
      enabled: true
      generators:
        - type: code_summary
        - type: dependency_graph
      refresh_on_merge: true
    org_shared: true

Expected Outcomes

Fabric reduces provider token usage even on cache misses:

Context sent to the provider is smaller (summaries vs. raw files).
Responses arrive faster due to smaller context windows.
Response quality improves because fabric provides better-structured context.

Expect overall cost reduction of 35–50% in this phase.

Migration Checklist for Phase 3

Enable fabric generators for active repositories
Lower semantic threshold to 0.92
Monitor fabric fill rate (how quickly entries populate)
Compare response quality before and after fabric
Measure cost reduction vs. Phase 2

Phase 4: Optimize and Expand

Fine-tune cache configuration based on observed patterns and expand to all teams.

Tuning Actions

Adjust similarity threshold — Lower to 0.90 if cached responses prove consistently useful.
Add test maps — Enable test map fabric for teams that generate tests frequently.
Configure warming — Schedule cache warming for repositories with predictable sprint work.
Enable single-flight — Deduplicate concurrent identical requests across the team.

Full Configuration

gateway:
  cache:
    enabled: true
    semantic:
      enabled: true
      similarity_threshold: 0.90
      ttl: 72h
    fabric:
      enabled: true
      generators:
        - type: code_summary
        - type: dependency_graph
        - type: test_map
      refresh_on_merge: true
    single_flight:
      enabled: true
      window: 5s
    org_shared: true
    warming:
      enabled: true
      schedule: "0 5 * * 1"

Expected Outcomes

Full optimization typically delivers:

50–70% overall cost reduction vs. direct provider APIs
60%+ cache hit rate for established repositories
Sub-100ms response time for cache hits vs. 2–5s for provider calls

Rollback Plan

At any phase, you can roll back by disabling cache and returning to passthrough mode:

gateway:
  cache:
    enabled: false
  mode: passthrough

Traffic continues flowing through the gateway to providers without interruption. You lose cache benefits but maintain full functionality.

Measuring Migration Success

Compare these metrics before and after migration:

Metric	Before (Direct API)	After (Keeptrusts + Cache)
Monthly provider spend	$X	Target: $X × 0.3–0.5
Average response latency	2–5s	Target: <500ms for hits
Engineering productivity	Baseline	Measurably higher
Context quality	Variable	Consistent (fabric-enriched)

Next steps

Complete your pre-migration spend analysis using provider dashboards.
Deploy gateway instances in passthrough mode (Phase 1) and collect two weeks of baseline data.
Benchmarking Cache Performance — establish targets before enabling cache.
Cache-First Culture — maximize adoption after technical migration.
Feature Branch Caching — understand how branch workflows interact with the new cache.

For AI systems

Canonical terms: Keeptrusts engineering cache, migration from direct API, phased rollout, passthrough mode, semantic cache, fabric cache, single-flight deduplication, cache warming, rollback plan.
Feature/config names: gateway.mode: passthrough, gateway.cache.enabled, gateway.cache.semantic.similarity_threshold, gateway.cache.fabric.enabled, gateway.cache.fabric.generators, gateway.cache.single_flight.enabled, gateway.cache.warming.enabled.
Best next pages: Benchmarking Cache Performance, Cache-First Culture, Sprint Planning Warmers.

For engineers

Phase 1: Deploy gateways with mode: passthrough and cache.enabled: false. Redirect development tools to route through the gateway. Verify zero-impact passthrough (same latency, same responses).
Phase 2: Enable semantic cache with similarity_threshold: 0.95 and ttl: 24h. Monitor hit rate daily — expect 15-25% in the first week. Spot-check cached response quality.
Phase 3: Enable fabric with generators: [code_summary, dependency_graph] and lower threshold to 0.92. Monitor token reduction and response quality improvement.
Rollback at any phase: set cache.enabled: false and mode: passthrough.

For leaders

Expected outcome: 50-70% overall cost reduction vs. direct provider APIs at full optimization (Phase 4).
Phased approach minimizes risk: each phase is independently reversible with zero downtime.
Migration success metric: compare monthly provider spend before vs. after, targeting 0.3-0.5x of baseline.
Engineering productivity gain: sub-100ms cache hits vs. 2-5s provider calls improve developer flow state.

Use this page when​

Primary audience​

Pre-Migration Assessment​

Current Spend Analysis​

Usage Pattern Analysis​

Phase 1: Observe​

Gateway Deployment​

What You Learn​

Migration Checklist for Phase 1​

Phase 2: Enable Semantic Cache​

Configuration​

Expected Outcomes​

Monitoring​

Migration Checklist for Phase 2​

Phase 3: Enable Fabric Cache​

Configuration​

Expected Outcomes​

Migration Checklist for Phase 3​

Phase 4: Optimize and Expand​

Tuning Actions​

Full Configuration​

Expected Outcomes​

Rollback Plan​

Measuring Migration Success​

Next steps​

For AI systems​

For engineers​

For leaders​

Use this page when

Primary audience

Pre-Migration Assessment

Current Spend Analysis

Usage Pattern Analysis

Phase 1: Observe

Gateway Deployment

What You Learn

Migration Checklist for Phase 1

Phase 2: Enable Semantic Cache

Configuration

Expected Outcomes

Monitoring

Migration Checklist for Phase 2

Phase 3: Enable Fabric Cache

Configuration

Expected Outcomes

Migration Checklist for Phase 3

Phase 4: Optimize and Expand

Tuning Actions

Full Configuration

Expected Outcomes

Rollback Plan

Measuring Migration Success

Next steps

For AI systems

For engineers

For leaders