Benchmarking Cache Performance

You need concrete performance data to validate your cache configuration, justify investment, and identify optimization opportunities. This guide covers the key metrics to benchmark, target values for each metric, and how to run load tests against your cache infrastructure.

Use this page when

You need to establish baseline performance metrics for your cache deployment.
You are running load tests or spike tests before onboarding more teams.
You need to set alert thresholds for semantic hit latency, fabric retrieval, or hit rate regressions.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Core Benchmark Metrics

Hit Latency

Hit latency measures the time from cache lookup initiation to response delivery for entries that exist in cache.

What it measures — The overhead that caching adds to a request when a hit occurs.
Target — Under 50ms for semantic cache hits, under 20ms for fabric lookups.
Concern threshold — Above 100ms indicates cache backend performance issues.

Semantic Latency

Semantic latency measures the time to compute embedding similarity and retrieve the best-matching cached response.

What it measures — The cost of semantic matching against the cache index.
Target — Under 80ms for indexes with fewer than 100,000 entries.
Concern threshold — Above 200ms suggests index size optimization is needed.

Fabric Retrieval

Fabric retrieval measures the time to fetch pre-computed code intelligence artifacts (summaries, graphs, maps).

What it measures — Storage backend read performance for fabric entries.
Target — Under 30ms for local backends, under 100ms for remote storage.
Concern threshold — Above 150ms indicates storage latency or network issues.

Single-Flight Wait

Single-flight wait measures how long deduplicated requests wait for an in-progress computation to complete.

What it measures — The time secondary requests spend waiting when a cache miss triggers computation that multiple requesters need.
Target — Under 3000ms (matches typical provider response time).
Concern threshold — Above 5000ms suggests provider latency issues or overly broad deduplication windows.

Economics Emit

Economics emit measures the time to record cost-avoidance events after a cache hit.

What it measures — Overhead of cost tracking on cached responses.
Target — Under 5ms (non-blocking, async recording).
Concern threshold — Above 20ms indicates event pipeline backpressure.

Running Benchmarks

Synthetic Benchmark

Run a synthetic benchmark against your cache infrastructure:

kt cache benchmark \
  --target gateway.internal:41002 \
  --concurrency 50 \
  --duration 300s \
  --scenario mixed \
  --report benchmark-results.json

The mixed scenario simulates realistic traffic patterns: 60% semantic lookups, 30% fabric retrievals, 10% cache misses.

Scenario-Specific Benchmarks

Run targeted benchmarks for specific cache operations:

# Semantic cache benchmark
kt cache benchmark --scenario semantic-only --concurrency 100

# Fabric retrieval benchmark
kt cache benchmark --scenario fabric-only --concurrency 200

# Single-flight deduplication benchmark
kt cache benchmark --scenario single-flight --concurrency 500

# Cold-start benchmark (all misses)
kt cache benchmark --scenario cold-start --concurrency 50

Production Traffic Replay

Replay recorded production traffic against a staging cache to benchmark with realistic query patterns:

kt cache benchmark \
  --replay production-traffic-2024-03.jsonl \
  --target staging-gateway:41002 \
  --speed 2x \
  --report replay-benchmark.json

Performance Targets by Team Size

Metric	Small team (5-15)	Medium team (15-50)	Large team (50+)
Semantic hit latency	<50ms	<50ms	<80ms
Fabric retrieval	<30ms	<50ms	<80ms
Single-flight wait	<3000ms	<3000ms	<4000ms
Overall hit rate	>40%	>55%	>65%
Cache fill time	<5000ms	<5000ms	<5000ms

Larger teams achieve higher hit rates because more engineers contribute to and benefit from the shared cache.

Load Testing

Sustained Load Test

Test cache performance under sustained engineering load:

kt cache loadtest \
  --target gateway.internal:41002 \
  --engineers-simulated 50 \
  --interactions-per-hour 20 \
  --duration 4h \
  --report load-test-results.json

This simulates 50 engineers each making 20 AI interactions per hour — a realistic mid-sprint workload.

Spike Load Test

Test cache behavior during demand spikes (sprint start, team onboarding):

kt cache loadtest \
  --target gateway.internal:41002 \
  --spike-pattern ramp-up \
  --peak-concurrency 200 \
  --ramp-duration 10m \
  --peak-duration 30m \
  --report spike-test-results.json

Cache Eviction Test

Test behavior when cache reaches capacity and eviction occurs:

kt cache loadtest \
  --target gateway.internal:41002 \
  --fill-to-capacity true \
  --overflow-percentage 120 \
  --report eviction-test-results.json

Interpreting Results

Latency Distribution

Look at percentile distributions, not just averages:

p50 — Typical experience for most requests.
p95 — Experience during moderate contention.
p99 — Worst case excluding outliers.
p99.9 — True worst case that users occasionally experience.

A cache with 20ms p50 but 500ms p99 has a tail latency problem that affects user experience during peak usage.

Hit Rate Analysis

Break down hit rates by cache layer:

Semantic hit rate — Percentage of queries that match existing semantic entries.
Fabric hit rate — Percentage of context lookups served from cached fabric.
Combined hit rate — Percentage of requests that avoid any provider call.
Miss rate by category — Which types of queries most frequently miss cache.

Cost Efficiency Ratio

Calculate the cost efficiency of your cache:

Cost Efficiency = (Provider cost avoided) / (Cache infrastructure cost + Warming cost)

A ratio above 3.0 indicates strong cache ROI. Below 1.5 suggests optimization opportunities.

Continuous Benchmarking

Run benchmarks on a schedule to detect performance regressions:

benchmarks:
  schedule: "0 2 * * 1"  # Weekly Monday 2am
  scenarios:
    - mixed
    - semantic-only
    - fabric-only
  alerts:
    - metric: semantic_hit_latency_p95
      threshold: 100ms
      action: notify
    - metric: overall_hit_rate
      threshold: 40%
      condition: below
      action: notify

Next steps

Run kt cache benchmark --scenario mixed to establish your baseline.
Compare p50/p95/p99 latencies against the team-size targets in this page.
Schedule weekly continuous benchmarks and configure alert thresholds.
Agent-Specific Cache Optimization — tune per-agent performance.
Multi-Region Cache Strategy — benchmark cross-region latency.

For AI systems

Canonical terms: Keeptrusts engineering cache, benchmarking, hit latency, semantic latency, fabric retrieval, single-flight wait, economics emit, load testing, cost efficiency ratio.
Feature/config/command names: kt cache benchmark, --scenario mixed, --scenario semantic-only, --scenario fabric-only, --scenario single-flight, --scenario cold-start, kt cache loadtest, --engineers-simulated, --spike-pattern, benchmarks.schedule, benchmarks.alerts.
Best next pages: Agent-Specific Cache Optimization, Multi-Region Cache Strategy, Cache-First Culture.

For engineers

Prerequisites: A deployed Keeptrusts gateway with cache enabled and a target endpoint reachable from the benchmark runner.
Run kt cache benchmark --target gateway.internal:41002 --concurrency 50 --duration 300s --scenario mixed --report results.json for a first baseline.
Validate: Confirm semantic hit latency p95 < 100ms and fabric retrieval p95 < 150ms. Investigate concern thresholds listed in the Core Benchmark Metrics section.
Use kt cache loadtest --engineers-simulated 50 --interactions-per-hour 20 --duration 4h to simulate sustained sprint load.

For leaders

Benchmarking quantifies cache ROI: a cost efficiency ratio above 3.0 confirms the infrastructure investment pays for itself.
Use team-size-appropriate targets (this page) to set SLOs for the platform team.
Schedule weekly automated benchmarks to detect regressions before they affect developer experience.
Spike tests validate capacity planning for team onboarding events and sprint starts.

Use this page when​

Primary audience​

Core Benchmark Metrics​

Hit Latency​

Semantic Latency​

Fabric Retrieval​

Single-Flight Wait​

Economics Emit​

Running Benchmarks​

Synthetic Benchmark​

Scenario-Specific Benchmarks​

Production Traffic Replay​

Performance Targets by Team Size​

Load Testing​

Sustained Load Test​

Spike Load Test​

Cache Eviction Test​

Interpreting Results​

Latency Distribution​

Hit Rate Analysis​

Cost Efficiency Ratio​

Continuous Benchmarking​

Next steps​

For AI systems​

For engineers​

For leaders​