Benchmarking Cache Performance
You need concrete performance data to validate your cache configuration, justify investment, and identify optimization opportunities. This guide covers the key metrics to benchmark, target values for each metric, and how to run load tests against your cache infrastructure.
Use this page when
- You need to establish baseline performance metrics for your cache deployment.
- You are running load tests or spike tests before onboarding more teams.
- You need to set alert thresholds for semantic hit latency, fabric retrieval, or hit rate regressions.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Core Benchmark Metrics
Hit Latency
Hit latency measures the time from cache lookup initiation to response delivery for entries that exist in cache.
- What it measures — The overhead that caching adds to a request when a hit occurs.
- Target — Under 50ms for semantic cache hits, under 20ms for fabric lookups.
- Concern threshold — Above 100ms indicates cache backend performance issues.
Semantic Latency
Semantic latency measures the time to compute embedding similarity and retrieve the best-matching cached response.
- What it measures — The cost of semantic matching against the cache index.
- Target — Under 80ms for indexes with fewer than 100,000 entries.
- Concern threshold — Above 200ms suggests index size optimization is needed.
Fabric Retrieval
Fabric retrieval measures the time to fetch pre-computed code intelligence artifacts (summaries, graphs, maps).
- What it measures — Storage backend read performance for fabric entries.
- Target — Under 30ms for local backends, under 100ms for remote storage.
- Concern threshold — Above 150ms indicates storage latency or network issues.
Single-Flight Wait
Single-flight wait measures how long deduplicated requests wait for an in-progress computation to complete.
- What it measures — The time secondary requests spend waiting when a cache miss triggers computation that multiple requesters need.
- Target — Under 3000ms (matches typical provider response time).
- Concern threshold — Above 5000ms suggests provider latency issues or overly broad deduplication windows.
Economics Emit
Economics emit measures the time to record cost-avoidance events after a cache hit.
- What it measures — Overhead of cost tracking on cached responses.
- Target — Under 5ms (non-blocking, async recording).
- Concern threshold — Above 20ms indicates event pipeline backpressure.
Running Benchmarks
Synthetic Benchmark
Run a synthetic benchmark against your cache infrastructure:
kt cache benchmark \
--target gateway.internal:41002 \
--concurrency 50 \
--duration 300s \
--scenario mixed \
--report benchmark-results.json
The mixed scenario simulates realistic traffic patterns: 60% semantic lookups, 30% fabric retrievals, 10% cache misses.
Scenario-Specific Benchmarks
Run targeted benchmarks for specific cache operations:
# Semantic cache benchmark
kt cache benchmark --scenario semantic-only --concurrency 100
# Fabric retrieval benchmark
kt cache benchmark --scenario fabric-only --concurrency 200
# Single-flight deduplication benchmark
kt cache benchmark --scenario single-flight --concurrency 500
# Cold-start benchmark (all misses)
kt cache benchmark --scenario cold-start --concurrency 50
Production Traffic Replay
Replay recorded production traffic against a staging cache to benchmark with realistic query patterns:
kt cache benchmark \
--replay production-traffic-2024-03.jsonl \
--target staging-gateway:41002 \
--speed 2x \
--report replay-benchmark.json
Performance Targets by Team Size
| Metric | Small team (5-15) | Medium team (15-50) | Large team (50+) |
|---|---|---|---|
| Semantic hit latency | <50ms | <50ms | <80ms |
| Fabric retrieval | <30ms | <50ms | <80ms |
| Single-flight wait | <3000ms | <3000ms | <4000ms |
| Overall hit rate | >40% | >55% | >65% |
| Cache fill time | <5000ms | <5000ms | <5000ms |
Larger teams achieve higher hit rates because more engineers contribute to and benefit from the shared cache.
Load Testing
Sustained Load Test
Test cache performance under sustained engineering load:
kt cache loadtest \
--target gateway.internal:41002 \
--engineers-simulated 50 \
--interactions-per-hour 20 \
--duration 4h \
--report load-test-results.json
This simulates 50 engineers each making 20 AI interactions per hour — a realistic mid-sprint workload.
Spike Load Test
Test cache behavior during demand spikes (sprint start, team onboarding):
kt cache loadtest \
--target gateway.internal:41002 \
--spike-pattern ramp-up \
--peak-concurrency 200 \
--ramp-duration 10m \
--peak-duration 30m \
--report spike-test-results.json
Cache Eviction Test
Test behavior when cache reaches capacity and eviction occurs:
kt cache loadtest \
--target gateway.internal:41002 \
--fill-to-capacity true \
--overflow-percentage 120 \
--report eviction-test-results.json
Interpreting Results
Latency Distribution
Look at percentile distributions, not just averages:
- p50 — Typical experience for most requests.
- p95 — Experience during moderate contention.
- p99 — Worst case excluding outliers.
- p99.9 — True worst case that users occasionally experience.
A cache with 20ms p50 but 500ms p99 has a tail latency problem that affects user experience during peak usage.
Hit Rate Analysis
Break down hit rates by cache layer:
- Semantic hit rate — Percentage of queries that match existing semantic entries.
- Fabric hit rate — Percentage of context lookups served from cached fabric.
- Combined hit rate — Percentage of requests that avoid any provider call.
- Miss rate by category — Which types of queries most frequently miss cache.
Cost Efficiency Ratio
Calculate the cost efficiency of your cache:
Cost Efficiency = (Provider cost avoided) / (Cache infrastructure cost + Warming cost)
A ratio above 3.0 indicates strong cache ROI. Below 1.5 suggests optimization opportunities.
Continuous Benchmarking
Run benchmarks on a schedule to detect performance regressions:
benchmarks:
schedule: "0 2 * * 1" # Weekly Monday 2am
scenarios:
- mixed
- semantic-only
- fabric-only
alerts:
- metric: semantic_hit_latency_p95
threshold: 100ms
action: notify
- metric: overall_hit_rate
threshold: 40%
condition: below
action: notify
Next steps
- Run
kt cache benchmark --scenario mixedto establish your baseline. - Compare p50/p95/p99 latencies against the team-size targets in this page.
- Schedule weekly continuous benchmarks and configure alert thresholds.
- Agent-Specific Cache Optimization — tune per-agent performance.
- Multi-Region Cache Strategy — benchmark cross-region latency.
For AI systems
- Canonical terms: Keeptrusts engineering cache, benchmarking, hit latency, semantic latency, fabric retrieval, single-flight wait, economics emit, load testing, cost efficiency ratio.
- Feature/config/command names:
kt cache benchmark,--scenario mixed,--scenario semantic-only,--scenario fabric-only,--scenario single-flight,--scenario cold-start,kt cache loadtest,--engineers-simulated,--spike-pattern,benchmarks.schedule,benchmarks.alerts. - Best next pages: Agent-Specific Cache Optimization, Multi-Region Cache Strategy, Cache-First Culture.
For engineers
- Prerequisites: A deployed Keeptrusts gateway with cache enabled and a target endpoint reachable from the benchmark runner.
- Run
kt cache benchmark --target gateway.internal:41002 --concurrency 50 --duration 300s --scenario mixed --report results.jsonfor a first baseline. - Validate: Confirm semantic hit latency p95 < 100ms and fabric retrieval p95 < 150ms. Investigate concern thresholds listed in the Core Benchmark Metrics section.
- Use
kt cache loadtest --engineers-simulated 50 --interactions-per-hour 20 --duration 4hto simulate sustained sprint load.
For leaders
- Benchmarking quantifies cache ROI: a cost efficiency ratio above 3.0 confirms the infrastructure investment pays for itself.
- Use team-size-appropriate targets (this page) to set SLOs for the platform team.
- Schedule weekly automated benchmarks to detect regressions before they affect developer experience.
- Spike tests validate capacity planning for team onboarding events and sprint starts.