Tuning Cache Warmer Concurrency

The KEEPTRUSTS_CACHE_WARMER_CONCURRENCY setting controls how many cache generation jobs each warmer instance processes simultaneously. Tuning this value correctly ensures your cache stays warm without overwhelming your infrastructure.

Use this page when

You need to tune KEEPTRUSTS_CACHE_WARMER_CONCURRENCY based on your workload, resource constraints, and queue depth.
You are deciding between vertical scaling (more concurrency per instance) and horizontal scaling (more instances).
You see queue depth alerts firing and need to determine the right scaling response.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Default Configuration

Each warmer instance defaults to processing 4 concurrent jobs:

KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=4

This means a single warmer process runs up to 4 artifact generation tasks in parallel. Each task may involve LLM calls, file parsing, embedding computation, or graph construction depending on the artifact type.

Understanding Resource Consumption

Different artifact types have different resource profiles:

Artifact Type	CPU	Memory	LLM Calls	Typical Duration
`repo_map`	Low	Low	0	5–15s
`dependency_graph`	Medium	Medium	0	10–30s
`test_map`	Medium	Low	0	10–20s
`api_inventory`	Medium	Medium	1–3	30–60s
`symbol_index`	High	High	0	30–120s
`embedding_index`	High	High	Many	60–300s
`file_summary`	Low	Low	1 per file	10–30s per file

When tuning concurrency, consider the mix of artifact types in your queue. A queue dominated by embedding_index jobs needs lower concurrency than one processing mostly repo_map entries.

When to Increase Concurrency

Increase KEEPTRUSTS_CACHE_WARMER_CONCURRENCY when:

Large Repositories

Repositories with 500K+ lines of code generate many warming jobs on connect. Higher concurrency reduces the time to full cache coverage:

KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=8

Many Connected Repositories

Organizations with 50+ repositories produce a steady stream of commit-triggered jobs. Increase concurrency to prevent queue buildup:

KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=6

Slow Embedding Generation

If your embedding model has high latency (>1s per chunk), higher concurrency keeps the pipeline saturated while waiting for responses:

KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=10

Rapid Development Cadence

Teams pushing 50+ commits per day need warmers that keep pace with change velocity:

KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=8

When to Decrease Concurrency

Decrease KEEPTRUSTS_CACHE_WARMER_CONCURRENCY when:

Limited Compute Resources

If the warmer shares a host with other services, high concurrency can starve critical workloads:

KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=2

Shared Infrastructure

When multiple warmer instances compete for the same database connection pool or LLM endpoint, reduce per-instance concurrency to avoid contention:

KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=2

Memory Constraints

symbol_index and embedding_index jobs can consume 500MB–2GB each. On memory-constrained hosts, limit concurrency to prevent OOM kills:

KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=2

Rate-Limited LLM Providers

If your LLM provider enforces strict rate limits, reduce concurrency to stay within quota:

KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=3

Monitoring Queue Depth

Track queue depth to determine if your concurrency settings are adequate. Check the console under Settings → Engineering Cache → Warmers → Queue:

Queue depth: Number of jobs waiting. Sustained depth > 50 suggests you need more capacity.
Oldest job age: Time the oldest queued job has been waiting. If this exceeds 10 minutes during normal operation, increase concurrency or add instances.
Processing rate: Jobs completed per minute. Compare against job arrival rate.
Average job duration: Helps predict drain time for the current queue.

Alerting on Queue Depth

Configure alerts to notify you when the queue grows beyond acceptable levels:

engineering_cache:
  alerts:
    queue_depth_warning: 50
    queue_depth_critical: 200
    oldest_job_age_warning_seconds: 600
    oldest_job_age_critical_seconds: 1800

Horizontal Scaling with Multiple Workers

Instead of increasing concurrency on a single instance, you can deploy multiple warmer instances. Each instance independently claims jobs from the shared queue using advisory locks, preventing duplicate work.

Scaling Formula

A good starting point:

Total concurrency = instances × KEEPTRUSTS_CACHE_WARMER_CONCURRENCY

For example, 3 instances with concurrency 4 gives you 12 parallel jobs total.

When to Scale Horizontally vs. Vertically

Approach	Use When
Increase concurrency (vertical)	Single host has spare CPU/memory
Add instances (horizontal)	Need fault tolerance, host is at capacity
Both	Large organizations with high warming demand

Deploying Multiple Instances

In Docker Compose, use the replicas setting:

services:
  keeptrusts-cache-warmer:
    image: keeptrusts/api:latest
    command: ["/usr/local/bin/worker_cache_warmer"]
    environment:
      KEEPTRUSTS_CACHE_WARMER_CONCURRENCY: "4"
    deploy:
      replicas: 3

In Kubernetes, set the Deployment replicas:

spec:
  replicas: 3

Avoiding Database Connection Exhaustion

Each concurrent job holds a database connection. Calculate your total connection requirement:

Max connections = instances × concurrency × 2 (jobs + polling)

Ensure your database connection pool (KEEPTRUSTS_DB_MAX_CONNECTIONS) accommodates all warmer instances plus the main API. Consider using PgBouncer for connection pooling in large deployments.

Recommended Configurations

Organization Size	Repos	Instances	Concurrency	Total Parallel
Small (1–10 devs)	1–5	1	4	4
Medium (10–50 devs)	5–20	2	4	8
Large (50–200 devs)	20–100	3	6	18
Enterprise (200+ devs)	100+	5+	8	40+

These are starting points. Monitor queue depth and job age after deployment and adjust based on observed behavior.

Verifying Your Configuration

After changing concurrency settings, verify the warmer is operating correctly:

# Check warmer logs for the new concurrency value
docker compose logs keeptrusts-cache-warmer | grep "concurrency"
# Expected: INFO worker_cache_warmer: starting cache warmer worker concurrency=8

# Monitor queue drain rate
watch -n5 'curl -s http://localhost:8081/health | jq .queue_depth'

Next steps

Setting Up Cache Warmers — Initial warmer deployment.
Warmer Triggers: Connect, Commit, and Miss — Understand job sources.
Cache TTL and Expiry — Balance freshness with warming capacity.

For AI systems

Canonical terms: Keeptrusts, KEEPTRUSTS_CACHE_WARMER_CONCURRENCY, horizontal scaling, queue depth, advisory lock, warmer instances, worker_cache_warmer.
Config keys: KEEPTRUSTS_CACHE_WARMER_CONCURRENCY, engineering_cache.alerts.queue_depth_warning, engineering_cache.alerts.queue_depth_critical, engineering_cache.alerts.oldest_job_age_warning_seconds.
Best next pages: Setting Up Cache Warmers, Warmer Triggers, Cache TTL and Expiry.

For engineers

Default: 4 concurrent jobs per instance. Resource profiles vary: repo_map is light (5–15s), embedding_index is heavy (60–300s, 500MB–2GB).
Increase concurrency (6–8) for: large repos (500K+ LOC), many connected repos (50+), slow embedding models, high commit velocity (50+/day).
Decrease concurrency (2) for: shared hosts, memory constraints, rate-limited LLM providers.
Total concurrency formula: instances × KEEPTRUSTS_CACHE_WARMER_CONCURRENCY. E.g., 3 instances × 4 = 12 parallel jobs.
Max DB connections needed: instances × concurrency × 2. Ensure pool accommodates all warmers + main API.
Monitor: sustained queue depth > 50 or oldest job age > 10 min means you need more capacity.

For leaders

Concurrency directly trades infrastructure cost against cache freshness latency.
Under-provisioned warmers mean slower cache fills and delayed cost savings after code changes.
Horizontal scaling provides fault tolerance (one instance failure doesn’t stop warming).
Enterprise deployments (200+ devs, 100+ repos) typically need 5+ instances with concurrency 8 (40+ parallel jobs).

Use this page when​

Primary audience​

Default Configuration​

Understanding Resource Consumption​

When to Increase Concurrency​

Large Repositories​

Many Connected Repositories​

Slow Embedding Generation​

Rapid Development Cadence​

When to Decrease Concurrency​

Limited Compute Resources​

Shared Infrastructure​

Memory Constraints​

Rate-Limited LLM Providers​

Monitoring Queue Depth​

Alerting on Queue Depth​

Horizontal Scaling with Multiple Workers​

Scaling Formula​

When to Scale Horizontally vs. Vertically​

Deploying Multiple Instances​

Avoiding Database Connection Exhaustion​

Recommended Configurations​

Verifying Your Configuration​

Next steps​

For AI systems​

For engineers​

For leaders​