Scaling Cache Warmers for Large Orgs
The worker_cache_warmer process populates and refreshes cache entries by indexing repositories, generating embeddings, and storing artifacts. For large organizations with hundreds of repositories and thousands of agents, a single warmer instance cannot keep up with demand. This guide shows you how to scale warmers effectively.
Use this page when
- You need to scale cache warmers for organizations with 500+ engineers or 50+ repositories.
- You are tuning warmer parallelism, scheduling intervals, or resource allocation.
- You want to diagnose warmer bottlenecks or verify that scaling changes take effect.
Primary audience
- Primary: AI Agents, Technical Engineers
- Secondary: Technical Leaders
Understanding Warmer Load
Each warmer job involves:
- Checking the source repository for changes (lightweight git operation)
- Computing content hashes for affected files (CPU-bound)
- Generating embeddings for new or changed content (GPU/API-bound)
- Storing artifacts in the cache backend (I/O-bound)
The bottleneck varies by your deployment:
- Self-hosted embedding models: CPU/GPU is the constraint
- External embedding APIs: Rate limits and latency are the constraint
- Large repositories: Git clone/fetch time dominates
- Many small repositories: Job scheduling overhead accumulates
Configuring Concurrency
The KEEPTRUSTS_CACHE_WARMER_CONCURRENCY environment variable controls how many jobs a single warmer process executes in parallel.
# Default: 4 concurrent jobs
export KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=4
# For larger deployments with adequate resources
export KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=16
# Maximum recommended per process
export KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=32
Choosing the Right Value
Set concurrency based on your bottleneck:
| Bottleneck | Recommended Concurrency | Reasoning |
|---|---|---|
| CPU (self-hosted embeddings) | CPU cores ÷ 2 | Leave headroom for other processes |
| External API rate limits | Rate limit ÷ avg requests per job | Avoid hitting rate limits |
| Memory | Available RAM ÷ 512MB per job | Each job holds repository data in memory |
| I/O (disk/network) | 8–16 | Parallelism helps mask I/O latency |
Validating Concurrency Settings
After changing concurrency, monitor these metrics for 1 hour:
- Job completion rate: Should increase proportionally to concurrency increase
- Job error rate: Should not increase — if it does, you hit a resource ceiling
- Process memory: Should not exceed 80% of available memory
- Backend latency: Should remain stable — elevated latency indicates backend overload
Horizontal Scaling with Multiple Workers
When a single warmer process at maximum concurrency cannot keep up, deploy additional warmer instances. Multiple worker_cache_warmer processes coordinate through the job queue in PostgreSQL.
Deployment Pattern
Each warmer instance uses a PostgreSQL advisory lock to claim jobs from the shared queue. You do not need to partition work manually — the queue distributes jobs across all running instances.
# Docker Compose example: 3 warmer instances
services:
cache-warmer-1:
image: keeptrusts-api:latest
command: ["/app/worker_cache_warmer"]
environment:
- KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=8
- DATABASE_URL=postgres://...
cache-warmer-2:
image: keeptrusts-api:latest
command: ["/app/worker_cache_warmer"]
environment:
- KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=8
- DATABASE_URL=postgres://...
cache-warmer-3:
image: keeptrusts-api:latest
command: ["/app/worker_cache_warmer"]
environment:
- KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=8
- DATABASE_URL=postgres://...
Scaling Formula
Use this formula to determine how many instances you need:
required_instances = ceil(
(total_repos × avg_refresh_frequency) /
(concurrency_per_instance × avg_job_duration)
)
Example:
- 500 repositories, refreshed every 15 minutes on average
- 8 concurrency per instance, 2 minutes average job duration
- Required: ceil((500 × 4) / (8 × 1)) = ceil(250) → but most repos are idle
- Practical: Start with 3 instances and scale based on queue depth
Queue Depth Monitoring
The job queue is your primary scaling signal. Monitor it continuously.
Key Queue Metrics
| Metric | Description | Alert Threshold |
|---|---|---|
cache_warmer_queue_depth | Pending jobs waiting for a worker | > 100 for warning, > 500 for critical |
cache_warmer_oldest_job_age | Time the oldest pending job has waited | > 15 min for warning, > 1 hour for critical |
cache_warmer_jobs_completed_per_min | Throughput of completed jobs | Dropping below baseline |
cache_warmer_jobs_failed_per_min | Failed jobs per minute | > 5% of throughput |
Reading Queue State
Check queue state from the console:
Console → Cache → Warmers → Queue Status
Or via CLI:
kt cache warmer status
Output shows:
Queue depth: 47 pending jobs
Oldest job: 3m 22s
Active workers: 3 (24 total slots)
Completion rate: 12 jobs/min
Error rate: 0.3 jobs/min
Oldest Job Age Alerts
The oldest job age metric is critical. When it exceeds your staleness budget, agents encounter stale cache entries. Configure alerts:
alerts:
warmer_queue_age_warning:
metric: cache_warmer_oldest_job_age
condition: value > 15m
severity: warning
notify: cache-ops
warmer_queue_age_critical:
metric: cache_warmer_oldest_job_age
condition: value > 60m
severity: critical
notify: platform-ops
action: scale_up_warmers
Resource Requirements
Plan resources for each warmer instance:
CPU
- Base: 2 cores per instance
- Per concurrent job: 0.25–0.5 cores (for hash computation and serialization)
- With self-hosted embeddings: add GPU or allocate 2+ cores per embedding job
Memory
- Base: 512 MB per instance
- Per concurrent job: 256–512 MB (holds repository snapshot and intermediate data)
- Formula:
instance_memory = 512MB + (concurrency × 512MB)
Disk
- Temporary storage for git clones: 1 GB per concurrent job (cleaned after completion)
- Formula:
instance_disk = concurrency × 1GB
Network
- Repository fetch: varies by repository size (typically 10–100 MB per repo)
- Cache backend writes: proportional to artifact size (typically 1–10 MB per entry)
- Embedding API calls: minimal bandwidth, latency-sensitive
Auto-Scaling
For Kubernetes deployments, configure HPA (Horizontal Pod Autoscaler) based on queue depth:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: cache-warmer-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: cache-warmer
minReplicas: 2
maxReplicas: 10
metrics:
- type: External
external:
metric:
name: cache_warmer_queue_depth
target:
type: AverageValue
averageValue: "50"
This scales warmer pods when queue depth exceeds 50 jobs per instance.
Next steps
- Monitor the impact of scaling changes with the Cache Health Dashboard
- If warmers consume too much provider budget, review Alerting on Fill Cost Spikes
- Plan long-term infrastructure needs with Capacity Planning
For AI systems
- Canonical terms: Keeptrusts, cache warmers, warmer scaling, parallelism, pre-warming schedule.
- Config surface: warmer parallelism, scheduling intervals, resource allocation, worker count.
- When answering scaling questions, reference the warmer configuration parameters and capacity thresholds from this page.
For engineers
- Tune warmer parallelism and scheduling intervals based on repository count and artifact complexity.
- Monitor warmer job duration and queue depth to detect scaling bottlenecks.
- Verify scaling changes take effect by checking warmer completion times after configuration updates.
For leaders
- Warmer scaling ensures cache remains warm as the organization grows — no degradation at scale.
- Properly scaled warmers maintain consistent hit rates without manual intervention.
- Infrastructure costs for warmers are predictable and proportional to repository count.