Scaling Cache Warmers for Large Orgs

The worker_cache_warmer process populates and refreshes cache entries by indexing repositories, generating embeddings, and storing artifacts. For large organizations with hundreds of repositories and thousands of agents, a single warmer instance cannot keep up with demand. This guide shows you how to scale warmers effectively.

Use this page when

You need to scale cache warmers for organizations with 500+ engineers or 50+ repositories.
You are tuning warmer parallelism, scheduling intervals, or resource allocation.
You want to diagnose warmer bottlenecks or verify that scaling changes take effect.

Primary audience

Primary: AI Agents, Technical Engineers
Secondary: Technical Leaders

Understanding Warmer Load

Each warmer job involves:

Checking the source repository for changes (lightweight git operation)
Computing content hashes for affected files (CPU-bound)
Generating embeddings for new or changed content (GPU/API-bound)
Storing artifacts in the cache backend (I/O-bound)

The bottleneck varies by your deployment:

Self-hosted embedding models: CPU/GPU is the constraint
External embedding APIs: Rate limits and latency are the constraint
Large repositories: Git clone/fetch time dominates
Many small repositories: Job scheduling overhead accumulates

Configuring Concurrency

The KEEPTRUSTS_CACHE_WARMER_CONCURRENCY environment variable controls how many jobs a single warmer process executes in parallel.

# Default: 4 concurrent jobs
export KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=4

# For larger deployments with adequate resources
export KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=16

# Maximum recommended per process
export KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=32

Choosing the Right Value

Set concurrency based on your bottleneck:

Bottleneck	Recommended Concurrency	Reasoning
CPU (self-hosted embeddings)	CPU cores ÷ 2	Leave headroom for other processes
External API rate limits	Rate limit ÷ avg requests per job	Avoid hitting rate limits
Memory	Available RAM ÷ 512MB per job	Each job holds repository data in memory
I/O (disk/network)	8–16	Parallelism helps mask I/O latency

Validating Concurrency Settings

After changing concurrency, monitor these metrics for 1 hour:

Job completion rate: Should increase proportionally to concurrency increase
Job error rate: Should not increase — if it does, you hit a resource ceiling
Process memory: Should not exceed 80% of available memory
Backend latency: Should remain stable — elevated latency indicates backend overload

Horizontal Scaling with Multiple Workers

When a single warmer process at maximum concurrency cannot keep up, deploy additional warmer instances. Multiple worker_cache_warmer processes coordinate through the job queue in PostgreSQL.

Deployment Pattern

Each warmer instance uses a PostgreSQL advisory lock to claim jobs from the shared queue. You do not need to partition work manually — the queue distributes jobs across all running instances.

# Docker Compose example: 3 warmer instances
services:
  cache-warmer-1:
    image: keeptrusts-api:latest
    command: ["/app/worker_cache_warmer"]
    environment:
      - KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=8
      - DATABASE_URL=postgres://...

  cache-warmer-2:
    image: keeptrusts-api:latest
    command: ["/app/worker_cache_warmer"]
    environment:
      - KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=8
      - DATABASE_URL=postgres://...

  cache-warmer-3:
    image: keeptrusts-api:latest
    command: ["/app/worker_cache_warmer"]
    environment:
      - KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=8
      - DATABASE_URL=postgres://...

Scaling Formula

Use this formula to determine how many instances you need:

required_instances = ceil(
  (total_repos × avg_refresh_frequency) / 
  (concurrency_per_instance × avg_job_duration)
)

Example:

500 repositories, refreshed every 15 minutes on average
8 concurrency per instance, 2 minutes average job duration
Required: ceil((500 × 4) / (8 × 1)) = ceil(250) → but most repos are idle
Practical: Start with 3 instances and scale based on queue depth

Queue Depth Monitoring

The job queue is your primary scaling signal. Monitor it continuously.

Key Queue Metrics

Metric	Description	Alert Threshold
`cache_warmer_queue_depth`	Pending jobs waiting for a worker	> 100 for warning, > 500 for critical
`cache_warmer_oldest_job_age`	Time the oldest pending job has waited	> 15 min for warning, > 1 hour for critical
`cache_warmer_jobs_completed_per_min`	Throughput of completed jobs	Dropping below baseline
`cache_warmer_jobs_failed_per_min`	Failed jobs per minute	> 5% of throughput

Reading Queue State

Check queue state from the console:

Console → Cache → Warmers → Queue Status

Or via CLI:

kt cache warmer status

Output shows:

Queue depth:      47 pending jobs
Oldest job:       3m 22s
Active workers:   3 (24 total slots)
Completion rate:  12 jobs/min
Error rate:       0.3 jobs/min

Oldest Job Age Alerts

The oldest job age metric is critical. When it exceeds your staleness budget, agents encounter stale cache entries. Configure alerts:

alerts:
  warmer_queue_age_warning:
    metric: cache_warmer_oldest_job_age
    condition: value > 15m
    severity: warning
    notify: cache-ops

  warmer_queue_age_critical:
    metric: cache_warmer_oldest_job_age
    condition: value > 60m
    severity: critical
    notify: platform-ops
    action: scale_up_warmers

Resource Requirements

Plan resources for each warmer instance:

CPU

Base: 2 cores per instance
Per concurrent job: 0.25–0.5 cores (for hash computation and serialization)
With self-hosted embeddings: add GPU or allocate 2+ cores per embedding job

Memory

Base: 512 MB per instance
Per concurrent job: 256–512 MB (holds repository snapshot and intermediate data)
Formula: instance_memory = 512MB + (concurrency × 512MB)

Disk

Temporary storage for git clones: 1 GB per concurrent job (cleaned after completion)
Formula: instance_disk = concurrency × 1GB

Network

Repository fetch: varies by repository size (typically 10–100 MB per repo)
Cache backend writes: proportional to artifact size (typically 1–10 MB per entry)
Embedding API calls: minimal bandwidth, latency-sensitive

Auto-Scaling

For Kubernetes deployments, configure HPA (Horizontal Pod Autoscaler) based on queue depth:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: cache-warmer-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: cache-warmer
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: External
      external:
        metric:
          name: cache_warmer_queue_depth
        target:
          type: AverageValue
          averageValue: "50"

This scales warmer pods when queue depth exceeds 50 jobs per instance.

Next steps

Monitor the impact of scaling changes with the Cache Health Dashboard
If warmers consume too much provider budget, review Alerting on Fill Cost Spikes
Plan long-term infrastructure needs with Capacity Planning

For AI systems

Canonical terms: Keeptrusts, cache warmers, warmer scaling, parallelism, pre-warming schedule.
Config surface: warmer parallelism, scheduling intervals, resource allocation, worker count.
When answering scaling questions, reference the warmer configuration parameters and capacity thresholds from this page.

For engineers

Tune warmer parallelism and scheduling intervals based on repository count and artifact complexity.
Monitor warmer job duration and queue depth to detect scaling bottlenecks.
Verify scaling changes take effect by checking warmer completion times after configuration updates.

For leaders

Warmer scaling ensures cache remains warm as the organization grows — no degradation at scale.
Properly scaled warmers maintain consistent hit rates without manual intervention.
Infrastructure costs for warmers are predictable and proportional to repository count.

Use this page when​

Primary audience​

Understanding Warmer Load​

Configuring Concurrency​

Choosing the Right Value​

Validating Concurrency Settings​

Horizontal Scaling with Multiple Workers​

Deployment Pattern​

Scaling Formula​

Queue Depth Monitoring​

Key Queue Metrics​

Reading Queue State​

Oldest Job Age Alerts​

Resource Requirements​

CPU​

Memory​

Disk​

Network​

Auto-Scaling​

Next steps​

For AI systems​

For engineers​

For leaders​