Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Scaling Cache Warmers for Large Orgs

The worker_cache_warmer process populates and refreshes cache entries by indexing repositories, generating embeddings, and storing artifacts. For large organizations with hundreds of repositories and thousands of agents, a single warmer instance cannot keep up with demand. This guide shows you how to scale warmers effectively.

Use this page when

  • You need to scale cache warmers for organizations with 500+ engineers or 50+ repositories.
  • You are tuning warmer parallelism, scheduling intervals, or resource allocation.
  • You want to diagnose warmer bottlenecks or verify that scaling changes take effect.

Primary audience

  • Primary: AI Agents, Technical Engineers
  • Secondary: Technical Leaders

Understanding Warmer Load

Each warmer job involves:

  1. Checking the source repository for changes (lightweight git operation)
  2. Computing content hashes for affected files (CPU-bound)
  3. Generating embeddings for new or changed content (GPU/API-bound)
  4. Storing artifacts in the cache backend (I/O-bound)

The bottleneck varies by your deployment:

  • Self-hosted embedding models: CPU/GPU is the constraint
  • External embedding APIs: Rate limits and latency are the constraint
  • Large repositories: Git clone/fetch time dominates
  • Many small repositories: Job scheduling overhead accumulates

Configuring Concurrency

The KEEPTRUSTS_CACHE_WARMER_CONCURRENCY environment variable controls how many jobs a single warmer process executes in parallel.

# Default: 4 concurrent jobs
export KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=4

# For larger deployments with adequate resources
export KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=16

# Maximum recommended per process
export KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=32

Choosing the Right Value

Set concurrency based on your bottleneck:

BottleneckRecommended ConcurrencyReasoning
CPU (self-hosted embeddings)CPU cores ÷ 2Leave headroom for other processes
External API rate limitsRate limit ÷ avg requests per jobAvoid hitting rate limits
MemoryAvailable RAM ÷ 512MB per jobEach job holds repository data in memory
I/O (disk/network)8–16Parallelism helps mask I/O latency

Validating Concurrency Settings

After changing concurrency, monitor these metrics for 1 hour:

  • Job completion rate: Should increase proportionally to concurrency increase
  • Job error rate: Should not increase — if it does, you hit a resource ceiling
  • Process memory: Should not exceed 80% of available memory
  • Backend latency: Should remain stable — elevated latency indicates backend overload

Horizontal Scaling with Multiple Workers

When a single warmer process at maximum concurrency cannot keep up, deploy additional warmer instances. Multiple worker_cache_warmer processes coordinate through the job queue in PostgreSQL.

Deployment Pattern

Each warmer instance uses a PostgreSQL advisory lock to claim jobs from the shared queue. You do not need to partition work manually — the queue distributes jobs across all running instances.

# Docker Compose example: 3 warmer instances
services:
cache-warmer-1:
image: keeptrusts-api:latest
command: ["/app/worker_cache_warmer"]
environment:
- KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=8
- DATABASE_URL=postgres://...

cache-warmer-2:
image: keeptrusts-api:latest
command: ["/app/worker_cache_warmer"]
environment:
- KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=8
- DATABASE_URL=postgres://...

cache-warmer-3:
image: keeptrusts-api:latest
command: ["/app/worker_cache_warmer"]
environment:
- KEEPTRUSTS_CACHE_WARMER_CONCURRENCY=8
- DATABASE_URL=postgres://...

Scaling Formula

Use this formula to determine how many instances you need:

required_instances = ceil(
(total_repos × avg_refresh_frequency) /
(concurrency_per_instance × avg_job_duration)
)

Example:

  • 500 repositories, refreshed every 15 minutes on average
  • 8 concurrency per instance, 2 minutes average job duration
  • Required: ceil((500 × 4) / (8 × 1)) = ceil(250) → but most repos are idle
  • Practical: Start with 3 instances and scale based on queue depth

Queue Depth Monitoring

The job queue is your primary scaling signal. Monitor it continuously.

Key Queue Metrics

MetricDescriptionAlert Threshold
cache_warmer_queue_depthPending jobs waiting for a worker> 100 for warning, > 500 for critical
cache_warmer_oldest_job_ageTime the oldest pending job has waited> 15 min for warning, > 1 hour for critical
cache_warmer_jobs_completed_per_minThroughput of completed jobsDropping below baseline
cache_warmer_jobs_failed_per_minFailed jobs per minute> 5% of throughput

Reading Queue State

Check queue state from the console:

Console → Cache → Warmers → Queue Status

Or via CLI:

kt cache warmer status

Output shows:

Queue depth: 47 pending jobs
Oldest job: 3m 22s
Active workers: 3 (24 total slots)
Completion rate: 12 jobs/min
Error rate: 0.3 jobs/min

Oldest Job Age Alerts

The oldest job age metric is critical. When it exceeds your staleness budget, agents encounter stale cache entries. Configure alerts:

alerts:
warmer_queue_age_warning:
metric: cache_warmer_oldest_job_age
condition: value > 15m
severity: warning
notify: cache-ops

warmer_queue_age_critical:
metric: cache_warmer_oldest_job_age
condition: value > 60m
severity: critical
notify: platform-ops
action: scale_up_warmers

Resource Requirements

Plan resources for each warmer instance:

CPU

  • Base: 2 cores per instance
  • Per concurrent job: 0.25–0.5 cores (for hash computation and serialization)
  • With self-hosted embeddings: add GPU or allocate 2+ cores per embedding job

Memory

  • Base: 512 MB per instance
  • Per concurrent job: 256–512 MB (holds repository snapshot and intermediate data)
  • Formula: instance_memory = 512MB + (concurrency × 512MB)

Disk

  • Temporary storage for git clones: 1 GB per concurrent job (cleaned after completion)
  • Formula: instance_disk = concurrency × 1GB

Network

  • Repository fetch: varies by repository size (typically 10–100 MB per repo)
  • Cache backend writes: proportional to artifact size (typically 1–10 MB per entry)
  • Embedding API calls: minimal bandwidth, latency-sensitive

Auto-Scaling

For Kubernetes deployments, configure HPA (Horizontal Pod Autoscaler) based on queue depth:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: cache-warmer-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: cache-warmer
minReplicas: 2
maxReplicas: 10
metrics:
- type: External
external:
metric:
name: cache_warmer_queue_depth
target:
type: AverageValue
averageValue: "50"

This scales warmer pods when queue depth exceeds 50 jobs per instance.

Next steps

For AI systems

  • Canonical terms: Keeptrusts, cache warmers, warmer scaling, parallelism, pre-warming schedule.
  • Config surface: warmer parallelism, scheduling intervals, resource allocation, worker count.
  • When answering scaling questions, reference the warmer configuration parameters and capacity thresholds from this page.

For engineers

  • Tune warmer parallelism and scheduling intervals based on repository count and artifact complexity.
  • Monitor warmer job duration and queue depth to detect scaling bottlenecks.
  • Verify scaling changes take effect by checking warmer completion times after configuration updates.

For leaders

  • Warmer scaling ensures cache remains warm as the organization grows — no degradation at scale.
  • Properly scaled warmers maintain consistent hit rates without manual intervention.
  • Infrastructure costs for warmers are predictable and proportional to repository count.