Capacity Planning for AI Workloads
AI workloads have unique capacity characteristics — token-based throughput, provider-imposed rate limits, variable latency, and bursty traffic patterns. This guide covers how to model demand, size your gateway fleet, and monitor queue depth to avoid bottlenecks.
Use this page when
- You are sizing a gateway fleet for a new AI workload or scaling an existing deployment
- You need to model token throughput against provider rate limits
- You are configuring Kubernetes HPA or queue-depth-based autoscaling for the gateway
- You want to size PostgreSQL and event storage for your retention policy
Primary audience
- Primary: Technical Engineers
- Secondary: AI Agents, Technical Leaders
Token Throughput Modeling
Understanding Token Economics
Every LLM request consumes tokens. Capacity planning starts with modeling your token throughput:
Throughput Calculation
Daily token budget = requests_per_day × avg_tokens_per_request
Peak tokens/min = peak_rps × avg_tokens_per_request × 60
Example:
10,000 requests/day × 2,000 avg tokens = 20M tokens/day
Peak: 50 RPS × 2,000 tokens × 60 = 6M tokens/min
Workload Profiles
| Workload | Avg Input Tokens | Avg Output Tokens | Requests/min | Tokens/min |
|---|---|---|---|---|
| Chat (short) | 200 | 150 | 100 | 35,000 |
| Chat (long context) | 4,000 | 1,000 | 20 | 100,000 |
| Summarization | 8,000 | 500 | 10 | 85,000 |
| Embeddings | 500 | 0 | 500 | 250,000 |
| Code generation | 1,500 | 2,000 | 30 | 105,000 |
Modeling Tool
# Estimate capacity requirements
kt capacity estimate \
--requests-per-day 50000 \
--avg-input-tokens 1500 \
--avg-output-tokens 800 \
--peak-multiplier 3.0 \
--provider openai \
--model gpt-4o
Provider Rate Limits
Common Provider Limits
| Provider | Model | RPM Limit | TPM Limit | RPD Limit |
|---|---|---|---|---|
| OpenAI | gpt-4o | 500–10,000 | 30K–10M | Varies by tier |
| OpenAI | gpt-4o-mini | 500–30,000 | 200K–150M | Varies by tier |
| Anthropic | claude-sonnet-4-20250514 | 50–4,000 | 20K–400K | Varies by tier |
| Azure OpenAI | gpt-4o | Per-deployment PTU | Per-deployment | N/A |
Rate Limit Configuration
pack:
name: capacity-planning-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider:
base_url: https://api.openai.com/v1
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Multi-Provider Rate Limit Distribution
# Distribute load across providers
routing:
strategy: weighted
weights:
openai: 60
azure-openai: 25
anthropic: 15
Gateway Scaling
Horizontal Scaling Architecture
Scaling Triggers
| Metric | Scale Up Threshold | Scale Down Threshold |
|---|---|---|
| CPU utilization | > 70% for 3 min | < 30% for 10 min |
| Request queue depth | > 100 for 1 min | < 10 for 5 min |
| Active connections | > 80% of max | < 20% of max |
| P99 latency (gateway overhead) | > 20 ms for 5 min | < 5 ms for 10 min |
Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: kt-gateway-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: kt-gateway
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: kt_request_queue_depth
target:
type: AverageValue
averageValue: "50"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 120
Gateway Instances per Workload
| Workload | RPS | Gateway Instances | CPU per Instance | Memory per Instance |
|---|---|---|---|---|
| Small team | < 10 | 1–2 | 0.5 vCPU | 128 MB |
| Department | 10–50 | 2–4 | 1 vCPU | 256 MB |
| Organization | 50–500 | 4–10 | 2 vCPU | 512 MB |
| Enterprise | 500–5000 | 10–50 | 4 vCPU | 1 GB |
| Platform | > 5000 | 50+ | 4 vCPU | 1 GB |
Resource Sizing
Control-Plane API
The API handles event ingestion, configuration management, and console traffic:
| Component | Small | Medium | Large |
|---|---|---|---|
| API replicas | 2 | 4 | 8 |
| CPU per replica | 1 vCPU | 2 vCPU | 4 vCPU |
| Memory per replica | 512 MB | 1 GB | 2 GB |
| PostgreSQL | 2 vCPU, 4 GB | 4 vCPU, 16 GB | 8 vCPU, 32 GB |
| PostgreSQL storage | 50 GB | 200 GB | 1 TB |
Event Storage Sizing
Event size ≈ 2 KB (avg, compressed)
Daily events = requests_per_day
Storage/day = daily_events × 2 KB
Example:
100,000 events/day × 2 KB = 200 MB/day
30-day retention = 6 GB
90-day retention = 18 GB
Database Connection Pools
# API database configuration
database:
max_connections: 50
min_connections: 5
connect_timeout: 5s
idle_timeout: 300s
max_lifetime: 1800s
Sizing formula:
max_connections = (api_replicas × connections_per_replica) + worker_connections + headroom
= (4 × 10) + (3 × 5) + 10
= 65
Queue Depth Monitoring
Request Queue
When gateway capacity is exhausted, requests queue:
Monitoring Queue Metrics
# Real-time queue depth
kt gateway status --metrics | grep queue
# Historical queue depth
kt events stats --metric queue_depth --last 1h
Alert Configuration
# Prometheus alerts for capacity
groups:
- name: kt-capacity
rules:
- alert: HighQueueDepth
expr: kt_request_queue_depth > 100
for: 2m
labels:
severity: warning
annotations:
summary: "Gateway request queue depth > 100"
- alert: ProviderRateLimited
expr: rate(kt_requests_total{status="429"}[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Provider {{ $labels.provider }} rate limiting > 10%"
- alert: TokenBudgetExhausted
expr: kt_tokens_remaining < 100000
labels:
severity: critical
annotations:
summary: "Token budget nearly exhausted for {{ $labels.provider }}"
Capacity Planning Checklist
Pre-Launch
- Model token throughput for expected workload
- Verify provider rate limits match projected demand
- Size gateway fleet for peak × 1.5 headroom
- Configure horizontal autoscaling
- Set up queue depth monitoring and alerts
- Size PostgreSQL for event retention policy
- Test failover to secondary provider
Ongoing
- Review weekly token consumption trends
- Monitor P99 gateway overhead (target < 10 ms)
- Check provider rate limit utilization (target < 80%)
- Review queue depth trends for scaling adjustments
- Update capacity model quarterly based on growth
Cost-Capacity Trade-offs
Next steps
- Performance Engineering the AI Gateway — optimize before scaling
- Resilience Engineering for AI Services — handle capacity failures gracefully
- Security Engineering for AI Pipelines — secure the scaled infrastructure
For AI systems
- Canonical terms:
kt capacity estimate, token throughput, TPM (tokens per minute), RPM (requests per minute), gateway HPA, queue depth,kt_request_queue_depth, provider rate limits,queue_overflow,queue_max_size - Key configuration:
providers[].rate_limits.requests_per_minute,providers[].rate_limits.tokens_per_minute,routing.strategy: weighted - Best next pages: Performance Engineering the AI Gateway, Resilience Engineering, Observability Patterns
For engineers
- Use
kt capacity estimate --requests-per-day N --avg-input-tokens N --peak-multiplier 3.0to model demand before deployment - Target scaling thresholds: CPU > 70% for 3 min (scale up), queue depth > 100 for 1 min (scale up), P99 overhead > 20 ms (scale up)
- Gateway memory footprint: ~128 MB (small team) to 1 GB (enterprise) per instance
- Event storage formula:
daily_events × 2 KB— 100K events/day = 200 MB/day, 6 GB at 30-day retention
For leaders
- Capacity planning directly affects AI infrastructure cost: over-provisioning wastes budget, under-provisioning causes latency spikes and dropped requests
- Multi-provider weighted routing (60/25/15 splits) hedges against single-provider rate limits and negotiates volume discounts
- Queue monitoring and alerting enable proactive scaling before users experience degradation