Capacity Planning for AI Workloads

AI workloads have unique capacity characteristics — token-based throughput, provider-imposed rate limits, variable latency, and bursty traffic patterns. This guide covers how to model demand, size your gateway fleet, and monitor queue depth to avoid bottlenecks.

Use this page when

You are sizing a gateway fleet for a new AI workload or scaling an existing deployment
You need to model token throughput against provider rate limits
You are configuring Kubernetes HPA or queue-depth-based autoscaling for the gateway
You want to size PostgreSQL and event storage for your retention policy

Primary audience

Primary: Technical Engineers
Secondary: AI Agents, Technical Leaders

Token Throughput Modeling

Understanding Token Economics

Every LLM request consumes tokens. Capacity planning starts with modeling your token throughput:

Throughput Calculation

Daily token budget = requests_per_day × avg_tokens_per_request
Peak tokens/min = peak_rps × avg_tokens_per_request × 60

Example:
  10,000 requests/day × 2,000 avg tokens = 20M tokens/day
  Peak: 50 RPS × 2,000 tokens × 60 = 6M tokens/min

Workload Profiles

Workload	Avg Input Tokens	Avg Output Tokens	Requests/min	Tokens/min
Chat (short)	200	150	100	35,000
Chat (long context)	4,000	1,000	20	100,000
Summarization	8,000	500	10	85,000
Embeddings	500	0	500	250,000
Code generation	1,500	2,000	30	105,000

Modeling Tool

# Estimate capacity requirements
kt capacity estimate \
  --requests-per-day 50000 \
  --avg-input-tokens 1500 \
  --avg-output-tokens 800 \
  --peak-multiplier 3.0 \
  --provider openai \
  --model gpt-4o

Provider Rate Limits

Common Provider Limits

Provider	Model	RPM Limit	TPM Limit	RPD Limit
OpenAI	gpt-4o	500–10,000	30K–10M	Varies by tier
OpenAI	gpt-4o-mini	500–30,000	200K–150M	Varies by tier
Anthropic	claude-sonnet-4-20250514	50–4,000	20K–400K	Varies by tier
Azure OpenAI	gpt-4o	Per-deployment PTU	Per-deployment	N/A

Rate Limit Configuration

pack:
  name: capacity-planning-providers-1
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai
    provider: 
    base_url: https://api.openai.com/v1
    secret_key_ref:
      env: OPENAI_API_KEY
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

Multi-Provider Rate Limit Distribution

# Distribute load across providers
routing:
  strategy: weighted
  weights:
    openai: 60
    azure-openai: 25
    anthropic: 15

Gateway Scaling

Horizontal Scaling Architecture

Scaling Triggers

Metric	Scale Up Threshold	Scale Down Threshold
CPU utilization	> 70% for 3 min	< 30% for 10 min
Request queue depth	> 100 for 1 min	< 10 for 5 min
Active connections	> 80% of max	< 20% of max
P99 latency (gateway overhead)	> 20 ms for 5 min	< 5 ms for 10 min

Kubernetes HPA

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: kt-gateway-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: kt-gateway
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Pods
      pods:
        metric:
          name: kt_request_queue_depth
        target:
          type: AverageValue
          averageValue: "50"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 25
          periodSeconds: 120

Gateway Instances per Workload

Workload	RPS	Gateway Instances	CPU per Instance	Memory per Instance
Small team	< 10	1–2	0.5 vCPU	128 MB
Department	10–50	2–4	1 vCPU	256 MB
Organization	50–500	4–10	2 vCPU	512 MB
Enterprise	500–5000	10–50	4 vCPU	1 GB
Platform	> 5000	50+	4 vCPU	1 GB

Resource Sizing

Control-Plane API

The API handles event ingestion, configuration management, and console traffic:

Component	Small	Medium	Large
API replicas	2	4	8
CPU per replica	1 vCPU	2 vCPU	4 vCPU
Memory per replica	512 MB	1 GB	2 GB
PostgreSQL	2 vCPU, 4 GB	4 vCPU, 16 GB	8 vCPU, 32 GB
PostgreSQL storage	50 GB	200 GB	1 TB

Event Storage Sizing

Event size ≈ 2 KB (avg, compressed)
Daily events = requests_per_day
Storage/day = daily_events × 2 KB

Example:
  100,000 events/day × 2 KB = 200 MB/day
  30-day retention = 6 GB
  90-day retention = 18 GB

Database Connection Pools

# API database configuration
database:
  max_connections: 50
  min_connections: 5
  connect_timeout: 5s
  idle_timeout: 300s
  max_lifetime: 1800s

Sizing formula:

max_connections = (api_replicas × connections_per_replica) + worker_connections + headroom
                = (4 × 10) + (3 × 5) + 10
                = 65

Queue Depth Monitoring

Request Queue

When gateway capacity is exhausted, requests queue:

Monitoring Queue Metrics

# Real-time queue depth
kt gateway status --metrics | grep queue

# Historical queue depth
kt events stats --metric queue_depth --last 1h

Alert Configuration

# Prometheus alerts for capacity
groups:
  - name: kt-capacity
    rules:
      - alert: HighQueueDepth
        expr: kt_request_queue_depth > 100
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Gateway request queue depth > 100"

      - alert: ProviderRateLimited
        expr: rate(kt_requests_total{status="429"}[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Provider {{ $labels.provider }} rate limiting > 10%"

      - alert: TokenBudgetExhausted
        expr: kt_tokens_remaining < 100000
        labels:
          severity: critical
        annotations:
          summary: "Token budget nearly exhausted for {{ $labels.provider }}"

Capacity Planning Checklist

Pre-Launch

Model token throughput for expected workload
Verify provider rate limits match projected demand
Size gateway fleet for peak × 1.5 headroom
Configure horizontal autoscaling
Set up queue depth monitoring and alerts
Size PostgreSQL for event retention policy
Test failover to secondary provider

Ongoing

Review weekly token consumption trends
Monitor P99 gateway overhead (target < 10 ms)
Check provider rate limit utilization (target < 80%)
Review queue depth trends for scaling adjustments
Update capacity model quarterly based on growth

Cost-Capacity Trade-offs

Next steps

Performance Engineering the AI Gateway — optimize before scaling
Resilience Engineering for AI Services — handle capacity failures gracefully
Security Engineering for AI Pipelines — secure the scaled infrastructure

For AI systems

Canonical terms: kt capacity estimate, token throughput, TPM (tokens per minute), RPM (requests per minute), gateway HPA, queue depth, kt_request_queue_depth, provider rate limits, queue_overflow, queue_max_size
Key configuration: providers[].rate_limits.requests_per_minute, providers[].rate_limits.tokens_per_minute, routing.strategy: weighted
Best next pages: Performance Engineering the AI Gateway, Resilience Engineering, Observability Patterns

For engineers

Use kt capacity estimate --requests-per-day N --avg-input-tokens N --peak-multiplier 3.0 to model demand before deployment
Target scaling thresholds: CPU > 70% for 3 min (scale up), queue depth > 100 for 1 min (scale up), P99 overhead > 20 ms (scale up)
Gateway memory footprint: ~128 MB (small team) to 1 GB (enterprise) per instance
Event storage formula: daily_events × 2 KB — 100K events/day = 200 MB/day, 6 GB at 30-day retention

For leaders

Capacity planning directly affects AI infrastructure cost: over-provisioning wastes budget, under-provisioning causes latency spikes and dropped requests
Multi-provider weighted routing (60/25/15 splits) hedges against single-provider rate limits and negotiates volume discounts
Queue monitoring and alerting enable proactive scaling before users experience degradation

Use this page when​

Primary audience​

Token Throughput Modeling​

Understanding Token Economics​

Throughput Calculation​

Workload Profiles​

Modeling Tool​

Provider Rate Limits​

Common Provider Limits​

Rate Limit Configuration​

Multi-Provider Rate Limit Distribution​

Gateway Scaling​

Horizontal Scaling Architecture​

Scaling Triggers​

Kubernetes HPA​

Gateway Instances per Workload​

Resource Sizing​

Control-Plane API​

Event Storage Sizing​

Database Connection Pools​

Queue Depth Monitoring​

Request Queue​

Monitoring Queue Metrics​

Alert Configuration​

Capacity Planning Checklist​

Pre-Launch​

Ongoing​

Cost-Capacity Trade-offs​

Next steps​

For AI systems​

For engineers​

For leaders​