Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Capacity Planning for AI Workloads

AI workloads have unique capacity characteristics — token-based throughput, provider-imposed rate limits, variable latency, and bursty traffic patterns. This guide covers how to model demand, size your gateway fleet, and monitor queue depth to avoid bottlenecks.

Use this page when

  • You are sizing a gateway fleet for a new AI workload or scaling an existing deployment
  • You need to model token throughput against provider rate limits
  • You are configuring Kubernetes HPA or queue-depth-based autoscaling for the gateway
  • You want to size PostgreSQL and event storage for your retention policy

Primary audience

  • Primary: Technical Engineers
  • Secondary: AI Agents, Technical Leaders

Token Throughput Modeling

Understanding Token Economics

Every LLM request consumes tokens. Capacity planning starts with modeling your token throughput:

Throughput Calculation

Daily token budget = requests_per_day × avg_tokens_per_request
Peak tokens/min = peak_rps × avg_tokens_per_request × 60

Example:
10,000 requests/day × 2,000 avg tokens = 20M tokens/day
Peak: 50 RPS × 2,000 tokens × 60 = 6M tokens/min

Workload Profiles

WorkloadAvg Input TokensAvg Output TokensRequests/minTokens/min
Chat (short)20015010035,000
Chat (long context)4,0001,00020100,000
Summarization8,0005001085,000
Embeddings5000500250,000
Code generation1,5002,00030105,000

Modeling Tool

# Estimate capacity requirements
kt capacity estimate \
--requests-per-day 50000 \
--avg-input-tokens 1500 \
--avg-output-tokens 800 \
--peak-multiplier 3.0 \
--provider openai \
--model gpt-4o

Provider Rate Limits

Common Provider Limits

ProviderModelRPM LimitTPM LimitRPD Limit
OpenAIgpt-4o500–10,00030K–10MVaries by tier
OpenAIgpt-4o-mini500–30,000200K–150MVaries by tier
Anthropicclaude-sonnet-4-2025051450–4,00020K–400KVaries by tier
Azure OpenAIgpt-4oPer-deployment PTUPer-deploymentN/A

Rate Limit Configuration

pack:
name: capacity-planning-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider:
base_url: https://api.openai.com/v1
secret_key_ref:
env: OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Multi-Provider Rate Limit Distribution

# Distribute load across providers
routing:
strategy: weighted
weights:
openai: 60
azure-openai: 25
anthropic: 15

Gateway Scaling

Horizontal Scaling Architecture

Scaling Triggers

MetricScale Up ThresholdScale Down Threshold
CPU utilization> 70% for 3 min< 30% for 10 min
Request queue depth> 100 for 1 min< 10 for 5 min
Active connections> 80% of max< 20% of max
P99 latency (gateway overhead)> 20 ms for 5 min< 5 ms for 10 min

Kubernetes HPA

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: kt-gateway-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: kt-gateway
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Pods
pods:
metric:
name: kt_request_queue_depth
target:
type: AverageValue
averageValue: "50"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 120

Gateway Instances per Workload

WorkloadRPSGateway InstancesCPU per InstanceMemory per Instance
Small team< 101–20.5 vCPU128 MB
Department10–502–41 vCPU256 MB
Organization50–5004–102 vCPU512 MB
Enterprise500–500010–504 vCPU1 GB
Platform> 500050+4 vCPU1 GB

Resource Sizing

Control-Plane API

The API handles event ingestion, configuration management, and console traffic:

ComponentSmallMediumLarge
API replicas248
CPU per replica1 vCPU2 vCPU4 vCPU
Memory per replica512 MB1 GB2 GB
PostgreSQL2 vCPU, 4 GB4 vCPU, 16 GB8 vCPU, 32 GB
PostgreSQL storage50 GB200 GB1 TB

Event Storage Sizing

Event size ≈ 2 KB (avg, compressed)
Daily events = requests_per_day
Storage/day = daily_events × 2 KB

Example:
100,000 events/day × 2 KB = 200 MB/day
30-day retention = 6 GB
90-day retention = 18 GB

Database Connection Pools

# API database configuration
database:
max_connections: 50
min_connections: 5
connect_timeout: 5s
idle_timeout: 300s
max_lifetime: 1800s

Sizing formula:

max_connections = (api_replicas × connections_per_replica) + worker_connections + headroom
= (4 × 10) + (3 × 5) + 10
= 65

Queue Depth Monitoring

Request Queue

When gateway capacity is exhausted, requests queue:

Monitoring Queue Metrics

# Real-time queue depth
kt gateway status --metrics | grep queue

# Historical queue depth
kt events stats --metric queue_depth --last 1h

Alert Configuration

# Prometheus alerts for capacity
groups:
- name: kt-capacity
rules:
- alert: HighQueueDepth
expr: kt_request_queue_depth > 100
for: 2m
labels:
severity: warning
annotations:
summary: "Gateway request queue depth > 100"

- alert: ProviderRateLimited
expr: rate(kt_requests_total{status="429"}[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Provider {{ $labels.provider }} rate limiting > 10%"

- alert: TokenBudgetExhausted
expr: kt_tokens_remaining < 100000
labels:
severity: critical
annotations:
summary: "Token budget nearly exhausted for {{ $labels.provider }}"

Capacity Planning Checklist

Pre-Launch

  • Model token throughput for expected workload
  • Verify provider rate limits match projected demand
  • Size gateway fleet for peak × 1.5 headroom
  • Configure horizontal autoscaling
  • Set up queue depth monitoring and alerts
  • Size PostgreSQL for event retention policy
  • Test failover to secondary provider

Ongoing

  • Review weekly token consumption trends
  • Monitor P99 gateway overhead (target < 10 ms)
  • Check provider rate limit utilization (target < 80%)
  • Review queue depth trends for scaling adjustments
  • Update capacity model quarterly based on growth

Cost-Capacity Trade-offs

Next steps

For AI systems

  • Canonical terms: kt capacity estimate, token throughput, TPM (tokens per minute), RPM (requests per minute), gateway HPA, queue depth, kt_request_queue_depth, provider rate limits, queue_overflow, queue_max_size
  • Key configuration: providers[].rate_limits.requests_per_minute, providers[].rate_limits.tokens_per_minute, routing.strategy: weighted
  • Best next pages: Performance Engineering the AI Gateway, Resilience Engineering, Observability Patterns

For engineers

  • Use kt capacity estimate --requests-per-day N --avg-input-tokens N --peak-multiplier 3.0 to model demand before deployment
  • Target scaling thresholds: CPU > 70% for 3 min (scale up), queue depth > 100 for 1 min (scale up), P99 overhead > 20 ms (scale up)
  • Gateway memory footprint: ~128 MB (small team) to 1 GB (enterprise) per instance
  • Event storage formula: daily_events × 2 KB — 100K events/day = 200 MB/day, 6 GB at 30-day retention

For leaders

  • Capacity planning directly affects AI infrastructure cost: over-provisioning wastes budget, under-provisioning causes latency spikes and dropped requests
  • Multi-provider weighted routing (60/25/15 splits) hedges against single-provider rate limits and negotiates volume discounts
  • Queue monitoring and alerting enable proactive scaling before users experience degradation