Load Balancing AI Gateway Traffic
Running multiple Keeptrusts gateway instances behind a load balancer provides high availability and horizontal throughput scaling. This guide covers configuration for nginx, HAProxy, and AWS ALB, with attention to health checks, session affinity, and streaming response handling.
Use this page when
- You are running multiple Keeptrusts gateway instances and need load distribution.
- You need to configure nginx, HAProxy, or AWS ALB with health checks for gateway backends.
- Streaming LLM responses require special load balancer configuration (chunked transfer, timeouts).
- You want to verify round-robin distribution and streaming passthrough work correctly.
Primary audience
- Primary: Technical Engineers
- Secondary: AI Agents, Technical Leaders
Architecture Overview
┌─────────────────┐
│ Load Balancer │
│ (443 / TLS) │
└────────┬────────┘
┌──────────────┼──────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Gateway-1 │ │ Gateway-2 │ │ Gateway-3 │
│ :41002 │ │ :41002 │ │ :41002 │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
└────────────────┼────────────────┘
▼
┌────────────────┐
│ API Server │
│ :8080 │
└────────────────┘
Each gateway instance is stateless — it loads its policy configuration at startup and forwards decision events to the API. No session state is stored locally, making round-robin the default balancing strategy.
Health Check Endpoints
The gateway exposes a health endpoint for load balancer probes:
# Liveness — process is running
curl -s http://localhost:41002/health
# Returns: 200 OK
# Readiness — configuration loaded, upstream reachable
curl -s http://localhost:41002/health/ready
# Returns: 200 OK or 503 Service Unavailable
Configure your load balancer to use /health/ready for routing decisions and /health for restart decisions.
nginx Configuration
Basic Round-Robin
upstream keeptrusts_gateways {
server 10.0.0.10:41002;
server 10.0.0.11:41002;
server 10.0.0.12:41002;
keepalive 64;
}
server {
listen 443 ssl http2;
server_name gateway.example.com;
ssl_certificate /etc/ssl/gateway.crt;
ssl_certificate_key /etc/ssl/gateway.key;
# Streaming support — critical for LLM responses
proxy_buffering off;
proxy_cache off;
chunked_transfer_encoding on;
# Timeouts for long-running LLM requests
proxy_connect_timeout 10s;
proxy_read_timeout 300s; # LLM responses can take minutes
proxy_send_timeout 60s;
location / {
proxy_pass http://keeptrusts_gateways;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
Weighted Distribution
Route more traffic to higher-capacity nodes:
upstream keeptrusts_gateways {
server 10.0.0.10:41002 weight=3; # 8 vCPU
server 10.0.0.11:41002 weight=2; # 4 vCPU
server 10.0.0.12:41002 weight=1; # 2 vCPU
keepalive 64;
}
Active Health Checks (nginx Plus)
upstream keeptrusts_gateways {
zone keeptrusts 64k;
server 10.0.0.10:41002;
server 10.0.0.11:41002;
server 10.0.0.12:41002;
health_check interval=5s fails=3 passes=2 uri=/health/ready;
}
HAProxy Configuration
global
maxconn 4096
log stdout format raw local0
defaults
mode http
timeout connect 10s
timeout client 300s
timeout server 300s
option httplog
option dontlognull
frontend gateway_https
bind *:443 ssl crt /etc/ssl/gateway.pem
default_backend keeptrusts_gateways
backend keeptrusts_gateways
balance roundrobin
option httpchk GET /health/ready
http-check expect status 200
# Disable buffering for streaming responses
option http-buffer-request
no option http-buffer-response
server gw1 10.0.0.10:41002 check inter 5s fall 3 rise 2
server gw2 10.0.0.11:41002 check inter 5s fall 3 rise 2
server gw3 10.0.0.12:41002 check inter 5s fall 3 rise 2
# Stats page for monitoring
frontend stats
bind *:8404
stats enable
stats uri /stats
stats refresh 10s
stats admin if LOCALHOST
AWS Application LoadAncer (ALB)
Target Group Configuration
# Create target group with health check
aws elbv2 create-target-group \
--name keeptrusts-gateways \
--protocol HTTP \
--port 41002 \
--vpc-id vpc-0abc123 \
--health-check-protocol HTTP \
--health-check-path /health/ready \
--health-check-interval-seconds 10 \
--healthy-threshold-count 2 \
--unhealthy-threshold-count 3 \
--target-type instance
# Register gateway instances
aws elbv2 register-targets \
--target-group-arn arn:aws:elasticloadbalancing:...:targetgroup/keeptrusts-gateways/... \
--targets Id=i-0abc123 Id=i-0def456 Id=i-0ghi789
Listener Configuration
# HTTPS listener with TLS termination
aws elbv2 create-listener \
--load-balancer-arn arn:aws:elasticloadbalancing:...:loadbalancer/app/keeptrusts-lb/... \
--protocol HTTPS \
--port 443 \
--certificates CertificateArn=arn:aws:acm:...:certificate/... \
--default-actions Type=forward,TargetGroupArn=arn:aws:elasticloadbalancing:...:targetgroup/keeptrusts-gateways/...
Streaming Support
ALB supports chunked transfer encoding by default. Ensure idle timeout accommodates long LLM responses:
aws elbv2 modify-load-balancer-attributes \
--load-balancer-arn arn:aws:elasticloadbalancing:...:loadbalancer/app/keeptrusts-lb/... \
--attributes Key=idle_timeout.timeout_seconds,Value=300
Sticky Sessions
The gateway is stateless, so sticky sessions are not required for correctness. However, they can improve cache locality when gateways cache provider connections:
# nginx — IP hash affinity
upstream keeptrusts_gateways {
ip_hash;
server 10.0.0.10:41002;
server 10.0.0.11:41002;
}
# HAProxy — cookie-based stickiness
backend keeptrusts_gateways
balance roundrobin
cookie GWID insert indirect nocache
server gw1 10.0.0.10:41002 cookie gw1 check
server gw2 10.0.0.11:41002 cookie gw2 check
Streaming & Server-Sent Events (SSE)
LLM providers return streaming responses via SSE. Load balancers must not buffer these:
# Ensure SSE passthrough
location / {
proxy_pass http://keeptrusts_gateways;
proxy_buffering off;
proxy_cache off;
proxy_set_header Connection "";
proxy_http_version 1.1;
# SSE specific
proxy_set_header Accept-Encoding "";
add_header X-Accel-Buffering no;
}
Connection Draining
During gateway instance shutdown (e.g., rolling update), allow in-flight LLM requests to complete:
# HAProxy — graceful drain
haproxy -f /etc/haproxy/haproxy.cfg -st $(cat /var/run/haproxy.pid)
# AWS ALB — deregistration delay
aws elbv2 modify-target-group-attributes \
--target-group-arn arn:aws:elasticloadbalancing:...:targetgroup/keeptrusts-gateways/... \
--attributes Key=deregistration_delay.timeout_seconds,Value=120
Verification
# Verify round-robin distribution
for i in $(seq 1 10); do
curl -s https://gateway.example.com/health | jq -r '.instance_id'
done
# Verify streaming through the load balancer
curl -N https://gateway.example.com/v1/chat/completions \
-H "Authorization: Bearer kt_gk_test" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o","messages":[{"role":"user","content":"count to 5"}],"stream":true}'
# Check backend health status in HAProxy
curl -s http://localhost:8404/stats\;csv | grep keeptrusts
Next steps
- Monitoring Infrastructure — metrics and alerting for load-balanced gateways
- TLS/SSL Configuration — certificate setup at the load balancer
- Capacity Sizing — determine how many gateway instances you need
For AI systems
- Canonical terms: Keeptrusts load balancing, gateway scaling, nginx upstream, HAProxy backend, AWS ALB, health checks, streaming support, round-robin, stateless gateway.
- Key config/commands: Gateway health endpoints
/health(liveness) and/health/ready(readiness); nginxupstreamwithleast_conn; HAProxy withhttpchk GET /health/ready; AWS ALB target group with streaming timeout; gateway is stateless — round-robin is the default strategy. - Best next pages: Monitoring Infrastructure, TLS/SSL Configuration, Capacity Sizing.
For engineers
- Prerequisites: Multiple gateway instances on separate hosts or ports; load balancer (nginx, HAProxy, or cloud ALB).
- Use
/health/readyfor routing decisions (checks config loaded + upstream reachable) and/healthfor restart decisions (liveness only). - Validate with: loop
curl -s https://gateway.example.com/health | jq -r '.instance_id'to confirm round-robin distribution; test streaming withcurl -Nto verify chunked responses pass through without buffering. - Streaming responses require
proxy_buffering off(nginx),no option httpclose(HAProxy), or idle timeout > 60s (ALB).
For leaders
- Horizontal scaling of stateless gateways provides high availability and throughput scaling with minimal cost per instance.
- No session state stored locally — any gateway can handle any request, simplifying failover.
- Gateway scaling is the first response to throughput limits (see Capacity Sizing for per-instance benchmarks).
- Load balancer is the TLS termination point for client traffic — consolidates certificate management.