Load Balancing AI Gateway Traffic

Running multiple Keeptrusts gateway instances behind a load balancer provides high availability and horizontal throughput scaling. This guide covers configuration for nginx, HAProxy, and AWS ALB, with attention to health checks, session affinity, and streaming response handling.

Use this page when

You are running multiple Keeptrusts gateway instances and need load distribution.
You need to configure nginx, HAProxy, or AWS ALB with health checks for gateway backends.
Streaming LLM responses require special load balancer configuration (chunked transfer, timeouts).
You want to verify round-robin distribution and streaming passthrough work correctly.

Primary audience

Primary: Technical Engineers
Secondary: AI Agents, Technical Leaders

Architecture Overview

                    ┌─────────────────┐
                    │   Load Balancer  │
                    │   (443 / TLS)    │
                    └────────┬────────┘
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
     ┌────────────┐  ┌────────────┐  ┌────────────┐
     │ Gateway-1  │  │ Gateway-2  │  │ Gateway-3  │
     │ :41002     │  │ :41002     │  │ :41002     │
     └─────┬──────┘  └─────┬──────┘  └─────┬──────┘
           └────────────────┼────────────────┘
                            ▼
                    ┌────────────────┐
                    │  API Server    │
                    │  :8080         │
                    └────────────────┘

Each gateway instance is stateless — it loads its policy configuration at startup and forwards decision events to the API. No session state is stored locally, making round-robin the default balancing strategy.

Health Check Endpoints

The gateway exposes a health endpoint for load balancer probes:

# Liveness — process is running
curl -s http://localhost:41002/health
# Returns: 200 OK

# Readiness — configuration loaded, upstream reachable
curl -s http://localhost:41002/health/ready
# Returns: 200 OK or 503 Service Unavailable

Configure your load balancer to use /health/ready for routing decisions and /health for restart decisions.

nginx Configuration

Basic Round-Robin

upstream keeptrusts_gateways {
    server 10.0.0.10:41002;
    server 10.0.0.11:41002;
    server 10.0.0.12:41002;
    keepalive 64;
}

server {
    listen 443 ssl http2;
    server_name gateway.example.com;

    ssl_certificate     /etc/ssl/gateway.crt;
    ssl_certificate_key /etc/ssl/gateway.key;

    # Streaming support — critical for LLM responses
    proxy_buffering off;
    proxy_cache off;
    chunked_transfer_encoding on;

    # Timeouts for long-running LLM requests
    proxy_connect_timeout 10s;
    proxy_read_timeout 300s;   # LLM responses can take minutes
    proxy_send_timeout 60s;

    location / {
        proxy_pass http://keeptrusts_gateways;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Weighted Distribution

Route more traffic to higher-capacity nodes:

upstream keeptrusts_gateways {
    server 10.0.0.10:41002 weight=3;   # 8 vCPU
    server 10.0.0.11:41002 weight=2;   # 4 vCPU
    server 10.0.0.12:41002 weight=1;   # 2 vCPU
    keepalive 64;
}

Active Health Checks (nginx Plus)

upstream keeptrusts_gateways {
    zone keeptrusts 64k;
    server 10.0.0.10:41002;
    server 10.0.0.11:41002;
    server 10.0.0.12:41002;

    health_check interval=5s fails=3 passes=2 uri=/health/ready;
}

HAProxy Configuration

global
    maxconn 4096
    log stdout format raw local0

defaults
    mode http
    timeout connect 10s
    timeout client  300s
    timeout server  300s
    option httplog
    option dontlognull

frontend gateway_https
    bind *:443 ssl crt /etc/ssl/gateway.pem
    default_backend keeptrusts_gateways

backend keeptrusts_gateways
    balance roundrobin
    option httpchk GET /health/ready
    http-check expect status 200

    # Disable buffering for streaming responses
    option http-buffer-request
    no option http-buffer-response

    server gw1 10.0.0.10:41002 check inter 5s fall 3 rise 2
    server gw2 10.0.0.11:41002 check inter 5s fall 3 rise 2
    server gw3 10.0.0.12:41002 check inter 5s fall 3 rise 2

# Stats page for monitoring
frontend stats
    bind *:8404
    stats enable
    stats uri /stats
    stats refresh 10s
    stats admin if LOCALHOST

AWS Application LoadAncer (ALB)

Target Group Configuration

# Create target group with health check
aws elbv2 create-target-group \
  --name keeptrusts-gateways \
  --protocol HTTP \
  --port 41002 \
  --vpc-id vpc-0abc123 \
  --health-check-protocol HTTP \
  --health-check-path /health/ready \
  --health-check-interval-seconds 10 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 3 \
  --target-type instance

# Register gateway instances
aws elbv2 register-targets \
  --target-group-arn arn:aws:elasticloadbalancing:...:targetgroup/keeptrusts-gateways/... \
  --targets Id=i-0abc123 Id=i-0def456 Id=i-0ghi789

Listener Configuration

# HTTPS listener with TLS termination
aws elbv2 create-listener \
  --load-balancer-arn arn:aws:elasticloadbalancing:...:loadbalancer/app/keeptrusts-lb/... \
  --protocol HTTPS \
  --port 443 \
  --certificates CertificateArn=arn:aws:acm:...:certificate/... \
  --default-actions Type=forward,TargetGroupArn=arn:aws:elasticloadbalancing:...:targetgroup/keeptrusts-gateways/...

Streaming Support

ALB supports chunked transfer encoding by default. Ensure idle timeout accommodates long LLM responses:

aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn arn:aws:elasticloadbalancing:...:loadbalancer/app/keeptrusts-lb/... \
  --attributes Key=idle_timeout.timeout_seconds,Value=300

Sticky Sessions

The gateway is stateless, so sticky sessions are not required for correctness. However, they can improve cache locality when gateways cache provider connections:

# nginx — IP hash affinity
upstream keeptrusts_gateways {
    ip_hash;
    server 10.0.0.10:41002;
    server 10.0.0.11:41002;
}

# HAProxy — cookie-based stickiness
backend keeptrusts_gateways
    balance roundrobin
    cookie GWID insert indirect nocache
    server gw1 10.0.0.10:41002 cookie gw1 check
    server gw2 10.0.0.11:41002 cookie gw2 check

Streaming & Server-Sent Events (SSE)

LLM providers return streaming responses via SSE. Load balancers must not buffer these:

# Ensure SSE passthrough
location / {
    proxy_pass http://keeptrusts_gateways;
    proxy_buffering off;
    proxy_cache off;
    proxy_set_header Connection "";
    proxy_http_version 1.1;

    # SSE specific
    proxy_set_header Accept-Encoding "";
    add_header X-Accel-Buffering no;
}

Connection Draining

During gateway instance shutdown (e.g., rolling update), allow in-flight LLM requests to complete:

# HAProxy — graceful drain
haproxy -f /etc/haproxy/haproxy.cfg -st $(cat /var/run/haproxy.pid)

# AWS ALB — deregistration delay
aws elbv2 modify-target-group-attributes \
  --target-group-arn arn:aws:elasticloadbalancing:...:targetgroup/keeptrusts-gateways/... \
  --attributes Key=deregistration_delay.timeout_seconds,Value=120

Verification

# Verify round-robin distribution
for i in $(seq 1 10); do
  curl -s https://gateway.example.com/health | jq -r '.instance_id'
done

# Verify streaming through the load balancer
curl -N https://gateway.example.com/v1/chat/completions \
  -H "Authorization: Bearer kt_gk_test" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o","messages":[{"role":"user","content":"count to 5"}],"stream":true}'

# Check backend health status in HAProxy
curl -s http://localhost:8404/stats\;csv | grep keeptrusts

Next steps

Monitoring Infrastructure — metrics and alerting for load-balanced gateways
TLS/SSL Configuration — certificate setup at the load balancer
Capacity Sizing — determine how many gateway instances you need

For AI systems

Canonical terms: Keeptrusts load balancing, gateway scaling, nginx upstream, HAProxy backend, AWS ALB, health checks, streaming support, round-robin, stateless gateway.
Key config/commands: Gateway health endpoints /health (liveness) and /health/ready (readiness); nginx upstream with least_conn; HAProxy with httpchk GET /health/ready; AWS ALB target group with streaming timeout; gateway is stateless — round-robin is the default strategy.
Best next pages: Monitoring Infrastructure, TLS/SSL Configuration, Capacity Sizing.

For engineers

Prerequisites: Multiple gateway instances on separate hosts or ports; load balancer (nginx, HAProxy, or cloud ALB).
Use /health/ready for routing decisions (checks config loaded + upstream reachable) and /health for restart decisions (liveness only).
Validate with: loop curl -s https://gateway.example.com/health | jq -r '.instance_id' to confirm round-robin distribution; test streaming with curl -N to verify chunked responses pass through without buffering.
Streaming responses require proxy_buffering off (nginx), no option httpclose (HAProxy), or idle timeout > 60s (ALB).

For leaders

Horizontal scaling of stateless gateways provides high availability and throughput scaling with minimal cost per instance.
No session state stored locally — any gateway can handle any request, simplifying failover.
Gateway scaling is the first response to throughput limits (see Capacity Sizing for per-instance benchmarks).
Load balancer is the TLS termination point for client traffic — consolidates certificate management.

Use this page when​

Primary audience​

Architecture Overview​

Health Check Endpoints​

nginx Configuration​

Basic Round-Robin​

Weighted Distribution​

Active Health Checks (nginx Plus)​

HAProxy Configuration​

AWS Application LoadAncer (ALB)​

Target Group Configuration​

Listener Configuration​

Streaming Support​

Sticky Sessions​

Streaming & Server-Sent Events (SSE)​

Connection Draining​

Verification​

Next steps​

For AI systems​

For engineers​

For leaders​