Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Load Balancing AI Gateway Traffic

Running multiple Keeptrusts gateway instances behind a load balancer provides high availability and horizontal throughput scaling. This guide covers configuration for nginx, HAProxy, and AWS ALB, with attention to health checks, session affinity, and streaming response handling.

Use this page when

  • You are running multiple Keeptrusts gateway instances and need load distribution.
  • You need to configure nginx, HAProxy, or AWS ALB with health checks for gateway backends.
  • Streaming LLM responses require special load balancer configuration (chunked transfer, timeouts).
  • You want to verify round-robin distribution and streaming passthrough work correctly.

Primary audience

  • Primary: Technical Engineers
  • Secondary: AI Agents, Technical Leaders

Architecture Overview

┌─────────────────┐
│ Load Balancer │
│ (443 / TLS) │
└────────┬────────┘
┌──────────────┼──────────────┐
▼ ▼ ▼
┌────────────┐ ┌────────────┐ ┌────────────┐
│ Gateway-1 │ │ Gateway-2 │ │ Gateway-3 │
│ :41002 │ │ :41002 │ │ :41002 │
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
└────────────────┼────────────────┘

┌────────────────┐
│ API Server │
│ :8080 │
└────────────────┘

Each gateway instance is stateless — it loads its policy configuration at startup and forwards decision events to the API. No session state is stored locally, making round-robin the default balancing strategy.

Health Check Endpoints

The gateway exposes a health endpoint for load balancer probes:

# Liveness — process is running
curl -s http://localhost:41002/health
# Returns: 200 OK

# Readiness — configuration loaded, upstream reachable
curl -s http://localhost:41002/health/ready
# Returns: 200 OK or 503 Service Unavailable

Configure your load balancer to use /health/ready for routing decisions and /health for restart decisions.

nginx Configuration

Basic Round-Robin

upstream keeptrusts_gateways {
server 10.0.0.10:41002;
server 10.0.0.11:41002;
server 10.0.0.12:41002;
keepalive 64;
}

server {
listen 443 ssl http2;
server_name gateway.example.com;

ssl_certificate /etc/ssl/gateway.crt;
ssl_certificate_key /etc/ssl/gateway.key;

# Streaming support — critical for LLM responses
proxy_buffering off;
proxy_cache off;
chunked_transfer_encoding on;

# Timeouts for long-running LLM requests
proxy_connect_timeout 10s;
proxy_read_timeout 300s; # LLM responses can take minutes
proxy_send_timeout 60s;

location / {
proxy_pass http://keeptrusts_gateways;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}

Weighted Distribution

Route more traffic to higher-capacity nodes:

upstream keeptrusts_gateways {
server 10.0.0.10:41002 weight=3; # 8 vCPU
server 10.0.0.11:41002 weight=2; # 4 vCPU
server 10.0.0.12:41002 weight=1; # 2 vCPU
keepalive 64;
}

Active Health Checks (nginx Plus)

upstream keeptrusts_gateways {
zone keeptrusts 64k;
server 10.0.0.10:41002;
server 10.0.0.11:41002;
server 10.0.0.12:41002;

health_check interval=5s fails=3 passes=2 uri=/health/ready;
}

HAProxy Configuration

global
maxconn 4096
log stdout format raw local0

defaults
mode http
timeout connect 10s
timeout client 300s
timeout server 300s
option httplog
option dontlognull

frontend gateway_https
bind *:443 ssl crt /etc/ssl/gateway.pem
default_backend keeptrusts_gateways

backend keeptrusts_gateways
balance roundrobin
option httpchk GET /health/ready
http-check expect status 200

# Disable buffering for streaming responses
option http-buffer-request
no option http-buffer-response

server gw1 10.0.0.10:41002 check inter 5s fall 3 rise 2
server gw2 10.0.0.11:41002 check inter 5s fall 3 rise 2
server gw3 10.0.0.12:41002 check inter 5s fall 3 rise 2

# Stats page for monitoring
frontend stats
bind *:8404
stats enable
stats uri /stats
stats refresh 10s
stats admin if LOCALHOST

AWS Application LoadAncer (ALB)

Target Group Configuration

# Create target group with health check
aws elbv2 create-target-group \
--name keeptrusts-gateways \
--protocol HTTP \
--port 41002 \
--vpc-id vpc-0abc123 \
--health-check-protocol HTTP \
--health-check-path /health/ready \
--health-check-interval-seconds 10 \
--healthy-threshold-count 2 \
--unhealthy-threshold-count 3 \
--target-type instance

# Register gateway instances
aws elbv2 register-targets \
--target-group-arn arn:aws:elasticloadbalancing:...:targetgroup/keeptrusts-gateways/... \
--targets Id=i-0abc123 Id=i-0def456 Id=i-0ghi789

Listener Configuration

# HTTPS listener with TLS termination
aws elbv2 create-listener \
--load-balancer-arn arn:aws:elasticloadbalancing:...:loadbalancer/app/keeptrusts-lb/... \
--protocol HTTPS \
--port 443 \
--certificates CertificateArn=arn:aws:acm:...:certificate/... \
--default-actions Type=forward,TargetGroupArn=arn:aws:elasticloadbalancing:...:targetgroup/keeptrusts-gateways/...

Streaming Support

ALB supports chunked transfer encoding by default. Ensure idle timeout accommodates long LLM responses:

aws elbv2 modify-load-balancer-attributes \
--load-balancer-arn arn:aws:elasticloadbalancing:...:loadbalancer/app/keeptrusts-lb/... \
--attributes Key=idle_timeout.timeout_seconds,Value=300

Sticky Sessions

The gateway is stateless, so sticky sessions are not required for correctness. However, they can improve cache locality when gateways cache provider connections:

# nginx — IP hash affinity
upstream keeptrusts_gateways {
ip_hash;
server 10.0.0.10:41002;
server 10.0.0.11:41002;
}
# HAProxy — cookie-based stickiness
backend keeptrusts_gateways
balance roundrobin
cookie GWID insert indirect nocache
server gw1 10.0.0.10:41002 cookie gw1 check
server gw2 10.0.0.11:41002 cookie gw2 check

Streaming & Server-Sent Events (SSE)

LLM providers return streaming responses via SSE. Load balancers must not buffer these:

# Ensure SSE passthrough
location / {
proxy_pass http://keeptrusts_gateways;
proxy_buffering off;
proxy_cache off;
proxy_set_header Connection "";
proxy_http_version 1.1;

# SSE specific
proxy_set_header Accept-Encoding "";
add_header X-Accel-Buffering no;
}

Connection Draining

During gateway instance shutdown (e.g., rolling update), allow in-flight LLM requests to complete:

# HAProxy — graceful drain
haproxy -f /etc/haproxy/haproxy.cfg -st $(cat /var/run/haproxy.pid)
# AWS ALB — deregistration delay
aws elbv2 modify-target-group-attributes \
--target-group-arn arn:aws:elasticloadbalancing:...:targetgroup/keeptrusts-gateways/... \
--attributes Key=deregistration_delay.timeout_seconds,Value=120

Verification

# Verify round-robin distribution
for i in $(seq 1 10); do
curl -s https://gateway.example.com/health | jq -r '.instance_id'
done

# Verify streaming through the load balancer
curl -N https://gateway.example.com/v1/chat/completions \
-H "Authorization: Bearer kt_gk_test" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o","messages":[{"role":"user","content":"count to 5"}],"stream":true}'

# Check backend health status in HAProxy
curl -s http://localhost:8404/stats\;csv | grep keeptrusts

Next steps

For AI systems

  • Canonical terms: Keeptrusts load balancing, gateway scaling, nginx upstream, HAProxy backend, AWS ALB, health checks, streaming support, round-robin, stateless gateway.
  • Key config/commands: Gateway health endpoints /health (liveness) and /health/ready (readiness); nginx upstream with least_conn; HAProxy with httpchk GET /health/ready; AWS ALB target group with streaming timeout; gateway is stateless — round-robin is the default strategy.
  • Best next pages: Monitoring Infrastructure, TLS/SSL Configuration, Capacity Sizing.

For engineers

  • Prerequisites: Multiple gateway instances on separate hosts or ports; load balancer (nginx, HAProxy, or cloud ALB).
  • Use /health/ready for routing decisions (checks config loaded + upstream reachable) and /health for restart decisions (liveness only).
  • Validate with: loop curl -s https://gateway.example.com/health | jq -r '.instance_id' to confirm round-robin distribution; test streaming with curl -N to verify chunked responses pass through without buffering.
  • Streaming responses require proxy_buffering off (nginx), no option httpclose (HAProxy), or idle timeout > 60s (ALB).

For leaders

  • Horizontal scaling of stateless gateways provides high availability and throughput scaling with minimal cost per instance.
  • No session state stored locally — any gateway can handle any request, simplifying failover.
  • Gateway scaling is the first response to throughput limits (see Capacity Sizing for per-instance benchmarks).
  • Load balancer is the TLS termination point for client traffic — consolidates certificate management.