Skip to main content

Gateway High Availability: Scaling with Multiple Instances

Keeptrusts scales with multiple instances by treating gateways as a fleet, not as one special process. In practice that means running more than one gateway, monitoring each instance through /health and kt doctor, grouping instances for bulk operations, and rolling configuration changes across the fleet instead of restarting or updating everything at once.

Use this page when

  • You are moving from one local or staging gateway to a production footprint with redundancy.
  • You need a practical explanation of how multiple Keeptrusts gateways are operated together.
  • You want to reduce the blast radius of gateway outages and policy rollouts.

Primary audience

  • Primary: Technical Engineers
  • Secondary: Technical Leaders, SRE and platform teams

The problem

A single gateway is easy to understand and easy to operate, but it is also a single operational boundary. If that one process is unhealthy, overloaded, misconfigured, or restarted at the wrong time, every governed request behind it is affected.

The first problem is availability. Even a well-behaved service still needs maintenance windows, provider diagnostics, config changes, and host restarts. If only one instance serves traffic, every routine operation becomes user-visible.

The second problem is rollout risk. Policy configs change behavior immediately. A good change still deserves observation. A bad change should affect one instance or one batch first, not every environment at once.

The third problem is topology. Teams often need separate gateways by environment, region, or compliance boundary. Production in one geography may need a different operating group from staging or from a regulated workload. Once that happens, high availability is not just “two copies of the same server.” It is fleet management.

The solution

Keeptrusts documents a fleet model rather than a single-box model.

At the infrastructure layer, you run multiple gateway instances and expose health through /health. The health response reports status, provider reachability, requests processed, and cache data when relevant. That is what your platform uses to decide whether an instance is ready for traffic.

At the operational layer, the kt CLI manages multiple gateways from one workstation. The published multi-gateway workflow includes gateway groups, fleet-wide health views, centralized event tailing, rolling config pushes, canary-style updates, and rollback on error.

At the rollout layer, the safety mechanism is sequencing. You do not push a new config everywhere and hope. You dry-run, push to one gateway or one batch, check health and verdict mix, then continue.

That combination is what makes multiple instances useful. Redundancy alone helps with uptime. Redundancy plus controlled rollout is what makes a gateway fleet safe to operate.

Implementation

Start with the runtime side. If you deploy on Kubernetes, the documented monitoring guide uses readiness and liveness probes against /health and more than one replica:

apiVersion: apps/v1
kind: Deployment
metadata:
name: keeptrusts-gateway
spec:
replicas: 3
template:
spec:
containers:
- name: gateway
image: keeptrusts/gateway:latest
ports:
- containerPort: 41002
livenessProbe:
httpGet:
path: /health
port: 41002
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /health
port: 41002
initialDelaySeconds: 3
periodSeconds: 5
failureThreshold: 2

That solves the basic problem of instance failure: unhealthy gateways stop receiving traffic and healthy ones continue.

Then add the fleet-management layer so configuration changes are safe as well as redundant. The documented CLI model uses gateway groups:

groups:
production:
gateways:
- gw-prod-01
- gw-prod-02
config_source: policies/production.yaml

eu:
gateways:
- gw-eu-01
config_source: policies/eu-production.yaml

With groups in place, you can do monitored rollouts instead of all-at-once pushes:

kt gateway health --group production

kt config push --file policy-config.yaml --group production \
--strategy rolling \
--batch-size 1 \
--pause-between 60s \
--rollback-on-error

kt events tail --all-gateways

That operating loop is the important part.

First, confirm the fleet is healthy.

Second, push to one gateway at a time.

Third, watch the metrics and the event stream during the pause window.

Fourth, let rollback happen automatically if the health threshold fails.

You can also use kt doctor as part of the maintenance routine when one gateway looks suspect or when provider reachability differs across regions. The platform docs are clear that health monitoring is not only about whether the process is up. It is also about whether the providers, event forwarding path, and policy chain are behaving correctly.

This is why high availability in Keeptrusts is as much an operational workflow as an infrastructure pattern. Running three replicas without controlled rollout still leaves you exposed to bad config pushes. Running careful rollouts without multiple instances still leaves you exposed to host failures. You need both.

Results and impact

The first result is better uptime. A single host restart or degraded gateway no longer becomes a full outage for governed traffic.

The second result is lower rollout risk. Rolling or canary-style updates reduce the blast radius of a bad change and make rollback faster because the error is detected before the whole fleet is touched.

The third result is better regional and organizational separation. Teams can keep different gateways or gateway groups for staging, production, or region-specific operations without losing centralized visibility.

There is also a governance benefit. When a gateway fleet is observable as a fleet, you can answer questions such as which region is degraded, which version is active where, and whether a new policy caused block-rate changes on only one subset of gateways.

Key takeaways

  • High availability is not only “more replicas”; it is replicas plus health checks plus controlled rollout.
  • /health and readiness probes keep bad instances out of the traffic path.
  • Gateway groups make bulk operations safe and reviewable.
  • Rolling updates reduce blast radius compared with all-at-once config pushes.
  • Fleet health and event visibility are part of the HA story, not separate concerns.

Next steps