DevOps Guide: Operating the AI Gateway in Production

The Keeptrusts gateway is a mission-critical component in your AI infrastructure — every LLM request flows through it. This guide covers production deployment patterns, monitoring, alerting, scaling, and operational runbooks for DevOps engineers.

Use this page when

You are deploying Keeptrusts gateways to production (Docker, Kubernetes, or bare metal)
You need to configure health checks, monitoring, and alerting for gateway infrastructure
You are scaling gateway instances behind a load balancer
You need operational runbooks for gateway upgrades, rollbacks, and incident response
You are automating gateway deployment with CI/CD pipelines and infrastructure as code

Primary audience

Primary: Technical Engineers (DevOps Engineers, SREs, Infrastructure Engineers)
Secondary: Platform Engineers, Cloud Architects, Security Engineers

Deployment Architecture

Single Gateway (Development / Small Teams)

# Start the gateway directly
kt gateway run \
  --config policy-config.yaml \
  --port 41002

Docker Deployment

# Gateway container
FROM keeptrusts/gateway:latest
COPY policy-config.yaml /etc/keeptrusts/policy-config.yaml
ENV KEEPTRUSTS_API_URL=https://api.keeptrusts.com
ENV KEEPTRUSTS_GATEWAY_TOKEN=${GATEWAY_TOKEN}
EXPOSE 41002
CMD ["kt", "gateway", "run", "--config", "/etc/keeptrusts/policy-config.yaml", "--port", "41002"]

# docker-compose.yml
services:
  keeptrusts-gateway:
    image: keeptrusts/gateway:latest
    ports:
      - "41002:41002"
    volumes:
      - ./policy-config.yaml:/etc/keeptrusts/policy-config.yaml:ro
    environment:
      KEEPTRUSTS_API_URL: http://keeptrusts-api:8080
      OPENAI_API_KEY: ${OPENAI_API_KEY}
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "kt", "doctor"]
      interval: 30s
      timeout: 10s
      retries: 3

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: keeptrusts-gateway
spec:
  replicas: 3
  selector:
    matchLabels:
      app: keeptrusts-gateway
  template:
    metadata:
      labels:
        app: keeptrusts-gateway
    spec:
      containers:
        - name: gateway
          image: keeptrusts/gateway:latest
          ports:
            - containerPort: 41002
          livenessProbe:
            exec:
              command: ["kt", "doctor"]
            initialDelaySeconds: 10
            periodSeconds: 30
          readinessProbe:
            exec:
              command: ["kt", "doctor"]
            initialDelaySeconds: 5
            periodSeconds: 10
          env:
            - name: KEEPTRUSTS_API_URL
              valueFrom:
                secretKeyRef:
                  name: keeptrusts-secrets
                  key: api-url
          volumeMounts:
            - name: config
              mountPath: /etc/keeptrusts
              readOnly: true
      volumes:
        - name: config
          configMap:
            name: keeptrusts-gateway-config

Health Checks and Diagnostics

Gateway Health

# Comprehensive health check
kt doctor

# Quick connectivity test
kt events list --since 1h --limit 1

# Validate configuration without restarting
kt policy lint --file policy-config.yaml

What `kt doctor` Checks

Check	What it validates
Configuration syntax	YAML parsing and schema validation
Provider connectivity	API keys and endpoint reachability
Control plane connection	API URL and authentication
Policy chain integrity	All referenced policies are valid
Event pipeline	Events can be submitted to the API

Monitoring and Observability

Key Metrics to Monitor

Metric	Source	Alert threshold
Gateway request latency (p99)	Gateway metrics	> 2s
Error rate	Events with `status=error`	> 5%
Policy evaluation time	Gateway metrics	> 500ms
Event submission failures	Gateway logs	> 0 sustained
Active connections	Gateway metrics	> 80% capacity
Configuration age	Last config reload timestamp	> 24h without refresh

Event Pipeline Monitoring

# Verify events are flowing
kt events list --since 5m --limit 5

# Tail events in real-time for debugging
kt events tail

# Check event submission from the API side
curl -H "Authorization: Bearer $API_TOKEN" \
  "https://api.keeptrusts.com/v1/events?since=5m&limit=5"

Log Aggregation

The gateway emits structured logs compatible with standard log aggregation tools. Forward these to your existing logging pipeline:

# Example: Docker logging driver
services:
  keeptrusts-gateway:
    logging:
      driver: json-file
      options:
        max-size: "10m"
        max-file: "3"

Alerting Rules

Critical Alerts (Page)

Condition	Action
Gateway unreachable for > 2 minutes	Page on-call, check container health
Event submission failures > 10 in 5 minutes	Page on-call, check API connectivity
Error rate > 10% for 5 minutes	Page on-call, check upstream providers

Warning Alerts (Ticket)

Condition	Action
P99 latency > 2s for 15 minutes	Create ticket, investigate provider latency
Configuration not refreshed in 24h	Create ticket, check git sync
Disk usage > 80% on gateway host	Create ticket, rotate logs

Scaling Strategies

Horizontal Scaling

Deploy multiple gateway instances behind a load balancer. The gateway is stateless — all state flows through the control-plane API.

# Kubernetes HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: keeptrusts-gateway-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: keeptrusts-gateway
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Configuration Management at Scale

Use Git-backed configuration sync for consistent policy deployment across all gateway instances:

Store policy configs in a Git repository
Link the repository in Console Settings > Git Repositories
Changes merged to the default branch automatically sync to all gateways

# Verify the current running configuration
kt policy lint --file policy-config.yaml

Rollback Procedures

Configuration Rollback

If a policy change causes issues:

# Validate the previous config version
kt policy lint --file policy-config-previous.yaml

# Redeploy with the previous config
kt gateway run --policy-config policy-config-previous.yaml --port 41002

With Git-backed configs, revert the commit and the sync will pick up the previous version automatically.

Full Gateway Rollback

For container deployments, roll back to the previous image version:

# Kubernetes rollback
kubectl rollout undo deployment/keeptrusts-gateway

# Docker rollback
docker compose up -d --no-deps keeptrusts-gateway

Operational Runbooks

Gateway Not Responding

Check container status: docker ps or kubectl get pods
Check logs: docker logs keeptrusts-gateway or kubectl logs -l app=keeptrusts-gateway
Run diagnostics: kt doctor
Verify network connectivity to upstream providers
Check control-plane API reachability

Events Not Appearing in Console

Verify event pipeline: kt events list --since 5m
Check API connectivity from gateway host
Verify API token validity
Check for rate limiting or quota exhaustion

High Latency

Check upstream provider status pages
Review p99 latency by provider: filter events by provider in Console
Check gateway resource utilization (CPU, memory)
Verify network path between gateway and providers

Success Metrics for DevOps

Metric	Target	Source
Gateway uptime	99.9%	Health check monitoring
Mean time to deploy config change	Under 15 minutes	Deployment pipeline metrics
Event delivery success rate	> 99.9%	Event pipeline monitoring
Mean time to recovery	Under 30 minutes	Incident tracking
Configuration drift	Zero	configuration deployment verification

Next steps

Review deployment topologies: Architecture Overview
Set up gateway fleet management: Platform Engineer Guide
Configure monitoring: Gateway Monitoring
Explore public runtime behavior: Gateway Runtime Features

For AI systems

Canonical terms: Keeptrusts, gateway deployment, production operations, health checks, scaling, monitoring, alerting
Key surfaces: kt gateway run, kt doctor, kt policy lint, Docker Compose, Kubernetes Deployment/Service, Console Dashboard
Deployment patterns: single gateway (dev), Docker Compose (small teams), Kubernetes Deployment with replicas (production)
Health check: kt doctor used in Docker HEALTHCHECK and Kubernetes liveness/readiness probes
Environment variables: KEEPTRUSTS_API_URL, KEEPTRUSTS_GATEWAY_TOKEN, provider key env vars
Best next pages: Architecture Overview, Platform Engineer Guide, Gateway Monitoring, Gateway Runtime Features

For engineers

Start gateway: kt gateway run --listen 0.0.0.0:41002 --policy-config policy-config.yaml
Docker health check: ["CMD", "kt", "doctor"] with 30s interval, 10s timeout, 3 retries
Kubernetes: deploy as apps/v1 Deployment with replicas: 3, liveness/readiness probes using kt doctor
Validate config before deploy: kt policy lint --file policy-config.yaml
Verify event flow: kt events list --since 1h --limit 1
Git-linked configurations auto-sync policy changes on merge to main branch
Target SLO: 99.9% gateway uptime, with mean time to deploy config changes under 15 minutes

For leaders

The gateway is a mission-critical path — every LLM request flows through it, so production deployment requires HA, health monitoring, and automated recovery
Docker and Kubernetes deployment patterns provide scalability from single-instance dev to multi-replica production clusters
Git-linked configuration sync enables infrastructure-as-code workflows where policy changes follow the same PR review process as application code
Gateway operational metrics (uptime, event delivery, config drift) should be tracked alongside application SLOs

Use this page when​

Primary audience​

Deployment Architecture​

Single Gateway (Development / Small Teams)​

Docker Deployment​

Kubernetes Deployment​

Health Checks and Diagnostics​

Gateway Health​

What kt doctor Checks​

Monitoring and Observability​

Key Metrics to Monitor​

Event Pipeline Monitoring​

Log Aggregation​

Alerting Rules​

Critical Alerts (Page)​

Warning Alerts (Ticket)​

Scaling Strategies​

Horizontal Scaling​

Configuration Management at Scale​

Rollback Procedures​

Configuration Rollback​

Full Gateway Rollback​

Operational Runbooks​

Gateway Not Responding​

Events Not Appearing in Console​

High Latency​

Success Metrics for DevOps​

Next steps​

For AI systems​

For engineers​

For leaders​