Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

IT Director Guide: AI Infrastructure Management

As an IT Director overseeing AI infrastructure, you manage the operational reality of running AI services at scale — capacity planning, vendor relationships, budgets, SLAs, and change processes. Keeptrusts gives you the operational data and controls needed to manage AI as a governed enterprise service rather than a collection of ungoverned experiments.

Use this page when

  • You are planning AI infrastructure capacity (users, requests/day, gateway sizing)
  • You need to manage multiple LLM vendor relationships and track provider performance
  • You are forecasting AI service budgets and negotiating SLAs
  • You want to implement change management processes for AI governance configurations
  • You need to run AI as a governed enterprise service rather than ad hoc experiments

Primary audience

  • Primary: Technical Leaders (IT Directors, Heads of IT Operations)
  • Secondary: Procurement, CTOs, DevOps Engineers, Finance

Capacity Planning

Current State Assessment

Before planning capacity, baseline your current AI usage through the Console and API:

# Get current usage volume
curl -H "Authorization: Bearer $API_TOKEN" \
"https://api.keeptrusts.com/v1/events?since=30d&group_by=gateway"

# Identify peak usage patterns
curl -H "Authorization: Bearer $API_TOKEN" \
"https://api.keeptrusts.com/v1/events?since=7d"

The Console Dashboard provides visual trends for request volumes, active users, and policy trigger rates.

Capacity Planning Model

DimensionCurrent6-Month ProjectionPlanning Action
Active usersMeasure via events+50-100% with AI adoption waveProvision gateway capacity
Requests/dayMeasure via events2-3x growth expectedLoad test gateway cluster
Teams onboardedCount gatewaysAll engineering + business teamsTemplate and onboarding plan
LLM providersCount in configAdd 1-2 for redundancyProvider evaluation
Storage (events)Measure via APILinear growth with usageEvent retention policy

Gateway Infrastructure Sizing

Deployment SizeUsersRequests/DayGateway InstancesResources
Small1-50Under 5K12 CPU, 4GB RAM
Medium50-5005K-50K2-34 CPU, 8GB RAM each
Large500+50K+5+8 CPU, 16GB RAM each

Vendor Management

LLM Provider Portfolio

Manage multiple LLM providers as a portfolio with Keeptrusts providing the governance abstraction:

pack:
name: it-director-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider:
secret_key_ref:
env: OPENAI_API_KEY
- id: anthropic
provider:
secret_key_ref:
env: ANTHROPIC_API_KEY
- id: azure-openai
provider:
base_url: https://your-instance.openai.azure.com
secret_key_ref:
env: AZURE_OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

Vendor Performance Tracking

Use Keeptrusts event data to track provider performance:

# Provider comparison — usage volume and policy triggers
curl -H "Authorization: Bearer $API_TOKEN" \
"https://api.keeptrusts.com/v1/events?since=30d&group_by=provider"

# Export provider performance data for vendor review
kt export create \
--type events \
--format csv \
--since 90d \
--description "Quarterly vendor performance review"

Vendor Review Checklist

Review ItemSourceFrequency
Usage volume per providerConsole Events + APIMonthly
Cost per providerConsole UsageMonthly
Policy trigger rate by providerEvent analyticsQuarterly
Provider uptime and reliabilityGateway health logsMonthly
Contract renewal timelineVendor management toolQuarterly
Security and compliance certificationsVendor documentationAnnually

Budget Forecasting

AI Spend Components

Cost CategoryTracking MethodForecast Input
LLM API usageConsole Cost CenterHistorical trend + growth rate
Gateway infrastructureInfrastructure monitoringCapacity plan
Keeptrusts platformLicense managementUser/team count
Internal team costsHR / FinanceHeadcount plan
Training and onboardingProject trackingAdoption roadmap

Pulling Cost Data for Forecasting

# Monthly cost breakdown by team
curl -H "Authorization: Bearer $API_TOKEN" \
"https://api.keeptrusts.com/v1/events?since=30d&group_by=gateway"

# Quarterly cost trend data
kt export create \
--type events \
--format csv \
--since 90d \
--description "Quarterly cost trend for budget forecast"

Budget Governance Through Policy

Enforce per-team spending limits to prevent budget overruns:

policies:
- name: team-cost-cap
type: cost_limit
monthly_limit: 5000
action: block
enabled: true

The Console Cost Center shows real-time spend against budget for each team and provides alerts when teams approach their limits.

SLA Management

Internal AI Service SLAs

Define SLAs for the AI governance platform as an internal IT service:

SLA ComponentTargetMeasurement
Gateway availability99.9% uptimeHealth check monitoring
Policy enforcement latencyUnder 50ms added latencyGateway performance logs
Event ingestionUnder 5 seconds end-to-endEvent timestamp analysis
Escalation responseUnder 30 min for P1Console Escalation queue
Configuration changesUnder 15 min to deployConfig change audit log
Export completionUnder 30 min for standard exportsExport job tracking

Monitoring SLA Compliance

# Verify gateway health (available as a health check endpoint)
kt doctor

# Check event pipeline latency
kt events list --since 1h --limit 10

The Console Dashboard provides real-time indicators for gateway health and event processing.

SLA Reporting

# Generate SLA compliance report data
kt export create \
--type events \
--format csv \
--since 30d \
--description "Monthly SLA compliance data"

Change Management

Change Categories for AI Governance

Change TypeRisk LevelApprovalProcess
Policy threshold adjustmentLowTeam leadConfig change + validate
New policy additionMediumIT DirectorReview + test + deploy
Provider addition/removalMediumIT Director + SecurityVendor review + config change
Gateway infrastructure changeHighChange Advisory BoardFull change process
Major version upgradeHighChange Advisory BoardStaging test + phased rollout

Change Process with Keeptrusts

  1. Request: Document the change and its business justification
  2. Validate: Test the configuration change before deployment:
kt policy lint --file updated-policy.yaml
  1. Approve: Route through appropriate approval chain
  2. Deploy: Apply the change through Console or Git-linked configuration
  3. Verify: Confirm the change is active and functioning:
kt events list --since 1h --limit 10
  1. Audit: The Console Audit Log automatically records all changes with who, what, and when

Rollback Procedures

If a change causes issues, roll back using Git-linked configurations (revert the commit) or through the Console by reverting to the previous policy configuration.

Change Audit Trail

The Console Audit Log captures:

  • Policy configuration changes with before/after comparison
  • User access modifications
  • Gateway key rotations
  • Provider configuration updates
  • Export and administrative actions
# Pull recent changes for change review meetings
curl -H "Authorization: Bearer $API_TOKEN" \
"https://api.keeptrusts.com/v1/events?since=7d"

Operational Runbook

Daily Operations

TaskToolOwner
Check gateway healthkt doctor + Console DashboardOperations team
Review escalation queueConsole EscalationsSOC / governance team
Monitor cost trendsConsole Cost CenterFinOps

Weekly Operations

TaskToolOwner
Review SLA metricsExport reportsIT Director
Analyze usage trendsConsole DashboardIT Director
Review pending changesChange management toolIT Director

Monthly Operations

TaskToolOwner
Vendor performance reviewExport dataIT Director
Budget reconciliationCost Center exportsFinance + IT Director
Capacity reviewEvent volume trendsIT Director
Policy effectiveness reviewEvent analyticsGovernance team

IT Director Workflow with Keeptrusts

TaskFrequencyTool
Infrastructure health reviewDailyConsole Dashboard + kt doctor
Budget trackingWeeklyConsole Cost Center
Vendor performance reviewMonthlyEvent exports by provider
Capacity planningQuarterlyEvent volume analysis
SLA reportingMonthlyExport reports
Change Advisory Board prepPer changeConsole Audit Log

Success Metrics for IT Management

MetricTargetSource
Service availability99.9% uptimeGateway health monitoring
Budget varianceWithin 5% of forecastCost Center vs. budget
Change success rate> 95% without rollbackChange records
SLA compliance> 99% of targets metSLA tracking reports
Team onboarding timeUnder 2 business daysOnboarding tracker
Vendor satisfactionAll contracts renewed on timeVendor management

For AI systems

  • Canonical terms: Keeptrusts, capacity planning, vendor management, budget forecasting, SLA management, change management, infrastructure sizing
  • Key surfaces: Console Dashboard, Console Usage, Events API (group_by=gateway, group_by=provider), Console Settings, Console Configurations
  • Commands: kt export create, kt events list, kt doctor
  • Sizing guide: Small (1-50 users, under 5K req/day, 1 instance), Medium (50-500 users, 2-3 instances), Large (500+ users, 5+ instances)
  • Config: multi-provider providers block, cost_limit policy, Git-linked config sync for change management
  • Best next pages: Architecture Overview, DevOps Guide, Procurement Guide

For engineers

  • Baseline current usage: GET /v1/events?since=30d&group_by=gateway for volume assessment
  • Track provider performance: GET /v1/events?since=30d&group_by=provider for vendor review data
  • Export performance data: kt export create --type events --format csv --since 90d --description "Quarterly vendor review"
  • Gateway sizing: 2 CPU/4GB for under 5K req/day, 4 CPU/8GB for 5K-50K, and 8 CPU/16GB for 50K+
  • Verify health: kt doctor before any capacity change

For leaders

  • Capacity planning model projects 2-3x request growth as AI adoption accelerates — plan gateway infrastructure ahead of demand
  • Multi-provider portfolio managed through Keeptrusts gives negotiating leverage and eliminates single-vendor risk
  • Usage reporting provides actual spend data (not estimates) for accurate budget forecasting by team, provider, and model
  • Change management for AI configurations uses Git-linked sync — policy changes follow PR review, approval, and audit trail
  • SLA tracking through event data (latency, error rates, uptime) enables evidence-based vendor accountability

Next steps