IT Director Guide: AI Infrastructure Management
As an IT Director overseeing AI infrastructure, you manage the operational reality of running AI services at scale — capacity planning, vendor relationships, budgets, SLAs, and change processes. Keeptrusts gives you the operational data and controls needed to manage AI as a governed enterprise service rather than a collection of ungoverned experiments.
Use this page when
- You are planning AI infrastructure capacity (users, requests/day, gateway sizing)
- You need to manage multiple LLM vendor relationships and track provider performance
- You are forecasting AI service budgets and negotiating SLAs
- You want to implement change management processes for AI governance configurations
- You need to run AI as a governed enterprise service rather than ad hoc experiments
Primary audience
- Primary: Technical Leaders (IT Directors, Heads of IT Operations)
- Secondary: Procurement, CTOs, DevOps Engineers, Finance
Capacity Planning
Current State Assessment
Before planning capacity, baseline your current AI usage through the Console and API:
# Get current usage volume
curl -H "Authorization: Bearer $API_TOKEN" \
"https://api.keeptrusts.com/v1/events?since=30d&group_by=gateway"
# Identify peak usage patterns
curl -H "Authorization: Bearer $API_TOKEN" \
"https://api.keeptrusts.com/v1/events?since=7d"
The Console Dashboard provides visual trends for request volumes, active users, and policy trigger rates.
Capacity Planning Model
| Dimension | Current | 6-Month Projection | Planning Action |
|---|---|---|---|
| Active users | Measure via events | +50-100% with AI adoption wave | Provision gateway capacity |
| Requests/day | Measure via events | 2-3x growth expected | Load test gateway cluster |
| Teams onboarded | Count gateways | All engineering + business teams | Template and onboarding plan |
| LLM providers | Count in config | Add 1-2 for redundancy | Provider evaluation |
| Storage (events) | Measure via API | Linear growth with usage | Event retention policy |
Gateway Infrastructure Sizing
| Deployment Size | Users | Requests/Day | Gateway Instances | Resources |
|---|---|---|---|---|
| Small | 1-50 | Under 5K | 1 | 2 CPU, 4GB RAM |
| Medium | 50-500 | 5K-50K | 2-3 | 4 CPU, 8GB RAM each |
| Large | 500+ | 50K+ | 5+ | 8 CPU, 16GB RAM each |
Vendor Management
LLM Provider Portfolio
Manage multiple LLM providers as a portfolio with Keeptrusts providing the governance abstraction:
pack:
name: it-director-providers-1
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider:
secret_key_ref:
env: OPENAI_API_KEY
- id: anthropic
provider:
secret_key_ref:
env: ANTHROPIC_API_KEY
- id: azure-openai
provider:
base_url: https://your-instance.openai.azure.com
secret_key_ref:
env: AZURE_OPENAI_API_KEY
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
Vendor Performance Tracking
Use Keeptrusts event data to track provider performance:
# Provider comparison — usage volume and policy triggers
curl -H "Authorization: Bearer $API_TOKEN" \
"https://api.keeptrusts.com/v1/events?since=30d&group_by=provider"
# Export provider performance data for vendor review
kt export create \
--type events \
--format csv \
--since 90d \
--description "Quarterly vendor performance review"
Vendor Review Checklist
| Review Item | Source | Frequency |
|---|---|---|
| Usage volume per provider | Console Events + API | Monthly |
| Cost per provider | Console Usage | Monthly |
| Policy trigger rate by provider | Event analytics | Quarterly |
| Provider uptime and reliability | Gateway health logs | Monthly |
| Contract renewal timeline | Vendor management tool | Quarterly |
| Security and compliance certifications | Vendor documentation | Annually |
Budget Forecasting
AI Spend Components
| Cost Category | Tracking Method | Forecast Input |
|---|---|---|
| LLM API usage | Console Cost Center | Historical trend + growth rate |
| Gateway infrastructure | Infrastructure monitoring | Capacity plan |
| Keeptrusts platform | License management | User/team count |
| Internal team costs | HR / Finance | Headcount plan |
| Training and onboarding | Project tracking | Adoption roadmap |
Pulling Cost Data for Forecasting
# Monthly cost breakdown by team
curl -H "Authorization: Bearer $API_TOKEN" \
"https://api.keeptrusts.com/v1/events?since=30d&group_by=gateway"
# Quarterly cost trend data
kt export create \
--type events \
--format csv \
--since 90d \
--description "Quarterly cost trend for budget forecast"
Budget Governance Through Policy
Enforce per-team spending limits to prevent budget overruns:
policies:
- name: team-cost-cap
type: cost_limit
monthly_limit: 5000
action: block
enabled: true
The Console Cost Center shows real-time spend against budget for each team and provides alerts when teams approach their limits.
SLA Management
Internal AI Service SLAs
Define SLAs for the AI governance platform as an internal IT service:
| SLA Component | Target | Measurement |
|---|---|---|
| Gateway availability | 99.9% uptime | Health check monitoring |
| Policy enforcement latency | Under 50ms added latency | Gateway performance logs |
| Event ingestion | Under 5 seconds end-to-end | Event timestamp analysis |
| Escalation response | Under 30 min for P1 | Console Escalation queue |
| Configuration changes | Under 15 min to deploy | Config change audit log |
| Export completion | Under 30 min for standard exports | Export job tracking |
Monitoring SLA Compliance
# Verify gateway health (available as a health check endpoint)
kt doctor
# Check event pipeline latency
kt events list --since 1h --limit 10
The Console Dashboard provides real-time indicators for gateway health and event processing.
SLA Reporting
# Generate SLA compliance report data
kt export create \
--type events \
--format csv \
--since 30d \
--description "Monthly SLA compliance data"
Change Management
Change Categories for AI Governance
| Change Type | Risk Level | Approval | Process |
|---|---|---|---|
| Policy threshold adjustment | Low | Team lead | Config change + validate |
| New policy addition | Medium | IT Director | Review + test + deploy |
| Provider addition/removal | Medium | IT Director + Security | Vendor review + config change |
| Gateway infrastructure change | High | Change Advisory Board | Full change process |
| Major version upgrade | High | Change Advisory Board | Staging test + phased rollout |
Change Process with Keeptrusts
- Request: Document the change and its business justification
- Validate: Test the configuration change before deployment:
kt policy lint --file updated-policy.yaml
- Approve: Route through appropriate approval chain
- Deploy: Apply the change through Console or Git-linked configuration
- Verify: Confirm the change is active and functioning:
kt events list --since 1h --limit 10
- Audit: The Console Audit Log automatically records all changes with who, what, and when
Rollback Procedures
If a change causes issues, roll back using Git-linked configurations (revert the commit) or through the Console by reverting to the previous policy configuration.
Change Audit Trail
The Console Audit Log captures:
- Policy configuration changes with before/after comparison
- User access modifications
- Gateway key rotations
- Provider configuration updates
- Export and administrative actions
# Pull recent changes for change review meetings
curl -H "Authorization: Bearer $API_TOKEN" \
"https://api.keeptrusts.com/v1/events?since=7d"
Operational Runbook
Daily Operations
| Task | Tool | Owner |
|---|---|---|
| Check gateway health | kt doctor + Console Dashboard | Operations team |
| Review escalation queue | Console Escalations | SOC / governance team |
| Monitor cost trends | Console Cost Center | FinOps |
Weekly Operations
| Task | Tool | Owner |
|---|---|---|
| Review SLA metrics | Export reports | IT Director |
| Analyze usage trends | Console Dashboard | IT Director |
| Review pending changes | Change management tool | IT Director |
Monthly Operations
| Task | Tool | Owner |
|---|---|---|
| Vendor performance review | Export data | IT Director |
| Budget reconciliation | Cost Center exports | Finance + IT Director |
| Capacity review | Event volume trends | IT Director |
| Policy effectiveness review | Event analytics | Governance team |
IT Director Workflow with Keeptrusts
| Task | Frequency | Tool |
|---|---|---|
| Infrastructure health review | Daily | Console Dashboard + kt doctor |
| Budget tracking | Weekly | Console Cost Center |
| Vendor performance review | Monthly | Event exports by provider |
| Capacity planning | Quarterly | Event volume analysis |
| SLA reporting | Monthly | Export reports |
| Change Advisory Board prep | Per change | Console Audit Log |
Success Metrics for IT Management
| Metric | Target | Source |
|---|---|---|
| Service availability | 99.9% uptime | Gateway health monitoring |
| Budget variance | Within 5% of forecast | Cost Center vs. budget |
| Change success rate | > 95% without rollback | Change records |
| SLA compliance | > 99% of targets met | SLA tracking reports |
| Team onboarding time | Under 2 business days | Onboarding tracker |
| Vendor satisfaction | All contracts renewed on time | Vendor management |
For AI systems
- Canonical terms: Keeptrusts, capacity planning, vendor management, budget forecasting, SLA management, change management, infrastructure sizing
- Key surfaces: Console Dashboard, Console Usage, Events API (
group_by=gateway,group_by=provider), Console Settings, Console Configurations - Commands:
kt export create,kt events list,kt doctor - Sizing guide: Small (1-50 users, under 5K req/day, 1 instance), Medium (50-500 users, 2-3 instances), Large (500+ users, 5+ instances)
- Config: multi-provider
providersblock,cost_limitpolicy, Git-linked config sync for change management - Best next pages: Architecture Overview, DevOps Guide, Procurement Guide
For engineers
- Baseline current usage:
GET /v1/events?since=30d&group_by=gatewayfor volume assessment - Track provider performance:
GET /v1/events?since=30d&group_by=providerfor vendor review data - Export performance data:
kt export create --type events --format csv --since 90d --description "Quarterly vendor review" - Gateway sizing: 2 CPU/4GB for under 5K req/day, 4 CPU/8GB for 5K-50K, and 8 CPU/16GB for 50K+
- Verify health:
kt doctorbefore any capacity change
For leaders
- Capacity planning model projects 2-3x request growth as AI adoption accelerates — plan gateway infrastructure ahead of demand
- Multi-provider portfolio managed through Keeptrusts gives negotiating leverage and eliminates single-vendor risk
- Usage reporting provides actual spend data (not estimates) for accurate budget forecasting by team, provider, and model
- Change management for AI configurations uses Git-linked sync — policy changes follow PR review, approval, and audit trail
- SLA tracking through event data (latency, error rates, uptime) enables evidence-based vendor accountability
Next steps
- Deploy infrastructure: DevOps Guide
- Vendor evaluation framework: Procurement Guide
- Architecture patterns: Architecture Overview
- Configuration management: Configuration Management