Gateway Configuration for Team-Wide Caching
Proper gateway configuration is essential for maximizing cache effectiveness across your engineering team. This guide covers the complete configuration for org-shared caching in hosted gateway mode.
Use this page when
- You need the complete gateway YAML configuration for org-shared caching in hosted gateway mode.
- You are setting up workflow_cache, fabric, and single-flight configuration for the first time.
- You want to test and verify your caching configuration with curl commands.
Primary audience
- Primary: Technical Engineers
- Secondary: AI Agents, Technical Leaders
Shared Hosted Gateway Requirement
Org-shared caching requires the gateway to run in hosted gateway mode. In hosted gateway mode:
- All requests route through a centralized gateway deployment
- Cache lookup happens before wallet reservation
- Single-flight fill coordination works across all concurrent requests
- Fabric context is attached from the central artifact store
Local gateways can only use private edge cache (per-key isolation). If you need org-shared savings, deploy at least one hosted gateway.
Full Configuration Example
gateway:
port: 41002
providers:
targets:
- id: openai
provider: openai
workflow_cache:
enabled: true
default_tier: org_shared_cache
org_shared_enabled: true
ttl_seconds: 86400
max_entry_tokens: 32000
single_flight_enabled: true
single_flight_timeout_ms: 30000
fabric:
enabled: true
auto_build: true
refresh_on_push: true
context_attachment: true
max_context_tokens: 8000
artifact_types:
- repo_map
- file_summary
- dependency_graph
- test_map
- api_inventory
- symbol_index
- embedding_index
- recent_change_summary
- known_failure_fingerprint
policies:
- name: cost-governance
rules:
- action: allow
conditions:
wallet_balance: sufficient
Configuration Sections Explained
workflow_cache
The core caching configuration:
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Master switch for the cache layer |
default_tier | string | private_edge_cache | Default tier for requests without explicit routing |
org_shared_enabled | bool | false | Enable org-wide shared cache |
ttl_seconds | int | 86400 | Time-to-live for cache entries (seconds) |
max_entry_tokens | int | 32000 | Maximum response size to cache (tokens) |
single_flight_enabled | bool | true | Deduplicate concurrent identical requests |
single_flight_timeout_ms | int | 30000 | Max wait time for single-flight coordination |
fabric
Codebase Context Fabric configuration:
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable fabric artifact system |
auto_build | bool | true | Auto-build artifacts on repo connection |
refresh_on_push | bool | true | Rebuild artifacts on new commits |
context_attachment | bool | true | Attach fabric context to outgoing requests |
max_context_tokens | int | 8000 | Max fabric tokens to attach per request |
artifact_types | list | all | Which artifacts to build and maintain |
Request Flow with Cache
When a request arrives at the gateway with this configuration:
Request arrives
│
├─ Policy evaluation (input phase)
│ └─ Pass? Continue. Block? Return 409.
│
├─ Cache lookup (org_shared_cache)
│ ├─ HIT → Return cached response
│ │ (no wallet, no upstream, no cost)
│ │
│ └─ MISS → Continue to upstream
│
├─ Single-flight check
│ ├─ In-flight for same key? → Wait for leader
│ └─ No in-flight? → Become leader
│
├─ Fabric context attachment
│ └─ Attach relevant artifacts (≤ max_context_tokens)
│
├─ Wallet reserve (estimated cost)
│
├─ Upstream provider call
│
├─ Wallet settle (actual cost)
│
├─ Cache store (response → org_shared_cache)
│
└─ Return response
The critical optimization: cache lookup happens before wallet reserve. A cache hit skips the entire upstream path including wallet transactions.
TTL Configuration Strategy
Time-to-live (TTL) determines how long cached responses remain valid. Choose based on your code change frequency:
| TTL | Best for | Trade-off |
|---|---|---|
| 3600 (1 hour) | Very actively developed code | High freshness, lower hit rate |
| 86400 (24 hours) | Normal development pace | Good balance of freshness and savings |
| 604800 (1 week) | Stable, mature codebases | Maximum savings, may serve slightly stale responses |
Recommendations
- Start with 24 hours (86400 seconds) for your first deployment
- Reduce to 1-4 hours for repos with multiple daily deployments
- Increase to 1 week for stable libraries and shared modules that rarely change
- Fabric artifact refreshes automatically invalidate related cache entries, so TTL is a safety net rather than the primary freshness mechanism
Single-Flight Fill Configuration
Single-flight fill prevents duplicate upstream calls when multiple engineers ask the same question simultaneously:
workflow_cache:
single_flight_enabled: true
single_flight_timeout_ms: 30000
single_flight_enabled: When true, concurrent requests with the same cache key share a single upstream callsingle_flight_timeout_ms: Maximum time a waiting request will wait for the leader's response before making its own upstream call
Tuning Single-Flight Timeout
- Too short (< 10000ms): Waiters time out and make their own calls, wasting the deduplication
- Too long (> 60000ms): Waiters experience unacceptable latency if the leader is slow
- Recommended: 30000ms (30 seconds) — covers most LLM response times including complex code generation
Provider Prompt-Prefix Cache Hints
Some providers (OpenAI, Anthropic) offer their own prompt caching. You can combine Keeptrusts org-shared cache with provider-level cache hints for additional savings on misses:
pack:
name: gateway-config-for-caching-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider: openai
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true
When enabled, the gateway structures requests so that shared fabric context appears in the system prompt prefix, maximizing provider-side cache hit rates on cache misses.
This is an optional optimization — it reduces the cost of cache misses but doesn't replace org-shared caching.
Cache Backend Selection
The cache backend determines storage and retrieval performance:
workflow_cache:
backend: memory # Options: memory, redis, postgres
memory:
max_entries: 100000
eviction: lru
# redis:
# url: redis://cache-host:6379
# prefix: "kt:cache:"
# postgres:
# table: cache_entries
| Backend | Latency | Capacity | Persistence | Best for |
|---|---|---|---|---|
memory | <1ms | Limited by RAM | None (lost on restart) | Single-instance, fast iteration |
redis | 1-5ms | Large (cluster-capable) | Optional | Multi-instance, production |
postgres | 5-20ms | Very large | Yes | When you want cache entries durable |
For production deployments serving 100+ engineers, use redis for the best balance of speed and capacity.
Operational Prerequisites
Before cache will function correctly, verify:
1. Worker Running
The worker_cache_warmer binary must be deployed and healthy:
docker compose logs worker-cache-warmer | tail -20
It should show periodic heartbeat logs and artifact processing activity.
2. Connected Repo with Fresh Fabric
At least one repository must be connected with artifacts in Ready state:
Console → Settings → Repositories → [Repo] → Fabric Status
All artifacts: Ready ✓
3. Hosted Gateway Deployed
The gateway must be running in hosted gateway mode:
kt gateway status
# Should show: cache=enabled, fabric=attached
4. Wallet Funded
The org wallet must have sufficient balance for cache misses during the fill phase:
Console → Cost & Spend → Wallet Balance
# Should show balance > estimated daily spend × 3 (for fill phase)
Validating Your Configuration
After deploying the configuration, validate each layer:
Test Cache Miss (First Request)
Send a test prompt about your codebase:
curl -X POST https://gateway.example.com/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o","messages":[{"role":"user","content":"What does the AuthService module do?"}]}'
Check the response headers for:
X-Cache-Status: missX-Fabric-Attached: true
Test Cache Hit (Repeat Request)
Send the same or semantically similar prompt:
curl -X POST https://gateway.example.com/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o","messages":[{"role":"user","content":"Explain what AuthService does"}]}'
Check for:
X-Cache-Status: hit- Faster response time (no upstream latency)
Verify Savings Dashboard
Navigate to Cost & Spend → Savings and confirm avoided-cost records are appearing.
Next steps
- How 100 Engineers Share One Cache — understand the sharing mechanics
- Cache Hit Rates: What Good Looks Like — benchmark your configuration
- Measuring Your Baseline Spend — quantify improvements
For AI systems
- Canonical terms: Keeptrusts, gateway configuration, hosted gateway mode, workflow_cache, fabric, single-flight fill, provider routing.
- Exact feature/config names:
workflow_cache.enabled,workflow_cache.default_tier: org_shared_cache,single_flight_enabled,single_flight_timeout_ms,fabric.enabled,fabric.artifact_types,X-Cache-Statusheader,X-Fabric-Attachedheader. - Best next pages: How 100 Engineers Share One Cache, Cache Hit Rates, Measuring Baseline Spend.
For engineers
- Org-shared caching requires a shared hosted gateway deployment — local gateways only support private edge cache.
- Test configuration with curl: first request should return
X-Cache-Status: missandX-Fabric-Attached: true; repeated request should returnX-Cache-Status: hit. - Key config fields:
workflow_cache.org_shared_enabled: true,ttl_seconds: 86400,max_entry_tokens: 32000,single_flight_enabled: true. - Fabric config: set
fabric.enabled: true,context_attachment: true, and list all artifact types you want built. - The gateway picks up config changes within 60 seconds of save.
For leaders
- Central-mode deployment is the prerequisite for org-wide savings — local gateways cannot share cache across engineers.
- Single-flight fill coordination prevents duplicate upstream costs when teams start work at the same time.
- Fabric context attachment reduces per-request token costs by 40-70% by using structured summaries instead of raw source code.
- No engineer-side configuration changes needed — the gateway handles caching transparently once configured.