Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Gateway Configuration for Team-Wide Caching

Proper gateway configuration is essential for maximizing cache effectiveness across your engineering team. This guide covers the complete configuration for org-shared caching in hosted gateway mode.

Use this page when

  • You need the complete gateway YAML configuration for org-shared caching in hosted gateway mode.
  • You are setting up workflow_cache, fabric, and single-flight configuration for the first time.
  • You want to test and verify your caching configuration with curl commands.

Primary audience

  • Primary: Technical Engineers
  • Secondary: AI Agents, Technical Leaders

Shared Hosted Gateway Requirement

Org-shared caching requires the gateway to run in hosted gateway mode. In hosted gateway mode:

  • All requests route through a centralized gateway deployment
  • Cache lookup happens before wallet reservation
  • Single-flight fill coordination works across all concurrent requests
  • Fabric context is attached from the central artifact store

Local gateways can only use private edge cache (per-key isolation). If you need org-shared savings, deploy at least one hosted gateway.

Full Configuration Example

gateway:
port: 41002
providers:
targets:
- id: openai
provider: openai
workflow_cache:
enabled: true
default_tier: org_shared_cache
org_shared_enabled: true
ttl_seconds: 86400
max_entry_tokens: 32000
single_flight_enabled: true
single_flight_timeout_ms: 30000
fabric:
enabled: true
auto_build: true
refresh_on_push: true
context_attachment: true
max_context_tokens: 8000
artifact_types:
- repo_map
- file_summary
- dependency_graph
- test_map
- api_inventory
- symbol_index
- embedding_index
- recent_change_summary
- known_failure_fingerprint
policies:
- name: cost-governance
rules:
- action: allow
conditions:
wallet_balance: sufficient

Configuration Sections Explained

workflow_cache

The core caching configuration:

FieldTypeDefaultDescription
enabledboolfalseMaster switch for the cache layer
default_tierstringprivate_edge_cacheDefault tier for requests without explicit routing
org_shared_enabledboolfalseEnable org-wide shared cache
ttl_secondsint86400Time-to-live for cache entries (seconds)
max_entry_tokensint32000Maximum response size to cache (tokens)
single_flight_enabledbooltrueDeduplicate concurrent identical requests
single_flight_timeout_msint30000Max wait time for single-flight coordination

fabric

Codebase Context Fabric configuration:

FieldTypeDefaultDescription
enabledboolfalseEnable fabric artifact system
auto_buildbooltrueAuto-build artifacts on repo connection
refresh_on_pushbooltrueRebuild artifacts on new commits
context_attachmentbooltrueAttach fabric context to outgoing requests
max_context_tokensint8000Max fabric tokens to attach per request
artifact_typeslistallWhich artifacts to build and maintain

Request Flow with Cache

When a request arrives at the gateway with this configuration:

Request arrives

├─ Policy evaluation (input phase)
│ └─ Pass? Continue. Block? Return 409.

├─ Cache lookup (org_shared_cache)
│ ├─ HIT → Return cached response
│ │ (no wallet, no upstream, no cost)
│ │
│ └─ MISS → Continue to upstream

├─ Single-flight check
│ ├─ In-flight for same key? → Wait for leader
│ └─ No in-flight? → Become leader

├─ Fabric context attachment
│ └─ Attach relevant artifacts (≤ max_context_tokens)

├─ Wallet reserve (estimated cost)

├─ Upstream provider call

├─ Wallet settle (actual cost)

├─ Cache store (response → org_shared_cache)

└─ Return response

The critical optimization: cache lookup happens before wallet reserve. A cache hit skips the entire upstream path including wallet transactions.

TTL Configuration Strategy

Time-to-live (TTL) determines how long cached responses remain valid. Choose based on your code change frequency:

TTLBest forTrade-off
3600 (1 hour)Very actively developed codeHigh freshness, lower hit rate
86400 (24 hours)Normal development paceGood balance of freshness and savings
604800 (1 week)Stable, mature codebasesMaximum savings, may serve slightly stale responses

Recommendations

  • Start with 24 hours (86400 seconds) for your first deployment
  • Reduce to 1-4 hours for repos with multiple daily deployments
  • Increase to 1 week for stable libraries and shared modules that rarely change
  • Fabric artifact refreshes automatically invalidate related cache entries, so TTL is a safety net rather than the primary freshness mechanism

Single-Flight Fill Configuration

Single-flight fill prevents duplicate upstream calls when multiple engineers ask the same question simultaneously:

workflow_cache:
single_flight_enabled: true
single_flight_timeout_ms: 30000
  • single_flight_enabled: When true, concurrent requests with the same cache key share a single upstream call
  • single_flight_timeout_ms: Maximum time a waiting request will wait for the leader's response before making its own upstream call

Tuning Single-Flight Timeout

  • Too short (< 10000ms): Waiters time out and make their own calls, wasting the deduplication
  • Too long (> 60000ms): Waiters experience unacceptable latency if the leader is slow
  • Recommended: 30000ms (30 seconds) — covers most LLM response times including complex code generation

Provider Prompt-Prefix Cache Hints

Some providers (OpenAI, Anthropic) offer their own prompt caching. You can combine Keeptrusts org-shared cache with provider-level cache hints for additional savings on misses:

pack:
name: gateway-config-for-caching-providers-3
version: 1.0.0
enabled: true
providers:
targets:
- id: openai
provider: openai
policies:
chain:
- audit-logger
policy:
audit-logger:
immutable: true
retention_days: 365
log_all_access: true

When enabled, the gateway structures requests so that shared fabric context appears in the system prompt prefix, maximizing provider-side cache hit rates on cache misses.

This is an optional optimization — it reduces the cost of cache misses but doesn't replace org-shared caching.

Cache Backend Selection

The cache backend determines storage and retrieval performance:

workflow_cache:
backend: memory # Options: memory, redis, postgres
memory:
max_entries: 100000
eviction: lru
# redis:
# url: redis://cache-host:6379
# prefix: "kt:cache:"
# postgres:
# table: cache_entries
BackendLatencyCapacityPersistenceBest for
memory<1msLimited by RAMNone (lost on restart)Single-instance, fast iteration
redis1-5msLarge (cluster-capable)OptionalMulti-instance, production
postgres5-20msVery largeYesWhen you want cache entries durable

For production deployments serving 100+ engineers, use redis for the best balance of speed and capacity.

Operational Prerequisites

Before cache will function correctly, verify:

1. Worker Running

The worker_cache_warmer binary must be deployed and healthy:

docker compose logs worker-cache-warmer | tail -20

It should show periodic heartbeat logs and artifact processing activity.

2. Connected Repo with Fresh Fabric

At least one repository must be connected with artifacts in Ready state:

Console → Settings → Repositories → [Repo] → Fabric Status
All artifacts: Ready ✓

3. Hosted Gateway Deployed

The gateway must be running in hosted gateway mode:

kt gateway status
# Should show: cache=enabled, fabric=attached

4. Wallet Funded

The org wallet must have sufficient balance for cache misses during the fill phase:

Console → Cost & Spend → Wallet Balance
# Should show balance > estimated daily spend × 3 (for fill phase)

Validating Your Configuration

After deploying the configuration, validate each layer:

Test Cache Miss (First Request)

Send a test prompt about your codebase:

curl -X POST https://gateway.example.com/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o","messages":[{"role":"user","content":"What does the AuthService module do?"}]}'

Check the response headers for:

  • X-Cache-Status: miss
  • X-Fabric-Attached: true

Test Cache Hit (Repeat Request)

Send the same or semantically similar prompt:

curl -X POST https://gateway.example.com/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o","messages":[{"role":"user","content":"Explain what AuthService does"}]}'

Check for:

  • X-Cache-Status: hit
  • Faster response time (no upstream latency)

Verify Savings Dashboard

Navigate to Cost & Spend → Savings and confirm avoided-cost records are appearing.

Next steps

For AI systems

  • Canonical terms: Keeptrusts, gateway configuration, hosted gateway mode, workflow_cache, fabric, single-flight fill, provider routing.
  • Exact feature/config names: workflow_cache.enabled, workflow_cache.default_tier: org_shared_cache, single_flight_enabled, single_flight_timeout_ms, fabric.enabled, fabric.artifact_types, X-Cache-Status header, X-Fabric-Attached header.
  • Best next pages: How 100 Engineers Share One Cache, Cache Hit Rates, Measuring Baseline Spend.

For engineers

  • Org-shared caching requires a shared hosted gateway deployment — local gateways only support private edge cache.
  • Test configuration with curl: first request should return X-Cache-Status: miss and X-Fabric-Attached: true; repeated request should return X-Cache-Status: hit.
  • Key config fields: workflow_cache.org_shared_enabled: true, ttl_seconds: 86400, max_entry_tokens: 32000, single_flight_enabled: true.
  • Fabric config: set fabric.enabled: true, context_attachment: true, and list all artifact types you want built.
  • The gateway picks up config changes within 60 seconds of save.

For leaders

  • Central-mode deployment is the prerequisite for org-wide savings — local gateways cannot share cache across engineers.
  • Single-flight fill coordination prevents duplicate upstream costs when teams start work at the same time.
  • Fabric context attachment reduces per-request token costs by 40-70% by using structured summaries instead of raw source code.
  • No engineer-side configuration changes needed — the gateway handles caching transparently once configured.