Gateway Configuration for Team-Wide Caching

Proper gateway configuration is essential for maximizing cache effectiveness across your engineering team. This guide covers the complete configuration for org-shared caching in hosted gateway mode.

Use this page when

You need the complete gateway YAML configuration for org-shared caching in hosted gateway mode.
You are setting up workflow_cache, fabric, and single-flight configuration for the first time.
You want to test and verify your caching configuration with curl commands.

Primary audience

Primary: Technical Engineers
Secondary: AI Agents, Technical Leaders

Shared Hosted Gateway Requirement

Org-shared caching requires the gateway to run in hosted gateway mode. In hosted gateway mode:

All requests route through a centralized gateway deployment
Cache lookup happens before wallet reservation
Single-flight fill coordination works across all concurrent requests
Fabric context is attached from the central artifact store

Local gateways can only use private edge cache (per-key isolation). If you need org-shared savings, deploy at least one hosted gateway.

Full Configuration Example

gateway:
  port: 41002
providers:
  targets:
  - id: openai
    provider: openai
workflow_cache:
  enabled: true
  default_tier: org_shared_cache
  org_shared_enabled: true
  ttl_seconds: 86400
  max_entry_tokens: 32000
  single_flight_enabled: true
  single_flight_timeout_ms: 30000
fabric:
  enabled: true
  auto_build: true
  refresh_on_push: true
  context_attachment: true
  max_context_tokens: 8000
  artifact_types:
  - repo_map
  - file_summary
  - dependency_graph
  - test_map
  - api_inventory
  - symbol_index
  - embedding_index
  - recent_change_summary
  - known_failure_fingerprint
policies:
- name: cost-governance
  rules:
  - action: allow
    conditions:
      wallet_balance: sufficient

Configuration Sections Explained

`workflow_cache`

The core caching configuration:

Field	Type	Default	Description
`enabled`	bool	`false`	Master switch for the cache layer
`default_tier`	string	`private_edge_cache`	Default tier for requests without explicit routing
`org_shared_enabled`	bool	`false`	Enable org-wide shared cache
`ttl_seconds`	int	`86400`	Time-to-live for cache entries (seconds)
`max_entry_tokens`	int	`32000`	Maximum response size to cache (tokens)
`single_flight_enabled`	bool	`true`	Deduplicate concurrent identical requests
`single_flight_timeout_ms`	int	`30000`	Max wait time for single-flight coordination

`fabric`

Codebase Context Fabric configuration:

Field	Type	Default	Description
`enabled`	bool	`false`	Enable fabric artifact system
`auto_build`	bool	`true`	Auto-build artifacts on repo connection
`refresh_on_push`	bool	`true`	Rebuild artifacts on new commits
`context_attachment`	bool	`true`	Attach fabric context to outgoing requests
`max_context_tokens`	int	`8000`	Max fabric tokens to attach per request
`artifact_types`	list	all	Which artifacts to build and maintain

Request Flow with Cache

When a request arrives at the gateway with this configuration:

Request arrives
  │
  ├─ Policy evaluation (input phase)
  │   └─ Pass? Continue. Block? Return 409.
  │
  ├─ Cache lookup (org_shared_cache)
  │   ├─ HIT → Return cached response
  │   │        (no wallet, no upstream, no cost)
  │   │
  │   └─ MISS → Continue to upstream
  │
  ├─ Single-flight check
  │   ├─ In-flight for same key? → Wait for leader
  │   └─ No in-flight? → Become leader
  │
  ├─ Fabric context attachment
  │   └─ Attach relevant artifacts (≤ max_context_tokens)
  │
  ├─ Wallet reserve (estimated cost)
  │
  ├─ Upstream provider call
  │
  ├─ Wallet settle (actual cost)
  │
  ├─ Cache store (response → org_shared_cache)
  │
  └─ Return response

The critical optimization: cache lookup happens before wallet reserve. A cache hit skips the entire upstream path including wallet transactions.

TTL Configuration Strategy

Time-to-live (TTL) determines how long cached responses remain valid. Choose based on your code change frequency:

TTL	Best for	Trade-off
3600 (1 hour)	Very actively developed code	High freshness, lower hit rate
86400 (24 hours)	Normal development pace	Good balance of freshness and savings
604800 (1 week)	Stable, mature codebases	Maximum savings, may serve slightly stale responses

Recommendations

Start with 24 hours (86400 seconds) for your first deployment
Reduce to 1-4 hours for repos with multiple daily deployments
Increase to 1 week for stable libraries and shared modules that rarely change
Fabric artifact refreshes automatically invalidate related cache entries, so TTL is a safety net rather than the primary freshness mechanism

Single-Flight Fill Configuration

Single-flight fill prevents duplicate upstream calls when multiple engineers ask the same question simultaneously:

workflow_cache:
  single_flight_enabled: true
  single_flight_timeout_ms: 30000

single_flight_enabled: When true, concurrent requests with the same cache key share a single upstream call
single_flight_timeout_ms: Maximum time a waiting request will wait for the leader's response before making its own upstream call

Tuning Single-Flight Timeout

Too short (< 10000ms): Waiters time out and make their own calls, wasting the deduplication
Too long (> 60000ms): Waiters experience unacceptable latency if the leader is slow
Recommended: 30000ms (30 seconds) — covers most LLM response times including complex code generation

Provider Prompt-Prefix Cache Hints

Some providers (OpenAI, Anthropic) offer their own prompt caching. You can combine Keeptrusts org-shared cache with provider-level cache hints for additional savings on misses:

pack:
  name: gateway-config-for-caching-providers-3
  version: 1.0.0
  enabled: true
providers:
  targets:
  - id: openai
    provider: openai
policies:
  chain:
  - audit-logger
policy:
  audit-logger:
    immutable: true
    retention_days: 365
    log_all_access: true

When enabled, the gateway structures requests so that shared fabric context appears in the system prompt prefix, maximizing provider-side cache hit rates on cache misses.

This is an optional optimization — it reduces the cost of cache misses but doesn't replace org-shared caching.

Cache Backend Selection

The cache backend determines storage and retrieval performance:

workflow_cache:
  backend: memory      # Options: memory, redis, postgres
  memory:
    max_entries: 100000
    eviction: lru
  # redis:
  #   url: redis://cache-host:6379
  #   prefix: "kt:cache:"
  # postgres:
  #   table: cache_entries

Backend	Latency	Capacity	Persistence	Best for
`memory`	<1ms	Limited by RAM	None (lost on restart)	Single-instance, fast iteration
`redis`	1-5ms	Large (cluster-capable)	Optional	Multi-instance, production
`postgres`	5-20ms	Very large	Yes	When you want cache entries durable

For production deployments serving 100+ engineers, use redis for the best balance of speed and capacity.

Operational Prerequisites

Before cache will function correctly, verify:

1. Worker Running

The worker_cache_warmer binary must be deployed and healthy:

docker compose logs worker-cache-warmer | tail -20

It should show periodic heartbeat logs and artifact processing activity.

2. Connected Repo with Fresh Fabric

At least one repository must be connected with artifacts in Ready state:

Console → Settings → Repositories → [Repo] → Fabric Status
All artifacts: Ready ✓

3. Hosted Gateway Deployed

The gateway must be running in hosted gateway mode:

kt gateway status
# Should show: cache=enabled, fabric=attached

4. Wallet Funded

The org wallet must have sufficient balance for cache misses during the fill phase:

Console → Cost & Spend → Wallet Balance
# Should show balance > estimated daily spend × 3 (for fill phase)

Validating Your Configuration

After deploying the configuration, validate each layer:

Test Cache Miss (First Request)

Send a test prompt about your codebase:

curl -X POST https://gateway.example.com/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o","messages":[{"role":"user","content":"What does the AuthService module do?"}]}'

Check the response headers for:

X-Cache-Status: miss
X-Fabric-Attached: true

Test Cache Hit (Repeat Request)

Send the same or semantically similar prompt:

curl -X POST https://gateway.example.com/v1/chat/completions \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"model":"gpt-4o","messages":[{"role":"user","content":"Explain what AuthService does"}]}'

Check for:

X-Cache-Status: hit
Faster response time (no upstream latency)

Verify Savings Dashboard

Navigate to Cost & Spend → Savings and confirm avoided-cost records are appearing.

Next steps

How 100 Engineers Share One Cache — understand the sharing mechanics
Cache Hit Rates: What Good Looks Like — benchmark your configuration
Measuring Your Baseline Spend — quantify improvements

For AI systems

Canonical terms: Keeptrusts, gateway configuration, hosted gateway mode, workflow_cache, fabric, single-flight fill, provider routing.
Exact feature/config names: workflow_cache.enabled, workflow_cache.default_tier: org_shared_cache, single_flight_enabled, single_flight_timeout_ms, fabric.enabled, fabric.artifact_types, X-Cache-Status header, X-Fabric-Attached header.
Best next pages: How 100 Engineers Share One Cache, Cache Hit Rates, Measuring Baseline Spend.

For engineers

Org-shared caching requires a shared hosted gateway deployment — local gateways only support private edge cache.
Test configuration with curl: first request should return X-Cache-Status: miss and X-Fabric-Attached: true; repeated request should return X-Cache-Status: hit.
Key config fields: workflow_cache.org_shared_enabled: true, ttl_seconds: 86400, max_entry_tokens: 32000, single_flight_enabled: true.
Fabric config: set fabric.enabled: true, context_attachment: true, and list all artifact types you want built.
The gateway picks up config changes within 60 seconds of save.

For leaders

Central-mode deployment is the prerequisite for org-wide savings — local gateways cannot share cache across engineers.
Single-flight fill coordination prevents duplicate upstream costs when teams start work at the same time.
Fabric context attachment reduces per-request token costs by 40-70% by using structured summaries instead of raw source code.
No engineer-side configuration changes needed — the gateway handles caching transparently once configured.

Use this page when​

Primary audience​

Shared Hosted Gateway Requirement​

Full Configuration Example​

Configuration Sections Explained​

workflow_cache​

fabric​

Request Flow with Cache​

TTL Configuration Strategy​

Recommendations​

Single-Flight Fill Configuration​

Tuning Single-Flight Timeout​

Provider Prompt-Prefix Cache Hints​

Cache Backend Selection​

Operational Prerequisites​

1. Worker Running​

2. Connected Repo with Fresh Fabric​

3. Hosted Gateway Deployed​

4. Wallet Funded​

Validating Your Configuration​

Test Cache Miss (First Request)​

Test Cache Hit (Repeat Request)​

Verify Savings Dashboard​

Next steps​

For AI systems​

For engineers​

For leaders​