Distributed Cache Architecture: L1 to Control Plane
Agent gateway group cache sharing relies on a three-tier architecture. No gateway ever fetches cache entries directly from another gateway. Instead, all cache state flows through centralized, auditable layers.
Use this page when
- You need to understand the three-tier cache architecture (L1 memory, control-plane metadata in PostgreSQL, shared payload/vector backends).
- You are planning infrastructure for Redis/Valkey, S3/GCS, or Qdrant to support agent gateway group caching.
- You want to understand cache read/write flows and why gateways never communicate cache data directly.
Primary audience
- Primary: Technical Engineers
- Secondary: AI Agents, Technical Leaders
Architecture Overview
┌─────────────────────────────────────────────────────────┐
│ Agent Gateway Group │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Gateway A │ │Gateway B │ │Gateway C │ │
│ │ (L1) │ │ (L1) │ │ (L1) │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
└───────┼───────────────┼───────────────┼─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────┐
│ Control-Plane Metadata (PostgreSQL) │
│ Authoritative org-shared cache metadata store │
└────────────────────────────┬────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Shared Payload / Vector Backends │
│ Redis/Valkey │ S3/GCS │ Qdrant (vector) │
└─────────────────────────────────────────────────────────┘
Tier 1: Gateway L1 Local Memory
Each gateway maintains a local in-memory cache (L1). This is the fastest tier — responses served from L1 have sub-millisecond overhead.
Characteristics
| Property | Value |
|---|---|
| Storage location | Gateway process memory |
| Latency | Sub-millisecond |
| Scope | Single gateway instance |
| Persistence | None — lost on restart |
| Size | Configurable, typically 256 MB–2 GB |
| Eviction | LRU (Least Recently Used) |
Purpose
L1 exists as a performance optimization only. It reduces repeated round-trips to the control plane for frequently accessed entries. L1 is not authoritative — the control-plane metadata store is the source of truth.
What L1 Stores
- Full response payloads for recent cache hits
- Negative cache markers (confirmed misses) to avoid repeated lookups
- TTL-bounded entries that expire independently of the shared tier
L1 and Group Sharing
L1 is private to each gateway instance. Gateway A's L1 does not directly populate Gateway B's L1. However, when Gateway B queries the shared tier and gets a hit (originally written by Gateway A), Gateway B populates its own L1 for subsequent requests.
Tier 2: Control-Plane Metadata (PostgreSQL)
The control-plane metadata store is the authoritative record of all org-shared cache entries. It lives in the platform's PostgreSQL database alongside other governance data.
Characteristics
| Property | Value |
|---|---|
| Storage location | PostgreSQL (control-plane database) |
| Latency | 1–10 ms (network dependent) |
| Scope | Org-wide, group-scoped |
| Persistence | Durable, backed up |
| Size | Metadata only — not full payloads |
| Eviction | TTL-based + retention policies |
What the Metadata Store Contains
- Cache entry keys and their associated gates (org, codebase, policy, model, entitlement)
- Pointers to payload locations in the shared backend
- Entry creation timestamps and TTLs
- Audit fields: which gateway created the entry, which agent group owns it
- Validity flags: whether the entry has been invalidated
Why Metadata Lives in the Control Plane
Centralizing metadata in PostgreSQL provides:
- Auditability — every cache entry is traceable through the governance audit log.
- Topology independence — gateways can be added, removed, or replaced without metadata loss.
- Consistent security enforcement — entitlement and residency checks happen at the metadata layer.
- Transactional guarantees — cache invalidation is atomic and consistent.
Tier 3: Shared Payload and Vector Backends
Full response payloads and vector embeddings live in dedicated shared storage backends. The control-plane metadata store holds pointers to these locations.
Supported Backends
| Backend | Use Case | Stored Data |
|---|---|---|
| Redis / Valkey | Hot payload cache | Serialized LLM responses, small payloads |
| S3 / GCS | Large or cold payloads | Large responses, batch results, artifacts |
| Qdrant | Semantic vector cache | Embedding vectors for similarity-based cache lookup |
Redis / Valkey
You configure Redis or Valkey as the hot payload store for frequently accessed cache entries. Gateways retrieve payloads by reference (the pointer stored in control-plane metadata).
- Low latency (1–5 ms)
- Suitable for responses under 1 MB
- TTL-based eviction aligned with cache entry TTLs
- Cluster mode supported for horizontal scaling
S3 / GCS
For larger payloads or entries that you want to retain longer, configure an S3-compatible or GCS backend. Gateways retrieve payloads via presigned URLs generated by the control plane.
- Higher latency (50–200 ms)
- Cost-effective for large or infrequently accessed entries
- Lifecycle policies manage storage costs
- Data residency controls via bucket location
Qdrant (Vector Cache)
When semantic cache is enabled, vector embeddings are stored in Qdrant. The control plane coordinates similarity searches across the vector index.
- Enables cache hits for semantically equivalent (not just identical) requests
- Configurable similarity threshold
- Org-scoped collections ensure isolation
No Direct Gateway-to-Gateway Cache Fetches
A deliberate architectural constraint: gateways never communicate cache data directly to each other. All cache access flows through the control plane and shared backends.
Why This Constraint Exists
| Reason | Explanation |
|---|---|
| Centralized auditability | Every cache access is logged at the control plane |
| Topology independence | Gateway count can change without reconfiguring mesh connections |
| Simpler security model | One enforcement point for entitlement and residency checks |
| No split-brain risk | Metadata consistency is managed by PostgreSQL, not distributed consensus |
| Easier operations | No gateway discovery, no peer authentication, no gossip protocols |
What This Means in Practice
- Adding a new gateway to a group requires zero configuration on existing gateways.
- Removing a gateway from a group has no impact on other gateways' cache access.
- Network partitions between gateways do not affect cache consistency.
- You can replace all gateways in a group simultaneously without losing shared cache.
Cache Read Flow
When a gateway receives a request and checks the shared cache:
- L1 check — gateway checks local memory. If hit, return immediately.
- Metadata lookup — gateway queries the control-plane metadata store with the computed cache key.
- Gate validation — the control plane verifies org, entitlement, residency, and policy gates.
- Payload retrieval — if metadata exists and gates pass, the gateway fetches the payload from the referenced shared backend.
- L1 population — the gateway stores the result in L1 for future requests.
- Response — the cached response is returned to the caller.
Cache Write Flow
When a gateway produces a new response to cache:
- Response received — the gateway gets a response from the upstream LLM.
- Metadata write — the gateway writes cache metadata to the control plane (key, gates, TTL, payload pointer).
- Payload write — the gateway writes the full payload to the configured shared backend.
- L1 population — the gateway stores the result in its own L1.
- Confirmation — the control plane acknowledges the write.
Other gateways in the group can immediately read the new entry via their own metadata lookups.
Operational Considerations
Latency Budget
| Tier | Typical Latency | When Used |
|---|---|---|
| L1 hit | < 1 ms | Repeated identical requests on same gateway |
| Metadata + Redis hit | 5–15 ms | Cross-gateway cache hit, hot payload |
| Metadata + S3 hit | 50–200 ms | Cross-gateway cache hit, cold payload |
| Full miss | 0 ms overhead | No cache entry exists, forward to LLM |
Monitoring
Monitor these metrics to assess cache architecture health:
- L1 hit ratio per gateway
- Control-plane metadata lookup latency (p50, p99)
- Shared backend retrieval latency
- Cache write success rate
- Cross-gateway hit ratio (hits served from entries written by a different gateway)
Next steps
- What Are Gateway Groups? — conceptual overview
- Cache Sharing Across Gateways — cache key mechanics
- Configuring Gateway Groups — setup and management
- Gateway Failover Without Cache Loss — failover behavior
For AI systems
- Canonical terms: Keeptrusts, distributed cache architecture, L1 local memory, control-plane metadata, PostgreSQL, shared payload backend, Redis, Valkey, S3, GCS, Qdrant, vector cache, semantic cache, three-tier architecture.
- Feature/config names: L1 cache (256 MB–2 GB, LRU eviction), control-plane metadata store (PostgreSQL), hot payload store (Redis/Valkey), cold payload store (S3/GCS), vector index (Qdrant), gateway L1 hit ratio, cross-gateway hit ratio.
- Best next pages: Cache Sharing Across Gateways, Configuring Gateway Groups, Gateway Failover Without Cache Loss.
For engineers
- L1 is process memory on each gateway (sub-ms reads, lost on restart). Size it based on your gateway’s memory budget (256 MB–2 GB typical).
- Control-plane metadata: 1–10ms lookups, durable in PostgreSQL. This is the source of truth for cache entry existence and gate validation.
- Shared backends: Redis/Valkey for hot payloads (1–5ms), S3/GCS for cold/large payloads (50–200ms), Qdrant for semantic vector search.
- Monitor: L1 hit ratio per gateway, metadata lookup latency (p50/p99), shared backend retrieval latency, cache write success rate, cross-gateway hit ratio.
For leaders
- No gateway-to-gateway communication simplifies operations: no mesh networking, no peer discovery, no gossip protocols, no split-brain risk.
- Centralized metadata in PostgreSQL provides full auditability of every cache access through the governance audit log.
- Infrastructure cost: Redis cluster for hot payloads, S3 for cold storage (with lifecycle policies), Qdrant for semantic matching. Scale independently per tier.
- Topology independence means you can replace, scale, or relocate gateways without impacting shared cache state.