Skip to main content
Browse docs
By Audience
Getting Started
Configuration
Use Cases
IDE Integration
Third-Party Integrations
Engineering Cache
Console
API Reference
Gateway
Workflow Guides
Templates
Providers and SDKs
Industry Guides
Advanced Guides
Browse by Role
Deployment Guides
In-Depth Guides
Tutorials
FAQ

Distributed Cache Architecture: L1 to Control Plane

Agent gateway group cache sharing relies on a three-tier architecture. No gateway ever fetches cache entries directly from another gateway. Instead, all cache state flows through centralized, auditable layers.

Use this page when

  • You need to understand the three-tier cache architecture (L1 memory, control-plane metadata in PostgreSQL, shared payload/vector backends).
  • You are planning infrastructure for Redis/Valkey, S3/GCS, or Qdrant to support agent gateway group caching.
  • You want to understand cache read/write flows and why gateways never communicate cache data directly.

Primary audience

  • Primary: Technical Engineers
  • Secondary: AI Agents, Technical Leaders

Architecture Overview

┌─────────────────────────────────────────────────────────┐
│ Agent Gateway Group │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Gateway A │ │Gateway B │ │Gateway C │ │
│ │ (L1) │ │ (L1) │ │ (L1) │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
└───────┼───────────────┼───────────────┼─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────────┐
│ Control-Plane Metadata (PostgreSQL) │
│ Authoritative org-shared cache metadata store │
└────────────────────────────┬────────────────────────────┘


┌─────────────────────────────────────────────────────────┐
│ Shared Payload / Vector Backends │
│ Redis/Valkey │ S3/GCS │ Qdrant (vector) │
└─────────────────────────────────────────────────────────┘

Tier 1: Gateway L1 Local Memory

Each gateway maintains a local in-memory cache (L1). This is the fastest tier — responses served from L1 have sub-millisecond overhead.

Characteristics

PropertyValue
Storage locationGateway process memory
LatencySub-millisecond
ScopeSingle gateway instance
PersistenceNone — lost on restart
SizeConfigurable, typically 256 MB–2 GB
EvictionLRU (Least Recently Used)

Purpose

L1 exists as a performance optimization only. It reduces repeated round-trips to the control plane for frequently accessed entries. L1 is not authoritative — the control-plane metadata store is the source of truth.

What L1 Stores

  • Full response payloads for recent cache hits
  • Negative cache markers (confirmed misses) to avoid repeated lookups
  • TTL-bounded entries that expire independently of the shared tier

L1 and Group Sharing

L1 is private to each gateway instance. Gateway A's L1 does not directly populate Gateway B's L1. However, when Gateway B queries the shared tier and gets a hit (originally written by Gateway A), Gateway B populates its own L1 for subsequent requests.

Tier 2: Control-Plane Metadata (PostgreSQL)

The control-plane metadata store is the authoritative record of all org-shared cache entries. It lives in the platform's PostgreSQL database alongside other governance data.

Characteristics

PropertyValue
Storage locationPostgreSQL (control-plane database)
Latency1–10 ms (network dependent)
ScopeOrg-wide, group-scoped
PersistenceDurable, backed up
SizeMetadata only — not full payloads
EvictionTTL-based + retention policies

What the Metadata Store Contains

  • Cache entry keys and their associated gates (org, codebase, policy, model, entitlement)
  • Pointers to payload locations in the shared backend
  • Entry creation timestamps and TTLs
  • Audit fields: which gateway created the entry, which agent group owns it
  • Validity flags: whether the entry has been invalidated

Why Metadata Lives in the Control Plane

Centralizing metadata in PostgreSQL provides:

  1. Auditability — every cache entry is traceable through the governance audit log.
  2. Topology independence — gateways can be added, removed, or replaced without metadata loss.
  3. Consistent security enforcement — entitlement and residency checks happen at the metadata layer.
  4. Transactional guarantees — cache invalidation is atomic and consistent.

Tier 3: Shared Payload and Vector Backends

Full response payloads and vector embeddings live in dedicated shared storage backends. The control-plane metadata store holds pointers to these locations.

Supported Backends

BackendUse CaseStored Data
Redis / ValkeyHot payload cacheSerialized LLM responses, small payloads
S3 / GCSLarge or cold payloadsLarge responses, batch results, artifacts
QdrantSemantic vector cacheEmbedding vectors for similarity-based cache lookup

Redis / Valkey

You configure Redis or Valkey as the hot payload store for frequently accessed cache entries. Gateways retrieve payloads by reference (the pointer stored in control-plane metadata).

  • Low latency (1–5 ms)
  • Suitable for responses under 1 MB
  • TTL-based eviction aligned with cache entry TTLs
  • Cluster mode supported for horizontal scaling

S3 / GCS

For larger payloads or entries that you want to retain longer, configure an S3-compatible or GCS backend. Gateways retrieve payloads via presigned URLs generated by the control plane.

  • Higher latency (50–200 ms)
  • Cost-effective for large or infrequently accessed entries
  • Lifecycle policies manage storage costs
  • Data residency controls via bucket location

Qdrant (Vector Cache)

When semantic cache is enabled, vector embeddings are stored in Qdrant. The control plane coordinates similarity searches across the vector index.

  • Enables cache hits for semantically equivalent (not just identical) requests
  • Configurable similarity threshold
  • Org-scoped collections ensure isolation

No Direct Gateway-to-Gateway Cache Fetches

A deliberate architectural constraint: gateways never communicate cache data directly to each other. All cache access flows through the control plane and shared backends.

Why This Constraint Exists

ReasonExplanation
Centralized auditabilityEvery cache access is logged at the control plane
Topology independenceGateway count can change without reconfiguring mesh connections
Simpler security modelOne enforcement point for entitlement and residency checks
No split-brain riskMetadata consistency is managed by PostgreSQL, not distributed consensus
Easier operationsNo gateway discovery, no peer authentication, no gossip protocols

What This Means in Practice

  • Adding a new gateway to a group requires zero configuration on existing gateways.
  • Removing a gateway from a group has no impact on other gateways' cache access.
  • Network partitions between gateways do not affect cache consistency.
  • You can replace all gateways in a group simultaneously without losing shared cache.

Cache Read Flow

When a gateway receives a request and checks the shared cache:

  1. L1 check — gateway checks local memory. If hit, return immediately.
  2. Metadata lookup — gateway queries the control-plane metadata store with the computed cache key.
  3. Gate validation — the control plane verifies org, entitlement, residency, and policy gates.
  4. Payload retrieval — if metadata exists and gates pass, the gateway fetches the payload from the referenced shared backend.
  5. L1 population — the gateway stores the result in L1 for future requests.
  6. Response — the cached response is returned to the caller.

Cache Write Flow

When a gateway produces a new response to cache:

  1. Response received — the gateway gets a response from the upstream LLM.
  2. Metadata write — the gateway writes cache metadata to the control plane (key, gates, TTL, payload pointer).
  3. Payload write — the gateway writes the full payload to the configured shared backend.
  4. L1 population — the gateway stores the result in its own L1.
  5. Confirmation — the control plane acknowledges the write.

Other gateways in the group can immediately read the new entry via their own metadata lookups.

Operational Considerations

Latency Budget

TierTypical LatencyWhen Used
L1 hit< 1 msRepeated identical requests on same gateway
Metadata + Redis hit5–15 msCross-gateway cache hit, hot payload
Metadata + S3 hit50–200 msCross-gateway cache hit, cold payload
Full miss0 ms overheadNo cache entry exists, forward to LLM

Monitoring

Monitor these metrics to assess cache architecture health:

  • L1 hit ratio per gateway
  • Control-plane metadata lookup latency (p50, p99)
  • Shared backend retrieval latency
  • Cache write success rate
  • Cross-gateway hit ratio (hits served from entries written by a different gateway)

Next steps

For AI systems

  • Canonical terms: Keeptrusts, distributed cache architecture, L1 local memory, control-plane metadata, PostgreSQL, shared payload backend, Redis, Valkey, S3, GCS, Qdrant, vector cache, semantic cache, three-tier architecture.
  • Feature/config names: L1 cache (256 MB–2 GB, LRU eviction), control-plane metadata store (PostgreSQL), hot payload store (Redis/Valkey), cold payload store (S3/GCS), vector index (Qdrant), gateway L1 hit ratio, cross-gateway hit ratio.
  • Best next pages: Cache Sharing Across Gateways, Configuring Gateway Groups, Gateway Failover Without Cache Loss.

For engineers

  • L1 is process memory on each gateway (sub-ms reads, lost on restart). Size it based on your gateway’s memory budget (256 MB–2 GB typical).
  • Control-plane metadata: 1–10ms lookups, durable in PostgreSQL. This is the source of truth for cache entry existence and gate validation.
  • Shared backends: Redis/Valkey for hot payloads (1–5ms), S3/GCS for cold/large payloads (50–200ms), Qdrant for semantic vector search.
  • Monitor: L1 hit ratio per gateway, metadata lookup latency (p50/p99), shared backend retrieval latency, cache write success rate, cross-gateway hit ratio.

For leaders

  • No gateway-to-gateway communication simplifies operations: no mesh networking, no peer discovery, no gossip protocols, no split-brain risk.
  • Centralized metadata in PostgreSQL provides full auditability of every cache access through the governance audit log.
  • Infrastructure cost: Redis cluster for hot payloads, S3 for cold storage (with lifecycle policies), Qdrant for semantic matching. Scale independently per tier.
  • Topology independence means you can replace, scale, or relocate gateways without impacting shared cache state.