Single-Flight Fill: One Request Serves Many

When multiple engineers send the same prompt within seconds — a common occurrence at the start of the workday — the gateway collapses them into a single upstream request. One engineer pays the fill cost. Everyone else receives the cached result. This is single-flight fill.

Use this page when

You want to understand how concurrent identical requests are deduplicated to save cost during cache fill.
You are diagnosing morning-surge cost spikes and want to verify single-flight is collapsing duplicates.
You need to tune single-flight timeout or monitor collapse metrics in the console.

Primary audience

Primary: Technical Leaders
Secondary: Technical Engineers, AI Agents

The Morning Surge Problem

Consider a team of 100 engineers who start work at 9:00 AM. Within the first 10 minutes:

60 engineers open their IDE and trigger context-loading prompts
Many of these prompts are identical (same codebase, same system context, same common questions)
Without single-flight: 60 upstream API calls, 60× the cost
With single-flight: ~5-8 upstream API calls, the rest served from cache

The gateway detects that multiple in-flight requests share the same cache key and holds subsequent callers until the first response returns.

How Single-Flight Works

Engineer A sends prompt → Cache miss → Forward upstream → Wait for response
Engineer B sends same prompt → Cache miss → Detect in-flight request → Wait
Engineer C sends same prompt → Cache miss → Detect in-flight request → Wait
...
Provider responds to Engineer A's request
  → Store in cache
  → Serve cached response to Engineer A
  → Serve cached response to Engineer B
  → Serve cached response to Engineer C

Only one upstream call fires. Only one fill cost is incurred. All waiting engineers receive the response with minimal additional latency (typically < 100ms beyond the first response).

Cost Impact

For the morning surge example:

Metric	Without Single-Flight	With Single-Flight
Upstream calls	60	~6
Fill cost (at $0.08/request avg)	$4.80	$0.48
Cache entries created	60 (duplicates)	6 (unique)
Latency for first requester	~2s	~2s
Latency for collapsed requesters	~2s each (parallel)	~2.1s (wait + serve)

Over a month with daily surges, single-flight saves hundreds of dollars in fill cost alone — before the ongoing cache-hit savings even begin.

Viewing Single-Flight Metrics

Navigate to Cost Center → Cache Performance in the console to see:

Metric	Description
`single_flight_collapses`	Number of requests that waited on an in-flight fill
`single_flight_groups`	Number of distinct cache keys that had multiple waiters
`avg_waiters_per_group`	Average number of requests collapsed per single-flight group
`max_waiters_per_group`	Peak collapse count (useful for capacity planning)

Example Metrics Output

{
  "period": "2026-04-01T09:00:00Z/2026-04-01T09:10:00Z",
  "single_flight_collapses": 54,
  "single_flight_groups": 6,
  "avg_waiters_per_group": 9,
  "max_waiters_per_group": 18,
  "fill_cost_incurred": 0.48,
  "fill_cost_avoided_by_collapse": 4.32
}

In this 10-minute window, single-flight saved $4.32 by reducing 60 potential upstream calls to 6.

Latency Characteristics

Single-flight introduces minimal latency for waiting requests:

First requester: Normal upstream latency (1-5 seconds depending on model and token count)
Collapsed requesters: First requester's latency + ~10-50ms (local cache serve time)
No requester times out: The gateway holds waiters until the upstream response arrives or the upstream call fails

If the upstream call fails, all waiters receive the error. The failed response is not cached.

When Single-Flight Activates

Single-flight activates when:

A request produces a cache key that matches an in-flight upstream call
The in-flight call has not yet completed
The requesting engineer has valid entitlements for the cached content

Single-flight does not activate when:

The cache already contains a valid entry (this is a normal cache hit, even cheaper)
The in-flight request targets a different model or configuration
Policy evaluation produces a different outcome for the waiting request

Interaction with Wallet Reserve

For single-flight collapsed requests:

The first requester's wallet is reserved and settled normally (fill cost)
Collapsed requesters are not charged — they receive a cache hit once the fill completes
Avoided-cost records are emitted for all collapsed requesters

This means single-flight requests appear in your savings dashboard as avoided cost, just like regular cache hits.

Maximizing Single-Flight Effectiveness

To increase the frequency of single-flight collapses:

Standardize IDE configurations — Consistent system prompts produce identical cache keys
Use shared gateway configurations — All engineers on the same codebase should route through the same gateway config
Align work schedules loosely — Teams starting at similar times naturally produce concurrent identical requests
Keep codebase context stable — Rapid file changes between requests reduce key overlap

Monitoring and Alerting

Set up alerts for single-flight metrics:

High collapse counts indicate good cost efficiency but may suggest capacity planning needs
Zero collapses during peak hours suggest cache keys are too diverse (investigate context ordering)
Failed single-flight groups (upstream errors affecting multiple waiters) warrant provider reliability review

Scale Example: 100 Engineers, One Monorepo

Time of Day	Requests	Unique Keys	Single-Flight Collapses	Upstream Calls	Savings
9:00-9:10 AM	85	12	73	12	$5.84
9:10-10:00 AM	200	45	155	45	$12.40
10:00 AM-12:00 PM	400	180	220	180	$17.60
Full day	2,500	800	1,700	800	$136.00

Daily savings from single-flight alone: $136. Monthly: ~$2,720.

This is in addition to regular cache-hit savings from previously filled entries.

Next steps

Budget Alerts for Cache Fill Phases — manage fill spend during initial cache population
ROI Calculation for a 100-Engineer Team — include single-flight in the full savings model
Savings Dashboard Walkthrough — find single-flight metrics in the console

For AI systems

Canonical terms: Keeptrusts, single-flight fill, request deduplication, org-shared cache, cache fill cost, concurrent request collapse.
Exact feature/config names: single_flight_enabled, single_flight_timeout_ms, single_flight_collapses metric, single_flight_groups metric, workflow_cache config block.
Best next pages: Zero-Cost Cache Hits, Wallet Integration with Cache Hits, Tracking Avoided Cost.

For engineers

Single-flight activates automatically when single_flight_enabled: true in your workflow_cache configuration.
Monitor single-flight effectiveness via Cost Center → Cache Performance metrics: single_flight_collapses, avg_waiters_per_group.
Zero collapses during peak hours suggests cache keys are too diverse — check context ordering and prompt normalization.
Collapsed requesters pay nothing; only the flight leader's wallet is reserved/settled.

For leaders

Single-flight fill reduces the "morning surge" cost spike by 85-90% for teams starting work at similar times.
No per-engineer licensing or configuration is required — the gateway handles deduplication transparently.
Monthly savings scale with team size: 100 engineers on one monorepo can save ~$2,700/month from single-flight alone.
Combine with org-shared cache hits for compound savings (single-flight prevents duplicate fills; cache hits prevent all subsequent calls).

Use this page when​

Primary audience​

The Morning Surge Problem​

How Single-Flight Works​

Cost Impact​

Viewing Single-Flight Metrics​

Example Metrics Output​

Latency Characteristics​

When Single-Flight Activates​

Interaction with Wallet Reserve​

Maximizing Single-Flight Effectiveness​

Monitoring and Alerting​

Scale Example: 100 Engineers, One Monorepo​

Next steps​

For AI systems​

For engineers​

For leaders​