Single-Flight Fill: One Request Serves Many
When multiple engineers send the same prompt within seconds — a common occurrence at the start of the workday — the gateway collapses them into a single upstream request. One engineer pays the fill cost. Everyone else receives the cached result. This is single-flight fill.
Use this page when
- You want to understand how concurrent identical requests are deduplicated to save cost during cache fill.
- You are diagnosing morning-surge cost spikes and want to verify single-flight is collapsing duplicates.
- You need to tune single-flight timeout or monitor collapse metrics in the console.
Primary audience
- Primary: Technical Leaders
- Secondary: Technical Engineers, AI Agents
The Morning Surge Problem
Consider a team of 100 engineers who start work at 9:00 AM. Within the first 10 minutes:
- 60 engineers open their IDE and trigger context-loading prompts
- Many of these prompts are identical (same codebase, same system context, same common questions)
- Without single-flight: 60 upstream API calls, 60× the cost
- With single-flight: ~5-8 upstream API calls, the rest served from cache
The gateway detects that multiple in-flight requests share the same cache key and holds subsequent callers until the first response returns.
How Single-Flight Works
Engineer A sends prompt → Cache miss → Forward upstream → Wait for response
Engineer B sends same prompt → Cache miss → Detect in-flight request → Wait
Engineer C sends same prompt → Cache miss → Detect in-flight request → Wait
...
Provider responds to Engineer A's request
→ Store in cache
→ Serve cached response to Engineer A
→ Serve cached response to Engineer B
→ Serve cached response to Engineer C
Only one upstream call fires. Only one fill cost is incurred. All waiting engineers receive the response with minimal additional latency (typically < 100ms beyond the first response).
Cost Impact
For the morning surge example:
| Metric | Without Single-Flight | With Single-Flight |
|---|---|---|
| Upstream calls | 60 | ~6 |
| Fill cost (at $0.08/request avg) | $4.80 | $0.48 |
| Cache entries created | 60 (duplicates) | 6 (unique) |
| Latency for first requester | ~2s | ~2s |
| Latency for collapsed requesters | ~2s each (parallel) | ~2.1s (wait + serve) |
Over a month with daily surges, single-flight saves hundreds of dollars in fill cost alone — before the ongoing cache-hit savings even begin.
Viewing Single-Flight Metrics
Navigate to Cost Center → Cache Performance in the console to see:
| Metric | Description |
|---|---|
single_flight_collapses | Number of requests that waited on an in-flight fill |
single_flight_groups | Number of distinct cache keys that had multiple waiters |
avg_waiters_per_group | Average number of requests collapsed per single-flight group |
max_waiters_per_group | Peak collapse count (useful for capacity planning) |
Example Metrics Output
{
"period": "2026-04-01T09:00:00Z/2026-04-01T09:10:00Z",
"single_flight_collapses": 54,
"single_flight_groups": 6,
"avg_waiters_per_group": 9,
"max_waiters_per_group": 18,
"fill_cost_incurred": 0.48,
"fill_cost_avoided_by_collapse": 4.32
}
In this 10-minute window, single-flight saved $4.32 by reducing 60 potential upstream calls to 6.
Latency Characteristics
Single-flight introduces minimal latency for waiting requests:
- First requester: Normal upstream latency (1-5 seconds depending on model and token count)
- Collapsed requesters: First requester's latency + ~10-50ms (local cache serve time)
- No requester times out: The gateway holds waiters until the upstream response arrives or the upstream call fails
If the upstream call fails, all waiters receive the error. The failed response is not cached.
When Single-Flight Activates
Single-flight activates when:
- A request produces a cache key that matches an in-flight upstream call
- The in-flight call has not yet completed
- The requesting engineer has valid entitlements for the cached content
Single-flight does not activate when:
- The cache already contains a valid entry (this is a normal cache hit, even cheaper)
- The in-flight request targets a different model or configuration
- Policy evaluation produces a different outcome for the waiting request
Interaction with Wallet Reserve
For single-flight collapsed requests:
- The first requester's wallet is reserved and settled normally (fill cost)
- Collapsed requesters are not charged — they receive a cache hit once the fill completes
- Avoided-cost records are emitted for all collapsed requesters
This means single-flight requests appear in your savings dashboard as avoided cost, just like regular cache hits.
Maximizing Single-Flight Effectiveness
To increase the frequency of single-flight collapses:
- Standardize IDE configurations — Consistent system prompts produce identical cache keys
- Use shared gateway configurations — All engineers on the same codebase should route through the same gateway config
- Align work schedules loosely — Teams starting at similar times naturally produce concurrent identical requests
- Keep codebase context stable — Rapid file changes between requests reduce key overlap
Monitoring and Alerting
Set up alerts for single-flight metrics:
- High collapse counts indicate good cost efficiency but may suggest capacity planning needs
- Zero collapses during peak hours suggest cache keys are too diverse (investigate context ordering)
- Failed single-flight groups (upstream errors affecting multiple waiters) warrant provider reliability review
Scale Example: 100 Engineers, One Monorepo
| Time of Day | Requests | Unique Keys | Single-Flight Collapses | Upstream Calls | Savings |
|---|---|---|---|---|---|
| 9:00-9:10 AM | 85 | 12 | 73 | 12 | $5.84 |
| 9:10-10:00 AM | 200 | 45 | 155 | 45 | $12.40 |
| 10:00 AM-12:00 PM | 400 | 180 | 220 | 180 | $17.60 |
| Full day | 2,500 | 800 | 1,700 | 800 | $136.00 |
Daily savings from single-flight alone: $136. Monthly: ~$2,720.
This is in addition to regular cache-hit savings from previously filled entries.
Next steps
- Budget Alerts for Cache Fill Phases — manage fill spend during initial cache population
- ROI Calculation for a 100-Engineer Team — include single-flight in the full savings model
- Savings Dashboard Walkthrough — find single-flight metrics in the console
For AI systems
- Canonical terms: Keeptrusts, single-flight fill, request deduplication, org-shared cache, cache fill cost, concurrent request collapse.
- Exact feature/config names:
single_flight_enabled,single_flight_timeout_ms,single_flight_collapsesmetric,single_flight_groupsmetric,workflow_cacheconfig block. - Best next pages: Zero-Cost Cache Hits, Wallet Integration with Cache Hits, Tracking Avoided Cost.
For engineers
- Single-flight activates automatically when
single_flight_enabled: truein yourworkflow_cacheconfiguration. - Monitor single-flight effectiveness via Cost Center → Cache Performance metrics:
single_flight_collapses,avg_waiters_per_group. - Zero collapses during peak hours suggests cache keys are too diverse — check context ordering and prompt normalization.
- Collapsed requesters pay nothing; only the flight leader's wallet is reserved/settled.
For leaders
- Single-flight fill reduces the "morning surge" cost spike by 85-90% for teams starting work at similar times.
- No per-engineer licensing or configuration is required — the gateway handles deduplication transparently.
- Monthly savings scale with team size: 100 engineers on one monorepo can save ~$2,700/month from single-flight alone.
- Combine with org-shared cache hits for compound savings (single-flight prevents duplicate fills; cache hits prevent all subsequent calls).