When a bounded queue is the right call
Fail-fast gets the good press. But there are real cases where letting requests wait is the better engineering decision.
Fail-fast has good marketing. Reject early, protect latency, surface overload. It's clean advice and it's usually right.
But "usually" isn't "always." There are workloads where rejecting a request costs more than holding it for a moment.
The case for waiting
A bounded queue makes sense when three conditions are true at the same time:
-
Overload is brief. The spike resolves on its own: a burst of webhook deliveries, a cron-triggered batch, a sudden page load from a social media link. Demand exceeds capacity for seconds, not minutes.
-
A late answer is better than no answer. The caller benefits from a response even if it takes an extra 200ms. A payment confirmation. A document conversion. A search result the user is still waiting for.
-
The caller doesn't have a tighter timeout. If the client will give up after 500ms and the queue adds 400ms of wait time, you're doing work that nobody will read the result of.
When all three hold, a small queue absorbs the burst without a single rejection. Fail-fast would have returned errors for requests that could have been served half a second later.
What "bounded" means and why it matters
An unbounded queue is a memory leak with a purpose. It accepts everything, grows without limit, and eventually kills the process. Every queue needs a ceiling.
const reports = createExpressBulkhead({
name: 'reports',
maxConcurrent: 4,
maxQueue: 8,
queueWaitTimeoutMs: 250,
});
Three constraints, all doing different jobs:
maxConcurrentcaps active work. This protects the downstream resource.maxQueuecaps waiting work. This protects memory and keeps the queue from growing without bound.queueWaitTimeoutMscaps wait time. This protects the caller from a slow "no."
Remove any one of these and the queue stops being helpful. No concurrency limit means the queue never fills and work piles up downstream. No queue limit means memory grows unbounded. No timeout means callers wait forever for capacity that may never free up.
The timeout is the most important setting
A queue without a wait timeout is a trap. It converts every rejection into a slow rejection: the caller waits for capacity, capacity never arrives, and eventually the connection dies or the client gives up.
The timeout should be shorter than the caller's patience. If your API clients time out at 5 seconds and your work takes 2 seconds, a queueWaitTimeoutMs of 1000 gives a queued request a chance to be admitted and still finish within the client's window. A queueWaitTimeoutMs of 4000 means the request might get admitted, start work, and then get killed by the client timeout before it finishes.
Where queues go wrong
Queues hide overload. That's their feature and their failure mode.
Under sustained load, a bounded queue fills, wait times climb to the timeout ceiling, and then requests start being rejected anyway, after the caller has already been waiting. The system did more work (managing the queue, admitting late requests, possibly starting work that won't finish in time) for a worse outcome than fail-fast would have delivered.
You can see this happen in real time at loadlens.dev. Set requests per second above capacity, enable a bounded queue, and watch P95 latency climb while fail-fast stays flat. The queue absorbs a few early rejections, but under sustained pressure it trades those rejections for latency that affects everyone.
The insight: queues are good at absorbing spikes. They are bad at absorbing sustained overload. The difference is duration.
A decision framework
Use maxQueue: 0 (fail-fast) when:
- The route serves interactive users who expect fast feedback
- Overload is sustained or unpredictable in duration
- The client has its own timeout and a late response has no value
- You'd rather show a clean error than a slow page
Use maxQueue > 0 (bounded queue) when:
- The work is too valuable to drop and a late answer still has value
- Spikes are short and capacity recovers within the timeout window
- The caller is tolerant of added latency (batch jobs, internal services, async workflows)
- You've set a
queueWaitTimeoutMsthat's shorter than the caller's patience
Never use an unbounded queue. If you don't know what to set maxQueue to, start with the same value as maxConcurrent and work down from there.
The honest version
Most teams reach for queues because rejection feels like failure. A 503 on the dashboard looks bad. A slow response doesn't show up until someone checks p95.
That instinct is worth examining. A queue doesn't reduce overload. It redistributes the pain from "some users get errors" to "all users get slower." Whether that's better depends on your users, your SLAs, and how long the overload lasts.
When the conditions are right (brief spikes, valuable work, patient callers) a bounded queue is the correct engineering decision. When they're not, it's a way to hide a problem you should be solving.