Stop letting LLM calls blow up your server

LLM calls are expensive in every dimension: latency, cost, and compute. A single Claude or GPT call can occupy a connection for seconds, burn through thousands of tokens, and tie up downstream resources the entire time.

Most backend code treats LLM calls like any other async operation. Fire the request, await the response, hope for the best. That works until traffic spikes, and suddenly you have 200 concurrent LLM calls, a provider rate limit error, and a $400 bill from a 10-minute window.

The fix is admission control: deciding whether a request should run before it reaches the provider.

What async-bulkhead-llm does

It sits between your route handler and your LLM client. Before a call goes out, it checks two things: is there capacity (concurrency slots), and is there budget (token ceiling). If both pass, the request proceeds. If either fails, it's rejected immediately.

import { createLLMBulkhead } from 'async-bulkhead-llm';

const bulkhead = createLLMBulkhead({
  model:         'claude-sonnet-4',
  maxConcurrent: 10,
  tokenBudget:   { budget: 200_000 },
});

const result = await bulkhead.run(request, async () => {
  return callYourLLMProvider(request);
});

If 10 calls are already in flight, the 11th is rejected before it ever touches the provider. If the token budget is exhausted, same thing. No silent queuing, no cascading overload.

Why this isn't just p-limit

Concurrency limiting libraries like p-limit or Bottleneck count tasks. That's fine for uniform work, but LLM calls aren't uniform. A 50-token classification and a 4,000-token document summary both count as "one task," but they have wildly different cost and resource impact.

async-bulkhead-llm is token-aware. It estimates input tokens from the request and reserves input + max_output against a budget before admission. When the call completes, it refunds the difference between what was reserved and what was actually used.

const result = await bulkhead.run(
  request,
  async () => callLLM(request),
  {
    getUsage: (response) => ({
      input:  response.usage.input_tokens,
      output: response.usage.output_tokens,
    }),
  },
);

This means a burst of cheap classification calls won't be blocked by the same limits as expensive summarization calls. The budget tracks real cost, not just concurrency.

In-flight deduplication

If three users ask the same question within a few seconds, you don't need three LLM calls. async-bulkhead-llm detects identical in-flight requests and shares a single call across all of them.

const bulkhead = createLLMBulkhead({
  model:         'claude-sonnet-4',
  maxConcurrent: 10,
  deduplication: true,
});

This is especially useful for search-style workloads where multiple users hit the same query. One LLM call, one slot consumed, one set of tokens burned, three users served.

Interactive vs batch

Not all LLM workloads have the same tolerance for delay. A user waiting for a chat response needs fail-fast. A background job processing a document queue can wait.

// User-facing — reject immediately if busy
const chat = createLLMBulkhead({
  model: 'claude-sonnet-4',
  maxConcurrent: 10,
  profile: 'interactive',
});

// Background — bounded queue, willing to wait
const batch = createLLMBulkhead({
  model: 'claude-sonnet-4',
  maxConcurrent: 4,
  profile: 'batch',
});

This is the same tradeoff LoadLens visualizes: fail-fast protects responsiveness, queuing absorbs bursts but risks latency under sustained pressure. The difference is that here you're applying it specifically to LLM calls, where the stakes per request are higher.

Graceful shutdown

When your process receives SIGTERM, you don't want in-flight LLM calls to be killed mid-response. You also don't want new calls to be admitted while you're draining.

process.on('SIGTERM', async () => {
  bulkhead.close();
  await bulkhead.drain();
  process.exit(0);
});

close() rejects all pending waiters and blocks future admission. drain() resolves when every in-flight call has completed.

Where it fits

async-bulkhead-llm is not an LLM framework. It doesn't build chains, manage prompts, or select models. It answers one question: should this request run right now?

Use LangChain to decide what to run. Use your provider SDK to decide how to call it. Use async-bulkhead-llm to decide whether it should run at all.

npm install async-bulkhead-llm