feat: add configurable BullMQ worker stall options to prevent duplicate job retries#6495
feat: add configurable BullMQ worker stall options to prevent duplicate job retries#6495Nek-11 wants to merge 7 commits into
Conversation
BullMQ's default stall detection (30s interval, 1 retry) causes long-running jobs like document store upserts to be silently retried, resulting in duplicate vector embeddings in the database. Adds two new environment variables to control this behaviour: - WORKER_STALLED_INTERVAL (ms): how often BullMQ checks for stalled jobs. Defaults to 300000 (5 min) instead of BullMQ's 30s default. - WORKER_MAX_STALLED_COUNT: how many times a stalled job is retried before failing. Defaults to 0 (fail immediately, no retry) instead of BullMQ's default of 1. Setting WORKER_MAX_STALLED_COUNT=0 prevents duplicate processing on retry while still allowing the job to be re-triggered manually if needed.
There was a problem hiding this comment.
Code Review
This pull request introduces two new environment variables, WORKER_STALLED_INTERVAL and WORKER_MAX_STALLED_COUNT, to configure the stalled job interval and maximum stalled count for the queue worker in BaseQueue.ts. The reviewer suggested explicitly specifying the radix in parseInt and implementing fail-fast validation to throw an error if the environment variables contain invalid non-numeric values, rather than silently falling back to defaults.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| const WORKER_STALLED_INTERVAL = process.env.WORKER_STALLED_INTERVAL ? parseInt(process.env.WORKER_STALLED_INTERVAL) : 300000 | ||
| const WORKER_MAX_STALLED_COUNT = process.env.WORKER_MAX_STALLED_COUNT ? parseInt(process.env.WORKER_MAX_STALLED_COUNT) : 0 |
There was a problem hiding this comment.
When parsing environment variables with parseInt, it is safer to specify the radix (base 10) explicitly to avoid any parsing ambiguity. Additionally, if the environment variable is set to an invalid non-numeric value, parseInt will return NaN. To promote fail-fast behavior, we should throw an error for invalid input types from external sources (like environment variables) rather than silently falling back to a default value.
const WORKER_STALLED_INTERVAL = (() => {
const val = process.env.WORKER_STALLED_INTERVAL;
if (!val) return 300000;
const parsed = parseInt(val, 10);
if (isNaN(parsed)) throw new Error("Invalid WORKER_STALLED_INTERVAL");
return parsed;
})();
const WORKER_MAX_STALLED_COUNT = (() => {
const val = process.env.WORKER_MAX_STALLED_COUNT;
if (!val) return 0;
const parsed = parseInt(val, 10);
if (isNaN(parsed)) throw new Error("Invalid WORKER_MAX_STALLED_COUNT");
return parsed;
})();References
- When handling potentially invalid data from external sources (like an API response), prefer throwing an error for invalid input types rather than silently returning a default or empty value. This promotes fail-fast behavior.
|
/gemini review again please |
There was a problem hiding this comment.
Code Review
This pull request introduces configurable worker stalled interval and max stalled count settings parsed from environment variables, along with validation checks. The review feedback suggests using nullish coalescing (??) instead of ternary operators for more idiomatic TypeScript environment variable fallbacks, and wrapping the if statement bodies in curly braces for better maintainability.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Problem
BullMQ's default stall detection uses a 30-second interval with 1 automatic retry (
maxStalledCount: 1). Long-running jobs — such as document store upserts that involve web scraping and LLM summarisation — frequently exceed this threshold and get silently retried.This causes the entire job to re-run, reinserting all vector embeddings into the database. Even with a record manager in
fullcleanup mode, the mid-stall inconsistency means duplicates slip through, resulting in a bloated vector store.Solution
Expose BullMQ's
stalledIntervalandmaxStalledCountWorker options as environment variables so operators can tune them for their workload:WORKER_STALLED_INTERVAL300000(5 min)WORKER_MAX_STALLED_COUNT0The new defaults are intentionally conservative:
Operators who prefer the old retry behaviour can set
WORKER_MAX_STALLED_COUNT=1to restore it.Changes
packages/server/src/queue/BaseQueue.ts: addedWORKER_STALLED_INTERVALandWORKER_MAX_STALLED_COUNTconstants read from env, passed to the BullMQWorkerconstructor