Fix for "One or more background workers are no longer alive. Exiting" errors by MikeUU332 · Pull Request #136 · MIC-DKFZ/batchgenerators

MikeUU332 · 2025-09-23T20:57:56Z

See also MIC-DKFZ/nnUNet#2910, this is the same issue and fix.
Hello,

We encountered the "One or more background workers are no longer alive. Exiting" error as described in:

when running nnUNet on an HPC cluster. A colleague and I looked into this issue and came across this link:
joblib/threadpoolctl#176

which says that using with threadpool_limits is not thread safe and to use a global mutex instead. with threadpool_limits is used in both nnUNet and batchgenerators libraries. By using a global mutex instead of with threadpool limits as done in this pull request, we no longer encounter this issue. We verified that nnUNet output is similar (they can't be identical with the non-determinative multi-threaded launcher).

We encountered this error with both nnUNet and the batchgenerators library so a similar pull request was made in nnUNet.

Added global mutex instead of threadpool_limits

tahsin43 · 2026-04-09T08:14:42Z

Hello, so i had the same issue. If anyone is still using an older version and has these problem, i got it work on older wheel /nnunetv2-2.5.1+computecanada-py3-none-any.whl

After a little digging and using LLM (claude), I cant claim i truly understand but I somehow got it to work in a remote slurm cluster:

This is what it identified: I am not so sure what its spitting is correct, so if anyone wants to correct me, I encourage it.

Root cause: `/dev/shm` exhaustion in the DataLoader handoff

control where multiprocessing temp files live — that defaults to /dev/shm
and is invisible to any nnUNet_* env var. They are different layers of the
stack and setting one does not affect the other.
so if you use any slurm type cluster with sbatch:
i had to add:
#SBATCH --tmp=100G
ulimit -n 65536
export TMPDIR=$SLURM_TMPDIR
export JOBLIB_TEMP_FOLDER=$SLURM_TMPDIR
to get it to work

Example for my trainer.sh script configuration:

i had it set up like this:
#SBATCH --cpus-per-task=8
#SBATCH --gpus=h100_3g.40gb:1
#SBATCH --mem=128000M
#SBATCH --array=1,2,3,4
#SBATCH --output=nnunet_array_%A_%a.log
#SBATCH --tmp=100G
ulimit -n 65536

Performance & Logging

export PYTHONUNBUFFERED=1
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export nnUNet_compile=False
export nnUNet_n_proc_DA=6
export TMPDIR=$SLURM_TMPDIR
export JOBLIB_TEMP_FOLDER=$SLURM_TMPDIR

PROJECT_DIR="$SCRATCH_BASE/nnunet_V1"
RESULTS_DIR="$PROJECT_DIR/nnUNet_results"
PREPROCESSED_BACKUP="$PROJECT_DIR/nnUNet_preprocessed_backup

Path configuration

export nnUNet_raw="$SLURM_TMPDIR/nnUNet_raw"
export nnUNet_preprocessed="$SLURM_TMPDIR/nnUNet_preprocessed"
export nnUNet_results="$RESULTS_DIR"
mkdir -p "$nnUNet_raw" "$nnUNet_preprocessed" "$nnUNet_results"

=========================================================
Sync preprocessed data from backup (preprocessing already done)

log_info "Syncing preprocessed data from backup..."
rsync -a "$PREPROCESSED_BACKUP/Dataset${DATASET_ID}/" "$PREPROCESSED_DEST/" || exit 1

FOLD=${SLURM_ARRAY_TASK_ID}

=========================================================
Training (shared by all folds)

log_info "Starting training (${CONFIGURATION}, fold ${FOLD}, plans ${PLANS_NAME})..."

if [ "$USE_CHECKPOINT" = true ]; then
TRAIN_CMD="nnUNetv2_train ${DATASET_ID} ${CONFIGURATION} ${FOLD} -p ${PLANS_NAME} --c"
else
TRAIN_CMD="nnUNetv2_train ${DATASET_ID} ${CONFIGURATION} ${FOLD} -p ${PLANS_NAME}"
fi

if eval "$TRAIN_CMD"; then
log_info "Fold ${FOLD} completed successfully."
exit 0
fi

LLM Interpretation:
PyTorch's multi-process DataLoader (and batchgenerators, which nnU-Net uses) passes
loaded batches between worker processes and the main trainer process through
shared memory at /dev/shm. The worker writes a tensor there, the trainer
mmaps it — zero copy, very fast.
in my cluster (rorqual) when i ran sbatch: training.sh script, I got this bug:

RUNTIME ERROR i previously encountered:

"Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):
File "/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v4/Compiler/gcccore/python/3.11.5/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
self.run()
File "/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v4/Compiler/gcccore/python/3.11.5/lib/python3.11/threading.py", line 975, in run
self._target(*self._args, **self._kwargs)
File "/localscratch/perseb.9966457.0/env/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
raise e
File "/localscratch/perseb.9966457.0/env/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Traceback (most recent call last):
File "/localscratch/perseb.9966457.0/env/bin/nnUNetv2_train", line 6, in
sys.exit(run_training_entry())"
^^^^^^^^^^^^^^^^^^^^

FabianIsensee · 2026-04-09T09:06:25Z

Hey there, I am a bit confused here as this is something we have never experienced and cannot reproduce. We run this successfully on local workstations, a LSF cluster with varying hard-and software configurations and a SLURM cluster. No issues at all. So for me it would be essential to get a repro for the issue as a first step.

The resolution proposed in this PR doesn't really seem to address the underlying issue. It just skips the call to threadpoolctl instead of wrapping it in a mutex as suggested in the linked issue? Or am I completely misunderstanding what is happening? The invocation of threadpoolctl is essential because it tells all the libraries (numpy, torch etc) that they should only use one thread in their backends. Omitting threadpoolctl will cause each background worker to potentially use all available threads whenever possible, quickly overloading the system (imagine a system with 512 threads where there are like 200 background workers each thinking they should do basic matrix maths with 512 threads. Welcome to hell... :-) )

Best,
Fabian

tahsin43 · 2026-04-11T14:51:22Z

Hi @FabianIsensee,

Thanks for the followup on #2910, Honestly I am a little new to deep learning and nnUnet. Pardon me for the lack of knowledge. It's been a nightmare to work remote clusters,because it works fine on my mac until i had to run the training there.

I have two distinct crashes with very different underlying causes that I think are worth disambiguating, I think my first mistake was that i used the remotes build-in cached pywheels because they do not let me pip install nnunetv2 and pytorch default version, thus making me to do runtime patches to run the training and etc.

as a workarround, for my next runs. I can try to install latest version locally and rsync it in the remote and try to run it to see if i get the same errors:

git clone https://github.com/MIC-DKFZ/nnUNet.git
cd nnUNet
pip install -e .

Before I dump my tracebacks on you, I want to re-run against with the most stable version that works on your cluster:
can you let me know:

Which nnunetv2 version do you consider most stable right now for 3d_fullres + ResEncL on torch ≥ 2.4?
Which torch version (and numpy / scipy / scikit-learn / threadpoolctl / batchgenerators) do you actively test against for 2.5.x releases? Any known-good pin set would be ideal — I'll ask CC admins to make those exact wheels available, i can try to install those wheels and try to recompile and run it again

I'll run the same dataset + same SLURM job on the version you recommend and report back whether the crashes reproduce. If they don't reproduce, it might just be a version error

The current configuration of the rorqual cluster (compute canada looks like this):

1. Machine / cluster / OS

Cluster: Digital Research Alliance of Canada — Rorqual (SLURM, Lustre-backed /scratch)
Job resources (SLURM):
- --gpus=h100_3g.40gb:1 (H100 MIG slice, 40 GB)
- --cpus-per-task=8
- --mem=128000M
- --tmp=100G (node-local NVMe, exposed as $SLURM_TMPDIR)
Node-local storage: dataset is staged to $SLURM_TMPDIR at job start; nnUNet_raw and nnUNet_preprocessed point there. Results go back to Lustre.
Login-shell limits inside the job: ulimit -n 65536 (explicitly raised from the default).
Filesystem: $SLURM_TMPDIR is local NVMe, but anything touching the venv or wheelhouse goes through CVMFS (/cvmfs/soft.computecanada.ca/...), which is a read-only HTTP-backed filesystem.

2. Python environment

Python: 3.11.5, from CVMFS EasyBuild module gcccore/python/3.11.5
(/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v4/Compiler/gcccore/python/3.11.5)
Venv: created fresh in $SLURM_TMPDIR/env at the start of every job with virtualenv --no-download
Install command: pip install --no-index torch nnunetv2
(resolves against Compute Canada's CVMFS wheelhouse; no network, no PyPI)

These are the dependencies installed in the job when i did
pip install --no-index torch nnunetv2

package	version
nnunetv2	2.5.1+computecanada
torch	2.11.0+computecanada (CC internal build, ahead of upstream)
numpy	2.4.2+computecanada
scipy	1.17.0+computecanada
scikit-learn	1.8.0+computecanada
scikit-image	0.26.0+computecanada
threadpoolctl	3.6.0+computecanada
batchgenerators	0.25.1+computecanada
batchgeneratorsv2	0.3.0+computecanada
dynamic-network-architectures	0.3.1+computecanada
acvl-utils	0.2+computecanada
fft-conv-pytorch	1.2.0+computecanada
SimpleITK	2.3.1+computecanada
pandas	3.0.0+computecanada

3. Local source patches I had to apply before training would start

I had to do two patches are required at venv-setup time. I apply them via sed after pip install, before the training command. Both are unrelated to dataloading / threadpoolctl:

these are the patches i had to add to make it work with nnunetv2 2.5.1:

# 3. Apply patches (for incompatible torch/nnunetv2 versions)
if [ "$APPLY_PATCHES" = true ]; then
  log_info "Applying library patches..."

  # Patch A: PolyLRScheduler (torch 2.6+ changed LRScheduler.__init__ signature)
  POLYLR_PATH=$(python -c "import nnunetv2; import os; print(os.path.join(nnunetv2.__path__[0], 'training/lr_scheduler/polylr.py'))")
  cat <<EOF >"$POLYLR_PATH"
from torch.optim.lr_scheduler import _LRScheduler
class PolyLRScheduler(_LRScheduler):
    def __init__(self, optimizer, initial_lr: float, max_steps: int, exponent: float = 0.9, current_step: int = None):
        self.optimizer = optimizer
        self.initial_lr = initial_lr
        self.max_steps = max_steps
        self.exponent = exponent
        self.ctr = 0
        super().__init__(optimizer, current_step if current_step is not None else -1)
    def step(self, current_step=None):
        if current_step is None or current_step == -1:
            current_step = self.ctr
            self.ctr += 1
        new_lr = self.initial_lr * (1 - current_step / self.max_steps) ** self.exponent
        for param_group in self.optimizer.param_groups:
            param_group['lr'] = new_lr
        self._last_lr = [group['lr'] for group in self.optimizer.param_groups]
    def get_last_lr(self):
        return self._last_lr
EOF
  log_info "Patch A (PolyLR) applied to $POLYLR_PATH"

  # Patch B: PyTorch 2.6+ unpickling fix (weights_only=False)
  TRAINER_PATH=$(python -c "import nnunetv2; import os; print(os.path.join(nnunetv2.__path__[0], 'training/nnUNetTrainer/nnUNetTrainer.py'))")
  sed -i 's/map_location=self.device)/map_location=self.device, weights_only=False)/g' "$TRAINER_PATH"
  log_info "Patch B (Unpickling) applied to $TRAINER_PATH"
else log_info "Skipping patches (APPLY_PATCHES=false)"
fi

4. Job env vars used

export PYTHONUNBUFFERED=1
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export nnUNet_compile=False
export nnUNet_n_proc_DA=<varies — see timeline>
export TMPDIR=$SLURM_TMPDIR
export JOBLIB_TEMP_FOLDER=$SLURM_TMPDIR
ulimit -n 65536

5. Dataset / run command

Dataset 707, 5 input channels (2 continuous + 3 binary), 185 cases
patch_size=[40, 320, 320], batch_size=2
Plans used: nnUNetPlans and nnUNetResEncUNetLPlans
Command: nnUNetv2_train 707 3d_fullres <fold> -p <plans> --c

6. The crashes I've seen so far (for context)

I'm including these now so you know what shape of failure I am encoutering, but I'd rather not have you dig into these until i try it again

Crash A — fold 0, Epoch 6, fresh run

Config differences from current script:

nnUNet_n_proc_DA=1
no ulimit -n 65536
no --tmp=100G / TMPDIR=$SLURM_TMPDIR / JOBLIB_TEMP_FOLDER=$SLURM_TMPDIR

Epochs 0–5 ran cleanly (80–120 s each). At Epoch 6:

2026-04-08 17:45:15: Current learning rate: 0.00995
Exception in thread Thread-1 (results_loop):
Traceback (most recent call last):
  File ".../batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
    raise RuntimeError("One or more background workers are no longer alive...")

No child-process traceback was emitted. Workers died silently.

Mitigations that partially helped

After Crash A, I added:

ulimit -n 65536
#SBATCH --tmp=100G
export TMPDIR=$SLURM_TMPDIR
export JOBLIB_TEMP_FOLDER=$SLURM_TMPDIR

With those in place, training ran cleanly from Epoch 0 through Epoch 299.

Crash B — fold 3, Epoch 300, on resume with `--c`

Config: same as above + nnUNet_n_proc_DA=6, plans = nnUNetResEncUNetLPlans, resumed from checkpoint_best.pth.

On re-entering Epoch 300, the first next(self.dataloader_val) inside validation_step died with a real child traceback this time:

2026-04-09 16:27:07: Epoch 300
2026-04-09 16:27:07: Current learning rate: 0.00725
Exception in thread Thread-2 (results_loop):
  File ".../batchgenerators/.../nondet_multi_threaded_augmenter.py", line 108, in results_loop
    item = in_queue.get()
  File ".../multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File ".../torch/multiprocessing/reductions.py", line 540, in rebuild_storage_fd
    fd = df.detach()
  File ".../multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File ".../multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File ".../multiprocessing/connection.py", line 751, in answer_challenge
    message = connection.recv_bytes(256)
  ...
ConnectionResetError: [Errno 104] Connection reset by peer

  File ".../nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1377, in run_training
    val_outputs.append(self.validation_step(next(self.dataloader_val)))

The script's auto-retry with --c reproduced the identical crash at the same fold/epoch — deterministic on resume.

Crash A (silent, early, fresh run) matches the threadpoolctl-deadlock shape: no child stderr, parent just notices workers are gone.
Crash B has a real child traceback inside torch's FD-sharing code and has nothing to do with threadpoolctl.

Both get wrapped by the same nondet_multi_threaded_augmenter "workers are no longer alive" umbrella, which makes them look identical at the top of a bug report. I strongly suspect some of the reports you're getting for #2910 are Crash-B-type (torch IPC flakiness) being attributed to threadpoolctl by accident.

What's currently working for me

Setting nnUNet_n_proc_DA=0 in the job env seems to resolve both crashes — no failures since. But I understand that's effectively "disable multi-worker DA", which is exactly what you *don'tc want long-term. So it's a workaround, not a fix.

let me know if you need any more info i can send you the - Plans / fingerprint / splits JSONs (raw data is patient imaging, can't share — everything else is fine), i can also send you the threadpool_info() dump from a freshly activated env, pre-training, Full pip freeze, lscpu, etc.

Compute node details (captured from an interactive job on Rorqual)

Host / OS

Node: rg13302 — Dell PowerEdge XE8640 (firmware 2.9.4)
OS: AlmaLinux 9.7 (Moss Jungle Cat), RHEL-9 lineage
Kernel: 5.14.0-611.42.1.el9_7.x86_64 (PREEMPT_DYNAMIC, x86_64)
glibc: 2.37 (Gentoo 2.37-r3, Compute Canada toolchain)
SLURM: 24.11.7

CPU / memory

Intel Xeon Gold 6448Y (Sapphire Rapids), 2 sockets × 32 cores = 64 physical cores, 1 thread/core
510 GB RAM on the node (my job requested 128 GB)
Node TmpDisk=1700000 MB (1.7 TB local); my job gets $SLURM_TMPDIR on ZFS

GPU

NVIDIA H100 80GB HBM3, MIG enabled
My slice: MIG 3g.40gb (40 GB, 60 SMs)
Driver: 570.211.01, runtime CUDA 12.8

Filesystems

/localscratch → ZFS (3.4 TB, node-local)
/home → Lustre (lustre08, over o2ib2 / InfiniBand)
/scratch → Lustre (lustre10, 19 TB)
CVMFS (/cvmfs/soft.computecanada.ca) → read-only HTTP-backed, where venv + wheels come from

ulimit inside interactive probe job (no override)

open files (-n): 51200 — my training script explicitly raises this to 65536
max locked memory: unlimited
max user processes: 2061572
stack / data / virtual mem: unlimited

Python / torch (captured inside the training venv)

Python: 3.11.5 (main, Sep 19 2023) [GCC 12.3.1 20230526]
torch: 2.11.0 (Compute Canada internal build, ahead of upstream)
torch.version.cuda: 12.9 (compiled against 12.9; driver reports 12.8 runtime → forward-compatible, but worth noting)
torch.backends.cudnn.version(): 91301
torch.version.git_version: 70d99e998b4955e0049d13a98d77ae1b14db1f45
Compiled with: GCC 12.3, C++17, Intel MKL-DNN v3.10.2, OpenMP 4.5
torch.multiprocessing.get_sharing_strategy(): file_descriptor ← default, and exactly the code path Crash B dies in
GPU visible to torch: NVIDIA H100 80GB HBM3 MIG 3g.40gb

Three things that stand out from this dump

Sharing strategy is file_descriptor. Crash B's full traceback goes through rebuild_storage_fd → df.detach() → resource_sharer.get_connection → ConnectionResetError. That is literally the file_descriptor IPC path. Forcing torch.multiprocessing.set_sharing_strategy('file_system') is the obvious diagnostic — I'll run that on the next fold-3 resume if you agree.
CUDA version skew: driver is CUDA 12.8, torch is built against CUDA 12.9. Forward-compat so it should be fine, but it's one more place CC's rebuilt torch is out of sync with the rest of the stack.
Default ulimit -n on this node is 51200, not unlimited. With 64 physical cores and multiple worker processes each holding open FDs for shared-memory segments, tempfiles, and Lustre handles, 51200 is not generous. My training script raises it to 65536 explicitly, and Crash A happened on a script that didn't — plausibly contributing to the silent worker death.

Update nondet_multi_threaded_augmenter.py

c49cb0a

Added global mutex instead of threadpool_limits

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for "One or more background workers are no longer alive. Exiting" errors#136

Fix for "One or more background workers are no longer alive. Exiting" errors#136
MikeUU332 wants to merge 1 commit intoMIC-DKFZ:masterfrom
MikeUU332:master

MikeUU332 commented Sep 23, 2025

Uh oh!

tahsin43 commented Apr 9, 2026 •

edited

Loading

Uh oh!

FabianIsensee commented Apr 9, 2026

Uh oh!

tahsin43 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MikeUU332 commented Sep 23, 2025

Uh oh!

tahsin43 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause: /dev/shm exhaustion in the DataLoader handoff

Example for my trainer.sh script configuration:

Performance & Logging

Path configuration

========================================================= Sync preprocessed data from backup (preprocessing already done)

========================================================= Training (shared by all folds)

RUNTIME ERROR i previously encountered:

Uh oh!

FabianIsensee commented Apr 9, 2026

Uh oh!

tahsin43 commented Apr 11, 2026

1. Machine / cluster / OS

2. Python environment

3. Local source patches I had to apply before training would start

4. Job env vars used

5. Dataset / run command

6. The crashes I've seen so far (for context)

Crash A — fold 0, Epoch 6, fresh run

Mitigations that partially helped

Crash B — fold 3, Epoch 300, on resume with --c

What's currently working for me

Compute node details (captured from an interactive job on Rorqual)

Three things that stand out from this dump

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tahsin43 commented Apr 9, 2026 •

edited

Loading

Root cause: `/dev/shm` exhaustion in the DataLoader handoff

=========================================================
Sync preprocessed data from backup (preprocessing already done)

=========================================================
Training (shared by all folds)

Crash B — fold 3, Epoch 300, on resume with `--c`