Skip to content

Fix for "One or more background workers are no longer alive. Exiting" errors#136

Open
MikeUU332 wants to merge 1 commit intoMIC-DKFZ:masterfrom
MikeUU332:master
Open

Fix for "One or more background workers are no longer alive. Exiting" errors#136
MikeUU332 wants to merge 1 commit intoMIC-DKFZ:masterfrom
MikeUU332:master

Conversation

@MikeUU332
Copy link
Copy Markdown

See also MIC-DKFZ/nnUNet#2910, this is the same issue and fix.
Hello,

We encountered the "One or more background workers are no longer alive. Exiting" error as described in:

#134
#133

when running nnUNet on an HPC cluster. A colleague and I looked into this issue and came across this link:
joblib/threadpoolctl#176

which says that using with threadpool_limits is not thread safe and to use a global mutex instead. with threadpool_limits is used in both nnUNet and batchgenerators libraries. By using a global mutex instead of with threadpool limits as done in this pull request, we no longer encounter this issue. We verified that nnUNet output is similar (they can't be identical with the non-determinative multi-threaded launcher).

We encountered this error with both nnUNet and the batchgenerators library so a similar pull request was made in nnUNet.

Added global mutex instead of threadpool_limits
@tahsin43
Copy link
Copy Markdown

tahsin43 commented Apr 9, 2026

Hello, so i had the same issue. If anyone is still using an older version and has these problem, i got it work on older wheel /nnunetv2-2.5.1+computecanada-py3-none-any.whl

After a little digging and using LLM (claude), I cant claim i truly understand but I somehow got it to work in a remote slurm cluster:

This is what it identified: I am not so sure what its spitting is correct, so if anyone wants to correct me, I encourage it.

Root cause: /dev/shm exhaustion in the DataLoader handoff

control where multiprocessing temp files live — that defaults to /dev/shm
and is invisible to any nnUNet_* env var. They are different layers of the
stack and setting one does not affect the other.
so if you use any slurm type cluster with sbatch:
i had to add:
#SBATCH --tmp=100G
ulimit -n 65536
export TMPDIR=$SLURM_TMPDIR
export JOBLIB_TEMP_FOLDER=$SLURM_TMPDIR
to get it to work

Example for my trainer.sh script configuration:

i had it set up like this:
#SBATCH --cpus-per-task=8
#SBATCH --gpus=h100_3g.40gb:1
#SBATCH --mem=128000M
#SBATCH --array=1,2,3,4
#SBATCH --output=nnunet_array_%A_%a.log
#SBATCH --tmp=100G
ulimit -n 65536

Performance & Logging

export PYTHONUNBUFFERED=1
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export nnUNet_compile=False
export nnUNet_n_proc_DA=6
export TMPDIR=$SLURM_TMPDIR
export JOBLIB_TEMP_FOLDER=$SLURM_TMPDIR

PROJECT_DIR="$SCRATCH_BASE/nnunet_V1"
RESULTS_DIR="$PROJECT_DIR/nnUNet_results"
PREPROCESSED_BACKUP="$PROJECT_DIR/nnUNet_preprocessed_backup

Path configuration

export nnUNet_raw="$SLURM_TMPDIR/nnUNet_raw"
export nnUNet_preprocessed="$SLURM_TMPDIR/nnUNet_preprocessed"
export nnUNet_results="$RESULTS_DIR"
mkdir -p "$nnUNet_raw" "$nnUNet_preprocessed" "$nnUNet_results"

=========================================================
Sync preprocessed data from backup (preprocessing already done)

log_info "Syncing preprocessed data from backup..."
rsync -a "$PREPROCESSED_BACKUP/Dataset${DATASET_ID}/" "$PREPROCESSED_DEST/" || exit 1

FOLD=${SLURM_ARRAY_TASK_ID}

=========================================================
Training (shared by all folds)

log_info "Starting training (${CONFIGURATION}, fold ${FOLD}, plans ${PLANS_NAME})..."

if [ "$USE_CHECKPOINT" = true ]; then
TRAIN_CMD="nnUNetv2_train ${DATASET_ID} ${CONFIGURATION} ${FOLD} -p ${PLANS_NAME} --c"
else
TRAIN_CMD="nnUNetv2_train ${DATASET_ID} ${CONFIGURATION} ${FOLD} -p ${PLANS_NAME}"
fi

if eval "$TRAIN_CMD"; then
log_info "Fold ${FOLD} completed successfully."
exit 0
fi

LLM Interpretation:
PyTorch's multi-process DataLoader (and batchgenerators, which nnU-Net uses) passes
loaded batches between worker processes and the main trainer process through
shared memory at /dev/shm. The worker writes a tensor there, the trainer
mmaps it — zero copy, very fast.
in my cluster (rorqual) when i ran sbatch: training.sh script, I got this bug:

RUNTIME ERROR i previously encountered:

"Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):
File "/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v4/Compiler/gcccore/python/3.11.5/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
self.run()
File "/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v4/Compiler/gcccore/python/3.11.5/lib/python3.11/threading.py", line 975, in run
self._target(*self._args, **self._kwargs)
File "/localscratch/perseb.9966457.0/env/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
raise e
File "/localscratch/perseb.9966457.0/env/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Traceback (most recent call last):
File "/localscratch/perseb.9966457.0/env/bin/nnUNetv2_train", line 6, in
sys.exit(run_training_entry())"
^^^^^^^^^^^^^^^^^^^^

@FabianIsensee
Copy link
Copy Markdown
Member

Hey there, I am a bit confused here as this is something we have never experienced and cannot reproduce. We run this successfully on local workstations, a LSF cluster with varying hard-and software configurations and a SLURM cluster. No issues at all. So for me it would be essential to get a repro for the issue as a first step.

The resolution proposed in this PR doesn't really seem to address the underlying issue. It just skips the call to threadpoolctl instead of wrapping it in a mutex as suggested in the linked issue? Or am I completely misunderstanding what is happening? The invocation of threadpoolctl is essential because it tells all the libraries (numpy, torch etc) that they should only use one thread in their backends. Omitting threadpoolctl will cause each background worker to potentially use all available threads whenever possible, quickly overloading the system (imagine a system with 512 threads where there are like 200 background workers each thinking they should do basic matrix maths with 512 threads. Welcome to hell... :-) )

Best,
Fabian

@tahsin43
Copy link
Copy Markdown

Hi @FabianIsensee,

Thanks for the followup on #2910, Honestly I am a little new to deep learning and nnUnet. Pardon me for the lack of knowledge. It's been a nightmare to work remote clusters,because it works fine on my mac until i had to run the training there.

I have two distinct crashes with very different underlying causes that I think are worth disambiguating, I think my first mistake was that i used the remotes build-in cached pywheels because they do not let me pip install nnunetv2 and pytorch default version, thus making me to do runtime patches to run the training and etc.

as a workarround, for my next runs. I can try to install latest version locally and rsync it in the remote and try to run it to see if i get the same errors:

git clone https://github.com/MIC-DKFZ/nnUNet.git
cd nnUNet
pip install -e .

Before I dump my tracebacks on you, I want to re-run against with the most stable version that works on your cluster:
can you let me know:

  • Which nnunetv2 version do you consider most stable right now for 3d_fullres + ResEncL on torch ≥ 2.4?
  • Which torch version (and numpy / scipy / scikit-learn / threadpoolctl / batchgenerators) do you actively test against for 2.5.x releases? Any known-good pin set would be ideal — I'll ask CC admins to make those exact wheels available, i can try to install those wheels and try to recompile and run it again

I'll run the same dataset + same SLURM job on the version you recommend and report back whether the crashes reproduce. If they don't reproduce, it might just be a version error

The current configuration of the rorqual cluster (compute canada looks like this):


1. Machine / cluster / OS

  • Cluster: Digital Research Alliance of Canada — Rorqual (SLURM, Lustre-backed /scratch)
  • Job resources (SLURM):
    • --gpus=h100_3g.40gb:1 (H100 MIG slice, 40 GB)
    • --cpus-per-task=8
    • --mem=128000M
    • --tmp=100G (node-local NVMe, exposed as $SLURM_TMPDIR)
  • Node-local storage: dataset is staged to $SLURM_TMPDIR at job start; nnUNet_raw and nnUNet_preprocessed point there. Results go back to Lustre.
  • Login-shell limits inside the job: ulimit -n 65536 (explicitly raised from the default).
  • Filesystem: $SLURM_TMPDIR is local NVMe, but anything touching the venv or wheelhouse goes through CVMFS (/cvmfs/soft.computecanada.ca/...), which is a read-only HTTP-backed filesystem.

2. Python environment

  • Python: 3.11.5, from CVMFS EasyBuild module gcccore/python/3.11.5
    (/cvmfs/soft.computecanada.ca/easybuild/software/2023/x86-64-v4/Compiler/gcccore/python/3.11.5)
  • Venv: created fresh in $SLURM_TMPDIR/env at the start of every job with virtualenv --no-download
  • Install command: pip install --no-index torch nnunetv2
    (resolves against Compute Canada's CVMFS wheelhouse; no network, no PyPI)

These are the dependencies installed in the job when i did
pip install --no-index torch nnunetv2

package version
nnunetv2 2.5.1+computecanada
torch 2.11.0+computecanada (CC internal build, ahead of upstream)
numpy 2.4.2+computecanada
scipy 1.17.0+computecanada
scikit-learn 1.8.0+computecanada
scikit-image 0.26.0+computecanada
threadpoolctl 3.6.0+computecanada
batchgenerators 0.25.1+computecanada
batchgeneratorsv2 0.3.0+computecanada
dynamic-network-architectures 0.3.1+computecanada
acvl-utils 0.2+computecanada
fft-conv-pytorch 1.2.0+computecanada
SimpleITK 2.3.1+computecanada
pandas 3.0.0+computecanada

3. Local source patches I had to apply before training would start

I had to do two patches are required at venv-setup time. I apply them via sed after pip install, before the training command. Both are unrelated to dataloading / threadpoolctl:

these are the patches i had to add to make it work with nnunetv2 2.5.1:

# 3. Apply patches (for incompatible torch/nnunetv2 versions)
if [ "$APPLY_PATCHES" = true ]; then
  log_info "Applying library patches..."

  # Patch A: PolyLRScheduler (torch 2.6+ changed LRScheduler.__init__ signature)
  POLYLR_PATH=$(python -c "import nnunetv2; import os; print(os.path.join(nnunetv2.__path__[0], 'training/lr_scheduler/polylr.py'))")
  cat <<EOF >"$POLYLR_PATH"
from torch.optim.lr_scheduler import _LRScheduler
class PolyLRScheduler(_LRScheduler):
    def __init__(self, optimizer, initial_lr: float, max_steps: int, exponent: float = 0.9, current_step: int = None):
        self.optimizer = optimizer
        self.initial_lr = initial_lr
        self.max_steps = max_steps
        self.exponent = exponent
        self.ctr = 0
        super().__init__(optimizer, current_step if current_step is not None else -1)
    def step(self, current_step=None):
        if current_step is None or current_step == -1:
            current_step = self.ctr
            self.ctr += 1
        new_lr = self.initial_lr * (1 - current_step / self.max_steps) ** self.exponent
        for param_group in self.optimizer.param_groups:
            param_group['lr'] = new_lr
        self._last_lr = [group['lr'] for group in self.optimizer.param_groups]
    def get_last_lr(self):
        return self._last_lr
EOF
  log_info "Patch A (PolyLR) applied to $POLYLR_PATH"

  # Patch B: PyTorch 2.6+ unpickling fix (weights_only=False)
  TRAINER_PATH=$(python -c "import nnunetv2; import os; print(os.path.join(nnunetv2.__path__[0], 'training/nnUNetTrainer/nnUNetTrainer.py'))")
  sed -i 's/map_location=self.device)/map_location=self.device, weights_only=False)/g' "$TRAINER_PATH"
  log_info "Patch B (Unpickling) applied to $TRAINER_PATH"
else log_info "Skipping patches (APPLY_PATCHES=false)"
fi

4. Job env vars used

export PYTHONUNBUFFERED=1
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export nnUNet_compile=False
export nnUNet_n_proc_DA=<varies — see timeline>
export TMPDIR=$SLURM_TMPDIR
export JOBLIB_TEMP_FOLDER=$SLURM_TMPDIR
ulimit -n 65536

5. Dataset / run command

  • Dataset 707, 5 input channels (2 continuous + 3 binary), 185 cases
  • patch_size=[40, 320, 320], batch_size=2
  • Plans used: nnUNetPlans and nnUNetResEncUNetLPlans
  • Command: nnUNetv2_train 707 3d_fullres <fold> -p <plans> --c

6. The crashes I've seen so far (for context)

I'm including these now so you know what shape of failure I am encoutering, but I'd rather not have you dig into these until i try it again

Crash A — fold 0, Epoch 6, fresh run

Config differences from current script:

  • nnUNet_n_proc_DA=1
  • no ulimit -n 65536
  • no --tmp=100G / TMPDIR=$SLURM_TMPDIR / JOBLIB_TEMP_FOLDER=$SLURM_TMPDIR

Epochs 0–5 ran cleanly (80–120 s each). At Epoch 6:

2026-04-08 17:45:15: Current learning rate: 0.00995
Exception in thread Thread-1 (results_loop):
Traceback (most recent call last):
  File ".../batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
    raise RuntimeError("One or more background workers are no longer alive...")

No child-process traceback was emitted. Workers died silently.

Mitigations that partially helped

After Crash A, I added:

ulimit -n 65536
#SBATCH --tmp=100G
export TMPDIR=$SLURM_TMPDIR
export JOBLIB_TEMP_FOLDER=$SLURM_TMPDIR

With those in place, training ran cleanly from Epoch 0 through Epoch 299.

Crash B — fold 3, Epoch 300, on resume with --c

Config: same as above + nnUNet_n_proc_DA=6, plans = nnUNetResEncUNetLPlans, resumed from checkpoint_best.pth.

On re-entering Epoch 300, the first next(self.dataloader_val) inside validation_step died with a real child traceback this time:

2026-04-09 16:27:07: Epoch 300
2026-04-09 16:27:07: Current learning rate: 0.00725
Exception in thread Thread-2 (results_loop):
  File ".../batchgenerators/.../nondet_multi_threaded_augmenter.py", line 108, in results_loop
    item = in_queue.get()
  File ".../multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File ".../torch/multiprocessing/reductions.py", line 540, in rebuild_storage_fd
    fd = df.detach()
  File ".../multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File ".../multiprocessing/resource_sharer.py", line 86, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File ".../multiprocessing/connection.py", line 751, in answer_challenge
    message = connection.recv_bytes(256)
  ...
ConnectionResetError: [Errno 104] Connection reset by peer

  File ".../nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1377, in run_training
    val_outputs.append(self.validation_step(next(self.dataloader_val)))

The script's auto-retry with --c reproduced the identical crash at the same fold/epoch — deterministic on resume.

  • Crash A (silent, early, fresh run) matches the threadpoolctl-deadlock shape: no child stderr, parent just notices workers are gone.
  • Crash B has a real child traceback inside torch's FD-sharing code and has nothing to do with threadpoolctl.

Both get wrapped by the same nondet_multi_threaded_augmenter "workers are no longer alive" umbrella, which makes them look identical at the top of a bug report. I strongly suspect some of the reports you're getting for #2910 are Crash-B-type (torch IPC flakiness) being attributed to threadpoolctl by accident.

What's currently working for me

Setting nnUNet_n_proc_DA=0 in the job env seems to resolve both crashes — no failures since. But I understand that's effectively "disable multi-worker DA", which is exactly what you *don'tc want long-term. So it's a workaround, not a fix.


let me know if you need any more info i can send you the - Plans / fingerprint / splits JSONs (raw data is patient imaging, can't share — everything else is fine), i can also send you the threadpool_info() dump from a freshly activated env, pre-training, Full pip freeze, lscpu, etc.


Compute node details (captured from an interactive job on Rorqual)

Host / OS

  • Node: rg13302 — Dell PowerEdge XE8640 (firmware 2.9.4)
  • OS: AlmaLinux 9.7 (Moss Jungle Cat), RHEL-9 lineage
  • Kernel: 5.14.0-611.42.1.el9_7.x86_64 (PREEMPT_DYNAMIC, x86_64)
  • glibc: 2.37 (Gentoo 2.37-r3, Compute Canada toolchain)
  • SLURM: 24.11.7

CPU / memory

  • Intel Xeon Gold 6448Y (Sapphire Rapids), 2 sockets × 32 cores = 64 physical cores, 1 thread/core
  • 510 GB RAM on the node (my job requested 128 GB)
  • Node TmpDisk=1700000 MB (1.7 TB local); my job gets $SLURM_TMPDIR on ZFS

GPU

  • NVIDIA H100 80GB HBM3, MIG enabled
  • My slice: MIG 3g.40gb (40 GB, 60 SMs)
  • Driver: 570.211.01, runtime CUDA 12.8

Filesystems

  • /localscratchZFS (3.4 TB, node-local)
  • /homeLustre (lustre08, over o2ib2 / InfiniBand)
  • /scratchLustre (lustre10, 19 TB)
  • CVMFS (/cvmfs/soft.computecanada.ca) → read-only HTTP-backed, where venv + wheels come from

ulimit inside interactive probe job (no override)

  • open files (-n): 51200 — my training script explicitly raises this to 65536
  • max locked memory: unlimited
  • max user processes: 2061572
  • stack / data / virtual mem: unlimited

Python / torch (captured inside the training venv)

  • Python: 3.11.5 (main, Sep 19 2023) [GCC 12.3.1 20230526]
  • torch: 2.11.0 (Compute Canada internal build, ahead of upstream)
  • torch.version.cuda: 12.9 (compiled against 12.9; driver reports 12.8 runtime → forward-compatible, but worth noting)
  • torch.backends.cudnn.version(): 91301
  • torch.version.git_version: 70d99e998b4955e0049d13a98d77ae1b14db1f45
  • Compiled with: GCC 12.3, C++17, Intel MKL-DNN v3.10.2, OpenMP 4.5
  • torch.multiprocessing.get_sharing_strategy(): file_descriptor ← default, and exactly the code path Crash B dies in
  • GPU visible to torch: NVIDIA H100 80GB HBM3 MIG 3g.40gb

Three things that stand out from this dump

  1. Sharing strategy is file_descriptor. Crash B's full traceback goes through rebuild_storage_fd → df.detach() → resource_sharer.get_connection → ConnectionResetError. That is literally the file_descriptor IPC path. Forcing torch.multiprocessing.set_sharing_strategy('file_system') is the obvious diagnostic — I'll run that on the next fold-3 resume if you agree.
  2. CUDA version skew: driver is CUDA 12.8, torch is built against CUDA 12.9. Forward-compat so it should be fine, but it's one more place CC's rebuilt torch is out of sync with the rest of the stack.
  3. Default ulimit -n on this node is 51200, not unlimited. With 64 physical cores and multiple worker processes each holding open FDs for shared-memory segments, tempfiles, and Lustre handles, 51200 is not generous. My training script raises it to 65536 explicitly, and Crash A happened on a script that didn't — plausibly contributing to the silent worker death.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants