Fix for "One or more background workers are no longer alive. Exiting" errors#136
Fix for "One or more background workers are no longer alive. Exiting" errors#136MikeUU332 wants to merge 1 commit intoMIC-DKFZ:masterfrom
Conversation
Added global mutex instead of threadpool_limits
|
Hello, so i had the same issue. If anyone is still using an older version and has these problem, i got it work on older wheel /nnunetv2-2.5.1+computecanada-py3-none-any.whl After a little digging and using LLM (claude), I cant claim i truly understand but I somehow got it to work in a remote slurm cluster: This is what it identified: I am not so sure what its spitting is correct, so if anyone wants to correct me, I encourage it. Root cause:
|
|
Hey there, I am a bit confused here as this is something we have never experienced and cannot reproduce. We run this successfully on local workstations, a LSF cluster with varying hard-and software configurations and a SLURM cluster. No issues at all. So for me it would be essential to get a repro for the issue as a first step. The resolution proposed in this PR doesn't really seem to address the underlying issue. It just skips the call to threadpoolctl instead of wrapping it in a mutex as suggested in the linked issue? Or am I completely misunderstanding what is happening? The invocation of threadpoolctl is essential because it tells all the libraries (numpy, torch etc) that they should only use one thread in their backends. Omitting threadpoolctl will cause each background worker to potentially use all available threads whenever possible, quickly overloading the system (imagine a system with 512 threads where there are like 200 background workers each thinking they should do basic matrix maths with 512 threads. Welcome to hell... :-) ) Best, |
|
Hi @FabianIsensee, Thanks for the followup on #2910, Honestly I am a little new to deep learning and nnUnet. Pardon me for the lack of knowledge. It's been a nightmare to work remote clusters,because it works fine on my mac until i had to run the training there. I have two distinct crashes with very different underlying causes that I think are worth disambiguating, I think my first mistake was that i used the remotes build-in cached pywheels because they do not let me pip install nnunetv2 and pytorch default version, thus making me to do runtime patches to run the training and etc. as a workarround, for my next runs. I can try to install latest version locally and rsync it in the remote and try to run it to see if i get the same errors: Before I dump my tracebacks on you, I want to re-run against with the most stable version that works on your cluster:
I'll run the same dataset + same SLURM job on the version you recommend and report back whether the crashes reproduce. If they don't reproduce, it might just be a version error The current configuration of the rorqual cluster (compute canada looks like this): 1. Machine / cluster / OS
2. Python environment
These are the dependencies installed in the job when i did
3. Local source patches I had to apply before training would startI had to do two patches are required at venv-setup time. I apply them via these are the patches i had to add to make it work with nnunetv2 2.5.1: # 3. Apply patches (for incompatible torch/nnunetv2 versions)
if [ "$APPLY_PATCHES" = true ]; then
log_info "Applying library patches..."
# Patch A: PolyLRScheduler (torch 2.6+ changed LRScheduler.__init__ signature)
POLYLR_PATH=$(python -c "import nnunetv2; import os; print(os.path.join(nnunetv2.__path__[0], 'training/lr_scheduler/polylr.py'))")
cat <<EOF >"$POLYLR_PATH"
from torch.optim.lr_scheduler import _LRScheduler
class PolyLRScheduler(_LRScheduler):
def __init__(self, optimizer, initial_lr: float, max_steps: int, exponent: float = 0.9, current_step: int = None):
self.optimizer = optimizer
self.initial_lr = initial_lr
self.max_steps = max_steps
self.exponent = exponent
self.ctr = 0
super().__init__(optimizer, current_step if current_step is not None else -1)
def step(self, current_step=None):
if current_step is None or current_step == -1:
current_step = self.ctr
self.ctr += 1
new_lr = self.initial_lr * (1 - current_step / self.max_steps) ** self.exponent
for param_group in self.optimizer.param_groups:
param_group['lr'] = new_lr
self._last_lr = [group['lr'] for group in self.optimizer.param_groups]
def get_last_lr(self):
return self._last_lr
EOF
log_info "Patch A (PolyLR) applied to $POLYLR_PATH"
# Patch B: PyTorch 2.6+ unpickling fix (weights_only=False)
TRAINER_PATH=$(python -c "import nnunetv2; import os; print(os.path.join(nnunetv2.__path__[0], 'training/nnUNetTrainer/nnUNetTrainer.py'))")
sed -i 's/map_location=self.device)/map_location=self.device, weights_only=False)/g' "$TRAINER_PATH"
log_info "Patch B (Unpickling) applied to $TRAINER_PATH"
else log_info "Skipping patches (APPLY_PATCHES=false)"
fi4. Job env vars usedexport PYTHONUNBUFFERED=1
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export nnUNet_compile=False
export nnUNet_n_proc_DA=<varies — see timeline>
export TMPDIR=$SLURM_TMPDIR
export JOBLIB_TEMP_FOLDER=$SLURM_TMPDIR
ulimit -n 655365. Dataset / run command
6. The crashes I've seen so far (for context)I'm including these now so you know what shape of failure I am encoutering, but I'd rather not have you dig into these until i try it again Crash A — fold 0, Epoch 6, fresh runConfig differences from current script:
Epochs 0–5 ran cleanly (80–120 s each). At Epoch 6: No child-process traceback was emitted. Workers died silently. Mitigations that partially helpedAfter Crash A, I added: ulimit -n 65536
#SBATCH --tmp=100G
export TMPDIR=$SLURM_TMPDIR
export JOBLIB_TEMP_FOLDER=$SLURM_TMPDIRWith those in place, training ran cleanly from Epoch 0 through Epoch 299. Crash B — fold 3, Epoch 300, on resume with
|
See also MIC-DKFZ/nnUNet#2910, this is the same issue and fix.
Hello,
We encountered the "One or more background workers are no longer alive. Exiting" error as described in:
#134
#133
when running nnUNet on an HPC cluster. A colleague and I looked into this issue and came across this link:
joblib/threadpoolctl#176
which says that using with threadpool_limits is not thread safe and to use a global mutex instead. with threadpool_limits is used in both nnUNet and batchgenerators libraries. By using a global mutex instead of with threadpool limits as done in this pull request, we no longer encounter this issue. We verified that nnUNet output is similar (they can't be identical with the non-determinative multi-threaded launcher).
We encountered this error with both nnUNet and the batchgenerators library so a similar pull request was made in nnUNet.