Skip to content

Fix NODE_ID bug in SLURM process group initialization#64

Open
vitusbenson wants to merge 1 commit intodevelopfrom
fix/slurm-node-id-bug
Open

Fix NODE_ID bug in SLURM process group initialization#64
vitusbenson wants to merge 1 commit intodevelopfrom
fix/slurm-node-id-bug

Conversation

@vitusbenson
Copy link
Copy Markdown
Collaborator

Bug

_init_process_group_slurm() crashes with TypeError: list indices must be integers or slices, not NoneType when launched via srun on a SLURM cluster.

Root Cause

Line 280 uses _WorkerInfo.NODE_ID to index into the tasks_per_node list:

local_world_size = tasks_per_node[_WorkerInfo.NODE_ID]

But _WorkerInfo.NODE_ID is None at this point — it only gets populated later by _initialize_via_tcp() (called on line 282). The local variable node_id (parsed from SLURM_NODEID on line 268) holds the correct value.

MWE

#!/bin/bash
#SBATCH --partition=gpu1
#SBATCH --gres=gpu:1
#SBATCH --ntasks=1

srun python -c "import dmlcloud as dml; dml.init()"

Output on develop:

  File "dmlcloud/core/distributed.py", line 280, in _init_process_group_slurm
    local_world_size = tasks_per_node[_WorkerInfo.NODE_ID]
TypeError: list indices must be integers or slices, not NoneType

Output with fix:

Connecting via slurm and TCPStore:
  rank: 0, world size: 1, local rank: 0, local world size: 1, node id: 0
torch.distributed initialized
SUCCESS: dml.init() completed

Tested on MPCDF DAIS (H200 nodes, SLURM 24.x).

Fix

Replace _WorkerInfo.NODE_ID with the local node_id variable that was already correctly parsed from SLURM_NODEID on line 268.

_init_process_group_slurm() used _WorkerInfo.NODE_ID to index into
tasks_per_node, but _WorkerInfo.NODE_ID is None at that point —
it only gets set later by _initialize_via_tcp(). Use the local
node_id variable (parsed from SLURM_NODEID) instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant