Fix NODE_ID bug in SLURM process group initialization#64
Open
vitusbenson wants to merge 1 commit intodevelopfrom
Open
Fix NODE_ID bug in SLURM process group initialization#64vitusbenson wants to merge 1 commit intodevelopfrom
vitusbenson wants to merge 1 commit intodevelopfrom
Conversation
_init_process_group_slurm() used _WorkerInfo.NODE_ID to index into tasks_per_node, but _WorkerInfo.NODE_ID is None at that point — it only gets set later by _initialize_via_tcp(). Use the local node_id variable (parsed from SLURM_NODEID) instead.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
_init_process_group_slurm()crashes withTypeError: list indices must be integers or slices, not NoneTypewhen launched viasrunon a SLURM cluster.Root Cause
Line 280 uses
_WorkerInfo.NODE_IDto index into thetasks_per_nodelist:But
_WorkerInfo.NODE_IDisNoneat this point — it only gets populated later by_initialize_via_tcp()(called on line 282). The local variablenode_id(parsed fromSLURM_NODEIDon line 268) holds the correct value.MWE
Output on
develop:Output with fix:
Tested on MPCDF DAIS (H200 nodes, SLURM 24.x).
Fix
Replace
_WorkerInfo.NODE_IDwith the localnode_idvariable that was already correctly parsed fromSLURM_NODEIDon line 268.