Distributed training currently does not work across multiple nodes in compliant detonation chamber. This is due to a bug in the set_environment_variables_for_nccl_backend method in pymarlin.utils.distributed where the master node's address is taken from the environment variable AZ_BATCH_MASTER_NODE rather than AZ_BATCHAI_MPI_MASTER_NODE. While this works for signed builds, the former option is disabled in detonation chamber. Thus, can we modify the codebase to enable this behavior? Based off of this stackoverflow post, the recommendation seems to be to always use AZ_BATCHAI_MPI_MASTER_NODE over AZ_BATCH_MASTER_NODE.
Given the current implementation for set_environment_variables_for_nccl_backend, this should be a pretty straightforward change of removing the if statement for single_node. I have already verified these changes in compliant detonation chamber.
def set_environment_variables_for_nccl_backend():
"""Sets distributed training environments for azureml openmpi runs with NCCL backend."""
# NCCL environment. Still works without it.
os.environ["NCCL_SOCKET_IFNAME"] = "^docker0,lo"
os.environ["NCCL_IB_DISABLE"] = "0" # for IB
single_node = int(os.environ["OMPI_COMM_WORLD_LOCAL_SIZE"]) == int(
os.environ["OMPI_COMM_WORLD_SIZE"]
)
if single_node:
master_node = os.environ["AZ_BATCHAI_MPI_MASTER_NODE"]
master_port = "54965"
else:
master_node_params = os.environ["AZ_BATCH_MASTER_NODE"].split(":")
master_node = master_node_params[0]
master_port = (
os.environ["MASTER_PORT"] if "MASTER_PORT" in os.environ else "6105"
)
# set env variables
os.environ["MASTER_ADDR"] = master_node
os.environ["MASTER_PORT"] = master_port
Distributed training currently does not work across multiple nodes in compliant detonation chamber. This is due to a bug in the
set_environment_variables_for_nccl_backendmethod inpymarlin.utils.distributedwhere the master node's address is taken from the environment variableAZ_BATCH_MASTER_NODErather thanAZ_BATCHAI_MPI_MASTER_NODE. While this works for signed builds, the former option is disabled in detonation chamber. Thus, can we modify the codebase to enable this behavior? Based off of this stackoverflow post, the recommendation seems to be to always useAZ_BATCHAI_MPI_MASTER_NODEoverAZ_BATCH_MASTER_NODE.Given the current implementation for
set_environment_variables_for_nccl_backend, this should be a pretty straightforward change of removing the if statement forsingle_node. I have already verified these changes in compliant detonation chamber.