Enabling GPU direct communication for Cray-MPICH by JPRichings · Pull Request #724 · QuEST-Kit/QuEST

JPRichings · 2026-04-15T16:16:37Z

This needs testing but here is an initial suggestion for checking support for GPU enabled MPI in CRAY_MPICH this should enable direct inter GPU comms for a wider range of systems in the HPC space that don't always support OpenMPI for there scale out network.

Plan to test on MI210 nodes on ARCHER2 when they are available.

JPRichings · 2026-04-25T21:18:12Z

Should have updated this before:

Hardware: 1 node 4 gpus Mi210 ARCHER2

before (no MPI direct comms):

QFT run time: 9.49068s
Total run time: 10.2344s

after (with MPI direct comms):

QFT run time: 2.29135s
Total run time: 2.92993s

I think it might be worth switching on GPU direct comms... ;)

JPRichings · 2026-04-28T15:57:36Z

Test all failing due to:

[ 80%] Building CXX object _deps/catch2-build/src/CMakeFiles/Catch2.dir/catch2/matchers/catch_matchers_predicate.cpp.o
/Users/runner/work/QuEST/QuEST/quest/src/comm/comm_config.cpp:75:5: error: use of undeclared identifier 'MPI_Get_library_version'
   75 |     MPI_Get_library_version(version_string, resultlen);
      |     ^
1 error generated.

I will look to add macro guard around this mpi call to fix non-mpi compilation.
need to set some sensible default buffer sizes
Need to remove the debugging messages for the code

JPRichings · 2026-04-28T19:14:01Z

need to return NONE for comm_whichMpi and false for comm_set_isMpiGpuAware

TysonRayJones · 2026-05-31T06:53:23Z

Hmm I think this diff is unnecessarily convoluted. It seeks to extend the systems for which we can detect GPU-awareness, but creates a comm_config.cpp dependency on an API env function:

bool comm_isMpiGpuAware() {
	return getQuESTEnv().isMPIGPUAware;
}

and performs brittle string processing to establish the MPI version...

enum Mpi_version {NONE, OPENMPI, CRAYMPICH};

int comm_whichMpi() {

    #ifdef COMPILE_MPI
        char version_string[MPI_MAX_LIBRARY_VERSION_STRING];
    #else
        char version_string[];
    #endif

    int resultlen[] = {0};

    #ifdef COMPILE_MPI
        MPI_Get_library_version(version_string, resultlen);
    #endif
    enum Mpi_version version = NONE;

    // Check if Openmpi used
    #ifdef OPEN_MPI
        version = OPENMPI;
    #endif

    // Check if Cray MPI used 
	
    const char* cray_string = "CRAY MPICH";

    std::string v_string = version_string;

    if (v_string.find(cray_string) != string::npos) {
        version = CRAYMPICH;
    }

    return version;

}

just so that it can merely look for a single env variable, MPICH_GPU_SUPPORTED_ENABLED:

bool comm_set_isMpiGpuAware() {

   int mpi_lib = comm_whichMpi();
  
    if(OPENMPI==mpi_lib) {
        // definitely not GPU-aware if compiler declares it is not
        #if defined(MPIX_CUDA_AWARE_SUPPORT) && ! MPIX_CUDA_AWARE_SUPPORT
            return false;
        #endif

        // check CUDA-awareness at run-time if we know it's principally supported
        #if defined(MPIX_CUDA_AWARE_SUPPORT)
            return (bool) MPIX_Query_cuda_support();
	#endif
    }

    if(CRAYMPICH==mpi_lib) { 
          
       const char* var = std::getenv("MPICH_GPU_SUPPORT_ENABLED");
       
       return (bool) var;
    }

    // if we can't ascertain CUDA-awareness, just assume no to avoid seg-fault
    return false;
}

In my understanding, this is all unnecessary because the check of "is GPU aware" is always fail safe. If we cannot detect an affirmative "yes", then we default to "no". Checking the MPI versions is redundant, given we can just check for the macros directly - when they are not defined, the check is null (NOT false) and we safely proceed.

The original comm_isMpiGpuAware function was:

bool comm_isMpiGpuAware() {

    // definitely not GPU-aware if compiler declares it is not
    #if defined(MPIX_CUDA_AWARE_SUPPORT) && ! MPIX_CUDA_AWARE_SUPPORT
        return false;
    #endif

    // check CUDA-awareness at run-time if we know it's principally supported
    #if defined(MPIX_CUDA_AWARE_SUPPORT)
        return (bool) MPIX_Query_cuda_support();
    #endif

    // if we can't ascertain CUDA-awareness, just assume no to avoid seg-fault
    return false;
}

If my understanding is correct, to support CRAY MPICH, we just need to introduce 3 lines before the return false, like:

    // check whether an MPICH env-var indicates support (we assume it never lies!)
    static const char* var = std::getenv("MPICH_GPU_SUPPORT_ENABLED"); // load once
    if (var && std::atoi(var))
        return true;

The only functional difference between that and this PR's diff is that the PR will never try to find the MPICH env-var when it knows OpenMPI is used. That doesn't add any robustness (the OpenMPI checks were safe, since undefined macros in C default to falsiness), except if we believe OpenMPI systems might have an erroneous MPICH_GPU_SUPPORT_ENABLED=1 env-var. And of course that is absurd, and much less likely than the true danger inherent to both these solutions; that an MPICH user will have somehow erroneously overriden MPICH_GPU_SUPPORT_ENABLED=1.

(Note too this branch doesn't compile, beyond just the isMPIGPUAware typo; MPI_MAX_LIBRARY_VERSION_STRING is not defined in Clang's MPI)

It seems best to close the PR and create a fresh one with the 3 lines mentioned above. Have I understood all this right?

TysonRayJones · 2026-05-31T07:00:26Z

Note I like the idea of attaching isMpiGpuAware to the environment at runtime. We should keep that, but then of course report it within reportQuESTEnv() and getEnvironmentString() so users can see it! I can add all that in a new PR

(Also note to self; rename getEnvironmentString() to getQuESTEnvString() - iirc this function exists for a pyQuEST hook, and @rrmeister can tolerate an API break ;) )

initial commit enabling GPU direct communictation for Cray-MPICH

2e5441f

JPRichings requested a review from otbrown April 15, 2026 16:16

JPRichings added the enhancement label Apr 15, 2026

JPRichings mentioned this pull request Apr 16, 2026

Cray MPICH GPU-GPU direct communication not supported #727

Open

JPRichings added 2 commits April 17, 2026 00:10

Update to GPU aware MPI checking logic

09805cd

update gpu aware setting logic

55fff7c

otbrown changed the title ~~initial commit enabling GPU direct communictation for Cray-MPICH~~ initial commit enabling GPU direct communication for Cray-MPICH Apr 27, 2026

TysonRayJones reviewed May 3, 2026

View reviewed changes

Comment thread quest/include/environment.h Outdated

JPRichings added 3 commits May 28, 2026 17:08

update formatting

3550cd5

Add MPI version string length and compile guard

ec138c7

remove debug printing

46618e2

JPRichings changed the title ~~initial commit enabling GPU direct communication for Cray-MPICH~~ Enabling GPU direct communication for Cray-MPICH May 28, 2026

otbrown marked this pull request as ready for review May 30, 2026 18:37

TysonRayJones mentioned this pull request May 31, 2026

Add GPU-aware MPICH detection #766

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enabling GPU direct communication for Cray-MPICH#724

Enabling GPU direct communication for Cray-MPICH#724
JPRichings wants to merge 6 commits into
QuEST-Kit:develfrom
JPRichings:devel

JPRichings commented Apr 15, 2026

Uh oh!

JPRichings commented Apr 25, 2026 •

edited

Loading

Uh oh!

JPRichings commented Apr 28, 2026 •

edited

Loading

Uh oh!

JPRichings commented Apr 28, 2026

Uh oh!

Uh oh!

TysonRayJones commented May 31, 2026

Uh oh!

TysonRayJones commented May 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JPRichings commented Apr 15, 2026

Uh oh!

JPRichings commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JPRichings commented Apr 28, 2026

Uh oh!

Uh oh!

TysonRayJones commented May 31, 2026

Uh oh!

TysonRayJones commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JPRichings commented Apr 25, 2026 •

edited

Loading

JPRichings commented Apr 28, 2026 •

edited

Loading

TysonRayJones commented May 31, 2026 •

edited

Loading