Set gpu tpb#736
Conversation
…. Yes, it's confusing. Yes, the OpenMP ARB know.
…they would be easy.
|
Rudimentary testing done with: #include <cstdio>
#include "quest.h"
int main (void)
{
const int NQUBITS = 24;
const int TPB = 32;
initQuESTEnv();
reportQuESTEnv();
std::printf("Initial number of threads per block: %d\n", getQuESTGpuThreadsPerBlock());
setQuESTGpuThreadsPerBlock(TPB);
std::printf("New number of threads per block: %d\n", getQuESTGpuThreadsPerBlock());
Qureg qureg = createForcedQureg(NQUBITS);
std::printf("Initialising Qureg.\n");
initPlusState(qureg);
reportQureg(qureg);
std::printf("Applying Quantum Fourier Transform.\n");
applyFullQuantumFourierTransform(qureg, false);
reportQureg(qureg);
destroyQureg(qureg);
finalizeQuESTEnv();
return 0;
} |
|
Why would |
|
Is there an advantage to users having to set this as a runtime hyperparameter? My (mostly undeveloped) belief is we can use occupancy tools (alluded to here) to automate this. I definitely shy from giving users a greater onus to optimise for their settings (like other prolific softwares), which the v4 overhaul was supposed to avoid (via e.g. the autodeployer). Note too that the kernels so far are very primitive - each thread handles the updating of the minimum possible number of amplitudes (often just one!). I quite like that because it's very readable and simple (great for an open-source scientific project) but is an obvious site for optimisation.
It's true that it will never be anywhere as big as the quantities |
|
Hi Tyson, I just noticed the fixed value to 128 and have a feeling that it was large. I just wanted a handle so I could write a benchmark so we can easily automate performance tuning ourselves. I have not played with the occupancy tools but I should take a proper look as this might solve this automatically. My other concern is that there are differences between Nvidia and AMD on optimal sizes due to hardware differences so we might not be able to reply on the occupancy tuning in all cases unless this becomes available on all platforms. |
I guess it's very GPU specific! I think For illustration, the next smallest size is Of course, newer GPUs support more active blocks per SM (even when the max active threads per SM is unchanged). E.g. CC=8 supports up to 32 active blocks per SM, so we could shrink to Certainly seems prudent to consult a CUDA runtime API, if that doesn't hurt our AMD compatibility! |
|
Apologies, probably won't get to look at this again this week, but very happy to set this value programmatically if it can be done! As it's architecture dependent, we definitely do need a way to adjust it, and ideally both at runtime and compile time. At compile time, so kindly HPC support teams can compile and maintain a tuned version, and at runtime, so they can scan through values without having to recompile in between. I'll have a chat with James abour approaches later this week! I 100% agree that we don't really want unknowing users messing around with this. I think something like an |
|
Fair enough - you've convinced me! Being able to runtime adjust is of course extremely helping during development of a user-friendlier adaptive system anyhow. I like the sound of |
|
Should validate TPB is multiple of 32! |
|
Rather than attempt to post a thumbs up on each comment while on train WiFi, I'll just comment thanks @TysonRayJones here! I'm hoping to give this branch some proper attention on Thursday/Friday. |
|
No prob! They're almost all nits I can sweep through rapidly. You can flag any you disagree with or for which there's nuance, and I can address the remainder next time I'm on QuEST. (No need to co-author the squash with me for my minor and mostly stylistic changes!) |
|
Okay, so I've managed a handful of minor updates on this today 🤦♂️ I've kicked back the grand plan to create a perftune.h and will just focus on getting this version polished enough to merge for v4.3 |
…emoved superfluous getQuESTEnv
|
Getting there -- still need to add tests, documentation, and review some |
…idated in the internal gpu function to stop me wonder where that is in future
|
I'll fix the validation issues, and move the API function; I think it doesn't belong in |
TysonRayJones
left a comment
There was a problem hiding this comment.
(I'll address these in #767)
| | `USER_SOURCE_NAMES` | (Undefined), String | The source file for a user program which will be compiled alongside QuEST. `USER_OUTPUT_EXE_NAME` *must* also be defined. | | ||
| | `USER_OUTPUT_EXE_NAME` | (Undefined), String | The name of the executable which will be created from the provided `USER_SOURCE_NAMES`. `USER_SOURCE_NAMES` *must* also be defined. | | ||
|
|
||
| | `QUEST_GPU_NUM_THREADS_PER_BLOCK` | (128), Number | The default number of threads per block QuEST will use when offloading to a GPU. *Must* be a multiple of 32. For AMD GPUs this *should* be a multiple of 64. | |
There was a problem hiding this comment.
A note on naming: QUEST_GPU_NUM_THREADS_PER_BLOCK documented here is inconsistent with the actual parameter name, QUEST_DEFAULT_NUM_THREADS_PER_BLOCK. But I think both GPU and DEFAULT are important to include, so advocate for the third name:
QUEST_DEFAULT_GPU_NUM_THREADS_PER_BLOCKThis clarifies the threads discussed here are GPU specific (not to be confused with CPU threads), and that this only overrides the default, in the same fashion as QUEST_DEFAULT_VALIDATION_EPSILON (which is an env-var).
| | `USER_SOURCE_NAMES` | (Undefined), String | The source file for a user program which will be compiled alongside QuEST. `USER_OUTPUT_EXE_NAME` *must* also be defined. | | ||
| | `USER_OUTPUT_EXE_NAME` | (Undefined), String | The name of the executable which will be created from the provided `USER_SOURCE_NAMES`. `USER_SOURCE_NAMES` *must* also be defined. | | ||
|
|
||
| | `QUEST_GPU_NUM_THREADS_PER_BLOCK` | (128), Number | The default number of threads per block QuEST will use when offloading to a GPU. *Must* be a multiple of 32. For AMD GPUs this *should* be a multiple of 64. | |
There was a problem hiding this comment.
Further, what is the motivation of making this variable a CMake/compile-time constant? The fact it can be runtime overriden means we don't gain any performance benefit. So we can penalty-free improve the flexibility by making this instead an environment variable, just like QUEST_DEFAULT_VALIDATION_EPSILON, so that changing it doesn't require recompilation
There was a problem hiding this comment.
See reply on #767 -- the flexibility is a convenience, the compilation is the important bit 😅
| QuESTEnv getQuESTEnv(); | ||
|
|
||
|
|
||
| /** @notyetdoced |
There was a problem hiding this comment.
Have to add a newline after command @notyetdoced, or the text below gets included in the command's created warning box
| * OpenMP the user can just export OMP_NUM_THREADS or call omp_set_num_threads. | ||
| */ | ||
| int getQuESTNumGpuThreadsPerBlock(); | ||
| void setQuESTNumGpuThreadsPerBlock(const int newThreadsPerBlock); |
There was a problem hiding this comment.
Needs separate TODO doc. We should reference these functions from initCustomQuESTEnv (or somewhere similar) to indicate further control of runtime environment is possible
| {"isOmpCompiled", cpu_isOpenmpCompiled()}, | ||
| {"isCuQuantumCompiled", gpu_isCuQuantumCompiled()}, | ||
| {"isGpuCompiled", gpu_isGpuCompiled()}, | ||
| {"isHipCompiled", gpu_isHipCompiled()}, |
There was a problem hiding this comment.
Great idea to introduce this, but I don't like that the binary is HIP-focused. We can replace isHipCompiled with "gpuPlatform", with options "CUDA" or "HIP"
There was a problem hiding this comment.
I think the binary of isHipCompiled would be useful to attach directly to QuESTEnv too, like James did for isMpiGpuAware
| # GPU Performance Tuning | ||
| ## We do not print this value when configuring CMake as it is for advanced users only. | ||
|
|
||
| set(QUEST_GPU_NUM_THREADS_PER_BLOCK 128 | ||
| CACHE | ||
| STRING | ||
| "The default number of threads per block QuEST will use when offloading to a GPU. Set to 128 by default. Must be a multiple of 32." | ||
| ) | ||
| mark_as_advanced(QUEST_GPU_NUM_THREADS_PER_BLOCK) | ||
|
|
||
| math(EXPR quest_tpb_remainder "${QUEST_GPU_NUM_THREADS_PER_BLOCK} % 32") | ||
| if ((NOT (quest_tpb_remainder EQUAL 0)) OR (QUEST_GPU_NUM_THREADS_PER_BLOCK LESS 32)) | ||
| message(FATAL_ERROR "QUEST_GPU_NUM_THREADS_PER_BLOCK must be a multiple of 32. QUEST_GPU_NUM_THREADS_PER_BLOCK=${QUEST_GPU_NUM_THREADS_PER_BLOCK}.") | ||
| endif() |
There was a problem hiding this comment.
Remove if we switch to env-var
| set(QUEST_COMPILE_SUBCOMM ${QUEST_ENABLE_SUBCOMM}) | ||
| set(QUEST_COMPILE_CUQUANTUM ${QUEST_ENABLE_CUQUANTUM}) | ||
| set(QUEST_INCLUDE_DEPRECATED_FUNCTIONS ${QUEST_ENABLE_DEPRECATED_API}) | ||
| set(QUEST_DEFAULT_NUM_THREADS_PER_BLOCK ${QUEST_GPU_NUM_THREADS_PER_BLOCK}) |
There was a problem hiding this comment.
Remove if we switch to env-var
| globalEnvPtr->isMultithreaded, | ||
| globalEnvPtr->isDistributed, | ||
| globalEnvPtr->userOwnsMpi, | ||
| numThreads, |
There was a problem hiding this comment.
can print numGpuThreadsPerBlock here (and rename threads to cpuThreads)
There was a problem hiding this comment.
(holding off on this; everything else reported here is [at least intendedly] fixed over runtime. The semi-exception is the reported num-threads, but users cannot modify that via QuEST: instead they'd need to use the OMP runtime. we can return and add gpuNumThreads when we sort out making cpuNumThreads runtime variable)
| } | ||
|
|
||
|
|
||
| TEST_CASE( "QuESTNumGpuThreadsPerBlock", TEST_CATEGORY ) { |
There was a problem hiding this comment.
must be the API function name
There was a problem hiding this comment.
Fair, just need to split it into "set" and "get" in that case.
| SECTION( LABEL_CORRECTNESS ) { | ||
| // Check that it initially matches the compile time value | ||
| // stored in config.h | ||
| REQUIRE(getQuESTNumGpuThreadsPerBlock() == QUEST_DEFAULT_NUM_THREADS_PER_BLOCK); |
There was a problem hiding this comment.
nit: spaces around macro invocation, i.e. REQUIRE( x )
Creating a facility for users to runtime set threads per block for tuning the GPU implementation. NOTE: only applies to kernels that are not handled by Thrust, which does its own thing. Resolves #735.
I considered and rejected the idea of creating a symmetric interface for the CPU for users who don't know
OMP_NUM_THREADSoromp_set_num_threads()exist, but that's much riskier as the point of truth is external (in the OpenMP runtime).TODO:
Should gpu_getNumThreadsPerBlock return a.qindex? ProbablyCreate a new home for user facing API, as environment doesn't really make sense.Kicking this back to next release.Query seemingly unused branch at
if constexpr (NumTargs != -1) {
QuEST/quest/src/gpu/gpu_subroutines.cpp
Line 453 in b7d4a29