[hipCUB] Centralized system for disabling tests#7241
Conversation
ff75615 to
efd2efa
Compare
This change adds a centralized mechanism for disabling tests in hipCUB. Tests can be disabled by adding rules to the text file hipcub/test/control.txt. These rules allow you to: - Disable tests or whole test suites by name using a regex - Disable tests only for particular architectures (using an architecture regex). The architecture regex supports "keywords": sets of commonly associated architectures (eg. all APU architectures, or all navi3x devices). These keywords provide a shortcut that makes it easy to disable tests for whole families of GPUs at once. - Disable only individual input sizes. You can specify ranges using relational operators (eg. disable all tests sizes >100). You can also use arithmetic operators when specifying the size (eg. (1 << 32 + 1)). This can be handy because test sizes are often specified as offsets from powers of 2 in the code. So in these cases, you can just directly copy and paste them into the rule. - Disable tests only when they're running a particular environment (OS or build configuration option). For example, you can disable tests when using ASAN or Valgrind, when running on Windows or Linux, or when running on an Nvidia platform. - Print a message after disabling a test to let the user know why it's disabled. This system brings a few advantages: - We often encounter situations where we need to temporarily disable a test. It is easy to forget to remove the code that disables the test if it is scatter throughout the files in the test directory. Using a centralized system reduces the chances of this, since everything's in one place, and provides a way to quickly look up which tests are currently disabled and why. - Keeping the rules in a text file allows us to enable/disable tests without needing to rebuild, since the text file is read at runtime. - Minimal code changes are needed in order to hookup existing tests to the centralized system. Modying tests so that their test fixture class inherits from test_controller::ControlledTest (as described below) causes a test disablement check to happen automatically before the test is run. For most tests, filtering individual sizes requires adding single macro call. - Code for detecting environment or platform state (eg. which OS we're running on, whether we're using ASAN or Valgrind) can be centralized in the test controller class instead of being replicated within each test file that requires checks. - Uses a custom rule format that's parsed inside library test code (the TestController class), so no additional dependencies (eg. JSON or XML parser) are required. - Defining an environment variable (export HIPCUB_EXTRA_TC_INFO=1) will cause the test controller to print all of test and sizes it's skipping, and the line number of the line in the control file that caused them to be skipped. This can be useful when you need to test out a new rule, or if you need to locate which control file lines affect a particular test. There are also a few potential disadvantages: - When writing a rule, the regexes that are used for matching test names require a little extra effort to construct, since you have to escape special characters and keep in mind that whitespace matters. - The text file rule format is not hierarchically organized like JSON objects. Rules are just a list of items delimited by a colon. The text parsing system is complex, and requires additional unit tests of it's own (see test_hipcub_test_controller.cpp) to validate. I've done my best to mitigate this by adding lots of documentation. - Since GTEST_SKIP only skips the currently running function, we have to wrap the size filter calls in macros. Adding macros to our unit tests (even if it's only one per test) is ugly, and easy to forget to do. I can add some documentation about how to write unit tests that adhere to the test controller to try to help reduce the changes of forgetting. - While this change is able to replace a number of individual GTEST_SKIP calls in existing unit tests with rules in the central text file, some GTEST_SKIP calls depend on values the condition checks on variables that are local to the test function (eg. whether the bock size or wavefront size exceeds some limit). Currently, these conditions cannot be captured inside rules. If it's deemed useful, this could probably be done in a follow-up change.
Added documentation, fixed bug with nvidia build type - changed it an arch_regex keyword instead. Removed unnecessary code.
8599a55 to
200a1c7
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project status has failed because the head coverage (69.24%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #7241 +/- ##
===========================================
+ Coverage 62.78% 63.06% +0.28%
===========================================
Files 972 1014 +42
Lines 185366 187970 +2604
Branches 21999 22100 +101
===========================================
+ Hits 116376 118529 +2153
- Misses 61008 61408 +400
- Partials 7982 8033 +51
*This pull request uses carry forward flags. Click here to find out more. 🚀 New features to boost your workflow:
|
Simplify the control-file-finding logic so that it searches relative to the path to the currently running binary (instead of using some fixed pathes inserted by CMake). Fix up documentation errors, minor bug fixes.
9a0a9e6 to
9255f20
Compare
9255f20 to
d8fbba5
Compare
d8fbba5 to
d5216ae
Compare
amd-jmahovsky
left a comment
There was a problem hiding this comment.
Some comments and observations.
| } | ||
|
|
||
| // Each entry represents the parsed information from one line of the control file. | ||
| std::vector<ControlInfo> control_data; |
There was a problem hiding this comment.
control_info for the var name? When reading this code I wasn't sure if ControlInfo and control_data were the same thing.
| // - Translates architecture keywords used in the control file intor | ||
| // regexes that are matched against the gfx id of the device currently | ||
| // in use. Please see the control file comments for more on these keywords. | ||
| class TestController |
There was a problem hiding this comment.
One general comment: This class does a lot of different things. It might make for clearer code if the ControlInfo/data parsing from the file was split out, then the TestController doesn't have the burden of handling all that parsing/token/etc. This is assuming I am understanding correctly that the filtering only needs the parsed ControlInfo/data and nothing else in the control file.
| #endif // defined(HIPCUB_ROCPRIM_API) | ||
|
|
||
| // -- Macros to use in unit tests -- | ||
| // Use macros here, even though they're ugle, since it's the only way to call |
| /HipcubDeviceRadixSort\/.*/ : /<apus>/ : >= (1 << 32) : * : "Skipping sizes over 2^32 on APUs" | ||
| /HipcubDeviceRadixSortLargeInput.*/ : /<apus>/ : >= (1 << 32) : * : "Skipping sizes over 2^32 on APUs" | ||
| /HipcubDeviceMergeSort\/.*/ : /<apus>/ : >= (1 << 32) : * : "Skipping sizes over 2^32 on APUs" | ||
| /HipcubIteratorTests\/\d+\.TestTex.+/ : /<mi300-family>|navi4x-family|<apus>/ : * : * : "Test not run on gfx94x, gfx95x, or gfx120x as texture cache API is not supported." |
There was a problem hiding this comment.
Should navi4x-family be in angle brackets like mi300-family?
| #else | ||
| // Linux does not add a '\0' at the end of the path, so we must do that manually. | ||
| // If truncate if the path exceeds max_len chars. | ||
| int bytes_read = readlink("/proc/self/exe", path_buf, max_len); |
There was a problem hiding this comment.
The readlink() system call can return -1 if it fails. That doesn't seem to be handled. Could it fail in this scenario and does that need to be handled?
|
|
||
| // The name may contain extra bits we don't need - eg. the xnack portion of "gfx942:xnack+". | ||
| const auto length = sizeof(hipDeviceProp_t::gcnArchName); | ||
| char* arch_end = std::find_if(arch_name, |
There was a problem hiding this comment.
If this doesn't find anything then arch_end will point one past the end of the string and then we will write a 0 out of bounds.
| for (size_t i = 0; !sizes.empty() && i < this->control_data.size(); i++) | ||
| { | ||
| // Grab the filter information that was parsed from the current line. | ||
| ControlInfo info = this->control_data[i]; |
There was a problem hiding this comment.
Could this be a const reference to avoid some copying?
| info.build_type_test_fns.push_back(it->second); | ||
| } | ||
|
|
||
| build_type_part = match_result.suffix(); |
There was a problem hiding this comment.
This doesn't seem to be used after being set...?
| } | ||
| } | ||
|
|
||
| size_part = match_result.suffix(); |
There was a problem hiding this comment.
does match_result change in this loop? i.e. can we just set size_part once?
| // Parsed from size filters. These are unary functions that accept a size and return true if that size should be skipped. | ||
| std::vector<std::function<bool(size_t)>> size_test_fns; | ||
| // Set to true if the user has specified a '*' to skip all test sizes. | ||
| bool disable_all_sizes; |
There was a problem hiding this comment.
should some of these members be default initialized? i.e. disable_all_sizes = false etc.
There was a problem hiding this comment.
(might just be good defensive coding practice..)
Motivation
We often need to temporarily disable tests until some root cause (potentially in another component) is tracked down and fixed. This results in a lot of disablement checks scattered throughout the test files. It's easy to forget to re-enable a test when something's fixed, and there can be code duplication in the checks that different files perform. In addition, it takes some time and effort to figure out which tests are disabled and why (you have to open and examine each individual test file and locate the checks).
It would be nice to have a single, centralized file that we can disable tests from. The system that reads this file could also encapsulate the checks we need to perform within a single class.
Technical Details
This change adds a centralized mechanism for disabling tests in hipCUB. Tests can be disabled by adding rules to the text file located at
hipcub/test/control.txt.These rules allow you to:
This system brings a few advantages:
test_controller::ControlledTestcauses a test disablement check to happen automatically before the test is run. For most tests, filtering individual sizes requires adding a single macro call.export HIPCUB_EXTRA_TC_INFO=1) will cause the test controller to print out all of tests and sizes it's skipping, and the line number of the line in the control file that caused them to be skipped. This can be useful when you need to test out a new rule, or if you need to locate which control file lines affect a particular test.There are also a few potential disadvantages:
test_hipcub_test_controller.cpp) to validate. I've done my best to mitigate this by adding lots of documentation.GTEST_SKIPonly skips the currently running function, we have to wrap the size filter calls in macros. Adding macros to our unit tests (even if it's only one per test) is ugly, and easy to forget to do. I've added some documentation about how to write unit tests that adhere to the test controller to try to help.GTEST_SKIPcalls in existing unit tests with rules in the central text file, someGTEST_SKIPcalls depend on condition checks that use variables that are local to the test function (eg. whether the block size or wavefront size exceeds some limit). Currently, these conditions cannot be captured inside rules. If it's useful, perhaps this is something we could think about in a follow-up change.Test Plan
Build and run all unit tests. Ensure that tests that used to be skipped are still correctly skipped (under the right conditions).
Build and run the test controller's own internal tests (the
test_hipcub_test_controllertarget).Test Result
Tests are skipped correctly locally. TheRock CI change here (ROCm/TheRock#5504) allows the test control file through into the final installation directory.
Submission Checklist