[hipCUB] Centralized system for disabling tests by umfranzw · Pull Request #7241 · ROCm/rocm-libraries

umfranzw · 2026-05-09T04:16:03Z

Motivation

We often need to temporarily disable tests until some root cause (potentially in another component) is tracked down and fixed. This results in a lot of disablement checks scattered throughout the test files. It's easy to forget to re-enable a test when something's fixed, and there can be code duplication in the checks that different files perform. In addition, it takes some time and effort to figure out which tests are disabled and why (you have to open and examine each individual test file and locate the checks).

It would be nice to have a single, centralized file that we can disable tests from. The system that reads this file could also encapsulate the checks we need to perform within a single class.

Technical Details

This change adds a centralized mechanism for disabling tests in hipCUB. Tests can be disabled by adding rules to the text file located at hipcub/test/control.txt.

These rules allow you to:

Disable tests or whole test suites by name using a regex.
Disable tests only for particular architectures (using an architecture regex). The architecture regex supports "keywords": sets of commonly associated architectures (eg. all APU architectures, or all navi3x devices, or nvidia devices). These keywords provide a shortcut that makes it easy to disable tests for whole families of GPUs at once.
Disable individual input sizes, or ranges using relational operators (eg. disable all tests sizes >100). You can also use arithmetic operators when specifying the size (eg. (1 << 32 + 1)). This can be handy because test sizes are often specified as offsets from powers of 2 in the code. So in these cases, you can just directly copy and paste them into the rule.
Disable tests only when they're running in a particular environment (OS or build configuration option). For example, you can disable tests when using ASAN or Valgrind, or when running on Windows or Linux,.
Print a message after disabling a test to let the user know why it's disabled.

This system brings a few advantages:

As mentioned above, using a centralized system reduces the chances of forgetting to re-enable, since everything's in one place, and provides a way to quickly look up which tests are currently disabled and why.
Keeping the rules in a text file allows us to enable/disable tests without needing to rebuild, since the text file is read at runtime.
Minimal code changes are needed in order to hookup existing tests to the centralized system. Modifying tests so that their test fixture class inherits from test_controller::ControlledTest causes a test disablement check to happen automatically before the test is run. For most tests, filtering individual sizes requires adding a single macro call.
Code for detecting environment or platform state (eg. which OS we're running on, whether we're using ASAN or Valgrind) can be centralized in the test controller class instead of being replicated within each test file that requires checks.
Uses a custom rule format that's parsed inside the test controller class, so no additional dependencies (eg. JSON or XML parsers) are required.
Defining an environment variable (export HIPCUB_EXTRA_TC_INFO=1) will cause the test controller to print out all of tests and sizes it's skipping, and the line number of the line in the control file that caused them to be skipped. This can be useful when you need to test out a new rule, or if you need to locate which control file lines affect a particular test.

There are also a few potential disadvantages:

When writing a rule, the regexes that are used for matching test names require a little extra effort to construct, since you have to escape special characters and keep in mind that whitespace matters.
The text file rule format is not hierarchically organized like JSON objects. Rules are just a list of items delimited by a colon. The text parsing code is complex, and requires additional unit tests of it's own (see test_hipcub_test_controller.cpp) to validate. I've done my best to mitigate this by adding lots of documentation.
Since GTEST_SKIP only skips the currently running function, we have to wrap the size filter calls in macros. Adding macros to our unit tests (even if it's only one per test) is ugly, and easy to forget to do. I've added some documentation about how to write unit tests that adhere to the test controller to try to help.
While this change is able to replace a number of individual GTEST_SKIP calls in existing unit tests with rules in the central text file, some GTEST_SKIP calls depend on condition checks that use variables that are local to the test function (eg. whether the block size or wavefront size exceeds some limit). Currently, these conditions cannot be captured inside rules. If it's useful, perhaps this is something we could think about in a follow-up change.

Test Plan

Build and run all unit tests. Ensure that tests that used to be skipped are still correctly skipped (under the right conditions).
Build and run the test controller's own internal tests (the test_hipcub_test_controller target).

Test Result

Tests are skipped correctly locally. TheRock CI change here (ROCm/TheRock#5504) allows the test control file through into the final installation directory.

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

This change adds a centralized mechanism for disabling tests in hipCUB. Tests can be disabled by adding rules to the text file hipcub/test/control.txt. These rules allow you to: - Disable tests or whole test suites by name using a regex - Disable tests only for particular architectures (using an architecture regex). The architecture regex supports "keywords": sets of commonly associated architectures (eg. all APU architectures, or all navi3x devices). These keywords provide a shortcut that makes it easy to disable tests for whole families of GPUs at once. - Disable only individual input sizes. You can specify ranges using relational operators (eg. disable all tests sizes >100). You can also use arithmetic operators when specifying the size (eg. (1 << 32 + 1)). This can be handy because test sizes are often specified as offsets from powers of 2 in the code. So in these cases, you can just directly copy and paste them into the rule. - Disable tests only when they're running a particular environment (OS or build configuration option). For example, you can disable tests when using ASAN or Valgrind, when running on Windows or Linux, or when running on an Nvidia platform. - Print a message after disabling a test to let the user know why it's disabled. This system brings a few advantages: - We often encounter situations where we need to temporarily disable a test. It is easy to forget to remove the code that disables the test if it is scatter throughout the files in the test directory. Using a centralized system reduces the chances of this, since everything's in one place, and provides a way to quickly look up which tests are currently disabled and why. - Keeping the rules in a text file allows us to enable/disable tests without needing to rebuild, since the text file is read at runtime. - Minimal code changes are needed in order to hookup existing tests to the centralized system. Modying tests so that their test fixture class inherits from test_controller::ControlledTest (as described below) causes a test disablement check to happen automatically before the test is run. For most tests, filtering individual sizes requires adding single macro call. - Code for detecting environment or platform state (eg. which OS we're running on, whether we're using ASAN or Valgrind) can be centralized in the test controller class instead of being replicated within each test file that requires checks. - Uses a custom rule format that's parsed inside library test code (the TestController class), so no additional dependencies (eg. JSON or XML parser) are required. - Defining an environment variable (export HIPCUB_EXTRA_TC_INFO=1) will cause the test controller to print all of test and sizes it's skipping, and the line number of the line in the control file that caused them to be skipped. This can be useful when you need to test out a new rule, or if you need to locate which control file lines affect a particular test. There are also a few potential disadvantages: - When writing a rule, the regexes that are used for matching test names require a little extra effort to construct, since you have to escape special characters and keep in mind that whitespace matters. - The text file rule format is not hierarchically organized like JSON objects. Rules are just a list of items delimited by a colon. The text parsing system is complex, and requires additional unit tests of it's own (see test_hipcub_test_controller.cpp) to validate. I've done my best to mitigate this by adding lots of documentation. - Since GTEST_SKIP only skips the currently running function, we have to wrap the size filter calls in macros. Adding macros to our unit tests (even if it's only one per test) is ugly, and easy to forget to do. I can add some documentation about how to write unit tests that adhere to the test controller to try to help reduce the changes of forgetting. - While this change is able to replace a number of individual GTEST_SKIP calls in existing unit tests with rules in the central text file, some GTEST_SKIP calls depend on values the condition checks on variables that are local to the test function (eg. whether the bock size or wavefront size exceeds some limit). Currently, these conditions cannot be captured inside rules. If it's deemed useful, this could probably be done in a follow-up change.

Added documentation, fixed bug with nvidia build type - changed it an arch_regex keyword instead. Removed unnecessary code.

codecov-commenter · 2026-05-13T01:32:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

❌ Your project status has failed because the head coverage (69.24%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #7241      +/-   ##
===========================================
+ Coverage    62.78%   63.06%   +0.28%     
===========================================
  Files          972     1014      +42     
  Lines       185366   187970    +2604     
  Branches     21999    22100     +101     
===========================================
+ Hits        116376   118529    +2153     
- Misses       61008    61408     +400     
- Partials      7982     8033      +51

Flag	Coverage Δ	*Carryforward flag
hipBLAS	`90.65% <ø> (ø)`	Carriedforward from d5216ae
hipBLASLt	`39.85% <ø> (ø)`	Carriedforward from d5216ae
hipCUB	`82.68% <ø> (?)`
hipDNN	`85.56% <ø> (ø)`	Carriedforward from d5216ae
hipFFT	`49.22% <ø> (ø)`	Carriedforward from d5216ae
hipSOLVER	`69.24% <ø> (ø)`	Carriedforward from d5216ae
rocBLAS	`48.11% <ø> (ø)`	Carriedforward from d5216ae
rocFFT	`47.60% <ø> (ø)`	Carriedforward from d5216ae
rocRAND	`57.02% <ø> (ø)`	Carriedforward from d5216ae

*This pull request uses carry forward flags. Click here to find out more.
see 42 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Simplify the control-file-finding logic so that it searches relative to the path to the currently running binary (instead of using some fixed pathes inserted by CMake). Fix up documentation errors, minor bug fixes.

amd-jmahovsky

Some comments and observations.

amd-jmahovsky · 2026-05-11T21:12:21Z

+    }
+
+    // Each entry represents the parsed information from one line of the control file.
+    std::vector<ControlInfo> control_data;


control_info for the var name? When reading this code I wasn't sure if ControlInfo and control_data were the same thing.

amd-jmahovsky · 2026-05-11T21:17:20Z

+// - Translates architecture keywords used in the control file intor
+//   regexes that are matched against the gfx id of the device currently
+//   in use. Please see the control file comments for more on these keywords.
+class TestController


One general comment: This class does a lot of different things. It might make for clearer code if the ControlInfo/data parsing from the file was split out, then the TestController doesn't have the burden of handling all that parsing/token/etc. This is assuming I am understanding correctly that the filtering only needs the parsed ControlInfo/data and nothing else in the control file.

amd-jmahovsky · 2026-05-11T21:18:15Z

+#endif // defined(HIPCUB_ROCPRIM_API)
+
+// -- Macros to use in unit tests --
+// Use macros here, even though they're ugle, since it's the only way to call


amd-jmahovsky · 2026-05-20T21:05:11Z

+/HipcubDeviceRadixSort\/.*/                : /<apus>/                              : >= (1 << 32)    : *              : "Skipping sizes over 2^32 on APUs"
+/HipcubDeviceRadixSortLargeInput.*/        : /<apus>/                              : >= (1 << 32)    : *              : "Skipping sizes over 2^32 on APUs"
+/HipcubDeviceMergeSort\/.*/                : /<apus>/                              : >= (1 << 32)    : *              : "Skipping sizes over 2^32 on APUs"
+/HipcubIteratorTests\/\d+\.TestTex.+/      : /<mi300-family>|navi4x-family|<apus>/ : *               : *              : "Test not run on gfx94x, gfx95x, or gfx120x as texture cache API is not supported."


Should navi4x-family be in angle brackets like mi300-family?

amd-jmahovsky · 2026-05-20T21:08:46Z

+#else
+        // Linux does not add a '\0' at the end of the path, so we must do that manually.
+        // If truncate if the path exceeds max_len chars.
+        int bytes_read = readlink("/proc/self/exe", path_buf, max_len);


The readlink() system call can return -1 if it fails. That doesn't seem to be handled. Could it fail in this scenario and does that need to be handled?

amd-jmahovsky · 2026-05-20T21:13:51Z

+
+        // The name may contain extra bits we don't need - eg. the xnack portion of "gfx942:xnack+".
+        const auto length = sizeof(hipDeviceProp_t::gcnArchName);
+        char* arch_end = std::find_if(arch_name,


If this doesn't find anything then arch_end will point one past the end of the string and then we will write a 0 out of bounds.

amd-jmahovsky · 2026-05-20T21:15:43Z

+        for (size_t i = 0; !sizes.empty() && i < this->control_data.size(); i++)
+        {
+            // Grab the filter information that was parsed from the current line.
+            ControlInfo info = this->control_data[i];


Could this be a const reference to avoid some copying?

amd-jmahovsky · 2026-05-20T21:24:27Z

+                    info.build_type_test_fns.push_back(it->second);
+            }
+
+            build_type_part = match_result.suffix();


This doesn't seem to be used after being set...?

amd-jmahovsky · 2026-05-20T21:27:57Z

+                }
+            }
+
+            size_part = match_result.suffix();


does match_result change in this loop? i.e. can we just set size_part once?

amd-jmahovsky · 2026-05-20T21:31:20Z

+        // Parsed from size filters. These are unary functions that accept a size and return true if that size should be skipped.
+        std::vector<std::function<bool(size_t)>> size_test_fns;
+        // Set to true if the user has specified a '*' to skip all test sizes.
+        bool disable_all_sizes;


should some of these members be default initialized? i.e. disable_all_sizes = false etc.

(might just be good defensive coding practice..)

umfranzw requested a review from a team as a code owner May 9, 2026 04:16

github-actions Bot added the project: hipcub label May 9, 2026

umfranzw marked this pull request as draft May 9, 2026 04:17

assistant-librarian Bot added the organization: ROCm label May 9, 2026

umfranzw force-pushed the users/umfranzw/test_control branch 3 times, most recently from ff75615 to efd2efa Compare May 11, 2026 13:15