Skip to content

[hipCUB] Centralized system for disabling tests#7241

Open
umfranzw wants to merge 6 commits into
developfrom
users/umfranzw/test_control
Open

[hipCUB] Centralized system for disabling tests#7241
umfranzw wants to merge 6 commits into
developfrom
users/umfranzw/test_control

Conversation

@umfranzw
Copy link
Copy Markdown
Contributor

@umfranzw umfranzw commented May 9, 2026

Motivation

We often need to temporarily disable tests until some root cause (potentially in another component) is tracked down and fixed. This results in a lot of disablement checks scattered throughout the test files. It's easy to forget to re-enable a test when something's fixed, and there can be code duplication in the checks that different files perform. In addition, it takes some time and effort to figure out which tests are disabled and why (you have to open and examine each individual test file and locate the checks).

It would be nice to have a single, centralized file that we can disable tests from. The system that reads this file could also encapsulate the checks we need to perform within a single class.

Technical Details

This change adds a centralized mechanism for disabling tests in hipCUB. Tests can be disabled by adding rules to the text file located at hipcub/test/control.txt.

These rules allow you to:

  • Disable tests or whole test suites by name using a regex.
  • Disable tests only for particular architectures (using an architecture regex). The architecture regex supports "keywords": sets of commonly associated architectures (eg. all APU architectures, or all navi3x devices, or nvidia devices). These keywords provide a shortcut that makes it easy to disable tests for whole families of GPUs at once.
  • Disable individual input sizes, or ranges using relational operators (eg. disable all tests sizes >100). You can also use arithmetic operators when specifying the size (eg. (1 << 32 + 1)). This can be handy because test sizes are often specified as offsets from powers of 2 in the code. So in these cases, you can just directly copy and paste them into the rule.
  • Disable tests only when they're running in a particular environment (OS or build configuration option). For example, you can disable tests when using ASAN or Valgrind, or when running on Windows or Linux,.
  • Print a message after disabling a test to let the user know why it's disabled.

This system brings a few advantages:

  • As mentioned above, using a centralized system reduces the chances of forgetting to re-enable, since everything's in one place, and provides a way to quickly look up which tests are currently disabled and why.
  • Keeping the rules in a text file allows us to enable/disable tests without needing to rebuild, since the text file is read at runtime.
  • Minimal code changes are needed in order to hookup existing tests to the centralized system. Modifying tests so that their test fixture class inherits from test_controller::ControlledTest causes a test disablement check to happen automatically before the test is run. For most tests, filtering individual sizes requires adding a single macro call.
  • Code for detecting environment or platform state (eg. which OS we're running on, whether we're using ASAN or Valgrind) can be centralized in the test controller class instead of being replicated within each test file that requires checks.
  • Uses a custom rule format that's parsed inside the test controller class, so no additional dependencies (eg. JSON or XML parsers) are required.
  • Defining an environment variable (export HIPCUB_EXTRA_TC_INFO=1) will cause the test controller to print out all of tests and sizes it's skipping, and the line number of the line in the control file that caused them to be skipped. This can be useful when you need to test out a new rule, or if you need to locate which control file lines affect a particular test.

There are also a few potential disadvantages:

  • When writing a rule, the regexes that are used for matching test names require a little extra effort to construct, since you have to escape special characters and keep in mind that whitespace matters.
  • The text file rule format is not hierarchically organized like JSON objects. Rules are just a list of items delimited by a colon. The text parsing code is complex, and requires additional unit tests of it's own (see test_hipcub_test_controller.cpp) to validate. I've done my best to mitigate this by adding lots of documentation.
  • Since GTEST_SKIP only skips the currently running function, we have to wrap the size filter calls in macros. Adding macros to our unit tests (even if it's only one per test) is ugly, and easy to forget to do. I've added some documentation about how to write unit tests that adhere to the test controller to try to help.
  • While this change is able to replace a number of individual GTEST_SKIP calls in existing unit tests with rules in the central text file, some GTEST_SKIP calls depend on condition checks that use variables that are local to the test function (eg. whether the block size or wavefront size exceeds some limit). Currently, these conditions cannot be captured inside rules. If it's useful, perhaps this is something we could think about in a follow-up change.

Test Plan

Build and run all unit tests. Ensure that tests that used to be skipped are still correctly skipped (under the right conditions).
Build and run the test controller's own internal tests (the test_hipcub_test_controller target).

Test Result

Tests are skipped correctly locally. TheRock CI change here (ROCm/TheRock#5504) allows the test control file through into the final installation directory.

Submission Checklist

@umfranzw umfranzw requested a review from a team as a code owner May 9, 2026 04:16
@umfranzw umfranzw marked this pull request as draft May 9, 2026 04:17
@umfranzw umfranzw force-pushed the users/umfranzw/test_control branch 3 times, most recently from ff75615 to efd2efa Compare May 11, 2026 13:15
Comment thread projects/hipcub/test/control.txt Outdated
Comment thread projects/hipcub/test/control.txt Outdated
Comment thread projects/hipcub/test/control.txt Outdated
Comment thread projects/hipcub/test/hipcub/test_utils_data_generation.hpp Outdated
Comment thread projects/hipcub/test/hipcub/test_utils_controller.hpp.in Outdated
Comment thread projects/hipcub/test/hipcub/test_utils_controller.hpp.in Outdated
umfranzw added 3 commits May 12, 2026 14:04
This change adds a centralized mechanism for disabling tests in hipCUB.
Tests can be disabled by adding rules to the text file hipcub/test/control.txt.

These rules allow you to:
- Disable tests or whole test suites by name using a regex
- Disable tests only for particular architectures (using an architecture regex).
  The architecture regex supports "keywords": sets of commonly associated
  architectures (eg. all APU architectures, or all navi3x devices). These keywords
  provide a shortcut that makes it easy to disable tests for whole families of GPUs at once.
- Disable only individual input sizes. You can specify ranges using relational operators
  (eg. disable all tests sizes >100). You can also use arithmetic operators when specifying
  the size (eg. (1 << 32 + 1)). This can be handy because test sizes are often specified
  as offsets from powers of 2 in the code. So in these cases, you can just directly
  copy and paste them into the rule.
- Disable tests only when they're running a particular environment (OS or build configuration
  option). For example, you can disable tests when using ASAN or Valgrind, when running on
  Windows or Linux, or when running on an Nvidia platform.
- Print a message after disabling a test to let the user know why it's disabled.

This system brings a few advantages:
- We often encounter situations where we need to temporarily disable a test. It is easy
  to forget to remove the code that disables the test if it is scatter throughout the
  files in the test directory. Using a centralized system reduces the chances of this,
  since everything's in one place, and provides a way to quickly look up which tests are
  currently disabled and why.
- Keeping the rules in a text file allows us to enable/disable tests without needing
  to rebuild, since the text file is read at runtime.
- Minimal code changes are needed in order to hookup existing tests to the centralized system.
  Modying tests so that their test fixture class inherits from test_controller::ControlledTest
  (as described below) causes a test disablement check to happen automatically before the test
  is run. For most tests, filtering individual sizes requires adding single macro call.
- Code for detecting environment or platform state (eg. which OS we're running on, whether
  we're using ASAN or Valgrind) can be centralized in the test controller class instead of
  being replicated within each test file that requires checks.
- Uses a custom rule format that's parsed inside library test code (the TestController class),
  so no additional dependencies (eg. JSON or XML parser) are required.
- Defining an environment variable (export HIPCUB_EXTRA_TC_INFO=1) will cause the test controller
  to print all of test and sizes it's skipping, and the line number of the line in the control file
  that caused them to be skipped. This can be useful when you need to test out a new rule, or if
  you need to locate which control file lines affect a particular test.

There are also a few potential disadvantages:
- When writing a rule, the regexes that are used for matching test names require a little extra
  effort to construct, since you have to escape special characters and keep in mind that whitespace
  matters.
- The text file rule format is not hierarchically organized like JSON objects. Rules are just a
  list of items delimited by a colon. The text parsing system is complex, and requires additional
  unit tests of it's own (see test_hipcub_test_controller.cpp) to validate. I've done my best to
  mitigate this by adding lots of documentation.
- Since GTEST_SKIP only skips the currently running function, we have to wrap the size filter
  calls in macros. Adding macros to our unit tests (even if it's only one per test) is ugly, and
  easy to forget to do. I can add some documentation about how to write unit tests that adhere to
  the test controller to try to help reduce the changes of forgetting.
- While this change is able to replace a number of individual GTEST_SKIP calls in existing unit tests
  with rules in the central text file, some GTEST_SKIP calls depend on values the condition checks on
  variables that are local to the test function (eg. whether the bock size or wavefront size exceeds some
  limit). Currently, these conditions cannot be captured inside rules. If it's deemed useful, this could
  probably be done in a follow-up change.
Added documentation, fixed bug with nvidia build type - changed it
an arch_regex keyword instead. Removed unnecessary code.
@umfranzw umfranzw force-pushed the users/umfranzw/test_control branch from 8599a55 to 200a1c7 Compare May 13, 2026 00:43
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

❌ Your project status has failed because the head coverage (69.24%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #7241      +/-   ##
===========================================
+ Coverage    62.78%   63.06%   +0.28%     
===========================================
  Files          972     1014      +42     
  Lines       185366   187970    +2604     
  Branches     21999    22100     +101     
===========================================
+ Hits        116376   118529    +2153     
- Misses       61008    61408     +400     
- Partials      7982     8033      +51     
Flag Coverage Δ *Carryforward flag
hipBLAS 90.65% <ø> (ø) Carriedforward from d5216ae
hipBLASLt 39.85% <ø> (ø) Carriedforward from d5216ae
hipCUB 82.68% <ø> (?)
hipDNN 85.56% <ø> (ø) Carriedforward from d5216ae
hipFFT 49.22% <ø> (ø) Carriedforward from d5216ae
hipSOLVER 69.24% <ø> (ø) Carriedforward from d5216ae
rocBLAS 48.11% <ø> (ø) Carriedforward from d5216ae
rocFFT 47.60% <ø> (ø) Carriedforward from d5216ae
rocRAND 57.02% <ø> (ø) Carriedforward from d5216ae

*This pull request uses carry forward flags. Click here to find out more.
see 42 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Simplify the control-file-finding logic so that it searches
relative to the path to the currently running binary (instead
of using some fixed pathes inserted by CMake).

Fix up documentation errors, minor bug fixes.
@umfranzw umfranzw force-pushed the users/umfranzw/test_control branch 11 times, most recently from 9a0a9e6 to 9255f20 Compare May 28, 2026 13:52
@umfranzw umfranzw force-pushed the users/umfranzw/test_control branch from 9255f20 to d8fbba5 Compare May 29, 2026 00:33
@umfranzw umfranzw force-pushed the users/umfranzw/test_control branch from d8fbba5 to d5216ae Compare May 29, 2026 14:08
@umfranzw umfranzw marked this pull request as ready for review May 29, 2026 14:14
Copy link
Copy Markdown
Contributor

@amd-jmahovsky amd-jmahovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments and observations.

}

// Each entry represents the parsed information from one line of the control file.
std::vector<ControlInfo> control_data;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

control_info for the var name? When reading this code I wasn't sure if ControlInfo and control_data were the same thing.

// - Translates architecture keywords used in the control file intor
// regexes that are matched against the gfx id of the device currently
// in use. Please see the control file comments for more on these keywords.
class TestController
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One general comment: This class does a lot of different things. It might make for clearer code if the ControlInfo/data parsing from the file was split out, then the TestController doesn't have the burden of handling all that parsing/token/etc. This is assuming I am understanding correctly that the filtering only needs the parsed ControlInfo/data and nothing else in the control file.

#endif // defined(HIPCUB_ROCPRIM_API)

// -- Macros to use in unit tests --
// Use macros here, even though they're ugle, since it's the only way to call
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugle->ugly

/HipcubDeviceRadixSort\/.*/ : /<apus>/ : >= (1 << 32) : * : "Skipping sizes over 2^32 on APUs"
/HipcubDeviceRadixSortLargeInput.*/ : /<apus>/ : >= (1 << 32) : * : "Skipping sizes over 2^32 on APUs"
/HipcubDeviceMergeSort\/.*/ : /<apus>/ : >= (1 << 32) : * : "Skipping sizes over 2^32 on APUs"
/HipcubIteratorTests\/\d+\.TestTex.+/ : /<mi300-family>|navi4x-family|<apus>/ : * : * : "Test not run on gfx94x, gfx95x, or gfx120x as texture cache API is not supported."
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should navi4x-family be in angle brackets like mi300-family?

#else
// Linux does not add a '\0' at the end of the path, so we must do that manually.
// If truncate if the path exceeds max_len chars.
int bytes_read = readlink("/proc/self/exe", path_buf, max_len);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The readlink() system call can return -1 if it fails. That doesn't seem to be handled. Could it fail in this scenario and does that need to be handled?


// The name may contain extra bits we don't need - eg. the xnack portion of "gfx942:xnack+".
const auto length = sizeof(hipDeviceProp_t::gcnArchName);
char* arch_end = std::find_if(arch_name,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this doesn't find anything then arch_end will point one past the end of the string and then we will write a 0 out of bounds.

for (size_t i = 0; !sizes.empty() && i < this->control_data.size(); i++)
{
// Grab the filter information that was parsed from the current line.
ControlInfo info = this->control_data[i];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be a const reference to avoid some copying?

info.build_type_test_fns.push_back(it->second);
}

build_type_part = match_result.suffix();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be used after being set...?

}
}

size_part = match_result.suffix();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does match_result change in this loop? i.e. can we just set size_part once?

// Parsed from size filters. These are unary functions that accept a size and return true if that size should be skipped.
std::vector<std::function<bool(size_t)>> size_test_fns;
// Set to true if the user has specified a '*' to skip all test sizes.
bool disable_all_sizes;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should some of these members be default initialized? i.e. disable_all_sizes = false etc.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(might just be good defensive coding practice..)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants