[ROCm][CI] Refine gating tests#37243
[ROCm][CI] Refine gating tests#37243AndreasKaratzas wants to merge 9 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refines the gating tests for ROCm CI by updating mirrored tests on AMD devices. The changes largely align with the goals stated in the description, such as adding new mirrors and removing obsolete ones. However, I have identified two areas for improvement. First, there is a discrepancy in misc.yaml where a mirror is added to a test whose label does not match the one in the pull request description, which could affect test coverage correctness. Second, a new mirrored test in models_language.yaml introduces duplicated commands, creating a future maintainability issue. Addressing these points will enhance the accuracy and long-term stability of the CI configuration.
| mirror: | ||
| amd: | ||
| device: mi250_1 | ||
| depends_on: | ||
| - image-build-amd |
There was a problem hiding this comment.
There is a discrepancy between the PR description and this change. The description specifies adding a mirror for the Examples test in misc.yaml with device mi250_1. However, this change adds the mirror to the Speculative Decoding (1 GPU) test instead. This could lead to incorrect test coverage for the mirroring process. Please ensure the change aligns with the intended test plan to guarantee correct test gating.
| commands: | ||
| - export TORCH_NCCL_BLOCKING_WAIT=1 | ||
| # NOTE: The rest is in complete parity with CUDA tests | ||
| - pip freeze | grep -E 'torch' | ||
| - pytest -v -s models/language -m 'core_model and slow_test' --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --shard-id=$$BUILDKITE_PARALLEL_JOB |
There was a problem hiding this comment.
The commands for this mirrored step are duplicated from the main test step (lines 32-33), with only the addition of an export command. This creates a maintainability issue, as any changes to the pytest command will need to be updated in two places. The comment # NOTE: The rest is in complete parity with CUDA tests highlights this fragility.
To avoid this duplication, you could use YAML anchors. This would involve defining the common commands with an anchor and then referencing it in both the main commands section and here. For example, you could chain the commands into a single string with && to allow prepending the export command in the mirror configuration.
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
bf94e35 to
3f0b48e
Compare
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
Related but not blocking PR: |
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
Removed |
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
This PR refines AMD mirror gating amidst new MI300 nodes.
Mirror removals
Mirror blocks removed from the source test files to smoothly transition again to gating signal:
Mirrors kept
New
mi300_1(ormi250_1where noted) mirrors:Device load
cc @kenroche