Skip to content

[ROCm][CI] Refine gating tests#37243

Open
AndreasKaratzas wants to merge 9 commits intovllm-project:mainfrom
ROCm:akaratza_gating_heterogeneous
Open

[ROCm][CI] Refine gating tests#37243
AndreasKaratzas wants to merge 9 commits intovllm-project:mainfrom
ROCm:akaratza_gating_heterogeneous

Conversation

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

@AndreasKaratzas AndreasKaratzas commented Mar 17, 2026

This PR refines AMD mirror gating amidst new MI300 nodes.

Mirror removals

Mirror blocks removed from the source test files to smoothly transition again to gating signal:

File Test Label Previous AMD Device
engine.yaml V1 e2e (2 GPUs) mi325_2
engine.yaml V1 e2e (4 GPUs) mi325_4
misc.yaml V1 Spec Decode mi325_1
models_basic.yaml Basic Models Tests (Other) mi325_1
models_language.yaml Language Models Test (Extended Generation) mi325_1
models_multimodal.yaml Multi-Modal Models (Standard) 4: other + whisper mi325_1

Mirrors kept

New mi300_1 (or mi250_1 where noted) mirrors:

File Test Label From -> To
entrypoints.yaml Entrypoints Integration (API Server openai - Part 1) mi325_1 -> mi300_1
entrypoints.yaml Entrypoints Integration (API Server openai - Part 2) mi325_1 -> mi300_1
misc.yaml V1 Sample + Logits mi325_1 -> mi300_1
misc.yaml V1 Core + KV + Metrics mi325_1 -> mi300_1
models_language.yaml Language Models Test (Extended Pooling) mi325_1 -> mi300_1
models_multimodal.yaml Multi-Modal Models (Standard) 1: qwen2 mi325_1 -> mi300_1
models_multimodal.yaml Multi-Modal Models (Standard) 3: llava + qwen2_vl mi325_1 -> mi300_1
models_multimodal.yaml Multi-Modal Models (Extended Generation 1) mi325_1 -> mi300_1
samplers.yaml Samplers Test mi325_1 -> mi250_1

Device load

Device Mirrors added
mi300_1 8
mi250_1 1

cc @kenroche

@mergify mergify Bot added ci/build rocm Related to AMD ROCm labels Mar 17, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Mar 17, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refines the gating tests for ROCm CI by updating mirrored tests on AMD devices. The changes largely align with the goals stated in the description, such as adding new mirrors and removing obsolete ones. However, I have identified two areas for improvement. First, there is a discrepancy in misc.yaml where a mirror is added to a test whose label does not match the one in the pull request description, which could affect test coverage correctness. Second, a new mirrored test in models_language.yaml introduces duplicated commands, creating a future maintainability issue. Addressing these points will enhance the accuracy and long-term stability of the CI configuration.

Comment thread .buildkite/test_areas/misc.yaml Outdated
Comment on lines +91 to +95
mirror:
amd:
device: mi250_1
depends_on:
- image-build-amd
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a discrepancy between the PR description and this change. The description specifies adding a mirror for the Examples test in misc.yaml with device mi250_1. However, this change adds the mirror to the Speculative Decoding (1 GPU) test instead. This could lead to incorrect test coverage for the mirroring process. Please ensure the change aligns with the intended test plan to guarantee correct test gating.

Comment on lines +40 to +44
commands:
- export TORCH_NCCL_BLOCKING_WAIT=1
# NOTE: The rest is in complete parity with CUDA tests
- pip freeze | grep -E 'torch'
- pytest -v -s models/language -m 'core_model and slow_test' --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT --shard-id=$$BUILDKITE_PARALLEL_JOB
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The commands for this mirrored step are duplicated from the main test step (lines 32-33), with only the addition of an export command. This creates a maintainability issue, as any changes to the pytest command will need to be updated in two places. The comment # NOTE: The rest is in complete parity with CUDA tests highlights this fragility.

To avoid this duplication, you could use YAML anchors. This would involve defining the common commands with an anchor and then referencing it in both the main commands section and here. For example, you could chain the commands into a single string with && to allow prepending the export command in the mirror configuration.

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 19, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AndreasKaratzas.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 20, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @AndreasKaratzas.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas AndreasKaratzas force-pushed the akaratza_gating_heterogeneous branch from bf94e35 to 3f0b48e Compare April 20, 2026 23:00
@mergify mergify Bot removed the needs-rebase label Apr 20, 2026
@AndreasKaratzas AndreasKaratzas marked this pull request as ready for review April 20, 2026 23:04
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@AndreasKaratzas AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 20, 2026
@AndreasKaratzas
Copy link
Copy Markdown
Collaborator Author

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas
Copy link
Copy Markdown
Collaborator Author

Removed Language Models Test (Extended Pooling) and Entrypoints Integration (API Server openai - Part 1) from the list of proposed tests for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

1 participant