[None][fix] fix tinygemm barrier bug#15338
Conversation
📝 WalkthroughWalkthroughThis PR refines synchronization logic in TinyGEMM2's compute kernel. The wait loop that blocks on weight and activation readiness now dynamically recalculates barrier pointers per loop iteration based on the current stage, ensuring correct synchronization across stage transitions rather than relying on stale cached pointers. ChangesBarrier Synchronization Fix
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_kernel.cuh (1)
341-345: 💤 Low valueOptional: Remove now-partially-redundant cached barrier pointers for consistency.
After the fix,
bar_ptr_wtandbar_ptr_actare only used for the initial try_wait before the loop. For consistency with the inline computation now used inside the wait loop, consider removing these cached variables and inlining the computation here as well.♻️ Proposed refactor
- uint32_t bar_ptr_wt = __cvta_generic_to_shared(&bar_wt_ready[stage]); - uint32_t bar_ptr_act = __cvta_generic_to_shared(&bar_act_ready[stage]); - - bool weight_ready = bar_try_wait(bar_ptr_wt, phase); - bool act_ready = bar_try_wait(bar_ptr_act, phase); + bool weight_ready = bar_try_wait(__cvta_generic_to_shared(&bar_wt_ready[stage]), phase); + bool act_ready = bar_try_wait(__cvta_generic_to_shared(&bar_act_ready[stage]), phase);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_kernel.cuh` around lines 341 - 345, Remove the now-redundant cached barrier pointer variables bar_ptr_wt and bar_ptr_act: instead of computing them once and passing to bar_try_wait, call bar_try_wait(__cvta_generic_to_shared(&bar_wt_ready[stage]), phase) and bar_try_wait(__cvta_generic_to_shared(&bar_act_ready[stage]), phase) inline; delete the declarations of uint32_t bar_ptr_wt and bar_ptr_act and update any uses (the initial try_wait checks) to use the inlined __cvta_generic_to_shared(...) expressions so the code matches the loop’s inline computation style.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_kernel.cuh`:
- Around line 341-345: Remove the now-redundant cached barrier pointer variables
bar_ptr_wt and bar_ptr_act: instead of computing them once and passing to
bar_try_wait, call bar_try_wait(__cvta_generic_to_shared(&bar_wt_ready[stage]),
phase) and bar_try_wait(__cvta_generic_to_shared(&bar_act_ready[stage]), phase)
inline; delete the declarations of uint32_t bar_ptr_wt and bar_ptr_act and
update any uses (the initial try_wait checks) to use the inlined
__cvta_generic_to_shared(...) expressions so the code matches the loop’s inline
computation style.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 620f19cb-424b-4c56-a915-f528c4190912
📒 Files selected for processing (1)
cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_kernel.cuh
|
/bot run |
|
PR_Github #54059 [ run ] triggered by Bot. Commit: |
|
PR_Github #54059 [ run ] completed with state
|
|
/bot run |
|
PR_Github #54096 [ run ] triggered by Bot. Commit: |
|
PR_Github #54096 [ run ] completed with state
|
|
/bot run |
|
PR_Github #54100 [ run ] triggered by Bot. Commit: |
|
PR_Github #54100 [ run ] completed with state
|
|
/bot run |
|
PR_Github #54109 [ run ] triggered by Bot. Commit: |
|
PR_Github #54109 [ run ] completed with state
|
|
/bot run |
|
PR_Github #54187 [ run ] triggered by Bot. Commit: |
|
PR_Github #54187 [ run ] completed with state
|
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
5293dba to
ffd72f7
Compare
|
/bot run |
|
PR_Github #54199 [ run ] triggered by Bot. Commit: |
Summary by CodeRabbit
Description
How is this bug?
Root cause
bar_ptr_wt / bar_ptr_actare computed ONCE before the loop for the initial stage and never updated, while thestageadvances by 4 everyki.ki==0, so try_wait returns ready immediately and the consumer proceeds to ldmatrix a stage the TMA has not filled yet -> corrupted output (or a pipeline hang).Why don't bugs appear when running it exclusively?
Also update the integration in flashinfer: flashinfer-ai/flashinfer#3630
Thanks to @LorrinWWW for reporting this bug and for providing a very detailed script to reproduce it.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.