Skip to content

[None][fix] fix tinygemm barrier bug#15338

Open
yweng0828 wants to merge 1 commit into
NVIDIA:mainfrom
yweng0828:yweng/fix_tinygemm_wrong_bar_try_wait
Open

[None][fix] fix tinygemm barrier bug#15338
yweng0828 wants to merge 1 commit into
NVIDIA:mainfrom
yweng0828:yweng/fix_tinygemm_wrong_bar_try_wait

Conversation

@yweng0828

@yweng0828 yweng0828 commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • Bug Fixes
    • Improved synchronization logic in tensor compute kernels to enhance correctness and stability of mathematical operations.

Description

How is this bug?

  • When Tinygemm and UVA copies are executed simultaneously, gemm results in errors.
  • Running Tinygemm exclusively does not cause this problem.
  • The bug will not be triggered if UVA or other similar workloads are not present.

Root cause

  • The consumer(compute warp) spin wait uses an obsolete barrier pointer.
  • Poll the CURRENT stage's ready barriers. bar_ptr_wt / bar_ptr_act are computed ONCE before the loop for the initial stage and never updated, while the stage advances by 4 every ki.
  • When the prefetch below is missed (producer behind, e.g., under heavy concurrent SM->sysmem traffic), this spin-wait polls the wrong, initial-stage barrier. That barrier already completed at ki==0, so try_wait returns ready immediately and the consumer proceeds to ldmatrix a stage the TMA has not filled yet -> corrupted output (or a pipeline hang).
  • Polling the live stage makes the wait check the barrier that actually gates this iteration's data.

Why don't bugs appear when running it exclusively?

  • It is only triggered when the producer lags behind, and prefetching fails. When run exclusively, the producer is always ahead, so it is never apparent.
  • Pinned-host writeback creates the trigger condition by preempting the memory subsystem and slowing down the TMA producer.

Also update the integration in flashinfer: flashinfer-ai/flashinfer#3630

Thanks to @LorrinWWW for reporting this bug and for providing a very detailed script to reproduce it.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR refines synchronization logic in TinyGEMM2's compute kernel. The wait loop that blocks on weight and activation readiness now dynamically recalculates barrier pointers per loop iteration based on the current stage, ensuring correct synchronization across stage transitions rather than relying on stale cached pointers.

Changes

Barrier Synchronization Fix

Layer / File(s) Summary
Compute loop barrier pointer refresh
cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_kernel.cuh
Wait loop bar_try_wait calls now compute shared-barrier addresses using the current stage value on each iteration via __cvta_generic_to_shared(&bar_*_ready[stage]) instead of reusing cached pointers, ensuring stage-specific barrier synchronization.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description provides detailed explanation of the bug, root cause, and fix, but the Test Coverage section is empty and no checklist items are marked as completed. Add test coverage information and mark completed checklist items. Explicitly state which tests safeguard the barrier synchronization fix.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the change as a bug fix to the tinygemm barrier logic, directly matching the code change described in the raw summary.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_kernel.cuh (1)

341-345: 💤 Low value

Optional: Remove now-partially-redundant cached barrier pointers for consistency.

After the fix, bar_ptr_wt and bar_ptr_act are only used for the initial try_wait before the loop. For consistency with the inline computation now used inside the wait loop, consider removing these cached variables and inlining the computation here as well.

♻️ Proposed refactor
-        uint32_t bar_ptr_wt = __cvta_generic_to_shared(&bar_wt_ready[stage]);
-        uint32_t bar_ptr_act = __cvta_generic_to_shared(&bar_act_ready[stage]);
-
-        bool weight_ready = bar_try_wait(bar_ptr_wt, phase);
-        bool act_ready = bar_try_wait(bar_ptr_act, phase);
+        bool weight_ready = bar_try_wait(__cvta_generic_to_shared(&bar_wt_ready[stage]), phase);
+        bool act_ready = bar_try_wait(__cvta_generic_to_shared(&bar_act_ready[stage]), phase);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_kernel.cuh` around lines 341 -
345, Remove the now-redundant cached barrier pointer variables bar_ptr_wt and
bar_ptr_act: instead of computing them once and passing to bar_try_wait, call
bar_try_wait(__cvta_generic_to_shared(&bar_wt_ready[stage]), phase) and
bar_try_wait(__cvta_generic_to_shared(&bar_act_ready[stage]), phase) inline;
delete the declarations of uint32_t bar_ptr_wt and bar_ptr_act and update any
uses (the initial try_wait checks) to use the inlined
__cvta_generic_to_shared(...) expressions so the code matches the loop’s inline
computation style.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_kernel.cuh`:
- Around line 341-345: Remove the now-redundant cached barrier pointer variables
bar_ptr_wt and bar_ptr_act: instead of computing them once and passing to
bar_try_wait, call bar_try_wait(__cvta_generic_to_shared(&bar_wt_ready[stage]),
phase) and bar_try_wait(__cvta_generic_to_shared(&bar_act_ready[stage]), phase)
inline; delete the declarations of uint32_t bar_ptr_wt and bar_ptr_act and
update any uses (the initial try_wait checks) to use the inlined
__cvta_generic_to_shared(...) expressions so the code matches the loop’s inline
computation style.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 620f19cb-424b-4c56-a915-f528c4190912

📥 Commits

Reviewing files that changed from the base of the PR and between 4e1776a and 5293dba.

📒 Files selected for processing (1)
  • cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_kernel.cuh

@yweng0828

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54059 [ run ] triggered by Bot. Commit: 5293dba Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54059 [ run ] completed with state SUCCESS. Commit: 5293dba
/LLM/main/L0_MergeRequest_PR pipeline #43143 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yweng0828

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54096 [ run ] triggered by Bot. Commit: 5293dba Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54096 [ run ] completed with state FAILURE. Commit: 5293dba
/LLM/main/L0_MergeRequest_PR pipeline #43179 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yweng0828

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54100 [ run ] triggered by Bot. Commit: 5293dba Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54100 [ run ] completed with state SUCCESS. Commit: 5293dba
/LLM/main/L0_MergeRequest_PR pipeline #43184 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yweng0828

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54109 [ run ] triggered by Bot. Commit: 5293dba Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54109 [ run ] completed with state SUCCESS. Commit: 5293dba
/LLM/main/L0_MergeRequest_PR pipeline #43193 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@dongfengy dongfengy left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@yweng0828

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54187 [ run ] triggered by Bot. Commit: 5293dba Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54187 [ run ] completed with state SUCCESS. Commit: 5293dba
/LLM/main/L0_MergeRequest_PR pipeline #43266 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
@yweng0828 yweng0828 force-pushed the yweng/fix_tinygemm_wrong_bar_try_wait branch from 5293dba to ffd72f7 Compare June 15, 2026 03:09
@yweng0828

Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54199 [ run ] triggered by Bot. Commit: ffd72f7 Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants