[None][fix] fix tinygemm barrier bug by yweng0828 · Pull Request #15338 · NVIDIA/TensorRT-LLM

yweng0828 · 2026-06-13T15:36:38Z

Summary by CodeRabbit

Bug Fixes
- Improved synchronization logic in tensor compute kernels to enhance correctness and stability of mathematical operations.

Description

How is this bug?

When Tinygemm and UVA copies are executed simultaneously, gemm results in errors.
Running Tinygemm exclusively does not cause this problem.
The bug will not be triggered if UVA or other similar workloads are not present.

Root cause

The consumer(compute warp) spin wait uses an obsolete barrier pointer.
Poll the CURRENT stage's ready barriers. bar_ptr_wt / bar_ptr_act are computed ONCE before the loop for the initial stage and never updated, while the stage advances by 4 every ki.
When the prefetch below is missed (producer behind, e.g., under heavy concurrent SM->sysmem traffic), this spin-wait polls the wrong, initial-stage barrier. That barrier already completed at ki==0, so try_wait returns ready immediately and the consumer proceeds to ldmatrix a stage the TMA has not filled yet -> corrupted output (or a pipeline hang).
Polling the live stage makes the wait check the barrier that actually gates this iteration's data.

Why don't bugs appear when running it exclusively?

It is only triggered when the producer lags behind, and prefetching fails. When run exclusively, the producer is always ahead, so it is never apparent.
Pinned-host writeback creates the trigger condition by preempting the memory subsystem and slowing down the TMA producer.

Also update the integration in flashinfer: flashinfer-ai/flashinfer#3630

Thanks to @LorrinWWW for reporting this bug and for providing a very detailed script to reproduce it.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-06-13T15:39:31Z

📝 Walkthrough

Walkthrough

This PR refines synchronization logic in TinyGEMM2's compute kernel. The wait loop that blocks on weight and activation readiness now dynamically recalculates barrier pointers per loop iteration based on the current stage, ensuring correct synchronization across stage transitions rather than relying on stale cached pointers.

Changes

Barrier Synchronization Fix

Layer / File(s)	Summary
Compute loop barrier pointer refresh `cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_kernel.cuh`	Wait loop `bar_try_wait` calls now compute shared-barrier addresses using the current `stage` value on each iteration via `__cvta_generic_to_shared(&bar_*_ready[stage])` instead of reusing cached pointers, ensuring stage-specific barrier synchronization.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description provides detailed explanation of the bug, root cause, and fix, but the Test Coverage section is empty and no checklist items are marked as completed.	Add test coverage information and mark completed checklist items. Explicitly state which tests safeguard the barrier synchronization fix.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the change as a bug fix to the tinygemm barrier logic, directly matching the code change described in the raw summary.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_kernel.cuh (1)

341-345: 💤 Low value

Optional: Remove now-partially-redundant cached barrier pointers for consistency.

After the fix, bar_ptr_wt and bar_ptr_act are only used for the initial try_wait before the loop. For consistency with the inline computation now used inside the wait loop, consider removing these cached variables and inlining the computation here as well.

♻️ Proposed refactor

-        uint32_t bar_ptr_wt = __cvta_generic_to_shared(&bar_wt_ready[stage]);
-        uint32_t bar_ptr_act = __cvta_generic_to_shared(&bar_act_ready[stage]);
-
-        bool weight_ready = bar_try_wait(bar_ptr_wt, phase);
-        bool act_ready = bar_try_wait(bar_ptr_act, phase);
+        bool weight_ready = bar_try_wait(__cvta_generic_to_shared(&bar_wt_ready[stage]), phase);
+        bool act_ready = bar_try_wait(__cvta_generic_to_shared(&bar_act_ready[stage]), phase);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_kernel.cuh` around lines 341 -
345, Remove the now-redundant cached barrier pointer variables bar_ptr_wt and
bar_ptr_act: instead of computing them once and passing to bar_try_wait, call
bar_try_wait(__cvta_generic_to_shared(&bar_wt_ready[stage]), phase) and
bar_try_wait(__cvta_generic_to_shared(&bar_act_ready[stage]), phase) inline;
delete the declarations of uint32_t bar_ptr_wt and bar_ptr_act and update any
uses (the initial try_wait checks) to use the inlined
__cvta_generic_to_shared(...) expressions so the code matches the loop’s inline
computation style.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_kernel.cuh`:
- Around line 341-345: Remove the now-redundant cached barrier pointer variables
bar_ptr_wt and bar_ptr_act: instead of computing them once and passing to
bar_try_wait, call bar_try_wait(__cvta_generic_to_shared(&bar_wt_ready[stage]),
phase) and bar_try_wait(__cvta_generic_to_shared(&bar_act_ready[stage]), phase)
inline; delete the declarations of uint32_t bar_ptr_wt and bar_ptr_act and
update any uses (the initial try_wait checks) to use the inlined
__cvta_generic_to_shared(...) expressions so the code matches the loop’s inline
computation style.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 620f19cb-424b-4c56-a915-f528c4190912

📥 Commits

Reviewing files that changed from the base of the PR and between 4e1776a and 5293dba.

📒 Files selected for processing (1)

cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_kernel.cuh

yweng0828 · 2026-06-13T15:50:48Z

/bot run

tensorrt-cicd · 2026-06-13T15:56:17Z

PR_Github #54059 [ run ] triggered by Bot. Commit: 5293dba Link to invocation

tensorrt-cicd · 2026-06-13T19:42:48Z

PR_Github #54059 [ run ] completed with state SUCCESS. Commit: 5293dba
/LLM/main/L0_MergeRequest_PR pipeline #43143 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yweng0828 · 2026-06-14T07:49:56Z

/bot run

tensorrt-cicd · 2026-06-14T07:55:06Z

PR_Github #54096 [ run ] triggered by Bot. Commit: 5293dba Link to invocation

tensorrt-cicd · 2026-06-14T08:15:31Z

PR_Github #54096 [ run ] completed with state FAILURE. Commit: 5293dba
/LLM/main/L0_MergeRequest_PR pipeline #43179 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yweng0828 · 2026-06-14T08:29:26Z

/bot run

tensorrt-cicd · 2026-06-14T08:35:21Z

PR_Github #54100 [ run ] triggered by Bot. Commit: 5293dba Link to invocation

tensorrt-cicd · 2026-06-14T10:03:21Z

PR_Github #54100 [ run ] completed with state SUCCESS. Commit: 5293dba
/LLM/main/L0_MergeRequest_PR pipeline #43184 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yweng0828 · 2026-06-14T10:05:32Z

/bot run

tensorrt-cicd · 2026-06-14T10:11:32Z

PR_Github #54109 [ run ] triggered by Bot. Commit: 5293dba Link to invocation

tensorrt-cicd · 2026-06-14T11:23:12Z

PR_Github #54109 [ run ] completed with state SUCCESS. Commit: 5293dba
/LLM/main/L0_MergeRequest_PR pipeline #43193 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

dongfengy

Thanks!

yweng0828 · 2026-06-15T02:15:32Z

/bot run

tensorrt-cicd · 2026-06-15T02:20:50Z

PR_Github #54187 [ run ] triggered by Bot. Commit: 5293dba Link to invocation

tensorrt-cicd · 2026-06-15T02:58:46Z

PR_Github #54187 [ run ] completed with state SUCCESS. Commit: 5293dba
/LLM/main/L0_MergeRequest_PR pipeline #43266 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

yweng0828 · 2026-06-15T03:09:45Z

/bot run

tensorrt-cicd · 2026-06-15T03:16:11Z

PR_Github #54199 [ run ] triggered by Bot. Commit: ffd72f7 Link to invocation

github-actions Bot assigned yweng0828 Jun 13, 2026

coderabbitai Bot reviewed Jun 13, 2026

View reviewed changes

yweng0828 requested a review from dongfengy June 13, 2026 15:42

yweng0828 mentioned this pull request Jun 13, 2026

fix: fix tinygemm barrier bug flashinfer-ai/flashinfer#3630

Draft

5 tasks

dongfengy approved these changes Jun 14, 2026

View reviewed changes

fix tinygemm barrier bug

ffd72f7

Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>

yweng0828 force-pushed the yweng/fix_tinygemm_wrong_bar_try_wait branch from 5293dba to ffd72f7 Compare June 15, 2026 03:09

Conversation

yweng0828 commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

How is this bug?

Root cause

Why don't bugs appear when running it exclusively?

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

yweng0828 commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

tensorrt-cicd commented Jun 13, 2026

Uh oh!

yweng0828 commented Jun 14, 2026

Uh oh!

tensorrt-cicd commented Jun 14, 2026

Uh oh!

tensorrt-cicd commented Jun 14, 2026

Uh oh!

yweng0828 commented Jun 14, 2026

Uh oh!

tensorrt-cicd commented Jun 14, 2026

Uh oh!

tensorrt-cicd commented Jun 14, 2026

Uh oh!

yweng0828 commented Jun 14, 2026

Uh oh!

tensorrt-cicd commented Jun 14, 2026

Uh oh!

tensorrt-cicd commented Jun 14, 2026

Uh oh!

dongfengy left a comment

Choose a reason for hiding this comment

Uh oh!

yweng0828 commented Jun 15, 2026

Uh oh!

tensorrt-cicd commented Jun 15, 2026

Uh oh!

tensorrt-cicd commented Jun 15, 2026

Uh oh!

yweng0828 commented Jun 15, 2026

Uh oh!

tensorrt-cicd commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yweng0828 commented Jun 13, 2026 •

edited

Loading

coderabbitai Bot commented Jun 13, 2026 •

edited

Loading