perf(dsv4): spread decode_indexer rope loop across cores via output de-aliasing by wangqin1723-max · Pull Request #358 · hw-native-sys/pypto-lib

wangqin1723-max · 2026-05-22T10:02:34Z

Summary

De-alias the rope loop output: rope_slice read qr_proj_flat and rope_write wrote back in-place into its ROPE columns. Although the 32 per-o0 iterations touch disjoint rows, the scheduler could not disambiguate the read+write (RAW/WAR) aliasing on qr_proj_flat at slice granularity and serialized every iteration onto a single cube+vector core pair — the rope window was 67–76% of total wall-clock.
Route rope output to a fresh qr_rope_out tensor so qr_proj_flat stays read-only across the loop; qr_hadamard then K-splits its matmul (NOPE half from qr_proj_flat, ROPE half from qr_rope_out). The scheduler now spreads the rope iterations across ~18 cube cores: rope window 863–1357us → 128us, total wall-clock ~880us (stable) vs 1293–1808us baseline.
Also includes GROUP-chunking of the per-head/per-batch loops (rope GRP=4, qr_hadamard split at GRP=4 from quant at GRP=2, score loop SCORE_B_GROUP=8) to amortize per-task launch overhead over taller, numerically-identical tiles.
Validated on a2a3: idx_kv_cache and score PASS, precision-neutral (score max_error_ratio unchanged at 0.005). topk_idxs FAIL is a known pre-existing pypto-HEAD issue and is unchanged by this PR.

Related Issues

N/A

coderabbitai · 2026-05-22T10:02:48Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e128ae33-c5a2-423e-bfcc-6fb4518262f4

📥 Commits

Reviewing files that changed from the base of the PR and between 4b79863 and e182a36.

📒 Files selected for processing (1)

models/deepseek/v4/decode_indexer.py

📝 Walkthrough

Walkthrough

This PR refactors the DeepSeek-V4 decode_indexer kernel's tiling strategy: it introduces head and batch group-chunking parameters to enable coarser-grained task parallelism, updates the ROPE→Hadamard→INT8 pipeline to process multiple heads per task using a fresh intermediate tensor, and rewrites score computation to batch-group accumulation using global buffers for safe disjoint slice writes.

Changes

Kernel Tiling and Dataflow Optimization

Layer / File(s)	Summary
Configuration constants and divisibility constraints `models/deepseek/v4/decode_indexer.py`	`HEAD_GROUP`, `HEAD_GROUP_ROPE`, and `SCORE_B_GROUP` parameters are introduced with divisibility asserts to enforce tiling preconditions for the refactored grouped loops.
ROPE processing with head grouping `models/deepseek/v4/decode_indexer.py`	ROPE loop refactored to process grouped head rows into separate `qr_rope_out` tensor, FP32 accumulator and slicing adjusted for grouped-head output shape, and Hadamard/INT8 quantization reindexed to consume grouped ROPE output while writing quantized results at original head granularity.
Score computation with batch grouping `models/deepseek/v4/decode_indexer.py`	Score computation rewrote to group multiple batches per task, replacing per-batch temporaries with global memory tensors indexed by `score_row0`, and updated score initialization, quantization, accumulation, and fused dequant/weight/relu/store to use disjoint grouped slices.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

hw-native-sys/pypto-lib#350: Heavily refactors the same decode_indexer.py tiling/dataflow, including ROPE→Hadamard and score-accumulation indexing.
hw-native-sys/pypto-lib#260: Modifies ROPE→Hadamard and score layout in the decode/indexer pipeline; related reindexing and grouped-head changes.
hw-native-sys/pypto-lib#339: Aligns with grouped-head ROPE/qkv_proj_rope staging and downstream expectations used by this PR.

Poem

🐰 I hop through grouped heads with glee,

tiles chunked and ropes fall free,
BF16 streams in tidy rows,
scores gather where the buffer grows,
safe slices, silent as a tree.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main performance optimization: spreading the decode_indexer rope loop across cores through output de-aliasing, which directly corresponds to the primary change in the changeset.
Description check	✅ Passed	The description comprehensively explains the changes including the rope loop de-aliasing strategy, performance improvements, GROUP-chunking optimizations, and validation results, all directly related to the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

models/deepseek/v4/decode_indexer.py (1)

64-74: ⚡ Quick win

Make group selection divisor-aware to avoid unnecessary config rejection.

Line 68 and Line 72 choose fixed groups (4/8) and then assert divisibility, which can reject otherwise valid static configs (for example odd T, or B not divisible by 8). Prefer picking the largest supported divisor first, then keep the asserts as sanity checks.

♻️ Suggested diff

-HEAD_GROUP = 2 if T >= 2 else 1
+HEAD_GROUP = 2 if (T >= 2 and T % 2 == 0) else 1
 HEAD_ROWS = IDX_N_HEADS * HEAD_GROUP
 ...
-HEAD_GROUP_ROPE = 4 if T >= 4 else HEAD_GROUP
+HEAD_GROUP_ROPE = 4 if (T >= 4 and T % 4 == 0) else HEAD_GROUP
 HEAD_ROWS_ROPE = IDX_N_HEADS * HEAD_GROUP_ROPE
 ...
-SCORE_B_GROUP = 8 if B >= 8 else B
+SCORE_B_GROUP = 8 if (B >= 8 and B % 8 == 0) else (4 if B % 4 == 0 else (2 if B % 2 == 0 else 1))
 assert B % SCORE_B_GROUP == 0, "B must be divisible by SCORE_B_GROUP"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/decode_indexer.py` around lines 64 - 74, The
group-selection logic uses fixed constants (4 and 8) then asserts divisibility,
which can reject valid configs; change HEAD_GROUP_ROPE and SCORE_B_GROUP to
choose the largest supported divisor of T and B respectively (for example pick 4
if T % 4 == 0, else fall back to HEAD_GROUP/2/1 as appropriate; pick 8 if B % 8
== 0, else try 4,2,1) before computing HEAD_ROWS_ROPE and using SCORE_B_GROUP,
and keep the existing asserts as sanity checks; update references to
HEAD_GROUP_ROPE, HEAD_ROWS_ROPE, and SCORE_B_GROUP so they are derived from that
divisor-selection logic rather than hardcoded 4/8.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@models/deepseek/v4/decode_indexer.py`:
- Around line 64-74: The group-selection logic uses fixed constants (4 and 8)
then asserts divisibility, which can reject valid configs; change
HEAD_GROUP_ROPE and SCORE_B_GROUP to choose the largest supported divisor of T
and B respectively (for example pick 4 if T % 4 == 0, else fall back to
HEAD_GROUP/2/1 as appropriate; pick 8 if B % 8 == 0, else try 4,2,1) before
computing HEAD_ROWS_ROPE and using SCORE_B_GROUP, and keep the existing asserts
as sanity checks; update references to HEAD_GROUP_ROPE, HEAD_ROWS_ROPE, and
SCORE_B_GROUP so they are derived from that divisor-selection logic rather than
hardcoded 4/8.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cf192232-1902-48d7-8826-fac980dad1ad

📥 Commits

Reviewing files that changed from the base of the PR and between be3c794 and 4b79863.

📒 Files selected for processing (1)

models/deepseek/v4/decode_indexer.py

gemini-code-assist

Code Review

This pull request introduces group-chunking and task folding optimizations to the decode_indexer.py script. By grouping multiple tokens and batches into single tasks, the changes amortize task launch overhead and improve parallelization across hardware cores. Key modifications include the introduction of intermediate tensors to avoid read-after-write hazards and the restructuring of loops for ROPE, Hadamard, and scoring operations to utilize these grouped tasks. I have no feedback to provide.

…e-aliasing Two layers of latency reduction on the DeepSeek-V4 decode indexer, validated on a2a3 (idx_kv_cache + score PASS, precision-neutral; topk_idxs is a known pre-existing pypto-HEAD issue, unchanged): 1. GROUP-chunking of the per-head/per-batch loops (rope GRP=4, qr_hadamard split at GRP=4 from quant at GRP=2, score loop SCORE_B_GROUP=8) to amortize per-task launch overhead over taller, numerically-identical tiles. 2. De-alias the rope loop output. rope_slice read qr_proj_flat and rope_write wrote back in-place into its ROPE columns; although the 32 iterations touch disjoint rows, the scheduler could not disambiguate the read+write (RAW/WAR) aliasing on qr_proj_flat at slice granularity and serialized all iterations onto a single cube+vector core pair (the rope window was 67-76% of total wall-clock). Routing rope output to a fresh qr_rope_out tensor keeps qr_proj_flat read-only; qr_hadamard then K-splits its matmul (NOPE half from qr_proj_flat, ROPE half from qr_rope_out). The scheduler now spreads the rope iterations across ~18 cube cores: rope window 863-1357us -> 128us, total wall-clock ~880us (stable) vs 1293-1808us baseline.

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

wangqin1723-max force-pushed the perf/dsv4-decode-indexer-rope-core-spread branch from 4b79863 to e182a36 Compare May 23, 2026 08:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(dsv4): spread decode_indexer rope loop across cores via output de-aliasing#358

perf(dsv4): spread decode_indexer rope loop across cores via output de-aliasing#358
wangqin1723-max wants to merge 1 commit into
hw-native-sys:mainfrom
wangqin1723-max:perf/dsv4-decode-indexer-rope-core-spread

wangqin1723-max commented May 22, 2026

Uh oh!

coderabbitai Bot commented May 22, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wangqin1723-max commented May 22, 2026

Summary

Related Issues

Uh oh!

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 22, 2026 •

edited

Loading