Skip to content

perf(dsv4): spread decode_indexer rope loop across cores via output de-aliasing#358

Open
wangqin1723-max wants to merge 1 commit into
hw-native-sys:mainfrom
wangqin1723-max:perf/dsv4-decode-indexer-rope-core-spread
Open

perf(dsv4): spread decode_indexer rope loop across cores via output de-aliasing#358
wangqin1723-max wants to merge 1 commit into
hw-native-sys:mainfrom
wangqin1723-max:perf/dsv4-decode-indexer-rope-core-spread

Conversation

@wangqin1723-max
Copy link
Copy Markdown
Contributor

Summary

  • De-alias the rope loop output: rope_slice read qr_proj_flat and rope_write wrote back in-place into its ROPE columns. Although the 32 per-o0 iterations touch disjoint rows, the scheduler could not disambiguate the read+write (RAW/WAR) aliasing on qr_proj_flat at slice granularity and serialized every iteration onto a single cube+vector core pair — the rope window was 67–76% of total wall-clock.
  • Route rope output to a fresh qr_rope_out tensor so qr_proj_flat stays read-only across the loop; qr_hadamard then K-splits its matmul (NOPE half from qr_proj_flat, ROPE half from qr_rope_out). The scheduler now spreads the rope iterations across ~18 cube cores: rope window 863–1357us → 128us, total wall-clock ~880us (stable) vs 1293–1808us baseline.
  • Also includes GROUP-chunking of the per-head/per-batch loops (rope GRP=4, qr_hadamard split at GRP=4 from quant at GRP=2, score loop SCORE_B_GROUP=8) to amortize per-task launch overhead over taller, numerically-identical tiles.
  • Validated on a2a3: idx_kv_cache and score PASS, precision-neutral (score max_error_ratio unchanged at 0.005). topk_idxs FAIL is a known pre-existing pypto-HEAD issue and is unchanged by this PR.

Related Issues

N/A

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e128ae33-c5a2-423e-bfcc-6fb4518262f4

📥 Commits

Reviewing files that changed from the base of the PR and between 4b79863 and e182a36.

📒 Files selected for processing (1)
  • models/deepseek/v4/decode_indexer.py

📝 Walkthrough

Walkthrough

This PR refactors the DeepSeek-V4 decode_indexer kernel's tiling strategy: it introduces head and batch group-chunking parameters to enable coarser-grained task parallelism, updates the ROPE→Hadamard→INT8 pipeline to process multiple heads per task using a fresh intermediate tensor, and rewrites score computation to batch-group accumulation using global buffers for safe disjoint slice writes.

Changes

Kernel Tiling and Dataflow Optimization

Layer / File(s) Summary
Configuration constants and divisibility constraints
models/deepseek/v4/decode_indexer.py
HEAD_GROUP, HEAD_GROUP_ROPE, and SCORE_B_GROUP parameters are introduced with divisibility asserts to enforce tiling preconditions for the refactored grouped loops.
ROPE processing with head grouping
models/deepseek/v4/decode_indexer.py
ROPE loop refactored to process grouped head rows into separate qr_rope_out tensor, FP32 accumulator and slicing adjusted for grouped-head output shape, and Hadamard/INT8 quantization reindexed to consume grouped ROPE output while writing quantized results at original head granularity.
Score computation with batch grouping
models/deepseek/v4/decode_indexer.py
Score computation rewrote to group multiple batches per task, replacing per-batch temporaries with global memory tensors indexed by score_row0, and updated score initialization, quantization, accumulation, and fused dequant/weight/relu/store to use disjoint grouped slices.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • hw-native-sys/pypto-lib#350: Heavily refactors the same decode_indexer.py tiling/dataflow, including ROPE→Hadamard and score-accumulation indexing.
  • hw-native-sys/pypto-lib#260: Modifies ROPE→Hadamard and score layout in the decode/indexer pipeline; related reindexing and grouped-head changes.
  • hw-native-sys/pypto-lib#339: Aligns with grouped-head ROPE/qkv_proj_rope staging and downstream expectations used by this PR.

Poem

🐰 I hop through grouped heads with glee,

tiles chunked and ropes fall free,
BF16 streams in tidy rows,
scores gather where the buffer grows,
safe slices, silent as a tree.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main performance optimization: spreading the decode_indexer rope loop across cores through output de-aliasing, which directly corresponds to the primary change in the changeset.
Description check ✅ Passed The description comprehensively explains the changes including the rope loop de-aliasing strategy, performance improvements, GROUP-chunking optimizations, and validation results, all directly related to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
models/deepseek/v4/decode_indexer.py (1)

64-74: ⚡ Quick win

Make group selection divisor-aware to avoid unnecessary config rejection.

Line 68 and Line 72 choose fixed groups (4/8) and then assert divisibility, which can reject otherwise valid static configs (for example odd T, or B not divisible by 8). Prefer picking the largest supported divisor first, then keep the asserts as sanity checks.

♻️ Suggested diff
-HEAD_GROUP = 2 if T >= 2 else 1
+HEAD_GROUP = 2 if (T >= 2 and T % 2 == 0) else 1
 HEAD_ROWS = IDX_N_HEADS * HEAD_GROUP
 ...
-HEAD_GROUP_ROPE = 4 if T >= 4 else HEAD_GROUP
+HEAD_GROUP_ROPE = 4 if (T >= 4 and T % 4 == 0) else HEAD_GROUP
 HEAD_ROWS_ROPE = IDX_N_HEADS * HEAD_GROUP_ROPE
 ...
-SCORE_B_GROUP = 8 if B >= 8 else B
+SCORE_B_GROUP = 8 if (B >= 8 and B % 8 == 0) else (4 if B % 4 == 0 else (2 if B % 2 == 0 else 1))
 assert B % SCORE_B_GROUP == 0, "B must be divisible by SCORE_B_GROUP"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@models/deepseek/v4/decode_indexer.py` around lines 64 - 74, The
group-selection logic uses fixed constants (4 and 8) then asserts divisibility,
which can reject valid configs; change HEAD_GROUP_ROPE and SCORE_B_GROUP to
choose the largest supported divisor of T and B respectively (for example pick 4
if T % 4 == 0, else fall back to HEAD_GROUP/2/1 as appropriate; pick 8 if B % 8
== 0, else try 4,2,1) before computing HEAD_ROWS_ROPE and using SCORE_B_GROUP,
and keep the existing asserts as sanity checks; update references to
HEAD_GROUP_ROPE, HEAD_ROWS_ROPE, and SCORE_B_GROUP so they are derived from that
divisor-selection logic rather than hardcoded 4/8.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@models/deepseek/v4/decode_indexer.py`:
- Around line 64-74: The group-selection logic uses fixed constants (4 and 8)
then asserts divisibility, which can reject valid configs; change
HEAD_GROUP_ROPE and SCORE_B_GROUP to choose the largest supported divisor of T
and B respectively (for example pick 4 if T % 4 == 0, else fall back to
HEAD_GROUP/2/1 as appropriate; pick 8 if B % 8 == 0, else try 4,2,1) before
computing HEAD_ROWS_ROPE and using SCORE_B_GROUP, and keep the existing asserts
as sanity checks; update references to HEAD_GROUP_ROPE, HEAD_ROWS_ROPE, and
SCORE_B_GROUP so they are derived from that divisor-selection logic rather than
hardcoded 4/8.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cf192232-1902-48d7-8826-fac980dad1ad

📥 Commits

Reviewing files that changed from the base of the PR and between be3c794 and 4b79863.

📒 Files selected for processing (1)
  • models/deepseek/v4/decode_indexer.py

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces group-chunking and task folding optimizations to the decode_indexer.py script. By grouping multiple tokens and batches into single tasks, the changes amortize task launch overhead and improve parallelization across hardware cores. Key modifications include the introduction of intermediate tensors to avoid read-after-write hazards and the restructuring of loops for ROPE, Hadamard, and scoring operations to utilize these grouped tasks. I have no feedback to provide.

…e-aliasing

Two layers of latency reduction on the DeepSeek-V4 decode indexer, validated
on a2a3 (idx_kv_cache + score PASS, precision-neutral; topk_idxs is a known
pre-existing pypto-HEAD issue, unchanged):

1. GROUP-chunking of the per-head/per-batch loops (rope GRP=4, qr_hadamard
   split at GRP=4 from quant at GRP=2, score loop SCORE_B_GROUP=8) to amortize
   per-task launch overhead over taller, numerically-identical tiles.

2. De-alias the rope loop output. rope_slice read qr_proj_flat and rope_write
   wrote back in-place into its ROPE columns; although the 32 iterations touch
   disjoint rows, the scheduler could not disambiguate the read+write (RAW/WAR)
   aliasing on qr_proj_flat at slice granularity and serialized all iterations
   onto a single cube+vector core pair (the rope window was 67-76% of total
   wall-clock). Routing rope output to a fresh qr_rope_out tensor keeps
   qr_proj_flat read-only; qr_hadamard then K-splits its matmul (NOPE half from
   qr_proj_flat, ROPE half from qr_rope_out). The scheduler now spreads the rope
   iterations across ~18 cube cores: rope window 863-1357us -> 128us, total
   wall-clock ~880us (stable) vs 1293-1808us baseline.
@wangqin1723-max wangqin1723-max force-pushed the perf/dsv4-decode-indexer-rope-core-spread branch from 4b79863 to e182a36 Compare May 23, 2026 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant