Skip to content

Update: reduce DeepSeek V4 sparse attention tasks#301

Draft
high-cloud wants to merge 1 commit into
hw-native-sys:mainfrom
high-cloud:update/deepseek-v4-swa-attn-perf
Draft

Update: reduce DeepSeek V4 sparse attention tasks#301
high-cloud wants to merge 1 commit into
hw-native-sys:mainfrom
high-cloud:update/deepseek-v4-swa-attn-perf

Conversation

@high-cloud
Copy link
Copy Markdown
Contributor

@high-cloud high-cloud commented May 16, 2026

Summary

  • Fuse sparse attention online softmax into one device task per head tile.
  • Pack grouped output rows per batch instead of per head.
  • Keep attention_swa tensor setup chunked for larger T scaling.
  • Validated attention_swa and sparse_attn ratio 0/4/128 on a2a3.

Related Issues

None

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 16, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1a9e6d8a-df26-4a4e-bd3b-298c8a6ae665

📥 Commits

Reviewing files that changed from the base of the PR and between 317296d and 2c02f00.

📒 Files selected for processing (1)
  • models/deepseek/v4/sparse_attn.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • models/deepseek/v4/sparse_attn.py

📝 Walkthrough

Walkthrough

The PR refactors sparse attention accumulation from explicit stepwise loop mutations to a modern pl.range iterator pattern with initialization and yield semantics, while repositioning a core-group scope annotation to wrap the full packing loop rather than nesting inside it.

Changes

Sparse Attention Refactoring

Layer / File(s) Summary
Sparse attention accumulation with pl.range iterator
models/deepseek/v4/sparse_attn.py
Sparse softmax mi/li/oi accumulation over sparse_k is replaced: explicit stepwise loop mutations → pl.range iterator with init_values, per-kk mi_new/li_new/oi_new computation, and pl.yield_ to return final accumulated (mi_final, li_final, oi_final) for sink-biased normalization and output.
Core-group scope repositioning for packing
models/deepseek/v4/sparse_attn.py
The pl.at(CORE_GROUP, name_hint="cfa_proj_pack_o_packed") scope is moved to enclose the entire for h in pl.parallel(...) packing loop setup/body rather than nesting only inside the loop body.

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

  • hw-native-sys/pypto-lib#225: Both PRs refactor DeepSeek-V4 decode sparse attention's core sparse softmax/accumulation logic; the main PR modernizes the control flow pattern while the linked PR introduces a fused sparse-attention kernel performing the same sparse attention normalization prior to grouped output projection.

Poem

🐇 The mi's and li's were looping slow,
But pl.range made them gleam and glow,
With init and yield, the flow runs true,
Sparse softmax shines in patterns new! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title refers to reducing DeepSeek V4 sparse attention tasks, which aligns with the main changes of fusing sparse attention softmax and optimizing task structure.
Description check ✅ Passed The description details the specific optimization goals including fusing sparse attention softmax, packing output rows per batch, and validation performed, all of which relate directly to the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the attention mechanisms in models/deepseek/v4/attention_swa.py and models/deepseek/v4/sparse_attn.py by removing unnecessary batch chunking loops and optimizing the pl.at block structures. A critical issue was identified in sparse_attn.py where the variables used for final normalization are not defined if the sparse_k loop is empty (e.g., when sparse_k is 1), which would cause a runtime error. I recommend initializing these variables with their initial values before the loop to ensure they are always defined.

Comment on lines +196 to +215
for kk, (mi_loop, li_loop, oi_loop) in pl.range(
1,
sparse_k,
init_values=(mi_init, li_init, oi_init),
):
cur_kv_batch = pl.col_expand(
pl.full([MATMUL_ROW_PAD, HEAD_DIM], dtype=pl.FP32, value=0.0),
pl.cast(kv_topk_batch[kk : kk + 1, 0 : HEAD_DIM], target_type=pl.FP32),
)
cur_score = pl.row_sum(pl.mul(q_batch, kv_batch))
cur_score = pl.row_sum(pl.mul(q_batch, cur_kv_batch))
cur_mi = pl.mul(cur_score, SOFTMAX_SCALE)
mi_new = pl.maximum(mi, cur_mi)
alpha = pl.exp(pl.sub(mi, mi_new))
mi_new = pl.maximum(mi_loop, cur_mi)
alpha = pl.exp(pl.sub(mi_loop, mi_new))
beta = pl.exp(pl.sub(cur_mi, mi_new))
li = pl.add(pl.mul(alpha, li), beta)
oi = pl.add(
pl.row_expand_mul(oi, alpha),
pl.row_expand_mul(kv_batch, beta),
li_new = pl.add(pl.mul(alpha, li_loop), beta)
oi_new = pl.add(
pl.row_expand_mul(oi_loop, alpha),
pl.row_expand_mul(cur_kv_batch, beta),
)
mi = mi_new
(mi_final, li_final, oi_final) = pl.yield_(mi_new, li_new, oi_new)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The variables mi_final, li_final, and oi_final are only assigned within the pl.range loop body. If sparse_k is 1 (which occurs during decode when only the current token is valid in the window and no compressed tokens are selected), the loop range pl.range(1, 1) will be empty. Consequently, these variables will remain undefined when they are accessed for the final normalization and output assembly at lines 218-220, leading to a runtime error. They should be initialized with the _init values before the loop to ensure correctness for all sparse_k values.

                mi_final, li_final, oi_final = mi_init, li_init, oi_init
                for kk, (mi_loop, li_loop, oi_loop) in pl.range(
                    1,
                    sparse_k,
                    init_values=(mi_init, li_init, oi_init),
                ):
                    cur_kv_batch = pl.col_expand(
                        pl.full([MATMUL_ROW_PAD, HEAD_DIM], dtype=pl.FP32, value=0.0),
                        pl.cast(kv_topk_batch[kk : kk + 1, 0 : HEAD_DIM], target_type=pl.FP32),
                    )
                    cur_score = pl.row_sum(pl.mul(q_batch, cur_kv_batch))
                    cur_mi = pl.mul(cur_score, SOFTMAX_SCALE)
                    mi_new = pl.maximum(mi_loop, cur_mi)
                    alpha = pl.exp(pl.sub(mi_loop, mi_new))
                    beta = pl.exp(pl.sub(cur_mi, mi_new))
                    li_new = pl.add(pl.mul(alpha, li_loop), beta)
                    oi_new = pl.add(
                        pl.row_expand_mul(oi_loop, alpha),
                        pl.row_expand_mul(cur_kv_batch, beta),
                    )
                    mi_final, li_final, oi_final = mi_new, li_new, oi_new
                    pl.yield_(mi_new, li_new, oi_new)

- Fuse sparse attention online softmax into one device task per head tile
- Pack grouped output rows per batch instead of per head
- Keep attention_swa tensor setup chunked for larger T scaling
@high-cloud high-cloud force-pushed the update/deepseek-v4-swa-attn-perf branch from 317296d to 2c02f00 Compare May 18, 2026 01:27
@high-cloud high-cloud changed the title Update: reduce DeepSeek V4 SWA attention tasks Update: reduce DeepSeek V4 sparse attention tasks May 18, 2026
@high-cloud high-cloud marked this pull request as draft May 18, 2026 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant