Update: reduce DeepSeek V4 sparse attention tasks by high-cloud · Pull Request #301 · hw-native-sys/pypto-lib

high-cloud · 2026-05-16T16:13:49Z

Summary

Fuse sparse attention online softmax into one device task per head tile.
Pack grouped output rows per batch instead of per head.
Keep attention_swa tensor setup chunked for larger T scaling.
Validated attention_swa and sparse_attn ratio 0/4/128 on a2a3.

Related Issues

None

coderabbitai · 2026-05-16T16:14:00Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1a9e6d8a-df26-4a4e-bd3b-298c8a6ae665

📥 Commits

Reviewing files that changed from the base of the PR and between 317296d and 2c02f00.

📒 Files selected for processing (1)

models/deepseek/v4/sparse_attn.py

🚧 Files skipped from review as they are similar to previous changes (1)

models/deepseek/v4/sparse_attn.py

📝 Walkthrough

Walkthrough

The PR refactors sparse attention accumulation from explicit stepwise loop mutations to a modern pl.range iterator pattern with initialization and yield semantics, while repositioning a core-group scope annotation to wrap the full packing loop rather than nesting inside it.

Changes

Sparse Attention Refactoring

Layer / File(s)	Summary
Sparse attention accumulation with pl.range iterator `models/deepseek/v4/sparse_attn.py`	Sparse softmax mi/li/oi accumulation over sparse_k is replaced: explicit stepwise loop mutations → pl.range iterator with init_values, per-kk mi_new/li_new/oi_new computation, and pl.yield_ to return final accumulated (mi_final, li_final, oi_final) for sink-biased normalization and output.
Core-group scope repositioning for packing `models/deepseek/v4/sparse_attn.py`	The `pl.at(CORE_GROUP, name_hint="cfa_proj_pack_o_packed")` scope is moved to enclose the entire `for h in pl.parallel(...)` packing loop setup/body rather than nesting only inside the loop body.

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

hw-native-sys/pypto-lib#225: Both PRs refactor DeepSeek-V4 decode sparse attention's core sparse softmax/accumulation logic; the main PR modernizes the control flow pattern while the linked PR introduces a fused sparse-attention kernel performing the same sparse attention normalization prior to grouped output projection.

Poem

🐇 The mi's and li's were looping slow,
But pl.range made them gleam and glow,
With init and yield, the flow runs true,
Sparse softmax shines in patterns new! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title refers to reducing DeepSeek V4 sparse attention tasks, which aligns with the main changes of fusing sparse attention softmax and optimizing task structure.
Description check	✅ Passed	The description details the specific optimization goals including fusing sparse attention softmax, packing output rows per batch, and validation performed, all of which relate directly to the changeset.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request refactors the attention mechanisms in models/deepseek/v4/attention_swa.py and models/deepseek/v4/sparse_attn.py by removing unnecessary batch chunking loops and optimizing the pl.at block structures. A critical issue was identified in sparse_attn.py where the variables used for final normalization are not defined if the sparse_k loop is empty (e.g., when sparse_k is 1), which would cause a runtime error. I recommend initializing these variables with their initial values before the loop to ensure they are always defined.

gemini-code-assist · 2026-05-16T16:26:35Z

+                for kk, (mi_loop, li_loop, oi_loop) in pl.range(
+                    1,
+                    sparse_k,
+                    init_values=(mi_init, li_init, oi_init),
+                ):
+                    cur_kv_batch = pl.col_expand(
                        pl.full([MATMUL_ROW_PAD, HEAD_DIM], dtype=pl.FP32, value=0.0),
                        pl.cast(kv_topk_batch[kk : kk + 1, 0 : HEAD_DIM], target_type=pl.FP32),
                    )
-                    cur_score = pl.row_sum(pl.mul(q_batch, kv_batch))
+                    cur_score = pl.row_sum(pl.mul(q_batch, cur_kv_batch))
                    cur_mi = pl.mul(cur_score, SOFTMAX_SCALE)
-                    mi_new = pl.maximum(mi, cur_mi)
-                    alpha = pl.exp(pl.sub(mi, mi_new))
+                    mi_new = pl.maximum(mi_loop, cur_mi)
+                    alpha = pl.exp(pl.sub(mi_loop, mi_new))
                    beta = pl.exp(pl.sub(cur_mi, mi_new))
-                    li = pl.add(pl.mul(alpha, li), beta)
-                    oi = pl.add(
-                        pl.row_expand_mul(oi, alpha),
-                        pl.row_expand_mul(kv_batch, beta),
+                    li_new = pl.add(pl.mul(alpha, li_loop), beta)
+                    oi_new = pl.add(
+                        pl.row_expand_mul(oi_loop, alpha),
+                        pl.row_expand_mul(cur_kv_batch, beta),
                    )
-                    mi = mi_new
+                    (mi_final, li_final, oi_final) = pl.yield_(mi_new, li_new, oi_new)


The variables mi_final, li_final, and oi_final are only assigned within the pl.range loop body. If sparse_k is 1 (which occurs during decode when only the current token is valid in the window and no compressed tokens are selected), the loop range pl.range(1, 1) will be empty. Consequently, these variables will remain undefined when they are accessed for the final normalization and output assembly at lines 218-220, leading to a runtime error. They should be initialized with the _init values before the loop to ensure correctness for all sparse_k values.

mi_final, li_final, oi_final = mi_init, li_init, oi_init for kk, (mi_loop, li_loop, oi_loop) in pl.range( 1, sparse_k, init_values=(mi_init, li_init, oi_init), ): cur_kv_batch = pl.col_expand( pl.full([MATMUL_ROW_PAD, HEAD_DIM], dtype=pl.FP32, value=0.0), pl.cast(kv_topk_batch[kk : kk + 1, 0 : HEAD_DIM], target_type=pl.FP32), ) cur_score = pl.row_sum(pl.mul(q_batch, cur_kv_batch)) cur_mi = pl.mul(cur_score, SOFTMAX_SCALE) mi_new = pl.maximum(mi_loop, cur_mi) alpha = pl.exp(pl.sub(mi_loop, mi_new)) beta = pl.exp(pl.sub(cur_mi, mi_new)) li_new = pl.add(pl.mul(alpha, li_loop), beta) oi_new = pl.add( pl.row_expand_mul(oi_loop, alpha), pl.row_expand_mul(cur_kv_batch, beta), ) mi_final, li_final, oi_final = mi_new, li_new, oi_new pl.yield_(mi_new, li_new, oi_new)

- Fuse sparse attention online softmax into one device task per head tile - Pack grouped output rows per batch instead of per head - Keep attention_swa tensor setup chunked for larger T scaling

gemini-code-assist Bot reviewed May 16, 2026

View reviewed changes

Update: reduce DeepSeek V4 sparse attention tasks

2c02f00

- Fuse sparse attention online softmax into one device task per head tile - Pack grouped output rows per batch instead of per head - Keep attention_swa tensor setup chunked for larger T scaling

high-cloud force-pushed the update/deepseek-v4-swa-attn-perf branch from 317296d to 2c02f00 Compare May 18, 2026 01:27

high-cloud changed the title ~~Update: reduce DeepSeek V4 SWA attention tasks~~ Update: reduce DeepSeek V4 sparse attention tasks May 18, 2026

high-cloud marked this pull request as draft May 18, 2026 06:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update: reduce DeepSeek V4 sparse attention tasks#301

Update: reduce DeepSeek V4 sparse attention tasks#301
high-cloud wants to merge 1 commit into
hw-native-sys:mainfrom
high-cloud:update/deepseek-v4-swa-attn-perf

high-cloud commented May 16, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 16, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated Code Review Effort

Possibly Related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

high-cloud commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issues

Uh oh!

coderabbitai Bot commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated Code Review Effort

Possibly Related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

high-cloud commented May 16, 2026 •

edited

Loading

coderabbitai Bot commented May 16, 2026 •

edited

Loading