perf(dsv4): spread decode_indexer rope loop across cores via output de-aliasing#358
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR refactors the DeepSeek-V4 decode_indexer kernel's tiling strategy: it introduces head and batch group-chunking parameters to enable coarser-grained task parallelism, updates the ROPE→Hadamard→INT8 pipeline to process multiple heads per task using a fresh intermediate tensor, and rewrites score computation to batch-group accumulation using global buffers for safe disjoint slice writes. ChangesKernel Tiling and Dataflow Optimization
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
models/deepseek/v4/decode_indexer.py (1)
64-74: ⚡ Quick winMake group selection divisor-aware to avoid unnecessary config rejection.
Line 68 and Line 72 choose fixed groups (
4/8) and then assert divisibility, which can reject otherwise valid static configs (for example oddT, orBnot divisible by 8). Prefer picking the largest supported divisor first, then keep the asserts as sanity checks.♻️ Suggested diff
-HEAD_GROUP = 2 if T >= 2 else 1 +HEAD_GROUP = 2 if (T >= 2 and T % 2 == 0) else 1 HEAD_ROWS = IDX_N_HEADS * HEAD_GROUP ... -HEAD_GROUP_ROPE = 4 if T >= 4 else HEAD_GROUP +HEAD_GROUP_ROPE = 4 if (T >= 4 and T % 4 == 0) else HEAD_GROUP HEAD_ROWS_ROPE = IDX_N_HEADS * HEAD_GROUP_ROPE ... -SCORE_B_GROUP = 8 if B >= 8 else B +SCORE_B_GROUP = 8 if (B >= 8 and B % 8 == 0) else (4 if B % 4 == 0 else (2 if B % 2 == 0 else 1)) assert B % SCORE_B_GROUP == 0, "B must be divisible by SCORE_B_GROUP"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/deepseek/v4/decode_indexer.py` around lines 64 - 74, The group-selection logic uses fixed constants (4 and 8) then asserts divisibility, which can reject valid configs; change HEAD_GROUP_ROPE and SCORE_B_GROUP to choose the largest supported divisor of T and B respectively (for example pick 4 if T % 4 == 0, else fall back to HEAD_GROUP/2/1 as appropriate; pick 8 if B % 8 == 0, else try 4,2,1) before computing HEAD_ROWS_ROPE and using SCORE_B_GROUP, and keep the existing asserts as sanity checks; update references to HEAD_GROUP_ROPE, HEAD_ROWS_ROPE, and SCORE_B_GROUP so they are derived from that divisor-selection logic rather than hardcoded 4/8.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@models/deepseek/v4/decode_indexer.py`:
- Around line 64-74: The group-selection logic uses fixed constants (4 and 8)
then asserts divisibility, which can reject valid configs; change
HEAD_GROUP_ROPE and SCORE_B_GROUP to choose the largest supported divisor of T
and B respectively (for example pick 4 if T % 4 == 0, else fall back to
HEAD_GROUP/2/1 as appropriate; pick 8 if B % 8 == 0, else try 4,2,1) before
computing HEAD_ROWS_ROPE and using SCORE_B_GROUP, and keep the existing asserts
as sanity checks; update references to HEAD_GROUP_ROPE, HEAD_ROWS_ROPE, and
SCORE_B_GROUP so they are derived from that divisor-selection logic rather than
hardcoded 4/8.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: cf192232-1902-48d7-8826-fac980dad1ad
📒 Files selected for processing (1)
models/deepseek/v4/decode_indexer.py
There was a problem hiding this comment.
Code Review
This pull request introduces group-chunking and task folding optimizations to the decode_indexer.py script. By grouping multiple tokens and batches into single tasks, the changes amortize task launch overhead and improve parallelization across hardware cores. Key modifications include the introduction of intermediate tensors to avoid read-after-write hazards and the restructuring of loops for ROPE, Hadamard, and scoring operations to utilize these grouped tasks. I have no feedback to provide.
…e-aliasing Two layers of latency reduction on the DeepSeek-V4 decode indexer, validated on a2a3 (idx_kv_cache + score PASS, precision-neutral; topk_idxs is a known pre-existing pypto-HEAD issue, unchanged): 1. GROUP-chunking of the per-head/per-batch loops (rope GRP=4, qr_hadamard split at GRP=4 from quant at GRP=2, score loop SCORE_B_GROUP=8) to amortize per-task launch overhead over taller, numerically-identical tiles. 2. De-alias the rope loop output. rope_slice read qr_proj_flat and rope_write wrote back in-place into its ROPE columns; although the 32 iterations touch disjoint rows, the scheduler could not disambiguate the read+write (RAW/WAR) aliasing on qr_proj_flat at slice granularity and serialized all iterations onto a single cube+vector core pair (the rope window was 67-76% of total wall-clock). Routing rope output to a fresh qr_rope_out tensor keeps qr_proj_flat read-only; qr_hadamard then K-splits its matmul (NOPE half from qr_proj_flat, ROPE half from qr_rope_out). The scheduler now spreads the rope iterations across ~18 cube cores: rope window 863-1357us -> 128us, total wall-clock ~880us (stable) vs 1293-1808us baseline.
4b79863 to
e182a36
Compare
Summary
rope_slicereadqr_proj_flatandrope_writewrote back in-place into its ROPE columns. Although the 32 per-o0iterations touch disjoint rows, the scheduler could not disambiguate the read+write (RAW/WAR) aliasing onqr_proj_flatat slice granularity and serialized every iteration onto a single cube+vector core pair — the rope window was 67–76% of total wall-clock.qr_rope_outtensor soqr_proj_flatstays read-only across the loop;qr_hadamardthen K-splits its matmul (NOPE half fromqr_proj_flat, ROPE half fromqr_rope_out). The scheduler now spreads the rope iterations across ~18 cube cores: rope window 863–1357us → 128us, total wall-clock ~880us (stable) vs 1293–1808us baseline.qr_hadamardsplit at GRP=4 from quant at GRP=2, score loopSCORE_B_GROUP=8) to amortize per-task launch overhead over taller, numerically-identical tiles.idx_kv_cacheandscorePASS, precision-neutral (scoremax_error_ratio unchanged at 0.005).topk_idxsFAIL is a known pre-existing pypto-HEAD issue and is unchanged by this PR.Related Issues
N/A