spec-dec: support device-aware recurrent GPU tape placement on ROCm multi-GPU#32
spec-dec: support device-aware recurrent GPU tape placement on ROCm multi-GPU#32nycdubliner wants to merge 2 commits into
Conversation
…ulti-GPU For multi-GPU ROCm/HIP setups, allocating all spec-dec tape buffers on a single device (e.g. ROCm0) causes execution failures or severe bottlenecks when recurrent layers are split across devices (e.g. ROCm1 cannot access ROCm0 source views). This commit enables device-aware recurrent GPU tape placement: - Tape buffers are allocated per recurrent layer on the exact physical GPU device assigned to compile/run that layer. - During DFlash tape capture and direct GPU replay, state memory is verified to reside on the local physical GPU before executing. - Non-local device memory accesses and associated ROCm IPC/peer faults are avoided. - A fallback path to CPU tape capture and replay is preserved when `GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=0` is set or on single-GPU setups. - Retained optimized scheduler callback-mode synchronization checks to minimize overhead when Hidden/GDN callback evaluation is enabled.
nycdubliner
left a comment
There was a problem hiding this comment.
Expert PR Review — Claude Opus 4.6 (Thinking)
Model under test: Qwen 3.6 27B (Q5_K_S target / Q4_K_M DFlash drafter)
Hardware: 2× RX 7800 XT, ROCm/HIP gfx1101
Overall Assessment
This is a well-structured PR that addresses a real performance bottleneck. The core idea — per-layer device-aware tape buffer allocation using model.dev_layer(il) — is the correct architectural approach. The diff is 222+/126- across 5 files, touching the right subsystems. The ROCm backend-name awareness ("CUDA" || "ROCm") is applied consistently. Good first contribution.
Verdict: Approve with requested changes (mostly documentation and one correctness concern).
What works well
-
allocate_tape_gpurefactor — Per-layer ggml context + buffer allocation keyed tomodel.dev_layer(il)viabackend_for_dev(). This is the right fix. Each tape layer'sbuf,ctx, anddevare now owned at layer granularity with correct cleanup in the destructor. -
dflash_gpu_backend_reg()helper — Centralizes the CUDA-then-ROCm registry lookup, eliminating 5 separateggml_backend_reg_by_name("CUDA")callsites. Clean dedup. -
dflash_is_cuda_compatible_tensor()helper — Replaces 3 localis_cuda_tensorlambdas with a shared function that checks both"CUDA"and"ROCm"buffer names. Good. -
build_recurrent_copy_plan/copy_cellinllama-memory-recurrent.cpp— Removing the singlecopy_plan_deviceand tracking per-entry device with per-deviceset_device/sync_devicecalls. This is the correct multi-GPU D2D pattern. -
tape_replay_gdn_direct_gpudevice validation — Now checks that ALL inputs (state, k, v, gate, beta) are on the same device before launching, and correctly setsreplay_device = -2for heterogeneous launches with per-ptr sync. Solid. -
dflash_memory_seq_cp_recurrent_ordered— Removed themodel.n_devices() > 1early-return gate and replaced it with per-backend sync across all GPU devices. Correct. -
ggml scheduler callback gate — Moving
ggml_backend_synchronize(split_backend)inside theif (need)block is a targeted optimization that avoids unnecessary CPU-GPU sync when no callback is registered for a tensor. This is safe because theneedflag already tracks callback presence.
Requested Changes
1. allocate_hidden_gpu still has an ungated n_devices() > 1 early return
// llama-context.cpp L1969-1973 (unmodified in this diff)
if (model.n_devices() > 1) {
dflash_capture->hidden_gpu.clear();
...
return;
}This means multi-GPU tape is enabled but hidden GPU capture always falls back to eval callbacks. The PR body says "hidden capture stays on eval callback" — so this is intentional. But it's not documented in the code. Please add a comment at L1969 explaining why hidden GPU capture is not yet enabled for multi-GPU (e.g., "// Hidden GPU capture requires same-device graph output tensors; not yet supported for multi-GPU layer splits").
2. allocate_prefill_gpu still has an ungated n_devices() > 1 early return
// llama-context.cpp L2061-2063 (unmodified in this diff)
if (model.n_devices() > 1) {
return false;
}Same situation — this is probably intentional but should have a comment noting it's not yet multi-GPU aware.
3. Env var GGML_DFLASH_ALLOW_MULTI_GPU_TAPE behavior
The implementation at L1138:
return env && env[0] != '\0' && std::strcmp(env, "0") != 0;This means =1, =yes, =banana all enable it. This matches the pattern used by GGML_DFLASH_GPU_RING, so it's consistent — good. But the env var is undocumented. Please add at minimum:
- A one-line comment at the
dflash_allow_multi_gpu_tape()function explaining when/why to use it - A note in the PR body about the interaction: GPU Ring (
GGML_DFLASH_GPU_RING) controls the cross-attention ring path; this env var controls the recurrent tape path. They are independent.
4. set_dflash_gpu_capture forced-on behavior
dflash_capture->gpu_capture_enabled =
enabled || (model.n_devices() > 1 && dflash_allow_multi_gpu_tape());This means if the server calls set_dflash_gpu_capture(false) but the env var is set, GPU capture stays enabled on multi-GPU. This is probably correct for the tape path, but it means an explicit disable request is silently overridden. Consider at least a LLAMA_LOG_INFO when this override fires, so users don't get confused if they try to force CPU-only.
5. std::map include
Adding #include <map> for the layers_by_dev diagnostic log is fine, but the project style prefers avoiding "fancy-looking modern STL constructs" (CONTRIBUTING.md). A simple loop counting devices would avoid the include. Not blocking — just flagging for awareness.
6. touched_devices vector in copy_cell hot path
std::vector<int> touched_devices; // allocated every copy_cell callcopy_cell is called per-cell during recurrent state management. Allocating a std::vector on every call adds heap pressure. Consider a small fixed-size array (max 8 devices) or moving touched_devices into the class as a reusable scratch buffer, similar to how copy_plan_entries is already a member.
7. AI disclosure
CONTRIBUTING.md requires: "Disclose that AI was used in your PR description." The PR body doesn't currently mention AI assistance. If AI tools were used during development, please add a disclosure line.
Minor / Nit
- L1408 indentation change: The
dflash_eval_callbackblock has a pure re-indent (4 spaces less). This makes the diff noisy but doesn't change logic. Consider splitting whitespace-only changes into a separate commit for cleaner review history. - Unused variable removal:
n_embd_rwas removed fromtape_replay— correct, it was unused after the refactor. Good cleanup. - Double
fn_preparecall intape_replay_gdn_direct_from_cpu_tapeat L2930-2931 still exists (called once in the validation loop, then again in the launch loop). This was pre-existing, not introduced by this PR, but worth noting.
Questions for the author
- Have you tested with
GGML_DFLASH_ALLOW_MULTI_GPU_TAPEunset (not=0, literally absent from environment) to confirm the existing behavior is byte-identical? - Is there a reason
tape_replay_conv_gpuat L2959 still has the hardmodel.n_devices() > 1 → return falsegate? This means conv state replay always falls back to CPU on multi-GPU even with the env var set. Is conv replay not needed for the measured speedup, or is this a follow-up? - The validation matrix in the PR body only shows
kv_report_module. The continuation-state doc shows 3 prompts with varying acceptance (27%–41%). Consider including all 3 in the PR body for reviewer confidence.
Summary
| Area | Status |
|---|---|
| Core tape placement correctness | ✅ Sound — per-layer device-aware allocation |
| ROCm compatibility | ✅ Consistent CUDA/ROCm buffer name + registry handling |
| Multi-GPU D2D copy plan | ✅ Per-device set/sync instead of single-device assumption |
| Replay device validation | ✅ All inputs verified same-device before launch |
| Fallback preservation | ✅ CPU path intact when env var disabled/unset |
| Documentation | |
| Performance hot path | touched_devices vector allocation in copy_cell |
| AI disclosure | |
| Scheduler callback gate | ✅ Safe optimization |
- Documented n_devices() > 1 multi-GPU limitations for hidden and prefill GPU allocations. - Documented role of GGML_DFLASH_ALLOW_MULTI_GPU_TAPE environment variable. - Added warning logging when explicitly disabling GPU capture is overridden by the env var. - Avoid std::map dependency in allocate_tape_gpu by using a vector of structures to count device occurrences. - Move touched_devices vector out of copy_cell hot-path stack to class member to avoid heap pressure.
✅ Review Complete — Ready for Maintainer ReviewAll review items from the initial review have been addressed in commit 4b208f7:
Additional validation noted: Determinism test, VRAM leak profiling (15 requests), multi-turn session stability. Recommendation: This PR is ready to come out of draft and be sent to the maintainer for final review. Review model: Claude Opus 4.6 (Thinking) |
|
Thank you. I'm currently investigating rebasing to latest llama.cpp, so it might take some time before I review it. |
Phase 2 Follow-up Review —
|
|
@Anbeeld Thanks for your work. :) Just in case it's useful, I've a follow up PR clearing out the rest of the Multi-GPU ROCm stuff I found here: |
Goal / Overview
This PR hardens and optimizes the speculative decoding DFlash tape-placement path for multi-GPU ROCm/HIP environments.
Problem
Previously, on multi-GPU targets, speculative decoding tape buffers (specifically recurrent memory hidden/state buffers) were allocated entirely on a single GPU (usually ROCm0/device 0). When recurrent layers were split across devices (e.g. some on ROCm0, some on ROCm1), the backend code for direct GPU replay failed shape/device-visibility checks. Specifically, ROCm1 could not access a source view whose buffer was resident on ROCm0 for graph
SETexecution, resulting in crash or fallback to extremely slow CPU paths.Solution
dflash_is_cuda_compatible_tensor/fn_ptr_device) to verify that all inputs/outputs for GDN direct GPU replay are on the local physical device before enqueueing.GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=0is set or on single-GPU setups.if (need) { ggml_backend_synchronize(...); }) to minimize CPU-GPU callback synchronization gates.touched_devicestracking vector incopy_cellto a class member scratch allocationcopy_plan_touched_devicesto avoid hot-path heap allocations and reduce pressure.<map>Dependency: Replacedstd::mapcounting of backend devices inallocate_tape_gpuwith a simple vector struct array to align with project style guidelines.Environment Variables & Controls
GGML_DFLASH_GPU_RING: Controls the cross-attention ring path for spec-dec tree verification.GGML_DFLASH_ALLOW_MULTI_GPU_TAPE: Controls the recurrent memory tape path (GPU/CPU placement).Note
These environment variables act independently.
GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=1enables device-aware recurrent GPU tape routing on multi-GPU targets, while cross-attention hidden-state capture stays on eval callbacks.Validation Matrix (Dual AMD Radeon RX 7800 XT, ROCm 7.2.3)
Tested on the prompt
dflash-prompts/kv_report_module.txt(512 tokens):GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=1GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=0Determinsim Test: Validated that output responses produced under all three runs are textually identical.
Leak Profiling Test: Verified VRAM allocation remains flat and doesn't leak under 15 consecutive requests.
Multi-Turn Test: Validated multi-turn conversational session continuation without segment faults or VRAM corruption.
Answers to Reviewer Questions
tape_replay_conv_gpustill have themodel.n_devices() > 1 -> return falsegate?The convolution states (
r_l) are extremely small (kernel width is 4) compared to the recurrent GDN state matrices (s_l). While GDN state replay was a critical multi-GPU bottleneck requiringtape_replay_gdn_direct_gpu, conv state CPU/fallback execution (tape_replay_conv) has virtually zero performance impact on decode speed. Keeping the multi-GPU conv replay on the fallback path avoids complex multi-device scheduling logic for tiny buffers while fully preserving the speedup.How to Reproduce / Benchmark Commands
Run the commands below from the
model-testingparent directory containingdflash_harness.py.1. Vanilla Baseline Benchmark Command
2. DFlash Patched (GPU Tape Enabled) Benchmark Command
3. DFlash Fallback (GPU Tape Disabled) Benchmark Command
Disclosure: This Pull Request was developed with the assistance of AI coding assistants (Antigravity).