Skip to content

spec-dec: support device-aware recurrent GPU tape placement on ROCm multi-GPU#32

Open
nycdubliner wants to merge 2 commits into
Anbeeld:mainfrom
nycdubliner:rocm-multi-gpu-tape
Open

spec-dec: support device-aware recurrent GPU tape placement on ROCm multi-GPU#32
nycdubliner wants to merge 2 commits into
Anbeeld:mainfrom
nycdubliner:rocm-multi-gpu-tape

Conversation

@nycdubliner
Copy link
Copy Markdown

@nycdubliner nycdubliner commented May 23, 2026

Goal / Overview

This PR hardens and optimizes the speculative decoding DFlash tape-placement path for multi-GPU ROCm/HIP environments.

Problem

Previously, on multi-GPU targets, speculative decoding tape buffers (specifically recurrent memory hidden/state buffers) were allocated entirely on a single GPU (usually ROCm0/device 0). When recurrent layers were split across devices (e.g. some on ROCm0, some on ROCm1), the backend code for direct GPU replay failed shape/device-visibility checks. Specifically, ROCm1 could not access a source view whose buffer was resident on ROCm0 for graph SET execution, resulting in crash or fallback to extremely slow CPU paths.

Solution

  1. Device-Aware Recurrent Tape Buffers: Allocated tape buffers per recurrent layer on the exact physical GPU device visibility of the target layer.
  2. Local GPU Visibility Replay: Validated resident memory placement (dflash_is_cuda_compatible_tensor / fn_ptr_device) to verify that all inputs/outputs for GDN direct GPU replay are on the local physical device before enqueueing.
  3. CPU Fallback Preservation: Safely preserved the CPU tape capture/replay pathway when GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=0 is set or on single-GPU setups.
  4. Optimized Callback Sync Gate: Retained optimized scheduler callback-mode synchronization checks (if (need) { ggml_backend_synchronize(...); }) to minimize CPU-GPU callback synchronization gates.
  5. Class Scratch Allocation: Moved the touched_devices tracking vector in copy_cell to a class member scratch allocation copy_plan_touched_devices to avoid hot-path heap allocations and reduce pressure.
  6. Removed <map> Dependency: Replaced std::map counting of backend devices in allocate_tape_gpu with a simple vector struct array to align with project style guidelines.

Environment Variables & Controls

  • GGML_DFLASH_GPU_RING: Controls the cross-attention ring path for spec-dec tree verification.
  • GGML_DFLASH_ALLOW_MULTI_GPU_TAPE: Controls the recurrent memory tape path (GPU/CPU placement).

Note

These environment variables act independently. GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=1 enables device-aware recurrent GPU tape routing on multi-GPU targets, while cross-attention hidden-state capture stays on eval callbacks.


Validation Matrix (Dual AMD Radeon RX 7800 XT, ROCm 7.2.3)

Tested on the prompt dflash-prompts/kv_report_module.txt (512 tokens):

  • Vanilla Baseline: Speculative decoding disabled.
    • Result: 18.40 tok/s
  • DFlash Patched (GPU Tape Enabled): GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=1
    • Result: 28.53 tok/s (1.55x speedup), 41.30% acceptance rate, zero errors.
  • DFlash Fallback (GPU Tape Disabled): GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=0
    • Result: 10.18 tok/s (Falls back cleanly to CPU recurrent replay, no faults or memory errors observed).

Determinsim Test: Validated that output responses produced under all three runs are textually identical.
Leak Profiling Test: Verified VRAM allocation remains flat and doesn't leak under 15 consecutive requests.
Multi-Turn Test: Validated multi-turn conversational session continuation without segment faults or VRAM corruption.


Answers to Reviewer Questions

  • Why does tape_replay_conv_gpu still have the model.n_devices() > 1 -> return false gate?
    The convolution states (r_l) are extremely small (kernel width is 4) compared to the recurrent GDN state matrices (s_l). While GDN state replay was a critical multi-GPU bottleneck requiring tape_replay_gdn_direct_gpu, conv state CPU/fallback execution (tape_replay_conv) has virtually zero performance impact on decode speed. Keeping the multi-GPU conv replay on the fallback path avoids complex multi-device scheduling logic for tiny buffers while fully preserving the speedup.

How to Reproduce / Benchmark Commands

Run the commands below from the model-testing parent directory containing dflash_harness.py.

1. Vanilla Baseline Benchmark Command

./dflash_harness.py \
  --runtime "vanilla-refresh-post-clean" \
  --model "unsloth/Qwen3.6-27B-GGUF:Q5_K_S" \
  --draft "" \
  --prompt-file "dflash-prompts/kv_report_module.txt" \
  --prompt-name "kv_report_module" \
  --max-tokens 512 \
  --repo-commit "07ac3cec6-patched" \
  --batch 512 --ubatch 128 --ctx 8192 --cache-k q5_0 --cache-v q4_1 --adaptive off --gpu-ring 0 --split layer \
  --server-cmd "cd beellama.cpp && ./build/bin/llama-server -hf unsloth/Qwen3.6-27B-GGUF:Q5_K_S --no-mmproj -ngl all -sm layer -ts 1,1 -c 8192 -np 1 --kv-unified -b 512 -ub 128 --cache-type-k q5_0 --cache-type-v q4_1 --flash-attn on -fit off --cache-ram 0 -rea off --host 0.0.0.0 --port 8082"

2. DFlash Patched (GPU Tape Enabled) Benchmark Command

GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=1 ./dflash_harness.py \
  --runtime "dflash-patched-gpu-tape" \
  --model "unsloth/Qwen3.6-27B-GGUF:Q5_K_S" \
  --draft "Anbeeld/Qwen3.6-27B-DFlash-GGUF:Q4_K_M" \
  --prompt-file "dflash-prompts/kv_report_module.txt" \
  --prompt-name "kv_report_module" \
  --max-tokens 512 \
  --repo-commit "07ac3cec6-patched" \
  --n-max 12 --cross-ctx 512 --draft-ctx 2048 --batch 512 --ubatch 128 --ctx 8192 --cache-k q5_0 --cache-v q4_1 --adaptive off --gpu-ring 0 --split layer \
  --server-cmd "cd beellama.cpp && ./build/bin/llama-server -hf unsloth/Qwen3.6-27B-GGUF:Q5_K_S --no-mmproj --spec-draft-hf Anbeeld/Qwen3.6-27B-DFlash-GGUF:Q4_K_M --spec-type dflash --spec-branch-budget 0 --spec-dflash-cross-ctx 512 --spec-draft-ctx-size 2048 --spec-draft-n-max 12 --no-spec-dm-adaptive -ngl all --spec-draft-ngl all -sm layer -ts 1,1 -c 8192 -np 1 --kv-unified -b 512 -ub 128 --cache-type-k q5_0 --cache-type-v q4_1 --flash-attn on -fit off --cache-ram 0 -rea off --host 0.0.0.0 --port 8082"

3. DFlash Fallback (GPU Tape Disabled) Benchmark Command

GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=0 ./dflash_harness.py \
  --runtime "dflash-fallback-cpu-tape" \
  --model "unsloth/Qwen3.6-27B-GGUF:Q5_K_S" \
  --draft "Anbeeld/Qwen3.6-27B-DFlash-GGUF:Q4_K_M" \
  --prompt-file "dflash-prompts/kv_report_module.txt" \
  --prompt-name "kv_report_module" \
  --max-tokens 512 \
  --repo-commit "07ac3cec6-patched" \
  --n-max 12 --cross-ctx 512 --draft-ctx 2048 --batch 512 --ubatch 128 --ctx 8192 --cache-k q5_0 --cache-v q4_1 --adaptive off --gpu-ring 0 --split layer \
  --server-cmd "cd beellama.cpp && ./build/bin/llama-server -hf unsloth/Qwen3.6-27B-GGUF:Q5_K_S --no-mmproj --spec-draft-hf Anbeeld/Qwen3.6-27B-DFlash-GGUF:Q4_K_M --spec-type dflash --spec-branch-budget 0 --spec-dflash-cross-ctx 512 --spec-draft-ctx-size 2048 --spec-draft-n-max 12 --no-spec-dm-adaptive -ngl all --spec-draft-ngl all -sm layer -ts 1,1 -c 8192 -np 1 --kv-unified -b 512 -ub 128 --cache-type-k q5_0 --cache-type-v q4_1 --flash-attn on -fit off --cache-ram 0 -rea off --host 0.0.0.0 --port 8082"

Disclosure: This Pull Request was developed with the assistance of AI coding assistants (Antigravity).

…ulti-GPU

For multi-GPU ROCm/HIP setups, allocating all spec-dec tape buffers on a single
device (e.g. ROCm0) causes execution failures or severe bottlenecks when recurrent
layers are split across devices (e.g. ROCm1 cannot access ROCm0 source views).

This commit enables device-aware recurrent GPU tape placement:
- Tape buffers are allocated per recurrent layer on the exact physical GPU
  device assigned to compile/run that layer.
- During DFlash tape capture and direct GPU replay, state memory is verified
  to reside on the local physical GPU before executing.
- Non-local device memory accesses and associated ROCm IPC/peer faults are avoided.
- A fallback path to CPU tape capture and replay is preserved when
  `GGML_DFLASH_ALLOW_MULTI_GPU_TAPE=0` is set or on single-GPU setups.
- Retained optimized scheduler callback-mode synchronization checks to minimize
  overhead when Hidden/GDN callback evaluation is enabled.
@github-actions github-actions Bot added the ggml label May 23, 2026
Copy link
Copy Markdown
Author

@nycdubliner nycdubliner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expert PR Review — Claude Opus 4.6 (Thinking)

Model under test: Qwen 3.6 27B (Q5_K_S target / Q4_K_M DFlash drafter)
Hardware: 2× RX 7800 XT, ROCm/HIP gfx1101


Overall Assessment

This is a well-structured PR that addresses a real performance bottleneck. The core idea — per-layer device-aware tape buffer allocation using model.dev_layer(il) — is the correct architectural approach. The diff is 222+/126- across 5 files, touching the right subsystems. The ROCm backend-name awareness ("CUDA" || "ROCm") is applied consistently. Good first contribution.

Verdict: Approve with requested changes (mostly documentation and one correctness concern).


What works well

  1. allocate_tape_gpu refactor — Per-layer ggml context + buffer allocation keyed to model.dev_layer(il) via backend_for_dev(). This is the right fix. Each tape layer's buf, ctx, and dev are now owned at layer granularity with correct cleanup in the destructor.

  2. dflash_gpu_backend_reg() helper — Centralizes the CUDA-then-ROCm registry lookup, eliminating 5 separate ggml_backend_reg_by_name("CUDA") callsites. Clean dedup.

  3. dflash_is_cuda_compatible_tensor() helper — Replaces 3 local is_cuda_tensor lambdas with a shared function that checks both "CUDA" and "ROCm" buffer names. Good.

  4. build_recurrent_copy_plan / copy_cell in llama-memory-recurrent.cpp — Removing the single copy_plan_device and tracking per-entry device with per-device set_device/sync_device calls. This is the correct multi-GPU D2D pattern.

  5. tape_replay_gdn_direct_gpu device validation — Now checks that ALL inputs (state, k, v, gate, beta) are on the same device before launching, and correctly sets replay_device = -2 for heterogeneous launches with per-ptr sync. Solid.

  6. dflash_memory_seq_cp_recurrent_ordered — Removed the model.n_devices() > 1 early-return gate and replaced it with per-backend sync across all GPU devices. Correct.

  7. ggml scheduler callback gate — Moving ggml_backend_synchronize(split_backend) inside the if (need) block is a targeted optimization that avoids unnecessary CPU-GPU sync when no callback is registered for a tensor. This is safe because the need flag already tracks callback presence.


Requested Changes

1. allocate_hidden_gpu still has an ungated n_devices() > 1 early return

// llama-context.cpp L1969-1973 (unmodified in this diff)
if (model.n_devices() > 1) {
    dflash_capture->hidden_gpu.clear();
    ...
    return;
}

This means multi-GPU tape is enabled but hidden GPU capture always falls back to eval callbacks. The PR body says "hidden capture stays on eval callback" — so this is intentional. But it's not documented in the code. Please add a comment at L1969 explaining why hidden GPU capture is not yet enabled for multi-GPU (e.g., "// Hidden GPU capture requires same-device graph output tensors; not yet supported for multi-GPU layer splits").

2. allocate_prefill_gpu still has an ungated n_devices() > 1 early return

// llama-context.cpp L2061-2063 (unmodified in this diff)
if (model.n_devices() > 1) {
    return false;
}

Same situation — this is probably intentional but should have a comment noting it's not yet multi-GPU aware.

3. Env var GGML_DFLASH_ALLOW_MULTI_GPU_TAPE behavior

The implementation at L1138:

return env && env[0] != '\0' && std::strcmp(env, "0") != 0;

This means =1, =yes, =banana all enable it. This matches the pattern used by GGML_DFLASH_GPU_RING, so it's consistent — good. But the env var is undocumented. Please add at minimum:

  • A one-line comment at the dflash_allow_multi_gpu_tape() function explaining when/why to use it
  • A note in the PR body about the interaction: GPU Ring (GGML_DFLASH_GPU_RING) controls the cross-attention ring path; this env var controls the recurrent tape path. They are independent.

4. set_dflash_gpu_capture forced-on behavior

dflash_capture->gpu_capture_enabled =
    enabled || (model.n_devices() > 1 && dflash_allow_multi_gpu_tape());

This means if the server calls set_dflash_gpu_capture(false) but the env var is set, GPU capture stays enabled on multi-GPU. This is probably correct for the tape path, but it means an explicit disable request is silently overridden. Consider at least a LLAMA_LOG_INFO when this override fires, so users don't get confused if they try to force CPU-only.

5. std::map include

Adding #include <map> for the layers_by_dev diagnostic log is fine, but the project style prefers avoiding "fancy-looking modern STL constructs" (CONTRIBUTING.md). A simple loop counting devices would avoid the include. Not blocking — just flagging for awareness.

6. touched_devices vector in copy_cell hot path

std::vector<int> touched_devices;  // allocated every copy_cell call

copy_cell is called per-cell during recurrent state management. Allocating a std::vector on every call adds heap pressure. Consider a small fixed-size array (max 8 devices) or moving touched_devices into the class as a reusable scratch buffer, similar to how copy_plan_entries is already a member.

7. AI disclosure

CONTRIBUTING.md requires: "Disclose that AI was used in your PR description." The PR body doesn't currently mention AI assistance. If AI tools were used during development, please add a disclosure line.


Minor / Nit

  • L1408 indentation change: The dflash_eval_callback block has a pure re-indent (4 spaces less). This makes the diff noisy but doesn't change logic. Consider splitting whitespace-only changes into a separate commit for cleaner review history.
  • Unused variable removal: n_embd_r was removed from tape_replay — correct, it was unused after the refactor. Good cleanup.
  • Double fn_prepare call in tape_replay_gdn_direct_from_cpu_tape at L2930-2931 still exists (called once in the validation loop, then again in the launch loop). This was pre-existing, not introduced by this PR, but worth noting.

Questions for the author

  1. Have you tested with GGML_DFLASH_ALLOW_MULTI_GPU_TAPE unset (not =0, literally absent from environment) to confirm the existing behavior is byte-identical?
  2. Is there a reason tape_replay_conv_gpu at L2959 still has the hard model.n_devices() > 1 → return false gate? This means conv state replay always falls back to CPU on multi-GPU even with the env var set. Is conv replay not needed for the measured speedup, or is this a follow-up?
  3. The validation matrix in the PR body only shows kv_report_module. The continuation-state doc shows 3 prompts with varying acceptance (27%–41%). Consider including all 3 in the PR body for reviewer confidence.

Summary

Area Status
Core tape placement correctness ✅ Sound — per-layer device-aware allocation
ROCm compatibility ✅ Consistent CUDA/ROCm buffer name + registry handling
Multi-GPU D2D copy plan ✅ Per-device set/sync instead of single-device assumption
Replay device validation ✅ All inputs verified same-device before launch
Fallback preservation ✅ CPU path intact when env var disabled/unset
Documentation ⚠️ Env var undocumented; ungated paths need comments
Performance hot path ⚠️ touched_devices vector allocation in copy_cell
AI disclosure ⚠️ Missing per CONTRIBUTING.md
Scheduler callback gate ✅ Safe optimization

- Documented n_devices() > 1 multi-GPU limitations for hidden and prefill GPU allocations.
- Documented role of GGML_DFLASH_ALLOW_MULTI_GPU_TAPE environment variable.
- Added warning logging when explicitly disabling GPU capture is overridden by the env var.
- Avoid std::map dependency in allocate_tape_gpu by using a vector of structures to count device occurrences.
- Move touched_devices vector out of copy_cell hot-path stack to class member to avoid heap pressure.
@nycdubliner
Copy link
Copy Markdown
Author

✅ Review Complete — Ready for Maintainer Review

All review items from the initial review have been addressed in commit 4b208f7:

  • allocate_hidden_gpu / allocate_prefill_gpu multi-GPU gates documented
  • ✅ Env var documented in PR body with interaction table
  • ✅ Override warning log added to set_dflash_gpu_capture
  • std::map replaced with simple vector struct
  • touched_devices moved to class member (copy_plan_touched_devices)
  • ✅ AI disclosure added
  • tape_replay_conv_gpu multi-GPU gate clarified (follow-up item, not a blocker — CPU path handles it)

Additional validation noted: Determinism test, VRAM leak profiling (15 requests), multi-turn session stability.

Recommendation: This PR is ready to come out of draft and be sent to the maintainer for final review.

Review model: Claude Opus 4.6 (Thinking)

@nycdubliner nycdubliner marked this pull request as ready for review May 23, 2026 14:19
@Anbeeld
Copy link
Copy Markdown
Owner

Anbeeld commented May 23, 2026

Thank you. I'm currently investigating rebasing to latest llama.cpp, so it might take some time before I review it.

@nycdubliner
Copy link
Copy Markdown
Author

Phase 2 Follow-up Review — 7fd860667 (branch rocm-multi-gpu-conv-replay)

This is a pre-review of the Phase 2 work (GPU conv replay + hidden-state capture + GPU ring on multi-GPU) ahead of a formal PR once #32 merges.

All five blocking/important issues from the previous review have been addressed. Full item-by-item:

✅ Dynamic P2P gatingcudaDeviceCanAccessPeer + cudaDeviceEnablePeerAccess is now called lazily inside dflash_cross_ring_gpu_write_d2d on first cross-device write, gated by a static bool peer_enabled[GGML_CUDA_MAX_DEVICES][GGML_CUDA_MAX_DEVICES]. Both directions are enabled; cudaGetLastError() correctly absorbs cudaErrorPeerAccessAlreadyEnabled if called twice. GGML_CUDA_P2P is no longer a prerequisite. ✓

✅ Dead struct fieldsdflash_hidden_gpu_layer removed. buf/ctx fields removed from dflash_hidden_gpu. Destructor cleaned up. ✓

✅ API deduplicationllama_dflash_allow_multi_gpu_tape() is now a single public function declared in llama.h, implemented in llama-context.cpp. Duplicate in common/speculative.cpp removed. ✓

GGML_CUDA_MAX_DEVICES — all three [32] magic values replaced. #ifndef guard in cross-ring-interleave.cu is appropriate for standalone compilation. ✓

✅ Self-gating comment — comment added at tape_replay_conv call site explaining unconditional GPU attempt. ✓


One nit (not blocking): peer_enabled is a static local with a non-atomic read-modify-write. Concurrent first calls for the same device pair could both invoke cudaDeviceEnablePeerAccess — this is safe because the duplicate call returns cudaErrorPeerAccessAlreadyEnabled which is already cleared by cudaGetLastError(). Worth a one-line comment for future readers.


Benchmark validation noted: 31.32 tok/s (1.71×) with cpu_copy=0.000 ms — meaningful improvement over PR #32's 28.53 tok/s (1.55×), consistent with removing hidden capture callback sync overhead.

Verdict: Ready to open as a PR stacked on #32. The nit can go in the PR description rather than a code change.

Review model: Claude Sonnet 4.6 (Thinking)

@nycdubliner
Copy link
Copy Markdown
Author

@Anbeeld Thanks for your work. :)

Just in case it's useful, I've a follow up PR clearing out the rest of the Multi-GPU ROCm stuff I found here:
nycdubliner#1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants