GPU fixes by illuzen · Pull Request #58 · Quantus-Network/quantus-miner

illuzen · 2026-04-22T05:07:39Z

Turns out our cancellation flag strategy was completely broken. GPUs can't check updates to VRAM mid-dispatch. This changes the cancellation strategy to live entirely in rust. It is defined by --cpu-batch-size and --gpu-batch-size CLI args.

The logs are also improved now.

n13 · 2026-04-22T05:23:05Z

PR #58 Review — "GPU fixes" link

Verdict: approve with minor suggestions. All CI checks pass (fmt / clippy / build & test / bench), I also ran cargo build --workspace, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace --locked, and cargo fmt --check locally on pr-58 — everything is green.

What the PR does (TL;DR)

Removes the in-shader cancel-flag check — GPU dispatches can't reliably observe host buffer writes mid-dispatch, so the old mechanism was silently broken.
Replaces it with host-side cancellation between Rust-driven batches (--gpu-batch-size, --cpu-batch-size).
Introduces a small CancelCheck trait with two implementations: AtomicBoolCancelCheck (bench/example) and JobIdCancelCheck (service). The service now drives cancellation and stale-result detection with a single AtomicU64 counter — eliminating the previous "set flag → sleep 1 ms → clear flag" dance in WorkerPool::start_job.
Switches job channels from bounded(1) + try_send to unbounded() with an explicit drain-to-latest in each worker.
Greatly expands GPU vendor detection (Blackwell, RDNA 4, Battlemage, M4 Ultra) and adds a fallback warning asking users to report unrecognized hardware.
Improves logging (per-worker/per-job elapsed + formatted hashrate, 0x prefix on hashes).

What's great

Root cause is correct. WGPU/Metal/Vulkan/DX12 don't guarantee host-visible storage-buffer writes become visible to an already-dispatched kernel without a barrier. Moving cancellation out of the shader is the right architectural call, and it also reclaims shader cycles previously spent on (j % cancel_check_interval) == 0 divergent checks.
JobIdCancelCheck is elegant — a single counter both triggers cancellation and disambiguates stale results. Much cleaner than the old cancel-flag + sleep trick in WorkerPool::start_job.
Unbounded channel + drain-to-latest removes the "stale job stuck in queue" failure mode and the need to race a 1 ms sleep against the worker.

Suggestions / things to consider

CPU inner loop now does a runtime modulo every hash (engine-cpu/src/lib.rs:153):

        loop {
            // Check for cancellation every batch_size hashes
            if hash_count.is_multiple_of(self.batch_size) && cancel.is_cancelled() {
                return EngineStatus::Cancelled { hash_count };

self.batch_size is a runtime u64, so is_multiple_of can't fold to a bit-mask/shift and generally lowers to an actual division. The old code was one AtomicBool::load(Relaxed). Poseidon almost certainly dominates, but it's worth microbenchmarking — or replace with a decrementing local counter to fully sidestep the divide:

let mut until_check: u64 = self.batch_size;
loop {
    if until_check == 0 {
        if cancel.is_cancelled() { return EngineStatus::Cancelled { hash_count }; }
        until_check = self.batch_size;
    }
    until_check -= 1;
    // ... hash ...
}

--gpu-batch-size / --cpu-batch-size flags don't flow into benchmark. The Benchmark subcommand uses DEFAULT_GPU_BATCH_SIZE / DEFAULT_CPU_BATCH_SIZE, and the criterion benches hard-code 10_000_000 (GPU) and 10_000 (CPU) — a 10× discrepancy from the service default of 1_000_000 for GPU. Consider either (a) plumbing the flags into Benchmark too, or (b) referencing the CLI DEFAULT_* constants from engine-*/benches/* so the numbers can't drift. Users comparing bench results to real runs will be confused otherwise.
GpuEngine::run_single_batch doesn't actually use &self. It reads only gpu_ctx / resources. Could be a free function or fn run_single_batch(gpu_ctx: &GpuContext, resources: &GpuResources, ...). Tiny style nit.
WorkerPool::cancel() semantics changed silently. Old cancel() set an AtomicBool. New one bumps current_job_id by 1, and then the next start_job() bumps by another 1. That's fine, but it means job IDs in logs become non-contiguous after a disconnect (quic.rs:53). A one-line comment explaining that would save a future maintainer some head-scratching.
Dropped public API on GpuEngine (new, try_new() (old), with_cancel_interval, try_with_cancel_interval) and on ServiceConfig (Default impl). This is a breaking change for any out-of-tree consumer and should be called out in the changelog / version bump.
GPU cancellation latency = one batch. On a 5 MH/s iGPU, the default --gpu-batch-size=1_000_000 means up to ~200 ms of "cancel ignored" per batch. Fine as a default, but worth a sentence in the README explaining the latency/overhead trade-off (smaller batch = more dispatch overhead, larger batch = slower cancellation).
Minor DRY opportunity in worker_loop — the "compute hash_rate if elapsed>0 / log X worker Y : Z hashes in Ws (rate)" block is copy-pasted three times (stale, exhausted, cancelled). Could be a two-line helper.

Log inlining: most of the new log::*! calls follow AGENTS.md style ("Job {new_job_id}"), but a few still use positional args, e.g.:

        log::debug!(
            "[JOB DISPATCH] Job {} dispatched to {} workers",
            new_job_id,
            self.job_senders.len()
        );

Inlinable per AGENTS.md.

Correctness spot-checks I did

hash_count = 0 on first CPU iteration → 0.is_multiple_of(batch_size) is true → pre-cancel is honored. Test engine_respects_immediate_cancellation covers this.
JobIdCancelCheck uses Ordering::Relaxed read vs SeqCst writes — safe since the only thing we care about is eventual visibility of the counter value; there's no cross-variable ordering dependency.
Drain-to-latest in worker_loop plus job-ID comparison handles the "two new jobs arrived while worker was busy" case correctly.
Shader binding count went from 6 to 5; the bind group, layout, and dispatch-config buffer size (16 → 12) are all updated consistently.
run_single_batch advances current_start by this_batch_size, not by dispatched_nonces — correct, because the returned hash_count is only an execution-work estimate, while this_batch_size is the logical stride we asked for.

Nice fix — the previous GPU cancel path was a real silent-failure bug and this replaces it with something simpler and more auditable.

n13

Nice let's ship it!

Check the AI comments, some of these things should be a follow up PR or in this PR if easy to do but it certainly beats GPU mining not working at all

illuzen added 5 commits April 21, 2026 18:19

improve logging

4bad16f

don't let the queue spill, just drain it

fec7d26

move cancellation out of the shader, since it doesn't work

602d47c

simplify batch size mechanism, improve logs

f022095

fmt

1841deb

n13 approved these changes Apr 22, 2026

View reviewed changes

address review items

44bfc81

illuzen merged commit d7bf278 into main Apr 22, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU fixes#58

GPU fixes#58
illuzen merged 6 commits into
mainfrom
illuzen/gpu-fixes

illuzen commented Apr 22, 2026

Uh oh!

n13 commented Apr 22, 2026

Uh oh!

n13 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

illuzen commented Apr 22, 2026

Uh oh!

n13 commented Apr 22, 2026

PR #58 Review — "GPU fixes" link

What the PR does (TL;DR)

What's great

Suggestions / things to consider

Correctness spot-checks I did

Uh oh!

n13 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants