Skip to content

GPU fixes#58

Merged
illuzen merged 6 commits into
mainfrom
illuzen/gpu-fixes
Apr 22, 2026
Merged

GPU fixes#58
illuzen merged 6 commits into
mainfrom
illuzen/gpu-fixes

Conversation

@illuzen
Copy link
Copy Markdown
Contributor

@illuzen illuzen commented Apr 22, 2026

Turns out our cancellation flag strategy was completely broken. GPUs can't check updates to VRAM mid-dispatch. This changes the cancellation strategy to live entirely in rust. It is defined by --cpu-batch-size and --gpu-batch-size CLI args.

The logs are also improved now.

@n13
Copy link
Copy Markdown
Contributor

n13 commented Apr 22, 2026

PR #58 Review — "GPU fixes" link

Verdict: approve with minor suggestions. All CI checks pass (fmt / clippy / build & test / bench), I also ran cargo build --workspace, cargo clippy --workspace --all-targets -- -D warnings, cargo test --workspace --locked, and cargo fmt --check locally on pr-58 — everything is green.

What the PR does (TL;DR)

  1. Removes the in-shader cancel-flag check — GPU dispatches can't reliably observe host buffer writes mid-dispatch, so the old mechanism was silently broken.
  2. Replaces it with host-side cancellation between Rust-driven batches (--gpu-batch-size, --cpu-batch-size).
  3. Introduces a small CancelCheck trait with two implementations: AtomicBoolCancelCheck (bench/example) and JobIdCancelCheck (service). The service now drives cancellation and stale-result detection with a single AtomicU64 counter — eliminating the previous "set flag → sleep 1 ms → clear flag" dance in WorkerPool::start_job.
  4. Switches job channels from bounded(1) + try_send to unbounded() with an explicit drain-to-latest in each worker.
  5. Greatly expands GPU vendor detection (Blackwell, RDNA 4, Battlemage, M4 Ultra) and adds a fallback warning asking users to report unrecognized hardware.
  6. Improves logging (per-worker/per-job elapsed + formatted hashrate, 0x prefix on hashes).

What's great

  • Root cause is correct. WGPU/Metal/Vulkan/DX12 don't guarantee host-visible storage-buffer writes become visible to an already-dispatched kernel without a barrier. Moving cancellation out of the shader is the right architectural call, and it also reclaims shader cycles previously spent on (j % cancel_check_interval) == 0 divergent checks.
  • JobIdCancelCheck is elegant — a single counter both triggers cancellation and disambiguates stale results. Much cleaner than the old cancel-flag + sleep trick in WorkerPool::start_job.
  • Unbounded channel + drain-to-latest removes the "stale job stuck in queue" failure mode and the need to race a 1 ms sleep against the worker.

Suggestions / things to consider

  1. CPU inner loop now does a runtime modulo every hash (engine-cpu/src/lib.rs:153):

            loop {
                // Check for cancellation every batch_size hashes
                if hash_count.is_multiple_of(self.batch_size) && cancel.is_cancelled() {
                    return EngineStatus::Cancelled { hash_count };
    

    self.batch_size is a runtime u64, so is_multiple_of can't fold to a bit-mask/shift and generally lowers to an actual division. The old code was one AtomicBool::load(Relaxed). Poseidon almost certainly dominates, but it's worth microbenchmarking — or replace with a decrementing local counter to fully sidestep the divide:

    let mut until_check: u64 = self.batch_size;
    loop {
        if until_check == 0 {
            if cancel.is_cancelled() { return EngineStatus::Cancelled { hash_count }; }
            until_check = self.batch_size;
        }
        until_check -= 1;
        // ... hash ...
    }
  2. --gpu-batch-size / --cpu-batch-size flags don't flow into benchmark. The Benchmark subcommand uses DEFAULT_GPU_BATCH_SIZE / DEFAULT_CPU_BATCH_SIZE, and the criterion benches hard-code 10_000_000 (GPU) and 10_000 (CPU) — a 10× discrepancy from the service default of 1_000_000 for GPU. Consider either (a) plumbing the flags into Benchmark too, or (b) referencing the CLI DEFAULT_* constants from engine-*/benches/* so the numbers can't drift. Users comparing bench results to real runs will be confused otherwise.

  3. GpuEngine::run_single_batch doesn't actually use &self. It reads only gpu_ctx / resources. Could be a free function or fn run_single_batch(gpu_ctx: &GpuContext, resources: &GpuResources, ...). Tiny style nit.

  4. WorkerPool::cancel() semantics changed silently. Old cancel() set an AtomicBool. New one bumps current_job_id by 1, and then the next start_job() bumps by another 1. That's fine, but it means job IDs in logs become non-contiguous after a disconnect (quic.rs:53). A one-line comment explaining that would save a future maintainer some head-scratching.

  5. Dropped public API on GpuEngine (new, try_new() (old), with_cancel_interval, try_with_cancel_interval) and on ServiceConfig (Default impl). This is a breaking change for any out-of-tree consumer and should be called out in the changelog / version bump.

  6. GPU cancellation latency = one batch. On a 5 MH/s iGPU, the default --gpu-batch-size=1_000_000 means up to ~200 ms of "cancel ignored" per batch. Fine as a default, but worth a sentence in the README explaining the latency/overhead trade-off (smaller batch = more dispatch overhead, larger batch = slower cancellation).

  7. Minor DRY opportunity in worker_loop — the "compute hash_rate if elapsed>0 / log X worker Y : Z hashes in Ws (rate)" block is copy-pasted three times (stale, exhausted, cancelled). Could be a two-line helper.

  8. Log inlining: most of the new log::*! calls follow AGENTS.md style ("Job {new_job_id}"), but a few still use positional args, e.g.:

            log::debug!(
                "[JOB DISPATCH] Job {} dispatched to {} workers",
                new_job_id,
                self.job_senders.len()
            );
    

    Inlinable per AGENTS.md.

Correctness spot-checks I did

  • hash_count = 0 on first CPU iteration → 0.is_multiple_of(batch_size) is true → pre-cancel is honored. Test engine_respects_immediate_cancellation covers this.
  • JobIdCancelCheck uses Ordering::Relaxed read vs SeqCst writes — safe since the only thing we care about is eventual visibility of the counter value; there's no cross-variable ordering dependency.
  • Drain-to-latest in worker_loop plus job-ID comparison handles the "two new jobs arrived while worker was busy" case correctly.
  • Shader binding count went from 6 to 5; the bind group, layout, and dispatch-config buffer size (16 → 12) are all updated consistently.
  • run_single_batch advances current_start by this_batch_size, not by dispatched_nonces — correct, because the returned hash_count is only an execution-work estimate, while this_batch_size is the logical stride we asked for.

Nice fix — the previous GPU cancel path was a real silent-failure bug and this replaces it with something simpler and more auditable.

Copy link
Copy Markdown
Contributor

@n13 n13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice let's ship it!

Check the AI comments, some of these things should be a follow up PR or in this PR if easy to do but it certainly beats GPU mining not working at all

@illuzen illuzen merged commit d7bf278 into main Apr 22, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants